← All rankingsGoogle
Voice & multimodal agents
Gemini Live
A · StrongMaturing
Low-latency multimodal streaming — voice, vision, and text in one session.
Native vision inside a live voice session is a real edge, and it is cheap at volume. A because it is a developer API, not a polished end-user voice product.
Excellent at its job with real, known tradeoffs.