Key Takeaways
- Gemma 4 brings multimodal AI to your pocket: Google DeepMind's Omar Sanseviero confirmed that Gemma 4's smaller models can now process audio, images, and short videos (30-60 seconds) right on a device, a leap for edge computing previously reserved for larger, cloud-based AI.
- It's a polyglot with 140 languages: Sanseviero highlighted Gemma's robust multilingual support, thanks to an advanced tokenizer inherited from Gemini. This means reliable performance across 140 languages, even when fine-tuned for specific, niche dialects.
- Concrete on-device use cases are here: Forget theoretical applications. Gemma 4 handles real-world tasks like speech recognition, speech-to-translated text, general speech understanding, object detection, pointing, and image captioning, all optimized for phone-level performance.
- But it's not magic (yet): While powerful, Gemma 4 currently doesn't support image segmentation, nor can it process video and synchronized audio together in a single prompt. Founders need to understand these current hard limits before building.
The Edge Gets Smarter: Multimodal AI Arrives On-Device
For years, truly powerful AI lived in the cloud. Complex tasks like understanding images or translating speech required sending data off-device, waiting for a server, and getting a response back. That delay killed a lot of potential real-time applications. Omar Sanseviero, from Google DeepMind, just signaled a significant shift with Gemma 4.
According to Sanseviero, the smaller Gemma models are now capable of understanding a range of inputs directly on-device. “Multimodal wise, the smaller models can understand audio, images, and short videos,” he explained, specifically mentioning 30- to 60-second clips. This isn't just a party trick; it means tasks like instant speech recognition or object detection can happen without internet access or latency.
Think about the implications: a device that can understand spoken commands and identify objects in its camera feed, all while offline. Sanseviero pointed to specific “use cases that are very optimized for like on-device phone use cases,” including speech-to-translated text and asking questions about an audio file. This move fundamentally changes the calculus for what you can build at the edge.
140 Languages, No Cloud Needed
Beyond just seeing and hearing, Gemma 4 also speaks. A lot. Sanseviero made it clear that multilingual capabilities are a core strength. “Gemma is quite important for the multilingual aspect as well. So Gemma supports these 140 languages.” This isn't just superficial support; it comes from a highly effective tokenizer derived from the larger Gemini models, ensuring robust performance across a vast array of global languages.
For founders eyeing international markets, this is a game-changer. Imagine building an app that offers real-time translation or voice commands in dozens of languages, with strong local accuracy, all without hitting a server. It sidesteps common latency and privacy issues, making truly global products feasible at a new scale. The ability to fine-tune for new languages and still maintain strong performance means more adaptable, geographically diverse applications.
However, Sanseviero was also transparent about current limitations. “We do not have image segmentation, which we know is like one thing that many people have been asking us.” And crucially for rich media, “The other thing we do not support yet is video with audio. So, we can understand like video input or audio input separately, but if you want to pass like in the same prompt both a visual part and the audio part, we still need to do some improvements around that.” These are important guideposts for what to build now versus what to wait for.
What to Do With This
Stop thinking about multimodal AI as something only for massive data centers. Pull up your product roadmap and identify features you scrapped because on-device AI wasn't powerful enough or cheap enough. Specifically, prototype a core loop that uses Gemma 4's 30-60 second audio or video processing for a real-time, offline capability. For example, if you're in field service, could a technician use on-device object detection to identify a part, or a voice command in a local language to log an action, without touching the screen? Your competition is likely still stuck waiting for cloud APIs.