IBL News | New York
Google announced yesterday its next major model, Gemini 2.0 Flash, which includes new multimodal outputs and can natively generate images, audio, and text. 2.0 Flash can.
It uses third-party apps and services, allowing it to tap into Google Search, execute code, and more.
However, the audio and image generation capabilities are launching only for “early access partners,” while the production version of 2.0 Flash will land in January.
In the meantime, Google is releasing an API, the Multimodal Live API, to help developers build apps with real-time audio and video streaming functionality.
Google said that using the Multimodal Live API allows developers to create real-time, multimodal apps with audio and video inputs from cameras or screens.
The API supports the integration of tools to accomplish tasks, and it can handle “natural conversation patterns” such as interruptions along the lines of OpenAI’s Realtime API.
The Multimodal Live API was generally available as of yesterday.
In addition, Google released Jules, an experimental AI-powered code agent using Gemini 2.0. for coding tasks with Python and Javascript. Jules creates comprehensive, multi-step plans to address issues, efficiently modifies multiple files, and even prepares to pull requests to land fixes into GitHub directly.
Field report…
1. Google’s Gemini 2.0 and 1.5 with Deep Research is the best LLM on the market… “deep research” is VERY VERY impressive.
2. The dedicated iOS App is reaching parity with ChatGPT already.
3. … and $goog is releasing new features and products related to… pic.twitter.com/FNaSqKwc7u
— @jason (@Jason) December 12, 2024