Google’s annual developer conference, Google I/O, took place in the early morning of May 15th Taiwan time, with a strong focus on updating AI functionalities. Throughout the event, the term “AI” was mentioned a total of 122 times.
One of the major updates is the introduction of Gemini, which brings “multimodal” capabilities to search engines and assistants. Starting this year, Google will be able to search using videos. In addition, they have launched the AI Overview feature, which uses AI to assist in summarizing search content. The intelligent assistant Astra can recognize objects and actions in videos while recording, and provide instant responses to related questions. They have also introduced a new version of the large language model called Gemini 1.5 Flash and an image generation model called Veo.
Google DeepMind leader, Demis Hassabis, made his first appearance at Google I/O.
The first major AI revolution is in the search engine domain. With the integration of Gemini, the search engine has undergone a fundamental update, enabling it to not only recognize audiovisual content but also understand longer and more complex instructions.
Google is now capable of “video search.” Previously, Google search primarily relied on text and images. With the new update, users can now shoot videos and supplement their inquiries with voice or text. The search engine will then analyze the content in the video and provide relevant responses. For example, if a user encounters technical difficulties while playing a vinyl record, they can record a video and ask Google, “Why is it behaving this way?” Google will automatically search for the answer and provide an AI-generated summary through the Google Overview feature.
AI Overview, a technology introduced by Google last year, summarizes and organizes search content at the top of the search engine. With the new “multi-step reasoning” capability of the Gemini model, AI Overview can handle complex queries. Regardless of the length, level of detail, or specific areas of attention in a query, AI Overview can complete the task without the need for multiple searches. For example, if a user wants to find a new yoga or Pilates studio in Boston and also wants to know about their new member offers and the time it takes to walk from Lighthouse Hill, they can simply search: “Find me the best yoga or Pilates studio in Boston and tell me their new member offers and the time it takes to walk from Lighthouse Hill.” AI Overview will be able to handle this complex query.
The second AI revolution is Astra, the future AI assistant from Google. Astra, demonstrated for the first time at Google I/O, is said to understand the dynamic and complex world like a human. Astra also possesses multimodal capabilities, including real-time analysis of videos. It can think and react quickly when presented with dynamic visuals. It even has memory capabilities. During the demonstration, a user walked around while filming with their phone. They asked Astra, “Where do you think I am right now in the community?” while standing near a window. They also circled a section of code on a computer screen with a brush and asked Astra, “Where do you think there are issues to improve?” Even before the video ended, they asked Astra, “Do you remember where I left my glasses?” Astra analyzed all the frames in the video and found the frame where the glasses were and provided the answer, “Next to an apple.”
The third AI revolution is in Google Photos. With the introduction of the Ask Photos with Gemini feature, photos can now be categorized based on objects in the images and labeled with keywords. Users can quickly find photos with their car license plate or even document the process of their daughter learning to swim. Gemini can quickly search for related images and provide the date as a response when asked, “When did my daughter learn the backstroke?”
The fourth AI revolution is in Android. Gemini is expected to be the best carrier for Google AI functionalities. Based on the applications demonstrated during the conference, Gemini can generate memes in chat conversations, answer questions about sports rules in sports videos, and even provide instant answers to 80-page PDF files through the Gemini Advanced App.
Gemini’s ability to process a large number of parameters allows it to read an entire economics textbook in a matter of seconds and provide summaries or answers to questions.
The fifth AI revolution is the update to Gemini. Gemini 1.5 Flash, a new model in the Gemini family, is more lightweight and capable of processing millions of tokens in text, images, and videos at once. This update focuses on the “multimodal” and “mass processing” capabilities. Gemini 1.5 Flash, positioned between Gemini 1.5 Pro and Gemini 1.5 Nano, offers the same level of capability as Gemini 1.5 Pro but with a lighter and more efficient design. For example, a conversation instruction window can process up to a million tokens, which means it can analyze documents as long as 1,500 pages or code with over 30,000 lines. This lightweight model is achieved through the technique of “knowledge distillation” and is more suitable for developers who prioritize speed and cost-effectiveness.
Gemini 1.5 Pro, which was introduced in February this year, will also be upgraded. It will be able to process twice the number of tokens, reaching 2 million. This means it can simultaneously process a 2-hour video, 22 hours of audio, over 60,000 lines of code, or over 1.4 million words of text.
The sixth AI revolution is in image modeling with Veo. Users can input natural language text instructions, and Veo, along with OpenAI’s Sora, can generate high-quality 1080p videos that last over a minute. Veo can understand film shooting and visual technology-related terms and incorporate techniques such as time-lapse photography during the creation process. OpenAI’s Sora, on the other hand, can generate complex scenes with multiple characters, specific actions, and many details. It understands not only the various objects mentioned in the prompts but also how these objects exist in the real world, creating impressive and realistic scenes.
Prior to Google I/O, OpenAI announced the release of GPT-4o, a new model with advanced language and video processing capabilities. It provides users with an experience that closely resembles interacting with a real person. GPT-4o can provide real-time translation during conversations, allowing smooth communication between individuals speaking different languages. It can also tell bedtime stories with rich and expressive voices or teach people how to solve simple math problems using a human-like tone.
With two major AI pioneers announcing their latest technologies within two days, this AI revolution will continue to impact people’s lives.
Editor: Lin Meixin