Back to Glossary
LLMs / Multimodal Models

Gemini

Gemini is a family of natively multimodal large language models developed by Google DeepMind. It is designed to seamlessly understand, operate across, and combine different types of information, including text, code, audio, image, and video.

Explanation

Unlike traditional models that are often trained on text first and then 'patched' with vision or audio capabilities, Gemini was built to be multimodal from the start. This 'native multimodality' allows it to reason across different formats more naturally—for example, it can analyze a live video feed and provide real-time commentary or generate code based on a hand-drawn UI sketch. The Gemini ecosystem is categorized into tiers: 'Ultra' for highly complex tasks, 'Pro' for versatile scaling, 'Flash' for high-speed/low-latency performance, and 'Nano' for efficient on-device processing. A defining technical feature of Gemini (specifically 1.5 Pro) is its massive context window, which can handle up to 2 million tokens, enabling the model to process entire libraries of code, hour-long videos, or massive document sets in a single prompt.

Related Terms