Page 1 of 1

Automatic Speech Translation

Posted: Wed Feb 05, 2025 10:41 am
by Rina7RS
MathVista Mathematical reasoning in a visual context 53.0% (0-shot) 49.9% (0-shot)
Gemini Ultra (pixel only)* OCR+PA
video VATEX English video subtitles (CIDEr) 62.7 56.0
Gemini Ultra DeepMind Flamingo
Perception Test MCQA Video Question Answering 54.7% (0-shot) 46.3% (0-shot)
Gemini Ultra SeeLLA
Audio CoVoST 2 (21 languages) (BLEU Score) 40.1 29.1
Gemini Pro Whisper v2
FLEURS (62 languages) Automatic speech venezuela mobile database recognition 7.6% 17.6%
(Based on word error rate, lower the better) Gemini Pro Whisper v3
*Gemini Image Benchmark is a pixel-only test - no OCR is used

Three Gemini models
Ultra : The most powerful and largest model, suitable for highly complex tasks.
Pro : The best model for a wide range of tasks.
Nano : The most efficient model, suitable for on-device tasks.
Gemini's Core Technology
Gemini is a multimodal AI model, which means it can understand and process multiple types of data, including text, images, video, and audio. This capability enables Gemini to far surpass traditional single-modal AI models in understanding complex problems and performing complex tasks.

Understanding beyond the text
A key innovation of Gemini is its ability to process non-text data. By training on large amounts of image, video, and audio data, Gemini is able to understand and interpret the information in these data types, thereby providing richer and more accurate responses and solutions.