Automatic Speech Translation

Data used to track, manage, and optimize resources.
Post Reply
Rina7RS
Posts: 675
Joined: Mon Dec 23, 2024 3:42 am

Automatic Speech Translation

Post by Rina7RS »

MathVista Mathematical reasoning in a visual context 53.0% (0-shot) 49.9% (0-shot)
Gemini Ultra (pixel only)* OCR+PA
video VATEX English video subtitles (CIDEr) 62.7 56.0
Gemini Ultra DeepMind Flamingo
Perception Test MCQA Video Question Answering 54.7% (0-shot) 46.3% (0-shot)
Gemini Ultra SeeLLA
Audio CoVoST 2 (21 languages) (BLEU Score) 40.1 29.1
Gemini Pro Whisper v2
FLEURS (62 languages) Automatic speech venezuela mobile database recognition 7.6% 17.6%
(Based on word error rate, lower the better) Gemini Pro Whisper v3
*Gemini Image Benchmark is a pixel-only test - no OCR is used

Three Gemini models
Ultra : The most powerful and largest model, suitable for highly complex tasks.
Pro : The best model for a wide range of tasks.
Nano : The most efficient model, suitable for on-device tasks.
Gemini's Core Technology
Gemini is a multimodal AI model, which means it can understand and process multiple types of data, including text, images, video, and audio. This capability enables Gemini to far surpass traditional single-modal AI models in understanding complex problems and performing complex tasks.

Understanding beyond the text
A key innovation of Gemini is its ability to process non-text data. By training on large amounts of image, video, and audio data, Gemini is able to understand and interpret the information in these data types, thereby providing richer and more accurate responses and solutions.
Post Reply