LLM models have made it easier to extract information from images, whether it’s creating detailed descriptions or recognizing text. In this article, we compare six popular models — Gemini Pro 1.5, Gemini Flash 1.5, ChatGPT-4o, GPT-4o-mini, Claude 3.5 Sonnet and Claude 3.5 Haiku — to identify the best performer for image recognition and text extraction tasks.

Question 1: Image Recognition – Describe the Photo Concisely

Ranking of LLM Models

Gemini Pro 1.5
Polished and concise, Gemini Pro 1.5 provides detailed yet succinct descriptions that perfectly balance accuracy and brevity.
Gemini Flash 1.5
Accurate and clear, Gemini Flash 1.5 offers robust image descriptions, although it’s slightly less polished than Pro.
ChatGPT-4o
ChatGPT-4o delivers detailed descriptions with clear identification of the outfit and background. However, its phrasing can be slightly less refined than the top two models.
Claude 3.5 Sonnet
Descriptive and thorough, Claude 3.5 Sonnet excels in detail but tends to over-explain, making it less concise than higher-ranked options.
GPT-4o-mini
Simple and effective for basic recognition, GPT-4o-mini provides accurate descriptions but lacks depth in covering finer details.
Claude 3.5 Haiku
Misinterpreted the image entirely, offering no useful description, which places it last in this category.

Question 2: Text Extraction – Extract the Text Concisely

For text extraction, the models were tested on the following quote:

“You can’t connect the dots looking forward; you can only connect them looking backwards. So you have to trust that the dots will somehow connect in your future.”

Performance Results

All 5 models achieved a perfect match with the original text, demonstrating 100% accuracy in text extraction, except Claude 3.5 Haiku gives no information again.

Gemini Pro 1.5
Gemini Flash 1.5
ChatGPT-4o
ChatGPT-4o–mini
Claude 3.5 Sonnet

Final Rankings

Gemini Pro 1.5
Gemini Flash 1.5
ChatGPT-4o
Claude 3.5 Sonnet
GPT-4o-mini
Claude 3.5 Haiku

Gemini excels in delivering high-quality results for image recognition and text extraction. Gemini Pro’s consistent polish and accuracy make it the standout choice for users seeking precise and versatile AI performance.

Comparing AI Vision: Which Model Wins in Image Recognition and Text Extraction?

Table of Contents

Question 1: Image Recognition – Describe the Photo Concisely

Ranking of LLM Models

Question 2: Text Extraction – Extract the Text Concisely

Performance Results

Final Rankings