Fotographer.ai
Latest Generative AI Learning Blog
What Can GPT-4V Do? Exploring the New Visual Capabilities of ChatGPT

What Can GPT-4V Do? Exploring the New Visual Capabilities of ChatGPT

Published :

November 11, 2023

AI is always a hot topic, constantly evolving and expanding its applications at an incredible pace.

This article will explore GPT-4V, the latest AI model from ChatGPT, which has recently sparked much interest. We'll cover its differences from previous ChatGPT AI models, its capabilities, and its potential uses.

What is GPT-4V?

Source: https://openai.com/research/gpt-4v-system-card

GPT-4V is the latest multimodal AI model released by OpenAI in mid-October 2023.

Note: Multimodal AI models can process and relate multiple data types (modalities) such as text, images, numbers, and audio.

It's integrated into ChatGPT, and users with a ChatGPT Plus subscription can already access GPT-4V.

As we'll discuss further, GPT has seemingly gained "eyes," "ears," and a "mouth," strengthening its role as a support tool in people's lives, rather than just an automated response AI.

ChatGPT has gained widespread recognition for its convenience and improved performance compared to previous AI models. Let's examine the differences between GPT-4V and earlier GPT versions.

Differences Between Recent GPT Versions, Including GPT-4V

GPT-3.5 vs. GPT-4

First, let's compare GPT-3.5 and GPT-4.

Overall, there's a significant performance gap between the AI models themselves.

Here are the key differences:

1. GPT-3.5 is available for free, while GPT-4 requires a paid subscription

As is typical, GPT-3.5 is less powerful than GPT-4 and is available to free users. GPT-4, on the other hand, is only available to paying subscribers. This is more of a difference in sales structure than AI capability, but it's important to note.

2. GPT-3.5 accepts only text input, while GPT-4 can technically handle image and audio data

Although you could only input text into GPT-4, it was technically capable of handling images and audio data.

Those already familiar with this are likely excited about the implementation of this long-awaited feature in GPT-4V.

You can think of it as an evolved version of GPT-4.

3. Higher Output Accuracy

The third difference is the accuracy of the outputted data and information.

GPT-4 has fewer typos than GPT-3.5 and can handle more complex prompts with multiple instructions, providing more accurate responses.

As you can see, there are performance differences between GPT-3.5 and GPT-4.

Next, let's discuss the differences between GPT-4 and GPT-4V.

GPT-4 vs. GPT-4V

As mentioned earlier, GPT-4V is an evolution of GPT-4, and the degree of improvement is significant.

Previously limited to text data input, GPT-4V now supports image and audio data input and can process mixed data types.

The ability to handle various data types and provide more realistic conversations goes beyond simple Q&A, highlighting the improved completion of multimodal AI.

Let's look at the specific differences.

1. GPT-4V Remains Exclusive to Paid Users

Currently, only paying subscribers can access GPT-4V, just like GPT-4.

A more advanced AI model like GPT-5 might be available to free users in the future. However, for now, it's limited to paid users.

2. GPT-4V Offers About Twice the Performance of GPT-4

GPT-4V boasts approximately twice the performance of GPT-4.

Here's a breakdown of the numerical differences:

・GPT-4: Language model with approximately 100 billion parameters, context size*¹: approximately 3,000 words, context window*² approximately 1,000 words

・GPT-4V: Multimodal language model with approximately 200 billion parameters, context size: approximately 5,000 words, context window approximately 2,000 words

*1 Context Size: The length of input an AI model can process at once.

*2 Context Window: The length of past inputs an AI model can remember.

3. Improved Completion as a Multimodal AI

The third difference is the improved completion as a multimodal AI.

As mentioned above, GPT-4 was limited to text input, while GPT-4V supports image input and mixed image and text input.

For example, it can explain objects or people in an image and answer questions related to the image.

This update demonstrates significant progress.

Let's recap what GPT-4V can do.

What Can GPT-4V Do?

We've covered what GPT-4V is and how it differs from previous GPT series. Here are five things GPT-4V can do:

Reading Image Data

First, it can read image data.

As mentioned, you can input image data (not just text like GPT-4) to have the AI analyze and process the given task.

Describing Image Data in Language

Second, it can describe image data in language.

Combining image data and language, you can input an image of a famous anime scene and ask the AI to "explain this image." It can then provide a text-based summary of the anime.

It can also identify locations in images and make inferences based on image data containing text or symbols.

As we'll discuss in the use cases, this feature is expected to be valuable for both work and personal use.

Reading Audio Data

Third, it can read audio data.

You can now input audio data in addition to text and images, which we'll discuss more in the user experience section.

Reading Inputted Text Data Aloud

Fourth, it can read inputted text data aloud.

While Japanese language support isn't yet implemented, you can translate a Japanese sentence into English and have it read aloud. This can help those with limited English skills communicate more easily and is particularly useful for communication.

Engaging in Conversations with AI Using Audio

Fifth, you can engage in conversations with AI using audio.

This feature is only available on the smartphone app, allowing you to converse with ChatGPT using voice.

It's similar to talking to Siri on an iPhone.

You can choose from five different voices.

While GPT-4V offers significantly expanded capabilities compared to previous GPT series, it also has some weaknesses, which we'll cover below.

GPT-4V's Weaknesses

Identifying Similar Image Data

First, it struggles to identify similar image data.

Think of it like a "spot the difference" game.

While it can compare multiple images and detect differences with some accuracy, there's still room for improvement.

If it could do this as well as a human, it could be used in various situations, such as streamlining (automating) visual inspection tasks in manufacturing.

Currently, significant engineering resources are spent programming such tasks, but GPT may be able to replace this coding in the future.

Processing Information Based on Data After 2023

Second, it struggles to process information based on data after 2023.

GPT-4V is based on information up to 2022 and struggles with tasks that rely solely on information from 2023 onwards.

However, this isn't a major issue and will likely be resolved over time. Consider this a current limitation.

Processing Tasks Requiring Specialized Knowledge or Expertise

Third, it struggles with processing tasks requiring specialized knowledge or expertise.

This isn't unique to GPT-4V, but it still struggles with areas that involve human expertise or skills in the referenced information.

Examples include diagnosing illnesses or current conditions from X-ray images in the medical field or determining which questions to ask or what to say during a sales negotiation.

The debate between humans and AI continues, but as long as there are areas where only humans can perform, it's hard to say that AI will replace everything.

Complex Image Analysis

Fourth, it struggles with complex image analysis.

GPT-4V can process image analysis, but it tends to take longer or provide less accurate results for images with many people or objects.

Now that we've discussed GPT-4V's strengths and weaknesses, let's look at some specific examples of how it can be used.

5 Use Cases for GPT-4V

Reading Charts and Graphs to Provide Information Needed for Analysis

First, it can read charts and graphs to provide information needed for analysis.

This is particularly useful for reading statistical information published by government agencies in Japan and for analyzing charts and statistical results in overseas research papers.

Being able to draw insights from information not only within Japan but also from overseas research papers is crucial in business and life. However, reading comprehension of English and technical terms can often be a hurdle.

GPT-4V allows anyone to access necessary information equally.

GPT-4V has the potential to significantly enhance information gathering capabilities and is one of the most practical uses of GPT-4V.

Outputting Sample Code Based on a Desired Image

Second, it can output sample code based on a desired image.

For example, you can create an image of a desired web page, upload it to GPT-4V, and enter the prompt "Create HTML code." The AI will then easily output sample code.

Mastering this technique could allow anyone to create web pages or develop SaaS products.

Obtaining Calculation Results Based on Images Containing Characters or Symbols

Third, it can obtain calculation results based on images containing characters or symbols.

For example, you could take a photo of a school assignment, upload the image, and enter "Create the answer." This could allow anyone to solve math problems.

It's reported that it sometimes makes mistakes when processing complex information with a large amount of data, but it can provide accurate answers for relatively simple tasks.

In the future, the ability to memorize answers will be less important than the fundamental ability to think mathematically.

Controlling Calorie Intake Using Images of Meals

Fourth, it can be used to control calorie intake using images of meals.

Similar features are already implemented in some smartphone apps, but you can upload photos of your meals to calculate your calorie intake.

With the recent popularity of health and fitness, this is a common and accessible use case for many people.

Identifying the Location of Objects in Images

Fifth, it can identify the location of objects in images.

It can identify where an uploaded image was taken.

This requires recognizable landmarks to pinpoint a location, such as Tokyo Tower, but it can be useful for tourism and travel.

How to Use GPT-4V

As mentioned earlier, you need to subscribe to ChatGPT Plus, the paid plan for ChatGPT, to use GPT-4V.

It costs around $20 per month. If you see practical applications, consider subscribing.

Points to Note When Using GPT-4V

Information May Not Always Be Accurate

First, the information you receive may not always be accurate.

This applies to AI in general. The information outputted by AI is simply the most plausible response, and it may contain errors.

While it's easy to use, always be cautious about the reliability of the information.

Risk of Copyright and Intellectual Property Infringement

Second, there's a risk of copyright and intellectual property infringement.

GPT-4V allows you to handle images and audio, which requires even greater caution.

Be especially careful when using it in business or work settings.

Currently, It Cannot Create Content Such As Image Generation

Third, it cannot currently create content such as image generation.

Since it's often grouped with generative AI, some people ask, "Can it also create images?" GPT-4V can only read image and audio data, not create new content from it.

For generating images, audio, and music, we recommend using appropriate generative AI tools separately. If you want to generate images, be sure to check out our service, "Photographer AI."

Summary

We've introduced an overview of GPT-4V, the latest multimodal AI.

Hopefully, this article has improved your understanding of the differences between AI models in ChatGPT, what GPT-4V can do, and its potential uses. This will support the use of generative AI.

Thank you for reading to the end.

Design your Dreams, Magically.

An AI image synthesis tool that anyone can intuitively use in the browser.

Try It Free

Learn More