Lesson 06 · ChatGPT Mastery Pro ~11 min 3 modalities

Multimodal ChatGPT: the Swiss army knife.

ChatGPT isn't a text-only tool anymore — it has eyes, a browser, and a sketchpad. It reads images you upload, searches the live web, and generates visuals. Most people still type at it like it's 2023. Here are the three places breaking out of text saves real time — and exactly where each one falls short.

The mental model

Three inputs you keep forgetting it has.

It's not that the capabilities are hard — it's that text is the default, so the others never cross your mind. Keep these three on the tip of your tongue:

📎

It can see

read any image you upload

🌐

It can browse

pull live info from the web

🎨

It can draw

generate visuals from a description

Try it · the input bar you ignore

Tap each tool. See what it unlocks.

This is the ChatGPT composer. Those buttons left of the text box are the multimodal tools most people never touch. Tap each one to see what it does — and where it trips up.

Message ChatGPT…

↑ Tap a tool to see it work.

0 of 3 tools explored

Your call · which input?

Pick the right tool for the job.

Four tasks. Reach for the capability that actually fits — sometimes that's still just typing.

Run one multimodal day

For a single workday, deliberately use a non-text capability every time the chance comes up. Whiteboard after a meeting → photograph and upload it. Need a current number → force web search. Need a visual for a deck → generate it. Count how often it actually helped — that's how the habit forms.

Try a prompt in the playground → Back to ChatGPT Mastery

What you can do now

Upload images and ask ChatGPT to read, transcribe, or analyze them
Force web search when freshness matters — and skip it for evergreen questions
Generate work-appropriate visuals with DALL-E
Know each feature's blind spots: handwriting, latency, text-in-images and faces
Stop defaulting to text when another input would be faster

Pro

Up next in ChatGPT Mastery

Lesson 7 · ChatGPT Agent — autonomous task execution

The agentic mode that absorbed Operator and Deep Research into one system — what it can do on its own, and the oversight that keeps it safe. Start lesson 7 →