:微软166页论文解读 GPT (English)

Generated: 2026-06-23 12:43:07

---

Last weekend, I holed up in my study, the 166-page GPT-4V paper spread out before me. I downed three cups of coffee, and my eyes were practically crossed. You might think I'm crazy—but halfway through, I slapped my thigh and yelled, "So that's how it works!" My wife pushed the door open, saw my screen covered in circles and arrows, and a pile of test screenshots on the desk. She sighed, "What crazy thing did you learn this time?"

Don't get me wrong—those 166 pages were well worth it. I spent three nights reading, pausing every so often to open the vision version of ChatGPT Plus and run side-by-side tests, like zeroing in a new weapon. Today, I'm not going to summarize the paper's structure—that would be boring. Instead, let's dive straight into the mind-blowing takeaways I got, along with the pitfalls I stumbled into. After you hear this, I guarantee you'll want to try it yourself right away.

---

First, don't confuse LMMs with LLMs.

The paper hits you with a counterintuitive point right off the bat: You might think GPT-4V is just ChatGPT with eyes, right? Wrong! Large Multimodal Models (LMMs) and Large Language Models (LLMs) are two different beasts. At its core, GPT-4V is a giant language model (LLM) as the brain, connected to a visual encoder as the eyes, trained on massive amounts of image-text pairs. So it has both the reasoning and chat abilities of an LLM and the ability to understand images.

But think about it—it processes images by breaking them into token sequences, not by staring at pixels the way we do. This means it inherently has limits when it comes to tiny details. I never realized this before. Once, I threw in a photo of densely packed parts and asked it to count how many there were. It gave me a bunch of wrong numbers. At the time, I blamed the model. After reading the paper, I realized—that's not its strong suit! The paper itself admits that counting and dense captioning in complex scenes are still weak points. So was it my fault? Actually, I was the one trying to use a cannon to kill a mosquito.

Speaking of which, I can't help but shout: The model's boundaries aren't an excuse for you to slack off—they're a signal for you to adjust your approach!

---

Second, the input method is way more flexible than you think—I almost screamed the first time I tried it.

How do you normally use the vision feature? Either you send text only or images only, right? The part that blew my mind in the paper is that you can interleave images and text in the input! What does that mean? Well, if you drop in a screenshot of a webpage or a PDF with images, and ask it a question based on the whole thing, it can understand the relationships between images and text in the page layout.

I tried it right away: I took a screenshot of a product manual—with a table, a diagram, and printed text—and asked, "What is the normal operating temperature range?" It instantly pulled the numbers from the screenshot, scarily accurate.

Later, I gave it a hand-drawn flowchart, with handwriting as messy as a doctor's prescription. It misread two branch labels. Section 4.4 of the paper, which talks about scene text and document reasoning, is very honest: tasks like these require more advanced prompting techniques, like asking the model to think step by step. Guess what? I added "Please analyze each line of text step by step" at the end of my prompt, and the accuracy shot up! See, just one sentence made a world of difference.

So don't just complain about the model—try different ways of talking to it.

---

Third, visual reference prompting—this is the single most important skill for anyone doing analysis, period!

Section 5 of the paper is dedicated to this. Simply put, you draw circles, arrows, or even write words on the input image, and then ask the model to focus on those marks. I made a mistake at first: I thought just saying "Look at the person in the upper left" was enough. But the model's understanding of spatial orientation didn't match mine, and it often gave irrelevant answers.

Later, I opened the screenshot in PowerPoint, drew a red circle around the target area, and asked, "What color is the person inside the circle wearing?" It got it right on the first try! You've basically given it a laser pointer.

The paper also gives a classic example: a map with several dots drawn by the user, asking GPT-4V to plan a route. That trick is powerful—ambiguity in multi-turn conversations vanishes instantly. Later, when I was doing competitive analysis, I took screenshots of three apps' interfaces side by side, used arrows to mark differences in the payment buttons, and asked, "What are the differences in the payment flows among these three interfaces?" It directly pointed to the areas I marked and compared them one by one. Efficiency went through the roof. After finishing that analysis, I was stunned: I had wasted so much time before going in circles with the model!

---

Fourth, multimodal chains and multimodal plugins—this is the most engineering-relevant part, and it feels straight out of the future.

Section 10 got my blood pumping: linking GPT-4V with search engines, image generators, and object detectors so they work together. The example that excited me most was Bing Image Search: you give it a photo from after the 2023 earthquake, and it doesn't know where it is. But if you hook it up to the search plugin, it can immediately reverse-image-search and locate it to İzmir, Turkey. That's like an external plug-in brain! When it doesn't know something, it goes and looks it up.

I couldn't try that plugin myself (no access), but I found a workaround: first use GPT-4V to identify landmarks or text in the image, get the description, then manually copy it to a search engine to verify. In other words, I created a manual version of a dual-modal chain. It's slow, but it made me fully understand what the paper means by "the plugin returns multimodal information to the model, and the model processes it." If this could be automated in the future, things like accident scene analysis or news photo verification could be done in seconds.

The paper also mentions a multimodal chain based on an extended ReAct framework. Honestly, I don't have the ability to build that myself yet, but the direction is clear: GPT-4V can serve as the hub for multimodal reasoning, pulling in various visual tools and knowledge sources to work together. Just imagine how awesome that would be!

---

Fifth, let me share my thoughts on the overall style of the paper—it's like an experimental photo album that I couldn't stop reading.

The whole thing is 29,618 words with 124 images. Reading it felt less like reading a paper and more like flipping through a photo album with experiment records: each case gave the input, the output, and the researchers' sharp commentary. My favorites were Section 3.4 (zero-shot and few-shot examples in complex scenes) and Chapter 9 (application scenarios). What's Section 3.4 about? It's about teaching the model new tasks just by giving a few examples, without changing the model parameters—that's the hardcore skill of a prompt engineer! And Chapter 9 is even more explosive: assisting with medical report understanding? Yes. Automated grading in education? Yes. Interpreting creative design sketches? Also yes. Since I work in data analysis, I immediately used GPT-4V to interpret whiteboard photos from meetings. It extracted key nodes and arrow flows pretty well—sometimes missing a connection, but more than enough for a first draft!

And the paper is not all fluff; it also lays out the failures: for example, it's easily fooled by counterfactual samples (a picture that looks like a koala but is actually a bear), and it sometimes stumbles on cultural jokes (like East Asian dry humor). I've encountered these myself during testing. So it's not an "almighty god"; it's a real tool that has made a fundamental breakthrough in knowledge and visual understanding, but still needs human cooperation when applied. Seeing that honesty, I almost wanted to bow to Microsoft.

---

Finally, let me give you a few advanced tips I learned the hard way—take notes!

First, always preprocess images before feeding them to GPT-4V! Crop and enlarge key text areas, or add numbered arrows with a drawing tool. It will save you half the trouble. Section 5's visual reference prompting is not just a gimmick—it works, I've tested it.

Second, manually simulating multimodal chains is worth your time. For now, it's not automated, but you can take GPT-4V's output (

:微软166页论文解读 GPT (English)

:微软166页论文解读 GPT (English)

Cael Lee

Ready to get started?