I've Tested Five "Unified" Multimodal Models — BAGEL Is the First One That Actually Works
I've Tested Five "Unified" Multimodal Models — BAGEL Is the First One That Actually Works
I've got PTSD from "unified" multimodal models. Proper trauma.
You know the drill. Someone publishes a paper with "Unified Architecture" in the title, and you crack it open to find the same old Frankenstein's monster — a language model spits out a few tokens, a diffusion model generates pixels, and they're held together by some hardcoded bridging layers that feel like they were designed at 4 AM before a deadline. I've tested at least five open-source projects built on this pattern. Back in 2023, not a single one survived three turns of conversation in a real scenario. Context retention was worse than my grandmother's memory after two glasses of sherry.
So when a friend sent me the BAGEL paper link, my response was predictable: "Another Frankenstein?"
Plot twist: it's not.
The Number That Made Me Sit Up
Let's talk metrics. BAGEL scored 44.9 on intelligent image editing benchmarks.
Now, chew on this — Step1X-Edit, a model that's specialised in editing tasks, scored 14.9. That's not a typo. Three times the performance.
And we're not talking about slapping a filter on or removing a watermark. This is proper reasoning-heavy stuff. Tasks like "replace the coffee with tea because it's 3 PM now." The model needs to understand cultural context (afternoon tea is a thing), grasp visual differences between coffee and tea (liquid colour, cup shape, steam behaviour), and maintain lighting consistency — where's the window light coming from? How should the reflection on the table change?
Honestly, previous open-source models couldn't touch this. They'd produce something that looked like a GCSE Photoshop project.
Right, 44.9. Let's park that number and dig into what makes this thing tick.
The Architecture: Sharing a Desk, Not Shouting Across the Hall
I spent an evening going through BAGEL's paper and codebase (github.com/bytedance-seed/BAGEL — weights are fully open-source, and no, I'm not using a HuggingFace mirror. Last Wednesday afternoon I was getting 200 KB/s on a direct clone. Painful).
Here's what they got right.
BAGEL uses a Mixture of Transformers — 14B total parameters, 7B activated. Two Transformer experts: one handles semantic understanding, the other manages visual generation. They share self-attention layers.
This is the crucial bit. It's not the shallow interaction pattern where the understanding module extracts a few tokens and tosses them over the fence to the generation module. Every single layer is interacting.
Think of it like two people sharing a large desk, constantly glancing at each other's scratch paper. Not shouting down a corridor and mishearing half the message.
I got burned by this in 2023. I was testing an open-source unified model (archived now, won't name names) and asked it to "generate an image of a cat on a sofa, and the sofa colour should match the one mentioned in our previous conversation." Disaster. The understanding module correctly extracted the "sofa colour" constraint from chat history, but the channel to the generation module was too narrow — something like a 128-dimensional conditioning vector. The colour information just... evaporated. The sofa came out a default beige.
BAGEL's shared self-attention design, in theory, demolishes that information bottleneck. Note: I said in theory. Whether it holds up in practice depends entirely on the training data.
The Data Story: Why 45 Million Video Clips Matter
This is where things get interesting.
Most multimodal models are trained on image-text pairs. One image, one caption. Clean, tidy, and boring enough to put you to sleep. But the world doesn't work like that. The world is — you watch a cooking video on YouTube, the frame moves, steps progress, the colour in the pan shifts from pink to golden to deep brown. That's temporal signal. You read a Wikipedia article, text interspersed with diagrams, captions underneath, info boxes on the side. That's interleaved structure.
BAGEL's team did something properly grubby. They created 45 million interleaved sequences extracted from videos. How? They chopped videos into segments and used a lightweight model to generate short descriptions between every two frames — captions limited to 30 tokens, specifically capturing object motion, action transitions, and state changes. They call these "inter-frame captions," and the goal isn't object recognition. It's teaching the model visual temporal logic — water going from boiling to evaporating, a person going from sitting to standing, light shifting from dusk to night.
Then there's another 20 million interleaved documents scraped from the web. Tutorials, encyclopaedia entries, DIY posts with step-by-step images. Here's the clever bit — they used a "caption-first" strategy. Before each image, they insert a concise description as a conceptual scaffold, then let the model generate.
The way I understand it, this is like giving the model a rough sketch before asking it to paint. It reduces the difficulty. You're not jumping straight from text to pixels — you're going text → structured description → pixels.
Honestly? This kind of data engineering has zero paper glamour. It's dirty work. But it's more effective than stacking ten extra attention layers.
Emergence: When the Model Suddenly "Gets It"
There's a figure in the paper I've stared at for way too long (Figure 5, go look it up). X-axis is training data volume, from 0 to 4T tokens. Y-axis is task performance.
Basic understanding tasks saturate around 0.18T tokens. Text-to-image generation plateaus around 0.68T.
But intelligent editing? It's flat on the floor for ages — 15, 16, 17 points. Basically random guessing. Then, somewhere after 3T tokens, it takes off. Shoots from 15 to 45.
Absolutely mental.
This is emergence in action. The model behaves like an idiot below a certain scale, then suddenly "gets it" past a critical threshold. And the ablation study shows something fascinating — if you remove the ViT tokens (the visual features handling semantic understanding), intelligent editing performance drops by 16%. This capability isn't magic. It grew out of deep coupling between understanding and generation, nourished by interleaved data.
I'll admit it — seeing this, I started to believe. Not in some grand unified narrative, but in this specific path being viable.
Real-World Testing: Hits and Misses
BAGEL isn't perfect. 14B parameters is chunky — my RTX 4090 with 24GB VRAM can handle inference with the 7B activated parameters, but fine-tuning? Forget it. From community feedback, inference speed is tolerable — roughly 3 to 5 seconds per image, depending on resolution.
They've put up an online demo. I threw a few cases at it.
Case one: "Turn this street photo into a rainy scene, with reflections on the road." Nailed it. The reflections were physically plausible — brightest directly under streetlamps, trailing reflections where vehicles had passed, weaker reflections near pavement edges. This is probably what it learned from all that video data — frame after frame of rainy road surfaces, watched millions of times.
Case two: "Predict what happens in the next second of this photo." It generated the next frame. The person's arm swing continued the previous trajectory, the skirt's flutter direction was consistent. It did mess up — finger details collapsed. Five fingers became six, or looked like melting candles. But the direction was right.
Case three: "Replace the plastic cup in this product photo with a glass one, maintaining the lighting." This one stumbled. The cup material changed to glass correctly, but the refractive light angles were off. It looked like a clumsy Photoshop layer overlay. Probably the training data didn't have enough physically-rendered transparent material samples.
Good, but the details need work.
What This Actually Means
The BAGEL team wrote something in their paper that stuck with me: "The key to narrowing the gap between open-source models and GPT-4o or Gemini is scaling multimodal interleaved data."
Not bigger models. Not fancier architectures. The structure and quality of the data itself.
This is actually quite a朴素 (straightforward, unpretentious) insight. If you want a model to understand the world, you feed it how the world actually works. The world isn't isolated images. The world is fluid, interleaved, causally connected — in a video, the spatula flips and the ingredients change colour in the next frame. In an encyclopaedia, the text describes a concept and the adjacent diagram illustrates it. In a conversation, the colour mentioned in the previous sentence constrains the next image generation.
That's it.
Weights, code, data construction pipeline — all open-source. github.com/bytedance-seed/BAGEL. Go have a poke around.
I need to figure out local deployment now. Whether my 4090 can handle it... we'll find out.
TL;DR
- BAGEL scores 44.9 on intelligent image editing — 3× better than specialised models like Step1X-Edit (14.9)
- Architecture uses Mixture of Transformers with shared self-attention, not shallow token-passing between modules
- Trained on 45M interleaved video sequences and 20M interleaved web documents, teaching temporal logic and structured reasoning
- Emergent behaviour at 3T tokens — editing capability suddenly jumps from ~15 to ~45 points
- Open-source and testable now, though 14B parameters needs decent hardware (RTX 4090 handles inference, barely)
- Real-world tests show promise on contextual edits, but struggles with fine details (fingers, transparent materials)
What's your experience with multimodal models? Have you found one that actually works in production? Drop a comment — I'm genuinely curious.
ai #machinelearning #computervision #opensource #deeplearning #multimodal
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.