理解LLM位置编码:RoPE (English)

Generated: 2026-06-21 16:27:05

---

Hey, you know what? The first time I ran into the trouble with positional encoding was during a text generation experiment. Same prompt, I just tweaked the word order—and boom, the model spat out complete nonsense! I was completely stunned: "Don't you have positional encoding? How come you don't recognize me just because I changed the order?"

Later I realized: the Attention mechanism itself is "face-blind"—it only cares about who is who, but has zero memory of who stands where. Give it "I chase you" and "you chase me"—without positional information, the output is identical. For language models, this is a disaster, because changing the order changes everything.

Over the years, many brilliant minds came up with various solutions. Early approaches split into two camps: absolute positional encoding and relative positional encoding. But each has its own flaws—absolute ones (like Sinusoidal) only saw 2048 positions during training, and if you suddenly throw 4096 at them during inference, they just "lose their memory" and the loss skyrockets; relative ones (like T5, ALiBi) were on the right track conceptually, but turned out slow and cumbersome in practice, and didn't support Flash Attention.

Then along came RoPE—it pulls a brilliant trick: on the surface it looks like absolute positional encoding (directly operating on Query and Key), but inside it achieves the effect of relative positional encoding (the inner product only depends on the distance, not on the absolute positions). It's like holding an apple in your hand—no matter from which angle you look at it, the apple is still the apple, but your "perspective" changes. Word vectors are the same—the semantics stay, but the "viewing angle" (position) shifts.

That's exactly why RoPE has become the de facto standard for LLMs, for just three reasons: great performance, high speed, and extrapolation ability! I've personally compared them on Llama-family models—with the same parameter count, models using RoPE achieved nearly 2 points lower perplexity on long texts than Sinusoidal, and when scaling to 4× the training length during inference, the performance remained rock solid!

---

First question solved, now here's the second one: How does that rotation actually work? What is rotate_half doing in the code?

Don't be scared off by the term "rotation matrix"—it's super simple.

2D Case

Suppose a word vector has only 2 dimensions, with coordinates [x0, x1]. To inject position info at position m, you simply rotate this vector around the origin by an angle of m * θ (where θ is a preset frequency). The formula for the rotated coordinates you learned in high school:


new_x = x0 * cos(mθ) - x1 * sin(mθ)
new_y = x0 * sin(mθ) + x1 * cos(mθ)

That's it? Yes, that's it!

Higher Dimensions

But in large models, each head has 64, 128 dimensions—how to handle that? Very easy—chop the vector into pairs of 2D sub-vectors, rotate each pair independently, and each pair uses a different rotation angle!

For example, a 4D vector [x0, x1, x2, x3] is treated as two 2D vectors: [x0, x1] and [x2, x3]. The first pair rotates by mθ0, the second by mθ1. 64 dimensions means 32 such 2×2 blocks stacked along the diagonal.

Efficient Implementation: `rotate_half`

You might think: if we explicitly construct a giant rotation matrix and multiply, it would be terribly slow, right? Exactly! The first time I implemented RoPE by myself, I naively wrote a matrix multiplication—training one batch took 5 seconds. After switching to the rotate_half operation, it dropped to 0.2 seconds—dozens of times faster! Surprise!

So what does rotate_half actually do? Let's look at the code directly (using Llama's implementation as an example):


def apply_rotary_emb(x, cos, sin):
 x_half = x.chunk(2, dim=-1) # split into first half and second half
 x_rotated = torch.cat([-x_half[1], x_half[0]], dim=-1)
 return x * cos + x_rotated * sin

That's it! Let me break it down: for a 2D vector [x0, x1], xhalf[0] = [x0], xhalf[1] = [x1], so xrotated = [-x1, x0]. Then x cos + xrotated sin gives [x0cos - x1sin, x1cos + x0sin]—exactly the rotation formula! For higher dimensions, this operation is repeated for each pair.

Here's a subtle detail: why does the code split into the first half and the second half, rather than alternating odd-even indices like Sinusoidal? I was stuck on this question for days. Finally I figured it out: because rotation works on pairs, if you used odd-even grouping, the indices would get messy. But chunk(2, dim=-1) simply splits the dimension into two halves—the positions in the first half and the second half naturally form pairs (e.g., dim=0 pairs with dim=d/2). This design is ingeniously simple and also fast to compute!

Also, head_dim must be even, otherwise you can't make pairs. I once accidentally set it to an odd number—the code immediately threw an error, and it took me half a day to find the cause. Remember this!

---

Third question: that frequency sequence formula looks familiar, but how exactly does it come about? And why does it enable long-range decay?

RoPE's frequency sequence formula is identical to Sinusoidal:


θ_i = 10000^{-2i / d}

where d is head_dim and i goes from 0 to d/2 - 1.

You might think: this formula can be replaced with any other number, right? What if I use θ_i = 1000^{-2i/d}? I'll tell you—I tried it myself, and it didn't work! The performance was far worse! The model trained with that had no sense of distance at all—long-range and short-range dependencies got mixed up, a complete mess.

So why must it be the constant 10000? Why does this formula guarantee long-range decay? Because it creates a multi-scale distribution in the frequency domain: low frequencies (small angles) handle long-range dependencies, high frequencies (large angles) handle short-range dependencies. It's like tuning a radio—low frequency signals travel far, high frequency signals travel short. Each dimension pair is sensitive to a different "distance scale," covering everything from short to long. As a result, the farther apart two positions are, the larger the "rotation difference" between their vector representations, and the dot product naturally decays—that's the secret of long-range decay.

Ultimately, this formula wasn't cooked up out of thin air—it's inherited from Sinusoidal positional encoding, but RoPE gives it new life: originally Sinusoidal was just added to the embedding, but now it's multiplied in via rotation, preserving both the multi-scale frequency and orthogonality, while achieving relative position in the inner product.

---

Final question: How does RoPE ensure that the dot product result only depends on the relative position, not the absolute position?

This is the most brilliant part of RoPE! Think about it: two words at positions m and n. After rotation, their Query and Key become rotated versions. When you compute their dot product—**because the rotation

理解LLM位置编码:RoPE (English)

理解LLM位置编码:RoPE (English)

2D Case

Higher Dimensions

Efficient Implementation: `rotate_half`

Cael Lee

Ready to get started?