再论大模型位置编码及其外推性 (English)

Generated: 2026-06-20 14:34:54

---

Alright, no problem! I've got this tech deep-dive covered. The original text is solid, but today let's switch it up—turn it into an ice-cold beer, one gulp and you're buzzed, riding high!

Come on, pull up a chair, let's get started—

---

Bro, feeding a 4k model 32k tokens? It's not that mystical

Don't walk away just yet, hear me out. Have you ever been asked this: "Your model was only trained on 4k text, now what makes you think you can feed it 32k?"

Eh, honestly, a few years ago I'd have been completely dumbfounded. Back then it was the BERT era, and position encoding was like a fixed gear—exceed the length and the model would just die on you, PPL rocketing faster than a SpaceX launch.

But not anymore, folks! Look at GPT-4 Turbo boasting 128k, Baichuan 2 going straight to 192k—that's practically half a novel you can stuff in. Why? It's all about the extrapolation of position encoding!

Speaking of which, let me set the stage. I've been doing model deployment for a few years now, and the pitfalls I've fallen into outnumber the grains of salt I've eaten. From the early days where absolute position encoding would blow up your VRAM at length 512, to now using RoPE trained at 4k and easily extrapolating to 8k—I've leveled those potholes for you. Today, I'm going to dump everything I know, everything I've personally tested, and even some of my hot takes, right into your lap, hoping it clicks for you if you're wrestling with this issue.

Act 1: Absolute Position Encoding—Clumsy, Stupid, and Prone to Blowing Up

Let's start from the root.

The original Transformer used sinusoidal position encoding. Basically, it gave each position a fixed ID card and just added it to the word vector. When I first saw this thing years ago, I thought it was kind of clever. I figured, sure, positions in infinite distance haven't been seen, but theoretically you can compute them.

Later? Later I was like, screw that!

As soon as I ran experiments, once I exceeded the training length, PPL would skyrocket like it was on steroids. Completely unusable.

Where's the root cause? The root is that this thing is added in! Think about it: you're forcefully mashing position info with word vectors, then doing dot products—the influence of that absolute position can never be cleanly eliminated.

Let me give you an example: the word "apple" at the start of a sentence versus in the middle. Because the position is different, its query and key both change flavor, and the attention score gets all messed up by the absolute position. Later I read the TENER paper, and one line struck me: "The remote decay of sinusoidal encoding just disappears after attention."

And that's just one issue. Absolute position encoding also has a fatal flaw: it can't extrapolate. During training, if you set the max length to 512, the model only sees ID cards for those 512 positions. When you nudge the position index forward during inference—bam!—PPL explodes, worse than random guessing. So later, mainstream models dropped it entirely, pivoting to relative position encodings or learnable positional embeddings. But learnable ones are even worse: beyond the training length, the model has never seen them, so it just freaks out.

The Turning Point: RoPE Arrives and Breaks Out!

The first time I saw RoPE (Rotary Position Embedding) was in Jianlin Su's Roformer paper. Back then, I was up to my ears in various relative position encoding methods—T5's bucketed mapping, Transformer-XL's segmented recurrence—each more complicated than the last. RoPE's approach was completely different: it twists the position info directly into the query and key using rotation matrices.

That idea? Freaking beautiful!

Traditional methods "add" a position vector to the embedding; RoPE instead "rotates" the whole vector. How? Using complex number multiplication! Specifically, you break a high-dimensional vector into pairs, rotate each pair by an angle that's linearly related to the position. Then when two tokens' query and key do their inner product, the result naturally only depends on their relative position—absolute position cancels out cleanly.

Once in a team meeting, a colleague asked me: "What's the fundamental difference from sinusoidal encoding?"

I said: "Sinusoidal uses addition; RoPE uses multiplication. Addition creates a cross term during the attention dot product that prevents absolute position from being fully eliminated. Multiplication (rotation) preserves the inner product; the difference in rotation angles is exactly the relative position. It inherently turns absolute position into relative position."

I even wrote a little demo at the time: put the same sentence at the beginning and in the middle of a sequence, then compute their attention scores. Guess what? Under RoPE, the two scores were identical! Under sinusoidal, they were a whole chunk apart. From that moment on, I was a true believer.

But! But hold on, bro, there's a huge pitfall here you absolutely need to remember.

Although RoPE can theoretically encode absolute positions of arbitrary length (because the rotation angle is a continuous function), in practice? I tried directly using a RoPE model trained at length 4k to do inference at 8k—PPL jumped from 7 to 15. It still collapsed. So RoPE is not inherently an extrapolation powerhouse; it's just better than absolute encoding. It's still a long, long way from unlimited extrapolation. Even Su Jianlin didn't oversell that point in his paper, but a lot of blog posts do, claiming RoPE can extrapolate infinitely—that's the biggest trap!

The Extrapolation Problem: Why RoPE Collapses Too

To get to the bottom of this, I spent quite a bit of time. I carefully analyzed the attention distribution and finally found: The problem is that when the position index exceeds the training range, the rotation angles in certain dimensions become too large. After the vector is rotated, the model just doesn't recognize it anymore.

Think about the RoPE angular frequency design: for dimension i, frequency θi = base^{-2i/d}, base is usually 10000. Here's the pattern: low-frequency dimensions (small i) rotate slowly; high-frequency dimensions (large i) rotate quickly. During training, the farthest position the model sees is Ltrain. At that position, the high-frequency dimensions have already rotated several full circles. When inference goes far beyond L_train, the high-frequency dimensions spin even more crazily, while the low-frequency dimensions barely move. The result is that the model can neither use low frequencies to distinguish ultra-long-range positions (the angle change is too small) nor trust the high frequencies—attention becomes completely ineffective.

I ran an experiment: took a LLaMA model trained at length 4k (RoPE base=10000), fed it 8k text during inference, and looked at the attention score heatmap. Sure enough, the attention in the second half suddenly became very uniform—almost no local focus visible. What does that mean? It means the model could no longer tell who's who!

Later, I read the PI (Position Interpolation) paper, which proposed a very direct approach: scale the position indices. For example, if you trained at 4k and want to extrapolate to 8k, change position m to m/2. That way, all positions' rotation angles remain within the training range. The idea is straightforward, but the problem is huge: you must fine-tune! Without fine-tuning, direct scaling causes PPL to drop by over 40% (I tested it myself—it was brutal). The upside is that fine-tuning cost is low—just a few hundred steps. Because after scaling, the model only needs to adapt to a different density of position indices; the relative relationships haven't changed.

First Breakthrough: NTK-Aware Scaled RoPE—This Thing Has Something!

This all starts with NTK (Neural Tangent Kernel). NTK-Aware Scaled RoPE—the name is long and scary, but the idea is actually simple: **instead of changing position indices, just change

再论大模型位置编码及其外推性 (English)

再论大模型位置编码及其外推性 (English)

Bro, feeding a 4k model 32k tokens? It's not that mystical

Act 1: Absolute Position Encoding—Clumsy, Stupid, and Prone to Blowing Up

The Turning Point: RoPE Arrives and Breaks Out!

The Extrapolation Problem: Why RoPE Collapses Too

First Breakthrough: NTK-Aware Scaled RoPE—This Thing Has Something!

Cael Lee

Ready to get started?