无大算力时,作为学生,LLM 还有哪些值得做的研究? (English)

Generated: 2026-06-23 00:19:47

---

Have you ever seen that kind of lab? Walls covered with graphics cards, servers humming away—it's like walking into a chicken farm.

But you? You don't even have one decent card. Late at night, scrolling through your feed, you see a classmate showing off their A100 running data, and your heart just sinks: "That's it, I'll never make it into a top conference."

I'll tell you, I was in the exact same boat back then. But later I discovered something—people without cards actually fight harder.

Think about it. Those big shots with hundreds of cards at their disposal—do they have the patience to really dig into the principles? Just throw more compute at the problem and brute‑force it. But you? You have to be meticulous, solve problems from the ground up. This forced — and fundamental — skill set is the kind of real expertise no one can take away from you.

Today I'm going to lay out for you all those directions that require almost no compute and that even a student can use to break into top conferences. No hype—just the real, muddy pits that I and people around me have crawled out of ourselves.

---

1. No training? So what can you even research?

A lot of people fall into the same mental trap: to work on LLMs you need the full pipeline "pretraining → fine‑tuning → RL → large‑scale evaluation." No GPU? No chance.

Wrong. Completely wrong!

LLM research actually splits into two paths:

Scaling model capabilities: bigger models, more data, stronger post‑training — that's the big companies' turf. If you have no GPU, keep away.
Understanding and controlling models: Why does the model answer this way? Where did it go wrong? Can we fix it cheaply?

The second path is tailor‑made for you. You don't need to train. Just take an open‑source model, run inference, extract activations, and do analysis.

Let me tell you a true story.

I once dissected Llama‑7B on a single 3090. I used SAE (sparse autoencoders) to take that messy soup of representations in the hidden layers and break it into interpretable features. That day I found one neuron specifically responsible for the "color" attribute—I was skeptical. Later I locked onto another neuron that controlled "negative sentiment." I did a crazy experiment: during inference I forcefully flipped the activation of that feature to the opposite direction. Guess what? The model went from reading a nasty movie review to interpreting it as if it were full of praise.

I posted a screenshot on Twitter. Within two hours, several research groups were asking me how to reproduce it.

The key point—the compute requirement was ridiculously low. Running inference on a small 7B model, storing a few layers' activations—the memory footprint was negligible. Follow‑up analysis could be done entirely on a CPU. The most impressive master's student I know ran all his SAE experiments on a free Google Colab T4, open‑sourced everything, and ended up with an EMNLP 2025 Oral.

Zero cost. Top conference. Can you believe it?

---

2. Just analysis, no training—will reviewers buy it?

Yes, they will. But you have to avoid a huge pitfall—never just "discover" without "validating."

Many newcomers fall into the exact same pattern: extract 5000 activations, cluster them, find a bunch of "concept neurons," and then excitedly write the paper: "Wow, look at all these functions inside the model!" The reviewer shoots it down: "So what? What's the use?"

I got criticized like that too. It hurt. Then I learned: a good interpretability story must have the triple package – explanation + intervention + validation.

Take SAE again. You find a feature that corresponds to "sentiment":

During normal inference, the model outputs "I hate you," and the sentiment feature score is extremely high.
You forcibly lower (or even reverse) that feature's activation, and the output sentiment becomes milder or even neutral.
Then you use a standard sentiment classifier as a judge to quantitatively compare the difference before and after the intervention.

Closed loop! The reviewer sees: "Oh, you didn't just find the feature—you can actually control the model's behavior." Scores go through the roof.

An old collaborator of mine at ETH used exactly this strategy—"activation steering + systematic validation"—to produce a beautiful piece of work that landed an EMNLP 2025 Oral. He used only 6 shared V100s from the lab. All intervention code was modified during inference. He never once trained a model.

---

3. With limited compute, which direction is easiest to get results now?

If I had to choose my direction again today, I'd bet on these three without hesitation. I've tested each one with real effort and money—low resources, high returns.

Direction A: KV‑Cache Optimization – industry needs it so badly they'll call you at midnight

In long‑text inference, the entire memory bottleneck is the KV‑cache. When the model reads a whole novel, it has to store everything from the first hundred chapters, memory explodes. If you can design an algorithm that dynamically discards unimportant historical tokens (e.g., the scenery descriptions from Chapter 1) and keeps only core characters and their relationships—which company wouldn't want that?

The crucial thing? You never need to touch the model parameters! You only modify the attention masking strategy during inference. Grab a 7B model, run perplexity and long‑text benchmarks on a single card, tweak the dropping strategy, and your experimental data alone can support a system‑level paper.

I guided a first‑year master's student who used exactly this approach – "entropy‑based KV‑cache pruning." On LongBench, task accuracy dropped a tiny bit, but memory usage was cut by 60%. They submitted to ACL, and all three reviewers gave 4 points, commenting: "Simple, effective, and fills a practical deployment gap."

Direction B: Red‑Teaming and Safety Testing – pure brainpower, zero compute barrier

The compute requirement for this direction is so low you won't believe it—approaches zero. You only need to call existing APIs (free tier is enough), or even run small model inference on CPU.

What do you research? How to use clever prompt construction to bypass the model's safety guardrails. For example, exploit data sparsity in low‑resource language translations, or hide dangerous instructions inside seemingly normal code comments, and trick the model into executing them.

I once built an automated adversarial example generation pipeline using the free GPT‑4 tier: use gradient descent to search for attack suffixes, automatically test the jailbreak robustness of mainstream open‑source models. That idea alone landed me a top‑conference paper. Cost? A few dozen cents from OpenAI.

What's more, safety is a universal need. Big companies dread public relations and legal risks. If a model has a safety scandal, the entire product line has to slam the brakes. Handing them a jailbreak report will impress an interviewer far more than handing them a fine‑tuning script.

Direction C: Task‑Oriented Structured Pruning of Small Models

Don't believe the marketing hype of "beating GPT‑4 with a 2B model." But you absolutely can take a 7B general model, structurally prune it to keep only the code capability, and then squeeze it into a Raspberry Pi.

Here's the workflow: use TransformerLens to extract every layer and every attention head of a 7B model, analyze which heads are primarily responsible for code, and which are for poetry, cooking, chatting. Then cut out all channels unrelated to code, and retrain for one or two epochs to recover accuracy. The amount of domain‑specific data needed for retraining is tiny.

One of my students compressed Llama‑3 8B down to 2.1B. On HumanEval, accuracy dropped only 3%, but inference speed increased by 4×. He took this project to an NVIDIA internship interview. The senior interviewer looked at his code and said, "You're the kind of guy who optimizes at the low level, right?" He got the offer on the spot.

---

4. I always feel like I'm just "tuning code," without theoretical depth. What should I do?

This problem means you've chosen the wrong sub‑field.

Everyone has a different background. Don't force yourself into a direction

无大算力时,作为学生,LLM 还有哪些值得做的研究? (English)

无大算力时,作为学生,LLM 还有哪些值得做的研究? (English)

1. No training? So what can you even research?

2. Just analysis, no training—will reviewers buy it?

3. With limited compute, which direction is easiest to get results now?

Direction A: KV‑Cache Optimization – industry needs it so badly they'll call you at midnight

Direction B: Red‑Teaming and Safety Testing – pure brainpower, zero compute barrier

Direction C: Task‑Oriented Structured Pruning of Small Models

4. I always feel like I'm just "tuning code," without theoretical depth. What should I do?

Cael Lee

Ready to get started?