杜克大学最新《可解释机器学习》综述论文,80页pdf阐述 (English)

Generated: 2026-06-20 19:21:41

---

Last week, I almost got scolded to tears by my boss.

It was over a bad case in our recommendation system: the system recommended a “prostate exam” to a female user. The user freaked out, and my boss grilled me: “Explain this to me. Why?” I stammered out something like, “The model learned it,” but even I didn’t believe what I was saying. I quickly whipped out SHAP and generated a bunch of explanation plots. Then an algorithms engineer walked by, glanced at them, and called me out on the spot: “Your explanation doesn’t match the model output!” At that moment, I wished I could just disappear.

Later, I found an 80-page survey from Duke University on arXiv. I printed it out and spent a weekend reading it cover to cover in a coffee shop.

Don’t get me wrong—I’m not a theoretical researcher. I’ve been writing technical columns for almost ten years, and I spend every day battling with algorithm teams and product managers. But this paper really struck a chord. It doesn’t dwell on those fancy “post-hoc explanation tricks.” Instead, it asks directly: what model should you be using in the first place? It lays out ten hard challenges that even the research community hasn’t fully solved, and analyzes them one by one.

Below, I’ll combine insights from that paper with my own experience, and talk about which challenges left a strong impression on me and which ones I’ve fallen into.

---

10 challenges, each with a real‑world counterpart

Challenge 1: Decision trees—shallow trees are inaccurate, deep trees are incomprehensible

The paper points out that you want trees that are both small and accurate, while also handling missing values and multimodal data—which is really hard to balance. My own pitfall: the product manager complained that the rules were too simple and demanded a more detailed explanation. So I deepened the tree, and online accuracy dropped by 5 percentage points—overfitting. I felt the model was clunky and dangerous; it could blow up at any time.

Challenge 2: Scoring systems (like credit scores)—scores must be integers

The paper says coefficients must be integers, scores must be additive, and you still need sparsity and predictive power. My pitfall: a bank client insisted each feature’s score had to be an integer. Linear regression gave decimal scores, so I forced rounding. The total scores ended up conflicting with the ranking, and the project fell apart. These constraints are very strict.

Challenge 3: Additive models—add interactions and everything falls apart

The paper says maintaining monotonicity constraints while supporting feature interactions creates a dilemma. My pitfall: I tried a Neural Additive Model (NAM). Univariate features worked fine. But as soon as I added interactions, the “product terms” in the model output became meaningless to anyone, and interpretability took a huge hit.

Challenge 4: Case‑based reasoning—“Why was this order rejected? Because a similar order was rejected too.”

The paper says you explain by finding similar cases from the training set, but the definition of “similar” is subjective. My pitfall: in a customer service system, I used “similar orders” to explain why an order was rejected. Users saw the “similar” order and said, “Those are clearly different,” which led to complaints.

Challenge 5: Supervised concept disentanglement—making neurons correspond to semantic concepts

The paper says the idea sounds great, but collecting supervised signals is hard. My pitfall: I tried to get a model to learn concepts like “has a beard” or “wears glasses.” The data labeling cost was high, and the concepts overlapped (e.g., “beard” vs. “goatee”). The model’s performance was poor in the end.

Challenge 6: Unsupervised concept disentanglement—learning without labels

The paper says the model automatically separates factors like “hairstyle” and “pose,” but there’s no evaluation standard. My pitfall: I ran β-VAE, and one latent dimension ended up controlling both glasses and beard at the same time. Impossible to interpret.

Challenge 7: Dimensionality reduction (t‑SNE/UMAP)—the business side’s soul‑searching question

The paper says you need to preserve local structure while also being stable and interpretable. My pitfall: I showed a t‑SNE plot to business stakeholders. They asked, “Why are these two points the same color but so far apart?” I struggled to explain the local distance distortion caused by nonlinear mapping.

Challenge 8: Physical/causal constraints—embedding domain knowledge into the model, extremely hard to solve

The paper says baking something like “conservation of energy” into the model makes optimization very difficult. My pitfall: I added a soft constraint in an industrial scenario. Training didn’t converge, and it took me days of tuning penalty coefficients to get it barely working. The process was painful.

Challenge 9: Rashomon model set—same accuracy, different structures, which one to choose

The paper says there can be many structurally different models with the same predictive accuracy—which one do you trust? My pitfall: I had two decision trees with identical accuracy. One labeled all male users as “not recommended,” the other didn’t. The business team argued endlessly, and finally the boss picked the one that seemed “politically correct.” Can you still call that interpretable?

Challenge 10: Interpretable reinforcement learning—policies need to be transparent, but sequential decisions are too complex

The paper says path dependency makes it hard. My pitfall: I used RL for a chatbot. After deployment, it suddenly started aggressively recommending financial products. It took me three days to track down which reward function caused it, and user feedback was terrible.

I’ve fallen into at least half of these challenges. For the rest, reading the paper taught me how to avoid them.

---

Five principles that reframe what “interpretable” means

The paper opens with five principles, each worth careful thought.

Principle 1: Interpretable models must follow domain‑specific constraints

A linear model might work in both medicine and credit scoring, but doctors want a ranking of risk factors, while loan officers want integer scores and additivity. There’s no universal “interpretable”; there’s only “interpretable for a given domain.”

Principle 2: Interpretable models don’t necessarily create trust—they can also cause distrust

Transparency doesn’t automatically earn trust. I once had a model that made “education level” its most important feature. The explanation was perfectly clear, but users called it “educational discrimination.” The model was interpretable, but users hated the explanation, and trust collapsed even faster. Transparency only lets people decide whether to trust you; it doesn’t guarantee acceptance.

Principle 3: Interpretability does not have to come at the cost of accuracy—it’s a false dichotomy

The paper has a key line: “Interpretability and accuracy are often a false dichotomy.” I’ve seen it myself: pruning a decision tree may lower accuracy, but adding rule‑based corrections can actually improve online performance because issues are easier to debug. The paper also mentions the “Clever Hans” phenomenon: black‑box models can get the right answer by using spurious correlations. I’ve seen a medical model diagnose a disease based on whether the patient was “wearing gloves,” because in the training data most patients who had tests done were wearing gloves. In a white‑box model, that kind of thing is immediately obvious; a black box hides it. In that sense, black boxes can actually be more fragile.

Principle 4: Performance metrics and interpretability metrics should be iterated together

Our team only ever looked at AUC. We had no quantitative metric for interpretability, and no one optimized for it. So the model got blacker and blacker, and eventually nobody dared to touch it. This principle sounds simple but is hard to put into practice—are you willing to prioritize interpretability?

Principle 5: For high‑stakes decisions, use inherently interpretable models, not post‑hoc explanations of black boxes

This is the paper’s foundation. In law, medicine, and finance, you can’t get away with saying “the model told me so.” You need to be able to point to the model and say, “the decision path is right here, look.” So from the start, choose a transparent model.

---

Why post‑hoc explanations are becoming less and less reliable

The paper keeps hammering one point that I need to highlight separately: **Post‑hoc explanations of black‑box models essentially use a simpler

杜克大学最新《可解释机器学习》综述论文,80页pdf阐述 (English)

杜克大学最新《可解释机器学习》综述论文,80页pdf阐述 (English)

10 challenges, each with a real‑world counterpart

Challenge 1: Decision trees—shallow trees are inaccurate, deep trees are incomprehensible

Challenge 2: Scoring systems (like credit scores)—scores must be integers

Challenge 3: Additive models—add interactions and everything falls apart

Challenge 4: Case‑based reasoning—“Why was this order rejected? Because a similar order was rejected too.”

Challenge 5: Supervised concept disentanglement—making neurons correspond to semantic concepts

Challenge 6: Unsupervised concept disentanglement—learning without labels

Challenge 7: Dimensionality reduction (t‑SNE/UMAP)—the business side’s soul‑searching question

Challenge 8: Physical/causal constraints—embedding domain knowledge into the model, extremely hard to solve

Challenge 9: Rashomon model set—same accuracy, different structures, which one to choose

Challenge 10: Interpretable reinforcement learning—policies need to be transparent, but sequential decisions are too complex

Five principles that reframe what “interpretable” means

Principle 1: Interpretable models must follow domain‑specific constraints

Principle 2: Interpretable models don’t necessarily create trust—they can also cause distrust

Principle 3: Interpretability does not have to come at the cost of accuracy—it’s a false dichotomy

Principle 4: Performance metrics and interpretability metrics should be iterated together

Principle 5: For high‑stakes decisions, use inherently interpretable models, not post‑hoc explanations of black boxes

Why post‑hoc explanations are becoming less and less reliable

Cael Lee

Ready to get started?