Search Is Broken, and Your Keyword Matching Is Making It Worse
Search Is Broken, and Your Keyword Matching Is Making It Worse
Last year, our search conversion rate dropped 37%. Not a gradual decline—just fell off a cliff. My boss's face went through several shades of purple I'd never seen before.
The culprit? Something properly embarrassing. Users searching for "lightweight waterproof jacket" got tents. Searching for "Apple" returned photos of the fruit rather than the iPhone 15. Our old-school keyword matching had finally, spectacularly failed.
Here's the uncomfortable truth: if you're not using embeddings for vector search, you're essentially guessing what your users actually want. And you're probably wrong.
How Search Actually Works Now (And Why the Old Way Is Dead)
Traditional search engines behave like the world's most literal-minded librarian. Ask for "beginner programming books" and they'll scan the shelves for titles containing exactly those five words. Nothing more, nothing less.
Vector search? It's like that friend who actually understands what you mean. Ask "how do I start coding" and they'll hand you Automate the Boring Stuff with Python without missing a beat.
Embeddings, at their core, turn anything—text, images, audio—into a string of numbers. Things that are semantically similar end up clustered together in vector space. It's almost magical, except it's just maths.
I first touched this stuff in 2019, building a recommendation system at a large tech company. I was ridiculously naive about it. Thought converting words to vectors couldn't possibly be that complicated.
Spoiler: it was.
On launch day, users who'd clicked on children's education articles got recommended... let's just say "adult entertainment content." I've never wanted the ground to swallow me faster. I skipped that team's next three social events.
Going All-In on Vector Search? Here's Your Invoice
Don't do it. I mean it. I've paid these tuition fees so you don't have to.
Mistake #1: Exact matching goes out the window
A user once searched for "iPhone 15 Pro Max 256GB black" on our platform. Vector search enthusiastically returned "smartphone buying guide," "flagship comparison 2024," and not a single actual product SKU.
Vector search is brilliant at understanding intent. It's rubbish at matching model numbers and precise specifications. That user wanted to buy a phone, not read a history of mobile computing.
Mistake #2: Technical terminology becomes a disaster
In 2022, I worked on a medical project where a doctor searched for "ST-segment elevation myocardial infarction." Vector search returned "heart disease precautions" and "how to manage angina"—all generic, all useless.
The head of A&E called our CTO directly. Screamed at him for ten minutes about how our system could have delayed critical care. I didn't sleep properly for three days after that call.
Mistake #3: Long-tail queries lose crucial details
E-commerce scenario: someone searches "birthday gift for girlfriend under £40 practical." Pure vector search latches onto "gift" and "girlfriend" while completely dropping the budget constraint and "practical" qualifier.
We recommended an £80 Jo Malone perfume. The user screenshotted it, posted it on social media, and the comments section was just rows of laughing emojis. Our marketing team nearly quit.
Actually, let me correct something here—Mistake #3 wasn't entirely vector search's fault. During our post-mortem, we discovered the query understanding layer had already truncated those long-tail modifiers. The problem was the entire pipeline, not just vector search playing silly beggars.
So what do smart teams do? Hybrid search.
How to Build Hybrid Search Without Losing Your Mind
Hybrid search combines keyword matching (good old BM25) with vector search (ANN semantic matching). But most people just dump both result sets together and call it a day. That's the approach of someone who's done a weekend course.
The Architecture We Actually Run in Production:
Layer 1: Dual-path recall
- Keyword recall: Elasticsearch with BM25 for exact matches, model numbers, proper nouns, SKUs
- Vector recall: bge-large-en-v1.5 (or the Chinese equivalent bge-large-zh-v1.5) converts all product descriptions and user queries into vectors, chucked into Milvus for ANN retrieval
Layer 2: Fusion ranking
This is where the real differentiation happens. I've seen too many teams use fixed weights—keyword 0.3, vector 0.7—then sit back and wait for results. That's basically guessing.
The proper approach is dynamically adjusting weights based on user intent. Here's the rough idea:
# Pseudocode — grasp the concept, don't copy-paste
def dynamic_fusion(query, bm25_results, vector_results):
# Figure out what kind of query this is
if is_exact_match_query(query): # Contains model numbers, specs, identifiers
weight_kw = 0.8
weight_vec = 0.2
elif is_semantic_query(query): # Descriptive, fuzzy phrasing
weight_kw = 0.2
weight_vec = 0.8
else: # Mixed intent
weight_kw = 0.5
weight_vec = 0.5
return merge_and_rerank(bm25_results, vector_results, weight_kw, weight_vec)
We also trained a query intent classification model—a distilled version of DistilBERT, fine-tuned on three months of user click logs. After deployment, click-through rates improved by 23%.
I'm slightly sceptical of that number, honestly. I reckon 3-5 percentage points came from other optimisations we shipped simultaneously. What I can confirm: return rates dropped 15%, because users actually found what they wanted.
Reranking Is Where Grown-Ups Separate from Beginners
Hybrid search is the engine. Reranking is the gearbox. Without it, all the horsepower in the world won't get you moving.
Three Approaches I've Actually Tried:
Approach 1: Cross-encoder fine-ranking
Used Cohere's rerank-v3 to rescore the top 50 fused results. The quality improvement was genuinely stunning. The latency increase? 200ms. My boss said "we need to maintain user experience." I said "great, give me more machines." He said "there's no budget."
Right then.
Approach 2: Feature engineering + LambdaMART
Extracted 30+ features: text similarity scores, keyword hit rates, historical CTR, price matching, category relevance, brand alignment... then fed everything into LambdaMART for ranking. Results were decent, but maintaining those features nearly drove two interns to quit. One of them told me on his last day, "I don't think I'm cut out for tech." I still feel guilty about that.
Approach 3: Multi-stage funnel (what we run now)
- Dual recall: top 100 from each path
- Fusion ranking: trim to top 50
- Cross-encoder fine-ranking: down to top 20
- Business rules layer (deduplication, blocklist, diversity control): final top 10
After deploying this, search relevance jumped 42% (A/B test, statistically significant with p=0.003), and P99 latency stayed under 300ms. Though I should mention—on peak sales days, that 300ms figure occasionally spiked to 480ms. An ops engineer called me at 2 AM, and we just stared at the monitoring dashboard in silence for five minutes.
The Case That Made Me Rethink Everything
Before last year's Black Friday equivalent, we noticed something odd. Users searching for "waterproof jacket" were getting roughly equal scores for 3-in-1 jackets and single-layer shells. But data showed 87% of them eventually bought the 3-in-1 version.
The problem? Our embeddings model was generic. It couldn't grasp the subtle distinctions within a vertical product category. We did two things:
- Fine-tuned embeddings using historical interaction data: Click pairs, add-to-cart pairs, and purchase pairs as positive samples; random sampling as negatives; trained with contrastive learning. Used LoRA on bge-large, only 18GB VRAM, ran fine on a single A100
- Added contextual features at reranking: Winter months boosted thermal insulation parameters; user behaviour history weighted professional-grade products for known hikers
After fine-tuning, top-5 hit rate went from 68% to 83%. But I'll be honest: fine-tuning embeddings is grunt work. The data cleaning phase nearly claimed another intern. We spent three solid weeks filtering out fraudulent orders and misclicks. I still can't look at those weekly reports.
Five Mistakes You'll Almost Certainly Make
Mistake 1: Choosing the wrong embeddings model
For non-English scenarios, don't default to OpenAI's text-embedding-3-large. It's expensive and often underperforms on language-specific benchmarks. bge-large-zh-v1.5 or text2vec-large-chinese have topped the Chinese MTEB leaderboard for over a year—and they're free. In our production environment, bge-large-zh-v1.5 averages about 8ms per embedding.
Mistake 2: Ignoring vector database index types
Faiss's IVF and HNSW perform radically differently at different scales. Under 1 million vectors, use HNSW—fast build times, fast queries. Over 100 million, switch to IVF+PQ or your memory will just... evaporate. Don't ask how I know. Rolling back code at 3 AM leaves scars.
Mistake 3: Skipping A/B tests and shipping straight to production
A colleague added a "positive review rate" feature to our ranking. Offline evaluation showed a 5% improvement. After launch, conversion dropped 8%. Why? Highly-rated products tend to be more expensive. Users saw things they couldn't afford and bounced. Optimising without business context is just academic masturbation. I've got that quote taped to my monitor.
Mistake 4: Serving vector search results as final output
Always add a business rules layer: deduplication, blocklists, inventory filtering, price boundaries, compliance checks. Otherwise you'll find discontinued products—or worse—in your search results. Then Legal will invite you for a "chat." We learned this the hard way when a delisted product appeared in search results and users started asking if we were going bankrupt.
Mistake 5: Not monitoring for embedding drift
Regularly check similarity scores for known synonyms and near-synonyms. We caught "PCR test" and "COVID test" drifting from 0.95 similarity to 0.7 in December 2022 because the upstream model updated its training data. If our monitoring alert hadn't fired, search quality would have cratered silently.
TL;DR
- Pure keyword search is dead. Pure vector search is unreliable. Hybrid search is the sweet spot
- Dynamic fusion weighting based on query intent beats fixed weights every time
- Reranking—especially with a multi-stage funnel—is where the real gains live
- Fine-tune your embeddings on domain-specific data. Generic models don't understand your vertical
- Always run A/B tests. Always. Your intuition about what works is probably wrong
I've been tinkering with something new lately: using LLMs for query understanding and result explanation. Imagine someone searches "commuting outfit ideas." An LLM can infer they're probably an office worker, want smart-casual suggestions, and might have budget constraints—then pass structured parameters to the search system. It's far more flexible than current query intent classifiers.
The catch? Inference latency is still all over the place. I tested GPT-4o for query rewriting in December 2024 and got 400ms+ per request. Completely unusable in production. Maybe the 2025 model releases will change things.
What search disasters has your team survived? Which embeddings models are you running? Drop a comment—I read every single one, genuinely. And if this article helped you, forward it to that colleague who's still using MySQL LIKE for search. They need saving.
searchsystems #embeddings #vectordatabase #RAG #machinelearning #systemarchitecture
Cael Lee
Full-stack developer with 8+ years of experience. Currently building AI-powered developer tools. I've tested 20+ AI API providers and coding assistants.