Intent classification in chatbots is not just an ML task. It is a routing layer that determines resource consumption of the entire system. Wrong architecture forces expensive LLM calls for trivial queries, burning budget and increasing latency.
Previously, we discussed how fine-tuned BERT can be a better alternative to the LLM-based classification. Now, let’s consider an even simpler solution: embeddings + shallow classifier — when it works, when it fails, and whether you need to fine-tune the encoder.
The canonical architecture
The pattern splits inference into two stages:
- Feature extraction: Frozen encoder (MiniLM, E5, BGE, or similar) converts text to dense vector. Single forward pass, shared across all downstream tasks
- Classification head: Lightweight model (Logistic Regression, Linear SVM) operates on the vector. Inference is matrix multiplication — microseconds on CPU
Why this often wins over fine-tuned BERT
Real-world benchmarks (Banking77, CLINC150) show the trade-offs:
| Metric | Embeddings + LR | Fine-tuned BERT |
|---|---|---|
| Accuracy | 92–93% | 93–95% |
| Latency (CPU) | 4–20ms | 50–300ms |
| Training time | Seconds | Hours |
| Multi-tenant scaling | One encoder for all | Separate model per task |
The accuracy gap is 1–3%. The latency gap is 5–15x. For most production cases, this trade-off favors embeddings.
The core limitation: frozen encoder cannot adapt to domain-specific vocabulary. If terminology differs significantly from pretraining corpus, semantically different intents may map to similar regions in embedding space.
Once the architecture is set, embedding quality becomes the main variable.
Embedding dimensionality
Default 768 dimensions are often excessive for intent routing. Dimensionality directly impacts memory (index size scales linearly) and latency (scoring is O(d) per candidate).
Practical approaches to reduction:
| Method | Description | Typical quality loss |
|---|---|---|
| PCA post-processing | Reduce 768→256 on your dataset | <1% (often) |
| Matryoshka (MRL) models | Truncate vector at inference | Depends on cutoff |
| Binary quantization | 1 bit per dimension | 2–5%, but 32x memory savings |
Run ablation before accepting d=768 as given. For routing tasks, 256 dimensions are often sufficient.
Shallow classifier selection
In practice, two linear models dominate:
Logistic Regression
Default choice. Outputs calibrated probabilities — critical for confidence-based routing. Inference is dot product. Works well when embedding space is linearly separable (which it usually is with modern encoders).
Linear SVM
Better generalization on small or noisy data (<50 samples per class). Margin-based optimization is less sensitive to outliers. Downside: no native probability output — requires calibration wrapper.
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV
from sklearn.svm import LinearSVC
# Default choice
clf = LogisticRegression(max_iter=1000, C=1.0)
# For small/noisy data — SVM with calibration
clf = CalibratedClassifierCV(
LinearSVC(C=1.0, max_iter=10000),
method='sigmoid'
)
Non-linear models (Random Forest, deep MLP) typically do not improve over linear ones. The representation is already separable — added complexity increases latency without accuracy gain.
Training data strategies
Real labeled data: High accuracy on frequent queries, expensive to collect, poor coverage of rare cases.
Synthetic data (LLM-generated): Fast to scale, but risks distribution shift and style bias. Critical point: generate hard negatives — queries that share keywords with an intent but have different meaning.
Hybrid approach: Real data for frequent queries (“head”), synthetic for rare cases (“tail”). Apply round-trip filtering: generate example → classify with stronger model → discard if mismatch.
Hard negatives matter more than volume. Example: query “What documents are NOT needed for mortgage?” should not match intent “Mortgage document list”. Adding such near-miss examples to training improves Out-of-Scope (OOS) detection — AUROC can jump from 0.91 to 0.98.
Confidence and routing
Logistic Regression outputs calibrated probabilities by design. The practical challenge is not calibration — it is setting thresholds.
Per-class thresholds
Global threshold (e.g., 0.7 for all intents) fails when intents have different cluster densities in embedding space. Narrow intent like “password reset” needs higher threshold (0.9) than broad intent like “general complaint” (0.6).
Automatic threshold formula: T_i = max(0.5, μ_i - α·σ_i) where μ and σ are mean and standard deviation of confidence scores on validation set for intent i.
Tiered routing
Do not make routing binary. Three zones work better:
| Confidence | Action |
|---|---|
| > τ_high | Deterministic response or API call |
| τ_low < c ≤ τ_high | Route to LLM for verification |
| < τ_low | Fallback: human handoff or “I don’t know” |
Middle zone catches uncertain cases without expensive LLM calls for obvious ones. This reduces average cost by 40–60% compared to routing everything through LLM.
Out-of-scope detection
Out-of-Scope queries (user asks something outside known intents) are tricky. Classifier will still output some class with non-trivial confidence.
Options:
- Train explicit OOS class using samples from open-domain datasets
- Use distance to class centroids in embedding space — large distance signals OOS
- Set conservative τ_low and route uncertain queries to fallback
Common failure modes
Most failures are not classifier bugs, but representation limits.
Negation blindness
Mean pooling loses syntactic structure. “Denies chest pain” embeds similarly to “has chest pain” because dominant tokens are the same.
Mitigation: For high-stakes domains (medical, legal), add rule-based layer or LLM verification for edge cases.
Fine-grained intent collapse
Semantically close intents (card_arrival vs card_tracking) overlap in embedding space. Linear classifier cannot separate what encoder did not distinguish.
Mitigation: Merge into single intent with sub-classification, or fine-tune encoder with contrastive loss on hard pairs.
Do you need to fine-tune the encoder?
Short answer: usually no. Off-the-shelf embeddings (E5, BGE, OpenAI, Cohere) handle most intent classification tasks.
When frozen embeddings are sufficient
- Domain is close to general text (support tickets, FAQ, e-commerce)
- Intents are semantically distinct
- You have enough labeled data for the classifier head
- Accuracy on validation set meets requirements
Signs that encoder quality is the bottleneck
- Close intents “collapse” — confusion matrix shows systematic errors between specific pairs
- Accuracy plateaus despite sufficient training data
- Domain has specialized terminology (medical codes, legal jargon, internal product names)
Alternatives before full fine-tuning
-
Try stronger encoder: Switch from MiniLM to E5-large or BGE-large. Latency increases but may solve the problem
-
Contrastive fine-tuning on hard pairs: Lighter than full fine-tuning. Collect pairs of confused intents, train with contrastive loss. Often sufficient
-
Add verification layer: Keep fast classifier for 80% of queries, route confused pairs to LLM or cross-encoder for final decision
-
LoRA/Adapters: If you need fine-tuning but serve multiple tenants, parameter-efficient methods keep single base model with lightweight per-task modules
When full fine-tuning is justified
- Significant domain shift (medical, legal, highly specialized)
- Classification depends on syntax (negations, conditionals)
- Sentence-pair tasks where cross-attention matters
- You have >1000 samples per class and can afford training time
Fine-tuning adapts embedding space itself — attention weights learn domain-specific patterns. But cost is real: hours of training, separate model per task, higher inference latency.
Production considerations
For chatbot latency (<100ms end-to-end):
- Use small encoder (MiniLM, DistilBERT, TinyBERT)
- Export to ONNX, apply INT8 quantization — gives 2–5x speedup
- Add semantic cache for frequent queries — hash lookup is faster than any model
Summary
Embeddings + shallow classifier is a strong production baseline. It scales across tenants, iterates in seconds, handles few-shot scenarios well.
The decision flow:
- Start with off-the-shelf encoder + Logistic Regression
- Check confusion matrix — if specific pairs collapse, try stronger encoder or contrastive tuning
- Add tiered routing with per-class thresholds
- Invest in hard negative generation — improves decision boundaries more than adding positive samples
- Fine-tune encoder only if domain shift is significant and alternatives fail
Component metrics (F1, accuracy) are diagnostic tools. Optimize for business metrics: resolution rate, cost per query, fallback rate.
| Approach | Latency | Memory | When to use |
|---|---|---|---|
| Frozen encoder + LR | <20ms | O(1) | Default starting point |
| Stronger encoder + LR | 30–50ms | O(1) | Close intents collapse |
| Contrastive fine-tuning | <20ms, 30–50ms | O(1) | Hard pairs, moderate domain shift |
| Full fine-tuning | 50–300ms | O(N) for N tasks | Significant domain shift, syntax-dependent |