-
Domain-driven design for AI systems: architectural patterns and production experience
Exploring how domain-driven design principles (bounded contexts, anti-corruption layer, ubiquitous language, domain events) enable modularity, safety, and traceability in production AI and LLM systems.
-
Semantic prompt caching: when LLM-judge beats exact match
Standard prompt caching requires exact prefix match. LLM-Judge validates semantic equivalence, rescuing cache hits on paraphrases while adding controllable latency overhead.
-
The reranking trap: when cross-encoders make things worse
Cross-encoders and LLM rerankers promise better retrieval precision, but the 200% latency penalty, diversity collapse, and production failures reveal when this expensive step becomes counterproductive.
-
Structured output engineering for production LLMs
Transitioning from 85% parse rates to production-grade reliability. Constrained decoding guarantees format, Pydantic ensures correctness, token optimization cuts costs by 50%.
-
The chunk size dilemma: identifying the optimal value in RAG systems
Finding the optimal chunk size is non-trivial: too small loses context, too large dilutes semantics through mean pooling. A systematic methodology for identifying the sweet spot.