Why agent memory degrades in production

Agent memory frameworks expose a clean read path: retrieve relevant context, inject it into the prompt. The write path — what gets stored, when, and how existing facts are updated or removed — receives significantly less attention in both documentation and implementation.

The benchmark gap

RankSquire benchmarked Mem0 v0.8.2 on LongMemEval and measured 93.4% accuracy. The same system, running over 50,000 real sessions across 30 days, dropped to 49.0% effective accuracy. The primary cause: flat memory structures accumulating a 38% staleness rate — contradicted facts stored as separate ADD events rather than resolved at write time.

Based on this audit, they approximated the relationship between real-world accuracy and two observable variables — staleness rate and memory store size:

Production_Accuracy ≈ Benchmark − (0.22 × Staleness_Rate) − (0.15 × log₁₀(Entities))

This is not a retrieval problem. Retrieval is faithfully surfacing stale facts that should have been updated when new information arrived. The write path was never designed.

How the write path fails

Four problems compound independently. Solving only one leaves the others intact.

Single-pass extraction — The default approach — one LLM call that simultaneously detects, structures, and extracts — has a joint accuracy problem. If you extract 20 fields at 97% per-field accuracy, the probability the entire record is correct is approximately 54% (xmemory). Memory acts as a system of record. A hallucinated value during the initial write silently corrupts every future read from that entry.

No retrieval before write — Committing a new fact without first checking what is already stored produces semantic duplicates and unresolved contradictions. Governed Memory uses hard cosine similarity thresholds: above 0.92, the write is skipped as a duplicate; an offline job uses 0.95 to merge near-duplicates. Without explicit thresholds, the decision falls to LLM judgment under context pressure — unreliable at scale.

Append-only semantics — Mem0’s v2 architecture stored both “I use React” and “I use Vue” as independent ADD events when the user’s stack changed, relying on search ranking to surface current truth. This works until the store is large enough that ranking becomes unreliable. The minimum viable write decision requires four explicit operations: ADD (no relevant memory exists), UPDATE (new information supersedes old), DELETE (fact explicitly revoked), NOOP (already accurately represented).

No eviction policy — Without garbage collection, stores explode. RankSquire audited a production system with 3.2 million memory entries and found 97.8% low-utility. Storage costs had spiked 300%. Scaling from 50k to 200k entities without graph resolution caused a 14x increase in retrieval latency — from 52ms to 728ms at p95. The bottleneck was not the retrieval algorithm.

When to write

Two anti-patterns dominate.

Batch compaction — buffering conversations and flushing memory when the context window fills — is lossy compression disguised as a memory strategy. OpenClaw documented the result after three months of production use: LLMs summarizing under context pressure produce unintended recency bias. Older sessions are permanently lost. The system silently truncates memory files at 20,000 characters.

Turn-level extraction for everything — running heavy LLM extraction after every exchange — wastes tokens on transient data. Most conversational content is not worth persisting.

The pattern that holds up is two-timescale operation:

SLM = Small Language Model, LTM = Long-Term Memory

Timescale	Operation	Mechanism
Online (real-time)	Lightweight acquisition	Fast embeddings or SLM into mid-term buffer
Offline (async)	Consolidation into LTM	Heavy LLM extraction, dedup, contradiction resolution

RecMem only triggers heavy extraction when sustained recurrence is observed — when an incoming interaction clusters semantically with enough past exchanges to prove the topic is durable. LightMem formalizes this as a dedicated SLM for online writes, with a heavier LLM running asynchronously for long-term consolidation. Mem0 made async_mode=True the production default for exactly this reason.

Contradiction resolution at write time

Contradictions that represent genuine transitions — not errors — need a different treatment than simple UPDATE overwrites. LinkedIn’s Cognitive Memory Agent uses a temporal arbiter that generates a reconciliation summary rather than overwriting history: “User utilized React until November 2025 but has since transitioned their primary stack to Vue.” The agent can reason about the transition, not just the current state.

For automated contradiction detection at scale, MemoryLake implements three-level conflict detection: logical conflicts (direct contradictions), implicit knowledge conflicts (requiring inference), and hallucination conflicts — with resolution rules that favor recency while preserving conflict history.

What the write path actually costs

At scale, memory operations account for 60% of total agent system costs — not LLM inference. This is the number that surprises most teams when they first see production billing.

The ProMem framing is correct: memory is write-once, read-many. Investing in rigorous write-path validation — staged extraction, retrieval-before-write operations, deduplication gates — increases initial token cost. But cheap extraction errors are paid forward across every future read from the corrupted entry.

One concrete tradeoff: adding entity resolution to prevent graph node explosion adds 80–120ms to write latency. For real-time pipelines requiring sub-100ms writes, this belongs in the async offline job, not the hot path.

What this requires structurally

A write path that holds under production conditions:

Session ends
  → async extraction fires (focused prompt: preferences, decisions, constraints, corrections)
  → for each extracted fact:
      retrieve existing memories for this entity/topic
      LLM selects ADD / UPDATE / DELETE / NOOP
      write with timestamp + session provenance
  → scheduled: consolidation pass (episodic → semantic compression)
  → scheduled: eviction pass (drop facts un-accessed in N days)

Three things that are easy to skip:

Provenance on every write — Trace which session introduced a bad fact. Audit trails are how corrective deletes get applied reliably
Contradiction detection at write time — Without it, conflicts surface at retrieval time and force the LLM to resolve them under context pressure
Observability on the write path — The read path is easy to instrument. The write path is invisible unless you make it visible. Log every extraction, every create/read/update/delete decision, every consolidation pass

The 32-point accuracy gap between benchmark and production is not a model limitation. It is a write path that was never designed.