Why Hashing LLM Output Will Always Fail
We ran backfill twice. Same signals, same model, temperature=0. Fifteen duplicates appeared. Our content hash caught zero of them.
The bug that broke our assumption
We ran backfill twice on the same repo. Same signals, same prompts, same model, temperature set to zero. Fifteen duplicate memories appeared in the database.
Our content hash — SHA-256 of the raw LLM output — caught zero of them. The system was supposed to be idempotent. It was not.
temperature=0 is a lie
Setting temperature to zero does not produce deterministic output. The causes are mechanical:
- Floating-point arithmetic is not associative. GPU kernels parallelize matrix multiplications across thousands of cores. The reduction order changes the result.
- Batch size changes the output. Qwen3-235B tested with 1,000 identical requests at temperature=0 produced 80 unique completions. Your request's batch depends on what else the server is processing.
- MoE routing is load-dependent. Same prompt, different concurrent requests, different expert allocation, different output.
- Hardware matters. GEMM kernels across different GPU models differ by 1e-4, compounding token by token.
OpenAI's seed parameter reduced variation by ~60% but didn't eliminate it. The GitHub issue requesting full determinism was closed as "not planned." Anthropic doesn't offer a seed parameter at all.
Key takeaway
An academic paper measured up to 70% accuracy gaps between identical runs at temperature=0. The contract that powers Git, Docker, and IPFS — same content, same bytes, same hash — is fundamentally incompatible with LLM-generated content.
The three-layer funnel
We rebuilt dedup as a funnel, cheapest-first:
Layer 1: Signal URL dedup. The input — a GitHub event URL, a Slack message timestamp — is perfectly deterministic. Check if the signal URL exists before any processing. Backfill re-runs went from 60s + 7 LLM calls to 2s + 0 LLM calls.
Layer 2: Normalized content hash. Lowercase, strip punctuation, collapse whitespace, then SHA-256. Catches "Team chose Zustand." vs "team chose zustand" — the most common LLM formatting variations.
Layer 3: Embedding similarity. The expensive fallback. pgvector cosine search at threshold 0.78. Catches "Team adopted Zustand for client-side state management" vs "The team chose Zustand for managing client state."
Layer 1 catches ~90% at near-zero cost. Layer 2 catches ~5%. Layer 3 handles the remaining ~5% that require semantic understanding.
The lesson
The first version of our dedup was 3 lines: hash the content, check the database, skip if exists. It caught 0% of the duplicates that mattered.
The version that works is a multi-layer system with 5 different similarity thresholds tuned for different contexts. It's not elegant. It's correct.
If you're building an LLM pipeline and you're hashing the output for dedup — hash the input instead. The only thing that's deterministic in an LLM system is what you put in, not what comes out.
Related research
The State of Agent Memory in 2026
We audited 10 open-source agent memory projects — 120K+ GitHub stars, $31.5M in funding — to map where the field actually stands. Here's what we found.
We Deleted a Feature 4 Days After Shipping It
We built a memory system for AI coding tools, shipped a feature that remembered too much, and deleted it when retrieval quality degraded. Here's what forgetting taught us.