MemPalace Review: Benchmark Claims vs Reality

MemPalace Review: Benchmark Claims vs Reality

When MemPalace launched on April 5, 2026, it came with a bold claim: "the highest-scoring AI memory system ever benchmarked." Within 48 hours, the project had 7,000+ GitHub stars, was trending across AI Twitter, and had endorsements from prominent tech voices.

Within those same 48 hours, the developer community tore the benchmarks apart.

This MemPalace review examines what was claimed, what independent analysis found, and where the gap between marketing and implementation actually lies. Our goal isn't a hit piece — it's a technically honest assessment of what MemPalace does, what it doesn't, and what the scores mean for anyone evaluating agent memory systems.

What MemPalace Claims

MemPalace's marketing rests on four pillars:

  1. 96.6% LongMemEval recall in raw mode, 100% with Haiku reranking — "first perfect score ever recorded"
  2. 100% LoCoMo — "more than 2x Mem0's score"
  3. 92.9% ConvoMem — compared against Mem0's 49.0%
  4. 30x AAAK compression — described as "lossless" with "zero information loss"

These numbers, combined with the celebrity factor of co-creator Milla Jovovich, drove a viral adoption cycle that crossed from developer circles into mainstream tech media.

But benchmark scores without methodology context are meaningless. Let's look at what each number actually measures.

LongMemEval: The Headline Number

LongMemEval is the standard benchmark for evaluating AI memory systems. MemPalace reports two scores: 96.6% in raw mode and 100% in hybrid mode with Haiku reranking.

The 100% Score: Textbook Overfitting

The perfect score was achieved by a process that violates benchmark integrity: the team identified the three specific questions the system failed, engineered targeted fixes for those exact questions, then retested on the same set.

This is textbook overfitting. You cannot fix the test cases you got wrong and claim a perfect score on the test you've already seen. The held-out score — the honest number — was 98.4%.

To their credit, the team eventually acknowledged this. But the 100% claim drove the initial viral wave and remains prominent in early coverage.

The 96.6% Score: It Measures ChromaDB, Not MemPalace

The more important finding comes from the lhl/agentic-memory independent analysis, the most thorough code review of MemPalace published to date.

Their conclusion: the 96.6% LongMemEval score runs on raw, uncompressed conversation text using ChromaDB's default embedding model. The palace structure — wings, rooms, halls — is not involved in this benchmark.

The headline number that drove 19,500 GitHub stars measures how well ChromaDB's default embeddings perform on verbatim text retrieval. It tells you nothing about MemPalace's novel architecture.

Palace Features Reduce Accuracy

An independent benchmark reproduction on M2 Ultra (GitHub Issue #39) confirmed what the code analysis suggested:

ConfigurationLongMemEval ScoreDelta from Raw
Raw mode (ChromaDB only)96.6%Baseline
Room-based retrieval89.4%-7.2 percentage points
AAAK compression84.2%-12.4 percentage points

The palace architecture that MemPalace markets as its differentiator makes retrieval worse. This is the single most important finding in any MemPalace review: the novel features are net-negative for retrieval performance.

The Metric Matters Too

MemPalace reports recall_any@5 — whether the correct memory appears anywhere in the top 5 results. This is a retrieval metric, not an end-to-end accuracy metric. It doesn't measure whether the system actually answers questions correctly.

Other systems in the space (Mem0 at 49.0%, Zep at 63.8%, Hindsight at 91.4%) report end-to-end QA accuracy — a harder metric. Comparing retrieval recall to QA accuracy side-by-side, as MemPalace's marketing does, is a metric mismatch that inflates MemPalace's relative position.

LoCoMo: The Retrieval Bypass

MemPalace claims 100% on LoCoMo. The methodology reveals why this number is meaningless.

The benchmark used top_k=50 retrieval against datasets containing 19–32 sessions. When top_k exceeds the corpus size, you retrieve everything. The "memory system" contributes nothing — you're testing whether Claude Sonnet can do reading comprehension on the entire dataset handed to it at once.

Without reranking, MemPalace scores 60.3% R@10 on LoCoMo. For comparison, HiMem reports 83–89% on the same benchmark with legitimate retrieval settings.

The LoCoMo benchmark, as run by MemPalace, doesn't test memory retrieval. It tests LLM reading comprehension with all context provided.

The Scalability Problem This Reveals

The top_k=50 configuration isn't just a benchmark trick — it exposes MemPalace's fundamental architectural limitation. The system's approach is to retrieve everything and let the LLM sort it out. That works when your corpus is 32 sessions. After a year of daily agent conversations — tens of thousands of sessions, millions of tokens — there's no way to fit it all in context, and MemPalace has no selective retrieval mechanism to fall back on.

This is why scale-focused benchmarks matter. The BEAM benchmark tests memory systems at up to 10 million tokens, where context-stuffing is impossible. Hindsight scored 64.1% at the 10M tier — 58% ahead of the next-best system. MemPalace has no published BEAM results because its architecture can't operate at that scale.

ConvoMem: The Small Sample Problem

MemPalace reports 92.9% on ConvoMem, compared to Mem0's 49.0%. The number sounds dramatic — nearly double.

But the test used only 50 items per category (300 total) from a dataset of 75,000+ QA pairs. This is statistically underpowered. The confidence intervals would be enormous, and no significance testing was reported. Drawing a "more than 2x" comparison from 300 data points sampled from 75,000 is not rigorous.

AAAK Compression: Not Lossless

AAAK is MemPalace's custom compression format, marketed as "30x lossless compression" with "zero information loss."

Three problems:

  1. It's lossy, not lossless. LongMemEval accuracy drops from 96.6% to 84.2% when AAAK is enabled — a 12.4 percentage point quality loss. "Zero information loss" is directly contradicted by the benchmark data.

  2. The compression ratio was miscalculated. Token counting used len(text)//3 instead of a real tokenizer. The team acknowledged and fixed this after the community caught it.

  3. The decode method doesn't reconstruct. AAAK decode is string splitting — there's no text reconstruction algorithm. It's lossy summarization marketed as lossless compression.

The Palace Architecture Paradox

This is the core tension in any MemPalace review: the thing that makes MemPalace MemPalace — the spatial palace metaphor with wings, rooms, and halls — doesn't contribute to retrieval performance.

The palace hierarchy adds metadata filtering to ChromaDB queries. Metadata filtering is a standard database technique — every vector database supports it. The spatial metaphor makes it sound novel, but filtering by wing=project_alpha AND room=api_design is functionally identical to filtering by project=project_alpha AND topic=api_design in any vector store.

Independent analysis from Thin Signal's benchmark teardown confirmed that the 96.6% path "uses zero MemPalace-specific logic" — the palace architecture isn't used in the raw benchmark, making it "a ChromaDB benchmark, not a MemPalace benchmark."

Missing Features: README vs Codebase

Multiple independent reviews have documented features described in MemPalace's README that don't exist in the code:

FeatureREADME ClaimsCode Reality
Contradiction detectionAutomatic inconsistency flaggingknowledge_graph.py has zero "contradict" occurrences. Only identical triple deduplication exists.
Fact checkingIntegrated fact verificationfact_checker.py exists but isn't wired into any operations
Multi-hop graph traversalKnowledge graph navigationFlat triple lookups only — no graph traversal
Hall-based retrievalMemory type corridors for targeted recallHalls are metadata strings not used in retrieval ranking
Entity resolutionEntity tracking across conversationsNaive slug conversion only ("alice_obrien")

A README listing features that don't exist in the codebase is a significant concern for any team evaluating MemPalace for production use. Search relevance engineer John Berryman's analysis raised a separate but related concern: MemPalace's semantic-only retrieval can't surface cross-domain context (like tech preferences when discussing product features), meaning the system fails silently on exactly the queries where persistent memory should help most.

Security and Maturity Concerns

Beyond benchmarks and features, this MemPalace review flagged two practical concerns:

Security

  • No input sanitization — User inputs are passed directly to storage and retrieval without validation, creating a prompt injection surface
  • No write gating — Any input is stored without content validation
  • stdout corruption — MemPalace writes human-readable startup text to stdout instead of stderr, corrupting JSON message streams when used as an MCP server with Claude Desktop

Maturity

At its viral launch, MemPalace had:

  • 7 commits
  • 4 test files covering 21 modules
  • Created approximately 7 days before public release
  • No CI/CD pipeline visible

For context, this means the project went from creation to "highest-scoring AI memory system ever benchmarked" in under a week. Draw your own conclusions about the depth of testing behind those claims.

What MemPalace Gets Right

A fair MemPalace review should acknowledge genuine strengths:

  • Local-first architecture — Zero API costs, complete data privacy. ChromaDB and SQLite run entirely on your machine.
  • Verbatim storage — No information is lost through summarization. The raw conversation text is preserved in full.
  • Community response — After criticism surfaced, Ben Sigman posted a transparent acknowledgment: "The dev community tore it apart. This is how open-source projects can improve." The team updated their README to thank critics and stated they "would rather be right than impressive."
  • The palace metaphor — While it doesn't improve retrieval, the spatial organization is a genuinely intuitive way for humans to think about memory structure.

What Honest Benchmarks Look Like Hindsight

If MemPalace's benchmarks represent what not to do, what does honest benchmarking look like?

Hindsight by Vectorize scored 91.4% on LongMemEval. That number may look lower than MemPalace's 96.6%, but it measures fundamentally different things:

  • It measures the actual system — TEMPR multi-strategy retrieval (semantic + keyword + graph + temporal), entity resolution, knowledge graph, and cross-encoder reranking are all engaged during the benchmark
  • No hand-tuning — The score comes from a standard benchmark run, not targeted fixes for failing questions
  • The architecture contributes — Unlike MemPalace, where the novel features reduce accuracy, Hindsight's architecture is load-bearing. Remove the entity resolution or graph traversal and the score drops.
  • It scales — Hindsight achieved SOTA on the BEAM benchmark at 10 million tokens (64.1%, 58% ahead of the next system), proving the retrieval architecture works at real-world scale — not just on small corpora where you can retrieve everything

Hindsight also ships features that MemPalace describes but doesn't implement: real entity resolution, multi-hop knowledge graph traversal, mental models that auto-update, and configurable disposition traits. Setup takes minutes through native OAuth 2.1 or locally with embedded PostgreSQL (pg0), compared to MemPalace's Python environment and manual hook configuration.

Verdict

MemPalace is a 7-day-old project that went viral on the strength of benchmark scores achieved on small datasets where retrieving everything bypasses the need for real retrieval. The novel features — rooms, halls, AAAK compression — make retrieval worse, not better. Key advertised features don't exist in the code. And the retrieve-all approach that inflates its benchmark numbers can't scale beyond toy-scale corpora.

None of this means MemPalace is worthless. The ChromaDB ingest pipeline works. The local-first approach is genuinely valuable. And the team's response to criticism was admirably transparent.

But for anyone evaluating agent memory systems for production use, the gap between MemPalace's marketing and its reality is too wide to ignore. If you need AI memory that works as advertised, with honest benchmarks and an architecture that contributes rather than regresses performance, Hindsight is where we'd point you.


Further Reading