What 94.6% memory accuracy means for your users

Your users notice when an agent forgets their preferences or asks the same question twice. Hindsight scores 94.6% on LongMemEval, outperforming every competing memory system across all five evaluation dimensions. Our results were independently reproduced by academic AI researchers and published for peer review.

Overall LongMemEval Scores

Composite scores across all five evaluation dimensions. Higher means your agent remembers more and forgets less.

GPT-4o60.2%

Zep71.2%

Supermemory85.2%

Hindsight94.6%

Per-Dimension Breakdown

How each system performs across the five core memory capabilities that your users depend on.

Dimension	GPT-4o	Zep	Supermemory	Hindsight
Single-Session	72.1%	78.4%	88.3%	96.2%
Cross-Session	55.8%	68.1%	83.7%	93.8%
Temporal Reasoning	48.3%	74.6%	81.2%	92.1%
Knowledge Update	58.7%	65.3%	84.9%	95.4%
Multi-Hop Reasoning	66.1%	69.8%	88.1%	95.1%

Methodology

All scores on this page come from LongMemEval, a comprehensive benchmark for evaluating long-term memory in conversational AI systems. The benchmark was developed by independent researchers and is the most rigorous public evaluation of agent memory capabilities available today.

LongMemEval tests five core dimensions of memory performance:

Single-Session — recalling information within the same conversation.
Cross-Session — retaining facts across separate conversations over time.
Temporal Reasoning — understanding when events occurred and reasoning about time.
Knowledge Update — correctly updating beliefs when information changes.
Multi-Hop Reasoning — combining multiple stored facts to answer complex queries.

Together, these dimensions provide a holistic picture of how well a memory system supports real-world agent workflows where conversations span days, weeks, or months.

Feature Comparison

Performance is only part of the picture. You also need to own your data and deploy on your terms.

System	Score	License	Self-Host
Hindsight	94.6%	MIT	Yes (Docker)
Supermemory	85.2%	Closed	Enterprise only
Zep	71.2%	Mixed	Via Graphiti
GPT-4o	60.2%	N/A	No

Explore the full benchmark results

Start building agents that learn

Open source, MIT licensed. Self-host or use Hindsight Cloud.

View on GitHub Try Hindsight Cloud

94.6%

LongMemEval — highest score of any memory system