What 94.6% memory accuracy means for your users
Your users notice when an agent forgets their preferences or asks the same question twice. Hindsight scores 94.6% on LongMemEval, outperforming every competing memory system across all five evaluation dimensions. Our results were independently reproduced by academic AI researchers and published for peer review.
Overall LongMemEval Scores
Composite scores across all five evaluation dimensions. Higher means your agent remembers more and forgets less.
Per-Dimension Breakdown
How each system performs across the five core memory capabilities that your users depend on.
| Dimension | GPT-4o | Zep | Supermemory | Hindsight |
|---|---|---|---|---|
| Single-Session | 72.1% | 78.4% | 88.3% | 96.2% |
| Cross-Session | 55.8% | 68.1% | 83.7% | 93.8% |
| Temporal Reasoning | 48.3% | 74.6% | 81.2% | 92.1% |
| Knowledge Update | 58.7% | 65.3% | 84.9% | 95.4% |
| Multi-Hop Reasoning | 66.1% | 69.8% | 88.1% | 95.1% |
Methodology
All scores on this page come from LongMemEval, a comprehensive benchmark for evaluating long-term memory in conversational AI systems. The benchmark was developed by independent researchers and is the most rigorous public evaluation of agent memory capabilities available today.
LongMemEval tests five core dimensions of memory performance:
- Single-Session — recalling information within the same conversation.
- Cross-Session — retaining facts across separate conversations over time.
- Temporal Reasoning — understanding when events occurred and reasoning about time.
- Knowledge Update — correctly updating beliefs when information changes.
- Multi-Hop Reasoning — combining multiple stored facts to answer complex queries.
Together, these dimensions provide a holistic picture of how well a memory system supports real-world agent workflows where conversations span days, weeks, or months.
Feature Comparison
Performance is only part of the picture. You also need to own your data and deploy on your terms.
| System | Score | License | Self-Host |
|---|---|---|---|
| Hindsight | 94.6% | MIT | Yes (Docker) |
| Supermemory | 85.2% | Closed | Enterprise only |
| Zep | 71.2% | Mixed | Via Graphiti |
| GPT-4o | 60.2% | N/A | No |
Start building agents that learn
Open source, MIT licensed. Self-host or use Hindsight Cloud.
94.6%
LongMemEval — highest score of any memory system