RAG is Dead! Long Live RAG!

Hero image

Yesterday Google announced Gemini 1.5, which features very long context windows, up to 1 million tokens. This is quite an advancement compared to existing models with longer contexts, such as GPT-4 Turbo with a 128K context window and Claude 2.1 with a 200K context window. Google claims that it has tested with context windows as large as 10 million tokens.

According to the details published in the technical paper, Gemini is also much improved compared to other models at actually making use of all the tokens that you can stuff into the context. They use needle-in-the-haystack tests inspired by work done by Greg Kamradt. In these tests, a fact is placed at different locations in the context window and then it is tested to see how well the model can recall that fact.

The testing Greg did against the original 128K context window GPT-4 model showed that recall degraded after 73K tokens and that facts at the beginning of the context window were always recalled, regardless of the size of the contest. So although the context window was large, only some of it was effective and if you wanted to guarantee that the model could recall the information you sent it, it’s best to send a small context window. This image summarizes his findings:

Graph
Credit <a href=httpsxcomGregKamradtstatus1722386725635580292s=20>Greg Kamradt post on X<a>

Google performed similar testing against Gemini 1.5, and found that for the needle-in-the-haystack test it achieves >99% recall up to millions of tokens for text, audio, and video modalities. No doubt, this is quite an impressive step forward.

Does this mean that RAG is no longer needed? If we can jam the context window with millions of tokens, why bother doing targeted retrieval of relevant context at all? This breakthrough seems to lead to that conclusion, but it’s not so simple.

Here are 5 reasons why you still need to do RAG even though Gemini 1.5 has smashed the context window barrier.

1 - Chunks of the retrieval window are still lost

    The needle-in-the-haystack test is an interesting and valuable experiment. However, the work being done by most models in an AI application is not finding Where’s Waldo in a sea of tokens. The model is synthesizing multiple facts from the context window in order to generate a quality response. Google admitted that the needle-in-the-haystack test was the “simplest possible setup” and ran some tests where the model had to recall multiple needles in the haystack.

    In these tests, the recall results are less impressive:

    Scatter Plot
    Credit <a href=httpsstoragegoogleapiscomdeepmind mediageminigemini v1 5 reportpdf>Gemini 15 Unlocking multimodal understanding across millions of tokens of context<a>

    From the chart you can see that although Gemini 1.5 is better than GPT-4 Turbo (within the overlapping context window range) and Gemini can maintain its recall capabilities all the way to 1M tokens, the average recall hovers around 60%. The context window may be filled with many relevant facts, but 40% or more of them are “lost” to the model. If you want to make sure the model is actually using the context you are sending it, you are best off curating it first and only sending the most relevant context. In other words, doing traditional RAG.

    If you think about it, this makes sense. Humans reason worse when given too much context. This is why the executive summary exists. And that’s why-and this is dating me– St. Joe Friday wants “Just the facts”.

    2 - High latency

    The Gemini 1.5 model is in developer preview, so we may not be seeing the ultimate performance, but the early indications are that Gemini 1.5 is very slow to process large context windows–between 30s and 1 minute.

    Credit <a href=httpstwittercomgranawkinsstatus1758495328268095533s=12t=I8ZMRCLNN6QihPnSqo5IlQ>Post by Grant on X<a>

    This makes sense. A foundational model requires a lot of compute and memory just to run. It’s not going to be fast at processing large amounts of data. A vector database, on the other hand, is purpose built-software for processing large amounts of data in a small amount of time. A good vector database can retrieve relevant context from a large corpus in less than a second.

    If your AI app is at all interactive, there is no way a user is going to wait 30s for a response. Even if your app is not interactive, high latency equals low throughput, so good luck on scaling this up.

    3 - It’s gonna cost you

    During the developer preview, Gemini 1.5 is free and Google has not announced any pricing yet. If we assume that it will be priced competitively with the OpenAI GPT4 Turbo model which it is comparable to in benchmarks, then you are looking at $0.01 per 1K input tokens. One million input tokens will cost $10.

    That’s for each generation. So your RAG chatbot costs $10 to answer one question in model inference alone. I hope you have deep pockets.

    4 - No way to tune results

    One of the common problems we see with RAG applications is that they end up generating poor quality responses at some point. Users are complaining that the application isn’t answering their questions correctly or that it is out right hallucinating. When this happens, you need to try and fix the problem.

    When using a traditional RAG pipeline, there are many places where you can look to improve the performance of your application. You can tweak the retrieval from the vector database, change the embedding model, adjust chunking strategy, or just get better source data.

    If you are using a stuff-the-context-window-1M-tokens strategy, you really have only one knob to play with: get better source data. Because you are basically giving the model all the data that you have (assuming it fits in the 1M token limit), you can’t tune anything. You can only try using different blobs of context. Iteratively debugging this will be painful, especially since you may have to wait up to a minute just to get the response back from the model.

    5 - Sorry, still doing RAG

    If you are thinking that the Gemini 1.5's super long context window means you get out of having to build RAG pipelines, you aren’t going to get so lucky. You are still augmenting the model’s generation, but in a more brute force way. You will need to retrieve fresh data quickly and send it to the model before it generates a response. Retrieving 1M tokens worth of data with reasonable latency and keeping those tokens up to date with the latest data is still a data engineering problem. Your RAG pipelines will look different, but they are still RAG pipelines.

    Conclusion

    Google’s Gemini 1.5 model has taken great strides forward by supporting a context window of up to 1M tokens. The step-up in the state-of-the-art from 200K to 1M is truly impressive. However, this is not the death knell for RAG and RAG pipelines that some are suggesting. Although Gemini 1.5 is much better than its competing models at using all the extended context, it still loses chunks of it in tests that are closer to real world scenarios. As expected, it takes a long time for Gemini 1.5 to generate a response with huge context windows, which presents some significant engineering challenges, not to mention potentially very high costs per response. If you do sort that out and get your application to some early users, you are going to run into another engineering challenge if the results are poor: there is very little you can do to tune the system. Going all-in on huge context windows doesn’t get you out of having to do data engineering or even from augmenting the generation with some sort of retrieval.

    Gemini 1.5 may change the landscape of RAG, but it certainly isn’t going to kill it. To borrow a turn of phrase from Mark Twain: the rumors of the death of RAG are greatly exaggerated.

    2 Comments

    • Hi Abdul. Even if the recall across the long context problem was fixed, it will always be better (and cheaper) to select the most relevant context, so we will always be doing RAG. RAG is here to stay.

    Leave a Reply