Less is More: How Good RAG Design Lets You Use Smaller Language Models

In the race to build better AI applications, there’s often an assumption that bigger is better when it comes to language models. However, with careful attention to retrieval-augmented generation (RAG) pipeline design and prompt engineering, smaller, more cost-effective models can often perform just as well as their larger counterparts. Here’s how you can optimize your RAG implementation to get excellent results with smaller models.
The Power of Relevant Context
The key insight is this: when you provide a language model with highly relevant context, it doesn’t need to rely as heavily on its trained knowledge and “intelligence” to generate good responses. This means that with a well-designed RAG pipeline, you can often achieve excellent results with smaller, more efficient models.
Three Pillars of Efficient RAG Design
1. Smart Retrieval Strategies
The first step in building an efficient RAG pipeline is ensuring your retrieval process returns highly relevant information. Several techniques can help:
- Query Rewriting: Use a large language model (LLM) to reformulate user questions for better retrieval. For example, a vague question like “When does it expire?” can be rewritten to include contextual information: “When does a retrieval endpoint token expire?”
- Contextual Enhancement: Include known context (like the current topic or user activity) in retrieval queries.
- Optimized Embedding Models: Choose the right embedding model for your use case. Modern options like OpenAI’s text-embedding-3-small and text-embedding-3-large offer excellent performance for general use cases, while specialized providers like Voyage AI offer domain-specific models for areas like code, legal, and finance. Test different models against your specific data and use case to find the best balance of performance and cost.
For instance, while a general-purpose embedding model might work well for customer service content, you might find better results with a specialized model if you’re working with technical documentation or financial reports. The key is to test different options against your actual data and use case rather than assuming any single approach is best.
2. Relevance Filtering with Reranking
One of the most powerful techniques for improving RAG efficiency is implementing strict relevance filtering through reranking:
- Use a reranking model to score the relevance of each retrieved result
- Set a minimum relevance threshold (e.g., 0.5)
- Only pass results above this threshold to your language model
This approach prevents your model from having to process irrelevant information, which can lead to confused or hallucinated responses. When your model only receives highly relevant context, it can focus on generating accurate responses rather than trying to reconcile conflicting or irrelevant information.
3. Precise Prompt Engineering
With relevant context in hand, careful prompt engineering becomes your final tool for getting the most out of smaller models. Key elements include:
- Clear instructions about using only provided context
- Explicit directions about handling missing information
- Context-specific guidance about the topic at hand
- Clear formatting requirements for responses
For example, a well-crafted anti-hallucination prompt might look like:
This is very important: if there is no relevant information in the texts or
there are no available texts, respond with "I'm sorry, I couldn't find an answer to your question."
Real-World Results
At Vectorize, we’ve been testing these principles with our AI assistant, which helps users navigate our product. Using a relatively small model (Llama 3.1 70B) combined with careful retrieval optimization and prompt engineering, we’re seeing promising results:
- Good response times for user queries
- More cost-effective operation compared to larger models
- Reliable answers when the context is well-matched
- Early user feedback has been encouraging
What’s particularly interesting is that by focusing on retrieval quality and relevance filtering rather than raw model size, we’ve been able to get good results from a smaller, more efficient model. Our experience suggests that investing in the RAG pipeline itself can be more valuable than simply using a larger language model.
Implementation Tips
To implement this efficient approach in your own RAG applications:
1. Focus on Retrieval Quality
- Implement query rewriting for ambiguous questions
- Test multiple embedding models against your specific use case
- Consider both general-purpose and specialized embedding options
- Regularly evaluate retrieval performance with real user queries
2. Optimize Your Filtering
- Choose appropriate reranking models
- Test different relevance thresholds
- Monitor filtered results to ensure valuable content isn’t being excluded
- Consider adjusting thresholds based on use case
3. Monitor and Iterate
- Track user feedback
- Monitor relevance scores
- Analyze cases where smaller models struggle
- Adjust thresholds and prompts based on results
Conclusion
The key insight we’ve gained helping our user build RAG applications is that model size isn’t the most important factor. While there’s often pressure to use the largest, most powerful language models available, our experience shows that intelligent pipeline design can be far more impactful than raw model size.
Think of it like giving a subject matter expert clear, relevant reference materials versus giving a general expert a mountain of loosely related documents. The subject matter expert with focused materials will likely provide better answers, even though the general expert might have broader knowledge.
The future of RAG isn’t about using the biggest models–it’s about building smarter pipelines that make efficient use of smaller, more cost-effective models. By focusing on the quality of your retrieval and filtering systems, you can build applications that are not just more efficient, but also more reliable and practical for production use.
Remember: it’s not about how much your model knows, but how effectively you can give it the right information at the right time.