Less is More: How Good RAG Design Lets You Use Smaller Language Models

Chris Bartholomew•January 15, 2025

In the race to build better AI applications, there’s often an assumption that bigger is better when it comes to language models. However, with careful attention to retrieval-augmented generation (RAG) pipeline design and prompt engineering, smaller, more cost-effective models can often perform just as well as their larger counterparts. Here’s how you can optimize your RAG implementation to get excellent results with smaller models.

The Power of Relevant Context

The key insight is this: when you provide a language model with highly relevant context, it doesn’t need to rely as heavily on its trained knowledge and “intelligence” to generate good responses. This means that with a well-designed RAG pipeline, you can often achieve excellent results with smaller, more efficient models.

Three Pillars of Efficient RAG Design

1. Smart Retrieval Strategies

The first step in building an efficient RAG pipeline is ensuring your retrieval process returns highly relevant information. Several techniques can help:

Query Rewriting: Use a large language model (LLM) to reformulate user questions for better retrieval. For example, a vague question like “When does it expire?” can be rewritten to include contextual information: “When does a retrieval endpoint token expire?”
Contextual Enhancement: Include known context (like the current topic or user activity) in retrieval queries.
Optimized Embedding Models: Choose the right embedding model for your use case. Modern options like OpenAI’s text-embedding-3-small and text-embedding-3-large offer excellent performance for general use cases, while specialized providers like Voyage AI offer domain-specific models for areas like code, legal, and finance. Test different models against your specific data and use case to find the best balance of performance and cost.

For instance, while a general-purpose embedding model might work well for customer service content, you might find better results with a specialized model if you’re working with technical documentation or financial reports. The key is to test different options against your actual data and use case rather than assuming any single approach is best.

2. Relevance Filtering with Reranking

One of the most powerful techniques for improving RAG efficiency is implementing strict relevance filtering through reranking:

Use a reranking model to score the relevance of each retrieved result
Set a minimum relevance threshold (e.g., 0.5)
Only pass results above this threshold to your language model

This approach prevents your model from having to process irrelevant information, which can lead to confused or hallucinated responses. When your model only receives highly relevant context, it can focus on generating accurate responses rather than trying to reconcile conflicting or irrelevant information.

3. Precise Prompt Engineering

With relevant context in hand, careful prompt engineering becomes your final tool for getting the most out of smaller models. Key elements include:

Clear instructions about using only provided context
Explicit directions about handling missing information
Context-specific guidance about the topic at hand
Clear formatting requirements for responses

For example, a well-crafted anti-hallucination prompt might look like:

This is very important: if there is no relevant information in the texts or
there are no available texts, respond with "I'm sorry, I couldn't find an answer to your question."

Free RAG Pipeline Builder Free for developers. Affordable for enterprises. Get Started Now

Real-World Results

At Vectorize, we’ve been testing these principles with our AI assistant, which helps users navigate our product. Using a relatively small model (Llama 3.1 70B) combined with careful retrieval optimization and prompt engineering, we’re seeing promising results:

Good response times for user queries
More cost-effective operation compared to larger models
Reliable answers when the context is well-matched
Early user feedback has been encouraging

What’s particularly interesting is that by focusing on retrieval quality and relevance filtering rather than raw model size, we’ve been able to get good results from a smaller, more efficient model. Our experience suggests that investing in the RAG pipeline itself can be more valuable than simply using a larger language model.

Implementation Tips

To implement this efficient approach in your own RAG applications:

1. Focus on Retrieval Quality

Implement query rewriting for ambiguous questions
Test multiple embedding models against your specific use case
Consider both general-purpose and specialized embedding options
Regularly evaluate retrieval performance with real user queries

2. Optimize Your Filtering

Choose appropriate reranking models
Test different relevance thresholds
Monitor filtered results to ensure valuable content isn’t being excluded
Consider adjusting thresholds based on use case

3. Monitor and Iterate

Track user feedback
Monitor relevance scores
Analyze cases where smaller models struggle
Adjust thresholds and prompts based on results

Conclusion

The key insight we’ve gained helping our user build RAG applications is that model size isn’t the most important factor. While there’s often pressure to use the largest, most powerful language models available, our experience shows that intelligent pipeline design can be far more impactful than raw model size.

Think of it like giving a subject matter expert clear, relevant reference materials versus giving a general expert a mountain of loosely related documents. The subject matter expert with focused materials will likely provide better answers, even though the general expert might have broader knowledge.

The future of RAG isn’t about using the biggest models–it’s about building smarter pipelines that make efficient use of smaller, more cost-effective models. By focusing on the quality of your retrieval and filtering systems, you can build applications that are not just more efficient, but also more reliable and practical for production use.

Remember: it’s not about how much your model knows, but how effectively you can give it the right information at the right time.

Your AI Needs Fresh Data Build a FREE RAG pipeline in minutes with Vectorize Try Free