Understanding Data Formats in RAG

Chris Latimer•September 5, 2024

RAG are transformational, period. They transform data into insights, extruding every ounce of value, wasting nothing. Though, to make them do what they do best you need to ensure all components of your RAG pipeline work as intended. For that, you need to know how to work with data, how to process it, what to expect, what to avoid and what are the limitations.

There are countless opportunities, challenges and solutions, when working with different types of data. That is exactly why you need to understand data formats when working with RAG pipelines. They impact the functionality, behavior and outcomes of your pipeline big time.

Mainly because, they contain the input you want to produce your output on. The raw material has to taste good to make the end product delicious. How and why do data formats hold so much power? Let’s get in to find out.

To lay the groundwork, you will come across three types of data. These are unstructured, semi-structured, and structured. All three have enormous potential and great benefits. Yet, all three work very differently, fulfill different needs and require unique approaches.

Unstructured Data: The Raw Text Goldmine

This is the most common and arguably the easiest type of data to work with. RAG pipelines and unstructured data are a match made in heaven as long as correct tools and techniques are used to process this data. This data is more like an ocean of text that flows without a predefined format or structure. So, it is text-heavy. It has complex and nuanced information with a lot of variables, relationships and layers of meaning.

It’s beauty lies in its abundance and flexibility. It’s everywhere around us – from the articles we read on the web, to social media content, to phone galleries, to the books lining our shelves, chats, emails exchanged, e-commerce activity and so on.

Your AI Needs Fresh Data Build a FREE RAG pipeline in minutes with Vectorize Try Free

The primary advantage of unstructured data in RAG systems is its richness in its meaning and insight. This makes it an ideal source for training language models and building deep context. If you want a RAG system that can work as a parallel human brain, generate human-like text, or understand nuanced queries, unstructured data is supreme. It’s particularly useful for tasks that require a deep knowledge of behaviors, concepts, and an understanding of colloquial language.

However, the very nature of unstructured data that makes it flexible also makes it messy. Unstructured data can be noisy. It contains irrelevant information, the burden of filtering this information out falls on the RAG system. Extracting specific pieces of information from unstructured data requires advanced processing, immense computational power and constant evaluations.

This type of data is best to produce qualitative answers for queries. Unstructured data can give great reasoning, logic and creative insights. It does not provide the best datasets though. Its advantages lean on the abstract side of things a lot more than on numbers.

Semi-Structured Data: The Middle Child

Moving along the spectrum, we come to semi-structured data. This is a hybrid format that combines elements of both unstructured and structured data. Meshed together, they offer a middle ground that can be incredibly useful in certain RAG applications. Semi-structured data is like a partially arranged filing cabinet – there’s some level of organization, but it’s not as rigid or predictable as fully structured data. It is an organized mess you could say.

Semi-structured data often contains both text and tabular information. So there is detailed data as well as the kind of data you would find in tables and graphs. There is some complexity but also some dimensionality.

This daya may include tags, labels or markers that provide some level of organization. Examples of semi-structured data include PDF documents, XML and JSON files, spreadsheets, and HTML web pages. These formats often contain rich textual content. They do have more organized elements like tables, metadata, or tagged sections as well.

Semi-structured data is versatile. It offers more organization than raw, unfiltered, unstructured data. It is, therefore, easier to parse and extract specific information. At the same time, it retains some of the richness and flexibility of unstructured text.

Don’t get too carried away though. This data has it’s own challenges. It is a concotion of two types of data, so it has inherent issues. It can be very tricky to process effectively. There’s room for inconsistencies, formatting issues, errors in recognizing relationships between data points, margin for human error in processing and so on. All of these can complicate data processing, retrieval and results.

Structured Data: The Organized Powerhouse

Then you’ve got your serious, objective, structured data. This data is rigid and straightforward. It does not bend. Facts are facts, after all.

Structured data has a predefined data model. It is organized in a very predictable way. This data is often stored in relational databases or tables, where each field has a defined purpose and format. Common examples of structured data include CSV files, well-organized API responses, and knowledge graphs. This one is consistent and precise.

The main advantage of structured data in RAG systems is its efficiency and reliability. There are lesser nuances to this, lesser fluff, so it is quick to retrieve. The results are more accurate. It is perfect for precise querying. It is perfect for pipelines that focus on factual Q&A.

Better RAG in 5 Minutes Use our free RAG evaluator to find the best performing embedding model & chunking strategy Try Free Now

Choosing the Right Format for Your RAG System

In practice, many effective RAG systems use a combination of these three types of data. In some modules you may need to rely more heavily on one of the data types. A RAG system designed to assist with medical queries will do better with:

Unstructured data from medical literature for broad medical knowledge.
Semi-structured data from clinical guidelines can help with more specific recommendations.
Structured data from drug databases for precise information about medications and dosages.

You will have to choose one type only in case you can afford to skip either the qualitative or objective response generation. Otherwise, you will need a combination that has all three to some degree. And, remember, there’s no one-size-fits-all solution. It all depends on what you are trying to achieve.