Introducing Automatic Metadata Extraction: Supercharge Your RAG Pipelines with Structured Information

Chris Bartholomew•May 23, 2025

Automatic Metadata Extraction is now available in Vectorize! This powerful addition transforms how you work with documents in your RAG pipelines. This feature allows you to extract structured information from your documents automatically, enhancing your retrieval capabilities and providing more context for your language models.

The Challenge of Unstructured Data

Organizations deal with vast amounts of unstructured documents daily – from technical specifications and financial reports to legal contracts and customer communications. While these documents contain valuable information, extracting specific data points has traditionally been a manual, time-consuming process.

Retrieval Augmented Generation (RAG) has already revolutionized how we interact with these documents by enabling semantic search and providing relevant context to language models. However, there’s been a missing piece: the ability to automatically extract and leverage structured information from within these documents.

Introducing Automatic Metadata Extraction

Our new Automatic Metadata Extraction feature bridges this gap by using Vectorize’s Iris model to analyze documents and extract structured information based on defined schemas. Iris handles the extraction of structured text from visually complex documents, while our metadata extraction system applies your schema to identify and populate the relevant fields — no manual labeling required. This extracted metadata enhances your retrieval capabilities in several ways:

Improved Filtering: Use extracted metadata fields for precise filtering during retrieval
Enhanced Context: Provide more structured information to your language models
Better Organization: Categorize and classify documents automatically
Deeper Insights: Extract specific data points like prices, part numbers, or technical specifications

Two Types of Metadata for Different Needs

Automatic Metadata Extraction supports two types of metadata, each serving different purposes:

Document Metadata

Document metadata is extracted by analyzing the entire document and is ideal for high-level information that applies to the document as a whole:

Title and author
Document type and classification
Publication date
Summary or conclusions

This metadata is attached to each chunk of the document, ensuring this high-level context is available regardless of which chunk is retrieved.

Section Metadata

Section metadata is applied at the chunk level and is perfect for more specific and detailed information that varies throughout the document:

Part numbers
Items purchased
Values in dollars
Technical specifications
Status information

For each chunk, the model determines if it matches one of your defined section metadata schemas and extracts the relevant information.

The Visual Schema Editor: No JSON Required

Creating metadata schemas is easy with our new visual schema editor. You don’t need to write JSON directly – instead, you can:

Start from Blank: Create a schema from scratch
Use a Template: Start with a pre-defined schema for common document types like receipts or invoices
Generate from Document: Upload a sample document and have the model automatically generate a suggested schema

The visual editor makes it simple to define properties, set types, add descriptions, and preview how the schema will be applied.

Real-World Applications

Here are a few ways you can use Automatic Metadata Extraction:

Financial Services

Financial institutions can use document metadata to classify research reports by sector, region, and analyst, while section metadata extracts specific financial metrics, stock recommendations, and price targets. This allows their analysts or AI agents to filter retrieval results by specific metrics or recommendations.

Manufacturing

Manufacturing companies can extract part numbers, specifications, and compatibility information from technical documentation. Their engineers (or AI agents) can then filter search results by specific part numbers or technical requirements, making it easier to find relevant information across thousands of documents.

Healthcare

Healthcare providers can use document metadata to classify medical literature by specialty and research type, while section metadata extracts specific treatments, dosages, and outcomes. This allows medical professionals (or AI agents) to filter retrieval results by specific medical criteria.

Enhancing Retrieval with Metadata in Chunks

One of the most powerful features of Automatic Metadata Extraction is the ability to add extracted metadata to your text chunks. This can significantly improve retrieval quality by:

Making high-level document context available in every chunk
Emphasizing important information for the LLM to use when generating answers
Providing additional context that might not be explicit in the text
Ensuring consistent information is available across all chunks from the same document

This is especially beneficial for documents that span many chunks or contain specific value strings.

Getting Started

Ready to try Automatic Metadata Extraction? Here’s how to get started:

Create a Metadata Schema: Use the visual schema editor to define what information you want to extract
Test with the Extraction Tester: Verify your schema works as expected with sample documents
Enable in Your Pipeline: Add metadata extraction to your RAG pipeline and configure metadata settings

For detailed instructions, check out our documentation on Automatic Metadata Extraction.

Conclusion

Automatic Metadata Extraction represents a significant advancement in how organizations can work with unstructured documents in their RAG pipelines. By automatically extracting structured information and making it available for filtering and context enhancement, we’re helping you get more value from your documents and providing more precise information to your users.

We’re excited to see how you’ll use this feature to enhance your RAG applications. If you have questions or ideas for improvements, we’d love to hear them.

Free RAG Pipeline Builder Free for developers. Affordable for enterprises. Get Started Now