Introducing Automatic Metadata Extraction: Supercharge Your RAG Pipelines with Structured Information

Automatic Metadata Extraction is now available in Vectorize! This powerful addition transforms how you work with documents in your RAG pipelines. This feature allows you to extract structured information from your documents automatically, enhancing your retrieval capabilities and providing more context for your language models.
The Challenge of Unstructured Data
Organizations deal with vast amounts of unstructured documents daily – from technical specifications and financial reports to legal contracts and customer communications. While these documents contain valuable information, extracting specific data points has traditionally been a manual, time-consuming process.
Retrieval Augmented Generation (RAG) has already revolutionized how we interact with these documents by enabling semantic search and providing relevant context to language models. However, there’s been a missing piece: the ability to automatically extract and leverage structured information from within these documents.
Introducing Automatic Metadata Extraction
Our new Automatic Metadata Extraction feature bridges this gap by using Vectorize’s Iris model to analyze documents and extract structured information based on defined schemas. Iris handles the extraction of structured text from visually complex documents, while our metadata extraction system applies your schema to identify and populate the relevant fields — no manual labeling required. This extracted metadata enhances your retrieval capabilities in several ways:
- Improved Filtering: Use extracted metadata fields for precise filtering during retrieval
- Enhanced Context: Provide more structured information to your language models
- Better Organization: Categorize and classify documents automatically
- Deeper Insights: Extract specific data points like prices, part numbers, or technical specifications
Two Types of Metadata for Different Needs
Automatic Metadata Extraction supports two types of metadata, each serving different purposes:
Document Metadata
Document metadata is extracted by analyzing the entire document and is ideal for high-level information that applies to the document as a whole:
- Title and author
- Document type and classification
- Publication date
- Summary or conclusions
This metadata is attached to each chunk of the document, ensuring this high-level context is available regardless of which chunk is retrieved.
Section Metadata
Section metadata is applied at the chunk level and is perfect for more specific and detailed information that varies throughout the document:
- Part numbers
- Items purchased
- Values in dollars
- Technical specifications
- Status information
For each chunk, the model determines if it matches one of your defined section metadata schemas and extracts the relevant information.
The Visual Schema Editor: No JSON Required
Creating metadata schemas is easy with our new visual schema editor. You don’t need to write JSON directly – instead, you can:
- Start from Blank: Create a schema from scratch
- Use a Template: Start with a pre-defined schema for common document types like receipts or invoices
- Generate from Document: Upload a sample document and have the model automatically generate a suggested schema
The visual editor makes it simple to define properties, set types, add descriptions, and preview how the schema will be applied.

Real-World Applications
Here are a few ways you can use Automatic Metadata Extraction:
Financial Services
Financial institutions can use document metadata to classify research reports by sector, region, and analyst, while section metadata extracts specific financial metrics, stock recommendations, and price targets. This allows their analysts or AI agents to filter retrieval results by specific metrics or recommendations.
Manufacturing
Manufacturing companies can extract part numbers, specifications, and compatibility information from technical documentation. Their engineers (or AI agents) can then filter search results by specific part numbers or technical requirements, making it easier to find relevant information across thousands of documents.
Healthcare
Healthcare providers can use document metadata to classify medical literature by specialty and research type, while section metadata extracts specific treatments, dosages, and outcomes. This allows medical professionals (or AI agents) to filter retrieval results by specific medical criteria.
Enhancing Retrieval with Metadata in Chunks
One of the most powerful features of Automatic Metadata Extraction is the ability to add extracted metadata to your text chunks. This can significantly improve retrieval quality by:
- Making high-level document context available in every chunk
- Emphasizing important information for the LLM to use when generating answers
- Providing additional context that might not be explicit in the text
- Ensuring consistent information is available across all chunks from the same document
This is especially beneficial for documents that span many chunks or contain specific value strings.

Getting Started
Ready to try Automatic Metadata Extraction? Here’s how to get started:
- Create a Metadata Schema: Use the visual schema editor to define what information you want to extract
- Test with the Extraction Tester: Verify your schema works as expected with sample documents
- Enable in Your Pipeline: Add metadata extraction to your RAG pipeline and configure metadata settings
For detailed instructions, check out our documentation on Automatic Metadata Extraction.
Conclusion
Automatic Metadata Extraction represents a significant advancement in how organizations can work with unstructured documents in their RAG pipelines. By automatically extracting structured information and making it available for filtering and context enhancement, we’re helping you get more value from your documents and providing more precise information to your users.
We’re excited to see how you’ll use this feature to enhance your RAG applications. If you have questions or ideas for improvements, we’d love to hear them.