PageIndex Deep Dive: The Good, The Bad, and The Ugly of Vectorless RAG

Stephen Jones
Ai
February 28, 2026

Table of Contents

What if everything we know about RAG is built on a flawed assumption?

That is the provocative claim behind PageIndex, a project from VectifyAI that has accumulated 19,000+ GitHub stars in under a year. Their argument: similarity is not relevance, and the entire chunk-embed-retrieve pipeline that powers most RAG systems today is fundamentally limited. Instead, PageIndex builds hierarchical tree structures from documents and uses LLM reasoning to navigate them. No vectors, no chunking, no embedding databases.

I spent some time digging through the codebase, the benchmarks, and the competitive landscape. Here is what I found.

What PageIndex Actually Does

At its core, PageIndex is a two-stage system:

Stage 1, Index Generation. Given a PDF or Markdown document, PageIndex uses an LLM (GPT-4o by default) to build a hierarchical JSON tree. It detects the table of contents, maps sections to physical page numbers, recursively splits oversized nodes, and generates summaries. The output is a structured representation of the document: a machine-readable table of contents with metadata.

Stage 2, Reasoning-Based Retrieval. When you query the system, instead of searching a vector database for “similar” chunks, an LLM reasons over the tree structure. It reads section titles and summaries, decides which branch to explore, drills down, extracts information, and follows cross-references. This mimics how a human expert actually navigates a 200-page annual report.

The pipeline has a self-correcting mechanism too. It verifies its own TOC extraction, and if accuracy drops below 60%, it cascades through three fallback modes (TOC with page numbers, TOC without page numbers, pure LLM segmentation).

from pageindex import page_index

result = page_index(
    doc="/path/to/annual-report.pdf",
    model="gpt-4o-2024-11-20",
    max_pages_per_node=10,
    if_add_node_summary="yes"
)
# Returns: hierarchical JSON tree with section summaries

The output looks like this:

{
  "doc_name": "2023-annual-report.pdf",
  "structure": [
    {
      "title": "Risk Management",
      "node_id": "0005",
      "start_index": 42,
      "end_index": 67,
      "summary": "Covers market risk, credit risk, and operational risk...",
      "nodes": [
        {
          "title": "Interest Rate Risk",
          "node_id": "0006",
          "start_index": 45,
          "end_index": 52,
          "summary": "Details hedging strategies and sensitivity analysis..."
        }
      ]
    }
  ]
}

Every retrieved answer traces back to specific page numbers and section paths. No opacity.

The Problem It Solves

If you have built a RAG system, you have fought the chunking war. The core tension is irreconcilable:

Small chunks (256-512 tokens) embed cleanly and match precisely, but lose all surrounding context. “Revenue grew 3%.” Which company? Which quarter?
Large chunks (1024+ tokens) preserve context but dilute semantic focus, making similarity search less reliable.

The community has layered increasingly sophisticated fixes on top: Anthropic’s Contextual Retrieval prepends per-chunk context (reducing retrieval failure by 49%). Jina AI’s Late Chunking embeds the full document first, then extracts chunk vectors from token-level representations. Semantic chunking groups sentences by embedding similarity.

But these are all patches on the same architecture. PageIndex asks: what if we just skip the chunking entirely?

The Good

A Genuinely Novel Architecture

This is not another “improved chunking strategy” paper. PageIndex replaces the entire vector similarity paradigm with tree-based reasoning. The AlphaGo inspiration is apt. Just as AlphaGo uses Monte Carlo Tree Search to reason through game states rather than pattern-matching board positions, PageIndex navigates document hierarchies through explicit LLM reasoning rather than embedding similarity.

The ~2,500 lines of Python are lean and readable. Dependencies are minimal: OpenAI SDK, PyMuPDF, tiktoken, and that is essentially it. No PyTorch, no FAISS, no vector database. The entire system is an LLM talking to a JSON tree.

The FinanceBench Numbers Are Real

VectifyAI’s Mafin 2.5 system (built on PageIndex) achieved 98.7% accuracy on FinanceBench, a benchmark of financial Q&A over SEC filings and earnings reports. For context, traditional vector-based RAG solutions typically score 60-80% on this dataset. That is not a marginal improvement; it is a different tier of performance.

This makes sense when you think about it. Financial documents are highly structured. When an analyst asks “What was Microsoft’s operating margin in Q3 2024?”, the answer lives in a specific section of a specific table in a specific filing. Vector similarity search might surface chunks that mention operating margins. Tree-based reasoning can navigate directly to Financials, Income Statement, Q3, Operating Margin.

Explainability by Design

Every answer comes with a traceable path through the document tree: “I found this on pages 45-52, under Risk Management, Interest Rate Risk.” For regulated industries (finance, healthcare, legal) this is not a nice-to-have. It is a compliance requirement.

Vector RAG gives you “here are the top-5 most similar chunks” with cosine similarity scores. That is not an explanation anyone can audit.

Cross-Reference Handling

When a document says “see Appendix G for methodology details,” chunk-based systems are blind to this. The chunk containing the reference and the chunk containing Appendix G are separate islands in vector space. PageIndex’s tree navigation can follow these references because it understands document structure, not just text similarity.

The Bad

Complete LLM Dependency

Every operation in PageIndex requires GPT-4o API calls. TOC detection? LLM call. Physical page mapping? LLM call. Summary generation? LLM call. Error correction? LLM call. For a 100-page document, you are looking at 50-200 API calls during indexing.

There is no offline mode, no local model support out of the box, and no way to run this without an OpenAI API key. The CHATGPT_API_KEY environment variable name tells you everything about where this system’s allegiance lies. While you could theoretically swap in another OpenAI-compatible API, the prompts are tuned for GPT-4o.

Cost Scales Uncomfortably

Indexing a typical 100-page PDF costs roughly $0.50-$5.00 at current GPT-4o pricing. That is fine for a handful of important documents. It is not fine for ingesting 50,000 documents into a knowledge base.

But here is the real cost concern: PageIndex requires LLM calls at retrieval time too, not just indexing time. Every query triggers a reasoning chain through the tree. Traditional vector RAG pays once at embedding time, then retrieval is essentially free (a cosine similarity lookup). PageIndex pays at both ends.

The Table of Contents Dependency

PageIndex works best when documents have clear, well-structured tables of contents. The system has fallback modes for documents without a TOC, but these rely on pure LLM segmentation, which is slower and potentially less accurate. If you are processing messy, unstructured documents (email threads, Slack exports, support tickets), the tree metaphor starts breaking down.

Only One Benchmark

The 98.7% FinanceBench number is impressive, but it is the only public benchmark. FinanceBench specifically tests financial document Q&A, a domain where hierarchical, structured documents are the norm. We have no data on how PageIndex performs on technical documentation, academic papers, legal contracts, mixed-media documents, or non-English content.

One stellar benchmark does not prove general superiority.

The Ugly

The Data Sovereignty Problem Nobody Mentions

This is the gap that matters most for anyone in a regulated environment. PageIndex sends every page of every document through OpenAI’s API, both at index time and at query time. There is no on-premises option. There is no local model support without significant rework.

The post recommends PageIndex for financial analysis and regulatory compliance. But for HIPAA, SOC 2, or financial compliance environments, sending document content to a third-party API is likely disqualifying without a Business Associate Agreement or equivalent data processing agreement in place. OpenAI’s data retention policies, training data inclusion practices, and access logging are outside your control entirely.

This is not the same concern as vendor lock-in. Lock-in is a business risk. Data sovereignty is a compliance and security risk. They deserve separate treatment, and most coverage of PageIndex conflates them.

The Scalability Question Nobody Is Answering

Vector databases scale to billions of documents. Pinecone, Weaviate, Qdrant are built for exactly this. A vector lookup is O(log n) with approximate nearest neighbours.

PageIndex’s tree-based reasoning requires an LLM to read and reason over the tree structure for every query. What happens when you have 10,000 documents, each with their own tree? 100,000? The project does not address multi-document search at all. There is no mechanism for deciding which document tree to even start searching.

For single-document or small-collection use cases, this is fine. For enterprise knowledge bases with millions of documents, this is an unsolved problem.

Latency Is Inherently Higher

A vector similarity search returns results in milliseconds. PageIndex’s multi-step reasoning chain (read tree, select branch, drill down, extract, assess, possibly repeat) takes seconds. For interactive chat applications where users expect sub-second responses, this is a meaningful trade-off.

The “Reasoning” Framing Oversells

PageIndex describes its retrieval as “reasoning-based” and draws the AlphaGo parallel. But let us be precise about what is happening: an LLM is reading a JSON tree and making selection decisions. This is closer to structured prompting than to Monte Carlo Tree Search. AlphaGo’s tree search involved value networks trained on millions of games and explicit exploration-exploitation trade-offs. PageIndex’s “reasoning” is GPT-4o following instructions.

This is not a criticism of the approach. It works well. But the framing occasionally oversells the sophistication of what is happening under the hood.

How Does It Compare?

Here is where PageIndex sits in the broader landscape:

vs. Traditional Chunk + Vector RAG. PageIndex wins on accuracy for structured documents and explainability. Vector RAG wins on cost, latency, scalability, and versatility across document types.

vs. Anthropic’s Contextual Retrieval / Jina’s Late Chunking. These are fixes layered on top of the vector paradigm. They acknowledge the context-loss problem and compensate. PageIndex sidesteps the problem entirely. But these approaches maintain the cost and speed advantages of vector search.

vs. Layout-Aware Parsers (Unstructured, LlamaParse, Docling). These tools solve document parsing, not retrieval. They are complementary to both vector RAG and PageIndex. In fact, PageIndex would benefit from better PDF parsing (it currently uses PyMuPDF/PyPDF2, which struggle with complex layouts).

vs. Vision-Based Approaches (ColPali, Vision-Guided Chunking). A June 2025 paper showed vision-guided chunking improved RAG accuracy by 14% by treating PDF pages as images. PageIndex also offers a vision RAG mode. These approaches are converging on the same insight: text-only processing loses too much information.

vs. Agentic RAG. PageIndex is essentially a specialised form of agentic RAG: an LLM agent with a structured index to navigate. LlamaIndex and LangChain offer similar agent-based retrieval patterns, but without PageIndex’s pre-built tree structure. The tree is what makes the reasoning practical rather than exploratory.

When Should You Actually Use This?

Use PageIndex when:

You are working with well-structured, long-form documents (financial reports, regulatory filings, technical manuals)
Accuracy matters more than speed or cost
You need auditable, traceable retrieval paths
Your document collection is small to medium (dozens to hundreds, not millions)
Cross-referencing within documents is important
You have verified that sending document content to OpenAI’s API meets your compliance requirements

Stick with vector RAG when:

You are processing large volumes of diverse, unstructured content
Sub-second latency matters
Cost per query needs to be near-zero
You need to scale to millions of documents
Your documents lack clear hierarchical structure

Consider a hybrid approach:

Use vector search to select the right document(s) from a large collection
Then use PageIndex-style tree reasoning for precise extraction within those documents

This hybrid pattern, coarse retrieval via vectors then fine retrieval via reasoning, is where the field is heading. It is also the architectural insight behind approaches like OpenViking’s hierarchical context tiers, which use L0 abstracts for cheap scanning and L2 full content for deep reads. The scalability of vectors combined with the precision of structured reasoning.

Getting Started

If you want to try it yourself:

git clone https://github.com/VectifyAI/PageIndex.git
cd PageIndex
pip3 install -r requirements.txt
export CHATGPT_API_KEY=sk-your-key-here

# Index a PDF
python3 run_pageindex.py --pdf_path /path/to/document.pdf

# Output lands in ./results/{pdf_name}_structure.json

There is also a cloud-hosted version at chat.pageindex.ai if you want to skip the setup, and an MCP server for Claude/Cursor integration.

The Bigger Picture

PageIndex is interesting not just for what it does, but for what it represents. The RAG ecosystem is moving from “chunk and search” toward “reason and navigate.” Anthropic’s contextual retrieval, Jina’s late chunking, vision-based approaches, and now fully vectorless systems like PageIndex are all symptoms of the same realisation: vector similarity is a proxy for relevance, and proxies have limits.

Whether PageIndex specifically becomes the standard is less important than the architectural shift it embodies. The next generation of document retrieval systems will likely combine the scalability of vectors with the precision of structured reasoning. For now, if you are building RAG for financial analysis, regulatory compliance, or any domain where “close enough” is not good enough, PageIndex is worth a serious look. Just go in with your eyes open about the cost, latency, scalability, and data sovereignty trade-offs.

PageIndex GitHub | FinanceBench Dataset | Anthropic Contextual Retrieval | Jina Late Chunking | Vision-Guided Chunking