Research · AB benchmark

OWLGraph beats vector RAG by
11.3pp on the same dataset.

We ran 100 questions across 15 categories — paraphrase, category lookup, constraint-join, multi-hop — through two retrieval pipelines pointed at the same database, the same reader model, and the same LLM judge. Only the retrieval strategy varied.

100 questions
15 categories
gpt-5-mini reader model

The top-line numbers.

OWLGraph is more accurate and more expensive. The right comparison isn't either column on its own — it's cost per correct answer in your eval suite.

Metric OWLGraph Naive Δ
LLM-Acc 0.950 0.837 +11.3pp
Contain-Acc 0.899 0.787 +11.2pp
Perfect (LLM-Acc=1.0) 94/100 80/100 +14
Wrong (LLM-Acc<0.4) 4/100 13/100 −9
Cost / question $0.0005 $0.0001 4.1×
Latency / question 20.5s 5.2s 3.9×

Controlled setup — only retrieval varies.

  • Same database, both runs. 48 cleaned Wikipedia passages about pizza; 35 typed entities; 33 typed triples; a Manchester-style ontology covering Pizza, PizzaTopping, and ~40 subclasses.
  • Same embedder. OpenAI text-embedding-3-small for both vector retrieval and the typed-entity index.
  • Same reader. gpt-5-mini for both pipelines and for the LLM judge.
  • Same 100 questions. Generated from the ground-truth entity / triple pairs, then categorized into 15 buckets by question shape (paraphrase, topping-class, reverse-topping, place-locatedIn, …).
  • Only the retrieval strategy differs. Naive: BM25 + vector cosine on passage chunks. OWLGraph: typed-entity resolution → ontology traversal → passage retrieval over the chain, with a small ReAct loop.

12 questions where OWLGraph was perfect and naive was wrong.

They cluster in two categories: topping-class lookups and reverse-topping queries. Both require ontological reasoning that vector similarity can't perform.

pizza-024 · B.topping-class

Q: What category of pizza topping is Ham?

Gold: meat

OWLGraph

"Ham is a meat topping. Ontology entity record for 'Ham' lists types: PizzaTopping, MeatTopping."

Naive

"The passages show ham is used as a topping on Hawaiian pizza (listed as 'either ham or bacon') but do not specify a topping category for ham."

pizza-070 · E.reverse-topping

Q: Which pizzas have Oregano as a topping?

Gold: Pizza Marinara

OWLGraph

"Pizza marinara. Evidence: 'Pizza marinara… is seasoned with tomatoes, extra virgin olive oil, oregano, and garlic.' — Chunk 0x1."

Naive

"The provided passages do not state that any specific pizzas have oregano as a topping."

The pattern: when the question asks about typed entities (their category, their relationships), naive RAG returns the chunks that mention the entity but not the ones that answer the question. The typed traversal closes that gap.

Most of the gain is concentrated in three categories.

On a third of question types (paraphrase, simple lookups), both pipelines are at parity. This is the right shape — OWLGraph is doing extra work where extra work matters.

Category n OWLGraph Naive Δ
B.topping-class181.000.61+0.39
F.place-locatedIn51.000.60+0.40
F.place-yes51.000.60+0.40
E.reverse-topping81.000.69+0.31
A.toppings61.000.92+0.08
C.disjointness151.001.00±0.00
A.origin71.001.00±0.00
G.paraphrase101.001.00±0.00
D.veg-inference60.580.92−0.33
E.reverse-origin50.800.93−0.13

Where naive wins.

Two cases (out of 100) where naive RAG was perfect and OWLGraph was wrong. Both are honest failures of the typed approach: when the question is about a property that is in the passage text but is not represented as a typed edge ("is Pizza Marinara vegetarian?"), the ontology traversal can land on the wrong evidence subgraph and miss what was sitting in a chunk. Worth knowing about.

The fix isn't conceptual — it's coverage. Improving the ontology to encode "vegetarian" as a derived property (no MeatTopping ∈ toppings) closes most of this gap. OWLGraph ships with the tools to make those refinements visible at query time.

Cost per question vs cost per correct answer.

Naive is cheaper per question ($0.0001 vs $0.0005) but more expensive per correct answer when the failure mode matters. The real number to track in your stack is:

cost_per_acceptable_answer = (cost_per_call × calls) / (n_correct − retry_factor × n_wrong_that_reached_user)

Where retry_factor is the cost of a wrong answer reaching a customer — for regulated or trust-sensitive applications that can be 10–100×, which flips the per-question economics. OWLGraph is the right choice when wrong answers are expensive; vector RAG is the right choice when they aren't.

Try it on
your data.

The dataset, the questions, and the scoring scripts are all in the platform repo. The same retrieval module ships in production — point it at your corpus and see the same shape.