05 May 2025

Evaluating RAG

General info

Roughly, it’s evaluating
- context — how relevant/correct are the retrieved chunks
- answer — how good are the generated claims
- (with interplay inbetween — e.g. whether the answer comes from the context, regardless of whether both are relevant to the query)
Or: generator metrics, retriever metrics, and overall metrics as used in RAGChecker (see RAGChecker picture later below).

Sources / libs

Evaluating - LlamaIndex
- Especially Usage Pattern (Response Evaluation) - LlamaIndex
  - Faithfullness: context(=source)+answer => whether the answer comes from the context available (=whether the answer was hallucinated)
  - Relevancy: query+context+answer => whether the answer AND CONTEXT were relevant for the specific query
- All LlamaIndex’s eval modules: Modules - LlamaIndex
  - CorrectnessEvaluator: compare query+answer to a reference answer: Correctness Evaluator - LlamaIndex w/ another LLM
  - Embedding Similarity Evaluator - LlamaIndex response+reference semantic similarity via embeddings

Deepeval’s metrics as given in the llamaindex docs:

from deepeval.integrations.llama_index import ( DeepEvalAnswerRelevancyEvaluator, DeepEvalFaithfulnessEvaluator, DeepEvalContextualRelevancyEvaluator, DeepEvalSummarizationEvaluator, DeepEvalBiasEvaluator, DeepEvalToxicityEvaluator, )

RAGChecker

RAGChecker: RAGChecker: A Fine-grained Framework For Diagnosing RAG
- LlamaIndex integration docs: RAGChecker: A Fine-grained Evaluation Framework For Diagnosing RAG - LlamaIndex

Pasted image 20250505103043.png

Nel mezzo del deserto posso dire tutto quello che voglio.

serhii.net

Evaluating RAG

General info

Sources / libs

RAGChecker