serhii.net

In the middle of the desert you can say anything you want

15 Apr 2025

Evaluating NLG

Papers

Approaches

Main goal: coherence with human judgement

Kinds

Reference-free

  • LLM-estimate the probability is a generated text
    • assumption being that better texts have better probability
    • downsides: unreliable / low human correspondence according to 1
  • Ask an LLM how coherent / grammatical / criterium_name the output is from 0 to 5
    • downsides: LLMs will often pick 3 but G-Eval works around that

Reference-based

  • BLEU/ROUGE and friends (n-gram based)
    • low correlation with human metrics
  • (semantic) Similarity between two texts based on embeddings (BERTScore/MoverScore)
  • GPTScore2: higher probability to texts better fitting the framing
  • FACTUAL CONSISTENCY of generated summaries: FactCC, QAGS4

Framings (for GPT/LLM-based metrics)

  • Form-filling (G-Eval)
  • Conditional generation (=probability): GPTScore

How to evaluate evaluations

Meta-evaluators

  • Meta-evaluators are a thing! (= datasets based on human ratings)
    • G-Eval1 uses
      • SummEval 5
      • TopicalChat6
      • QAGS4 evaluating factual consistency/hallucinations
    • Pasted image 20250415172817.png 1
  • Correlation with human metrics

Stats

  • Many papers use various statistical bits to compare them.
  • G-Eval (p. 6)
    • Spearman, Kendall-Tau correlations
    • Krippendorf’s Alpha

On a specific task

Task formulation

  • Input:
    • Advert for a car.

      Car has 4 wheels

    • Target profile of possible client

      family with 10 kids 5 dogs living in the Australian bush

  • Output:
    • Advert targeted for that profile:

      ROBUST car with 4 EXTRA LARGE WHEELS made of AUSTRALIAN METAL able to hold 12 KIDS and AT LEAST 8 DOGS

Concrete approaches

  • Factual correctness / hallucinations
    • Framed as summarization evaluation: FactCC, QAGS4
  • GPTScore, G-Eval based on yet-to-be-determined-criteria
    • Coherence, consistency, fluency, engagement, etc. — the criteria used by meta-evaluators?
    • GPTScore aspects
    • TODO: ask the car company what do they care about except factual correctness and grammar
  • information density-ish? Use an LLM to get key-value pairs of the main info in the advert (number_of_wheels: 4), formulate questions based on each, and score better the adverts that contain answers to more questions!
    • Inspired by QAGS

TODO

  • LLM-as-a-judge and comparative assestment
  • think about human eval + meta-eval
  • think about what we want from the car company
    • metrics they care about

  1. G-Eval: <_(@liuGEvalNLGEvaluation2023) “G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment” (2023) / Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, Chenguang Zhu: z / http://arxiv.org/abs/2303.16634 / 10.48550/arXiv.2303.16634 _> ↩︎ ↩︎ ↩︎ ↩︎

  2. <_(@fuGPTScoreEvaluateYou2023) “GPTScore: Evaluate as You Desire” (2023) / Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, Pengfei Liu: z / http://arxiv.org/abs/2302.04166 / 10.48550/arXiv.2302.04166 _> ↩︎ ↩︎

  3. <_(@zhongUnifiedMultiDimensionalEvaluator2022) “Towards a Unified Multi-Dimensional Evaluator for Text Generation” (2022) / Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, Jiawei Han: z / http://arxiv.org/abs/2210.07197 / 10.48550/arXiv.2210.07197 _> ↩︎

  4. <_(@wangAskingAnsweringQuestions2020) “Asking and Answering Questions to Evaluate the Factual Consistency of Summaries” (2020) / Alex Wang, Kyunghyun Cho, Mike Lewis: z / http://arxiv.org/abs/2004.04228 / 10.48550/arXiv.2004.04228 _> ↩︎ ↩︎ ↩︎

  5. <_(@fabbriSummEvalReevaluatingSummarization2021) “SummEval: Re-evaluating Summarization Evaluation” (2021) / Alexander R. Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, Dragomir Radev: z / https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00373/100686/SummEval-Re-evaluating-Summarization-Evaluation / 10.1162/tacl_a_00373 _> ↩︎

  6. <_(@gopalakrishnanTopicalChatKnowledgeGroundedOpenDomain2023) “Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations” (2023) / Karthik Gopalakrishnan, Behnam Hedayatnia, Qinlang Chen, Anna Gottardi, Sanjeev Kwatra, Anu Venkatesh, Raefer Gabriel, Dilek Hakkani-Tur: z / http://arxiv.org/abs/2308.11995 / 10.48550/arXiv.2308.11995 _> ↩︎

Nel mezzo del deserto posso dire tutto quello che voglio.
comments powered by Disqus