多个维度出发评价生成文本的质量,如一致性、流畅度等等。
每个维度的伪标注样本数量为30K,作者构建的数据集:
we first design specific rules for several commonly evaluated dimensions to construct pseudo data, and then combine them to train the evaluator.
任务形式:summary和dialogue。
实验验证:对比model有BLEU、METHOR、ROUGE、Bertscore....
人工标注的数据:TO verfify the proposed evaluator is qualifited, we need to calculated correlations with human scores in each benchamark.
Train the evaluator for 1-3 epochs. _Supervised method.
Conditional text generation: for example,machine translation, so the goal is to generate a hypothesis (h = h1, · · · , hm) based on a given source text (s = s1, · · · , sn)
require human judgments to train (i.e., supervised me