基于 Wang et al. (2023) 理论,评估 11 款 LLM 的篇章级翻译能力。
Evaluating 11 LLMs in Document-Level MT based on Wang et al. (2023) framework.
基于 Wang et al. (2023) 的研究,LLM 在处理长文档翻译(DLMT)时具有显著的“跨句上下文感知(Cross-sentence Contextual Awareness)”。这使得模型不仅能翻译单个句子,还能维持整个文档的全局一致性。
According to Wang et al. (2023), LLMs demonstrate superior 'Cross-sentence Contextual Awareness' in DLMT, enabling global consistency beyond sentence-level alignment.
在中文小说(如 mZPRT 语料库)中,主语缺失(ZP)是传统 NMT 翻译失效的主因。研究表明,LLM 凭借庞大的参数量和长上下文窗口,能精准推断并补全省略的主语,这是本次评估中 Qwen 系列胜出的关键。
In Chinese fiction, subject omission (ZP) causes traditional NMT to fail. LLMs' large parameter counts and context windows allow accurate inference of missing subjects—a key differentiator for Qwen series.
“在东校区?” 周图明显犹豫了一下。不管是学长还是老师都曾警告过。不要靠近。陈歌看着周图的眼睛:“相比较你们,其实他们更加接近希望。”
"At the East Campus?" Zhou Tu hesitated. Both seniors and teachers had warned him. Do not approach. Chen Ge looked into his eyes: "Compared to you, they are actually closer to hope."
"In the East Campus?" Zhou Tu hesitated. Senior or teacher warned. Don't go near. Chen Ge looked into Zhou Tu's eyes: "Compared with you, they are closer to hope."
“学霸”与“学渣”在教育体系中的分层,本质上是资源的错位分配。
The stratification between "top students" and "struggling students" reflects resource misallocation.
The layers between "academic masters" and "academic dregs" are misallocated resources.
为确保评估体系的透明度与可复现性,我们在此公开核心算法逻辑与 Prompt 设计方案。
To ensure transparency and reproducibility, we disclose our core algorithmic logic and Prompt design.
所有模型必须完成从 ZH → EN (Forward) 到 EN → ZH (Backward) 的双向流转,并对比三者(原文、正译、回译)的语义损耗。
All models must complete a bidirectional flow from ZH → EN (Forward) to EN → ZH (Backward), measuring semantic loss across all stages.
# 核心翻译引擎执行逻辑
def _process_single_file(self, content):
# 阶段 1: 正向翻译 (获取初步输出)
translated_en = self.engine.translate(content, "Chinese", "English")
# 阶段 2: 反向回译 (验证语义一致性)
if translated_en:
translated_zh = self.engine.translate(translated_en, "English", "Chinese")
return {"en": translated_en, "zh_back": translated_zh}
prompt = f"""
As an expert professional translator, evaluate the quality of the translation.
Source ({src_lang}): {source}
Translation ({tgt_lang}): {translation}
Criteria:
- accuracy (0.0-1.0): Faithfulness to the original source meaning.
- fluency (0.0-1.0): Naturalness and grammatical correctness.
- style (0.0-1.0): Preservation of the genre's tone (e.g., horror, news).
- total (0.0-1.0): An overall weighted average.
Return ONLY a JSON object.
"""
| Metric | Weight | Description |
|---|---|---|
| Forward Score | 40% | 正向翻译质量 (ZH -> EN) |
| Backward Score | 30% | 回译质量 (EN -> ZH) |
| Consistency | 30% | 源文与回译的语义契合度 |