LLM MT Arena | Advanced Research Report

01. 理论背景 01. Theoretical Foundation THEORY

篇章建模能力 (Discourse Modeling)

Discourse Modeling Ability

基于 Wang et al. (2023) 的研究，LLM 在处理长文档翻译（DLMT）时具有显著的“跨句上下文感知（Cross-sentence Contextual Awareness）”。这使得模型不仅能翻译单个句子，还能维持整个文档的全局一致性。

According to Wang et al. (2023), LLMs demonstrate superior 'Cross-sentence Contextual Awareness' in DLMT, enabling global consistency beyond sentence-level alignment.

零代词恢复 (Zero Pronoun Recovery)

Zero Pronoun Recovery (ZP)

在中文小说（如 mZPRT 语料库）中，主语缺失（ZP）是传统 NMT 翻译失效的主因。研究表明，LLM 凭借庞大的参数量和长上下文窗口，能精准推断并补全省略的主语，这是本次评估中 Qwen 系列胜出的关键。

In Chinese fiction, subject omission (ZP) causes traditional NMT to fail. LLMs' large parameter counts and context windows allow accurate inference of missing subjects—a key differentiator for Qwen series.

02. 竞技场排名：全量多维对比 02. Arena Rankings: Full Comparison QUANTITATIVE

Overall (总榜)

Fiction (小说)

News (新闻)

Social (社交)

Stability (稳定性)

03. 案例分析：质性解读与深度对比 03. Case Analysis & Qualitative Contrast SAMPLES

Fiction / Narrative Coherence

案例 1：小说场景中的零代词还原 (ZP Recovery)

Case 1: Zero Pronoun Recovery in Narrative

Original (ZH)

“在东校区？” 周图明显犹豫了一下。不管是学长还是老师都曾警告过。不要靠近。陈歌看着周图的眼睛：“相比较你们，其实他们更加接近希望。”

Qwen 3.5-27b (Win)

"At the East Campus?" Zhou Tu hesitated. Both seniors and teachers had warned him. Do not approach. Chen Ge looked into his eyes: "Compared to you, they are actually closer to hope."

Tencent MT (Loss)

"In the East Campus?" Zhou Tu hesitated. Senior or teacher warned. Don't go near. Chen Ge looked into Zhou Tu's eyes: "Compared with you, they are closer to hope."

深度分析： Qwen 成功推断出警告的对象是“周图”，并在英文中补全了代词“him”。传统 NMT 在这里产生了“主语失踪”的断层，证明了 LLM 的篇章建模优势。

In-depth: Qwen correctly inferred that "Zhou Tu" was the object of the warning, adding "him" in the output. Traditional MT suffered from subject loss, proving LLMs' discourse modeling edge.

News / Terminology Consistency

案例 2：新闻报道中的术语对齐

Case 2: Terminology Alignment in News

Original (ZH)

“学霸”与“学渣”在教育体系中的分层，本质上是资源的错位分配。

Qwen 3.5-27b

The stratification between "top students" and "struggling students" reflects resource misallocation.

Minimax M2.7

The layers between "academic masters" and "academic dregs" are misallocated resources.

深度分析： Qwen 选用了符合英文地道表达的“top students”和“struggling students”。而对比模型选用了“academic dregs”（学术残渣），显示出其在处理网络文化与地道表达时的风格偏移。

In-depth: Qwen chose idiomatic terms like "top/struggling students," whereas the other model used "academic dregs," showing a stylistic deviation in cultural translation.

04. 附录：评估流程与算法解读 04. Appendix: Methodology & Algorithm TECHNICAL

为确保评估体系的透明度与可复现性，我们在此公开核心算法逻辑与 Prompt 设计方案。

To ensure transparency and reproducibility, we disclose our core algorithmic logic and Prompt design.

A. 闭环回译 (Closed-Loop Loopback)

A. Closed-Loop Loopback

所有模型必须完成从 ZH → EN (Forward) 到 EN → ZH (Backward) 的双向流转，并对比三者（原文、正译、回译）的语义损耗。

All models must complete a bidirectional flow from ZH → EN (Forward) to EN → ZH (Backward), measuring semantic loss across all stages.

core/translator.pyPYTHON

# 核心翻译引擎执行逻辑
def _process_single_file(self, content):
    # 阶段 1: 正向翻译 (获取初步输出)
    translated_en = self.engine.translate(content, "Chinese", "English")
    
    # 阶段 2: 反向回译 (验证语义一致性)
    if translated_en:
        translated_zh = self.engine.translate(translated_en, "English", "Chinese")
        return {"en": translated_en, "zh_back": translated_zh}

B. LLM Judge 评分提示词 (Prompt Design)

B. LLM Judge Prompt Design

core/evaluator.pyLLM PROMPT

prompt = f"""
As an expert professional translator, evaluate the quality of the translation.
Source ({src_lang}): {source}
Translation ({tgt_lang}): {translation}

Criteria:
- accuracy (0.0-1.0): Faithfulness to the original source meaning.
- fluency (0.0-1.0): Naturalness and grammatical correctness.
- style (0.0-1.0): Preservation of the genre's tone (e.g., horror, news).
- total (0.0-1.0): An overall weighted average.

Return ONLY a JSON object.
"""

C. 评分权重 (The Recipe)

C. Scoring Recipe

Metric	Weight	Description
Forward Score	40%	正向翻译质量 (ZH -> EN)
Backward Score	30%	回译质量 (EN -> ZH)
Consistency	30%	源文与回译的语义契合度