MERaLiON-TextLLM: Cross-Lingual Understanding of Large Language Models in Chinese, Indonesian, Malay, and Singlish.

Xin Huang, Tarun Kumar Vangani, Minh Duc Pham, Xunlong Zou,Bin Wang,Zhengyuan Liu,Ai Ti Aw

CoRR（2025）

Cited 0|Views7

Abstract

Multilingual large language models (MLLMs) have shown impressive capabilities across a variety of languages. However, efficacy can differ greatly between different language families, especially for those with limited linguistic resources. This report presents MERaLiON-TextLLM, a series of open-source language models specifically tailored to improve understanding and generation in Chinese, Indonesian, Malay, and Singlish. The initial released model is built on Llama-3-8B-Base and refined through a meticulously crafted process of continued pre-training and weight merging. Our approach achieves performance improvements across benchmarks in these languages, exceeding the capabilities of the official Llama-3 models. We provide the model checkpoints as a resource to support further research and development in cross-lingual language understanding.

Translated text

Bibtex

AI Read Science

Must-Reading Tree

Example

Generate MRT to find the research sequence of this paper

Data Disclaimer

The page data are from open Internet sources, cooperative publishers and automatic analysis results through AI technology. We do not make any commitments and guarantees for the validity, accuracy, correctness, reliability, completeness and timeliness of the page data. If you have any questions, please contact us by email: report@aminer.cn

Chat Paper

【要点】：本文介绍了MERaLiON-TextLLM，一种开源的多语言大型语言模型，针对中文、印尼文、马来文和Singlish进行优化，提高了这些语言的理解和生成能力，并在相关基准测试中超越了官方Llama-3模型的表现。

【方法】：研究团队在Llama-3-8B-Base模型的基础上，通过持续预训练和权重合并的精心设计流程，对模型进行了优化改进。

【实验】：实验使用了多种语言基准测试，具体数据集名称未在摘要中提及，但结果显示MERaLiON-TextLLM在相关语言测试中的性能有显著提升，并提供了模型检查点以支持进一步的研究与发展。

去 AI 文献库对话