This repository implements the method for training multilingual sentence embeddings in the paper Modular Sentence Encoders: Separating Language Specialization from Cross-Lingual Alignment.
Abstract: Multilingual sentence encoders (MSEs) are commonly obtained by training multilingual language models to map sentences from different languages into a shared semantic space. As such, they are subject to curse of multilinguality, a loss of monolingual representational accuracy due to parameter sharing. Another limitation of MSEs is the trade-off between different task performance: cross-lingual alignment training distorts the optimal monolingual structure of semantic spaces of individual languages, harming the utility of sentence embeddings in monolingual tasks; cross-lingual tasks, such as cross-lingual semantic similarity and zero-shot transfer for sentence classification, may also require conflicting cross-lingual alignment strategies. In this work, we address both issues by means of modular training of sentence encoders. We first train language-specific monolingual modules to mitigate negative interference between languages (i.e., the curse). We then align all non-English sentence embeddings to the English by training cross-lingual alignment adapters, preventing interference with monolingual specialization from the first step. We train the cross-lingual adapters with two different types of data to resolve the conflicting requirements of different cross-lingual tasks. Monolingual and cross-lingual results on semantic text similarity and relatedness, bitext mining and sentence classification show that our modular solution achieves better and more balanced performance across all the tasks compared to full-parameter training of monolithic multilingual sentence encoders, especially benefiting low-resource languages.

Contact person: Yongxin Huang
Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.
This project requires Python 3.10. To install the requirements:
pip install -r requirements.txt
The script focus.py
is used for training language-specific tokenizers and initializing the new embeddings with FOCUS:
python focus.py \
--model_name sentence-transformers/LaBSE \
--train_data_path data/language_adaptation/deu_train.txt \
--tokenizer_save_path model/tokenizer/deu_labse \
--embedding_save_path model/embedding/deu_labse \
Parameter description
--model_name
is the model name or path of the multilingual source model.--train_data_path
is the path of the .txt file with monolingual text data.--tokenizer_save_path
is the path to save the new tokenizer.--embedding_save_path
is the path to save the new embedding matrix.
The script train_mlm.py
is used for continual monolingual MLM pre-training on each monolingual model:
python train_mlm.py \
--train_file data/language_adaptation/deu_train.txt \
--validation_file data/language_adaptation/deu_val.txt \
--model_name_or_path sentence-transformers/LaBSE \
--tokenizer_name model/tokenizer/deu_labse \
--embedding_path model/embedding/deu_labse \
--max_seq_length 256 \
--per_device_train_batch_size 128 \
--gradient_accumulation_steps 2 \
--output_dir model/mlm_model/deu \
Parameter description
--train_file
is the path of the .txt file with monolingual text data for training.--val_file
is the path of the .txt file with monolingual text data for validation.--model_name_or_path
is the model name or path of the multilingual source model.--tokenizer_name
is the path to monolingual tokenizer created in Step 1.--embedding_path
is the path to monolingual embedding matrix created in Step 1.--output_dir
is the path to save the adapted model.
First, we need to create monolingual paraphrase data in each target language with machine translation. The script data/paraphrase/translate_paraphrase_data.py provides code for the translation.
After we have prepared the training data, we can use the script train_sentence_encoder.py to do monolingual sentence embedding training:
python train_sentence_encoder.py \
--langs deu \
--model_name_or_path model/mlm_model/deu \
--max_seq_length 64 \
--learning_rate 2e-5 \
--train_batch_size 32 \
--num_epochs 1 \
--output_path model/mono_encoder/labse_deu \
--train_data_dir data/paraphrase \
--train_type cross \
Parameter description
--langs
is the languages of training data. For monolingual specialization, we always use one target language. For training multilingual baselines, you can pass multiple languages, e.g.--langs eng deu
.--model_name_or_path
is the model name or path of the model trained in Step 2.--output_path
is the path to save the monolingual sentence encoder.--train_data_dir
is the directory contraining the paraphrase data files created by the script data/paraphrase/translate_paraphrase_data.py.--train_type
should be either "mono" (monolingual) or "cross" (cross-lingual). For monolingual specialization, we always set it to "mono". "cross" can be used for training multilingual baselines, such as Singlec.
After training monolingual sentence encoders, we train cross-lingual adapters to align them using the script train_cla_adapter.py:
python train_cla_adapter.py \
--pivot_lang eng \
--target_lang deu \
--pivot_model_name_or_path model/mono_encoder/labse_eng \
--target_model_name_or_path model/mono_encoder/labse_deu \
--max_seq_length 128 \
--learning_rate 1e-4 \
--train_batch_size 128 \
--num_epochs 1 \
--output_path model/cla_adapter/labse_deu_adapter \
--train_data_dir data/paraphrase \
Parameter description
--pivot_lang
: We always use English as our pivot language and align each non-English encoder to the English encoder.--pivot_model_name_or_path
is the name or path of the pivot language model, i.e. the English encoder trained in Step 3 in monolingual specialization.--target_lang
is the language of the encoder that should be aligned to the English encoder.--target_model_name_or_path
is the name or path to the target language model trained in Step 3 in monolingual specialization.--output_path
is the path to save the cross-lingual alignment adapter (CLA adapter).--train_data_dir
is the directory contraining the paraphrase data files created by the script data/paraphrase/translate_paraphrase_data.py.
The script eval_sts_str.py is used for the evaluation on the STS and STR datasets. See this README and the script eval_data_utils.py on how to download and use the evaluation data.
The script eval_belebele.py is for the evaluation on Belebele.
Please use the following citation:
@article{huang2024modularsentenceencodersseparating,
title={Modular Sentence Encoders: Separating Language Specialization from Cross-Lingual Alignment},
author={Yongxin Huang and Kexin Wang and Goran Glavaš and Iryna Gurevych},
year={2024},
url={https://arxiv.org/abs/2407.14878},
journal={ArXiv preprint},
volume={abs/2407.14878},
}
This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.