Code for paper "A Balanced Data Approach for Evaluating Cross-Lingual Transfer: Mapping the Linguistic Blood Bank"

Abstract: We show that the choice of pretraining languages affects downstream cross-lingual transfer for BERT-based models. We inspect zero-shot performance in balanced data conditions to mitigate data size confounds, classifying pretraining languages that improve downstream performance as donors, and languages that are improved in zero-shot performance as recipients. We develop a method of quadratic time complexity in the number of languages to estimate these relations, instead of an exponential exhaustive computation of all possible combinations. We find that our method is effective on a diverse set of languages spanning different linguistic features and two downstream tasks. Our findings can inform developers of large-scale multilingual language models in choosing better pretraining configurations.

Table of Contents generated with DocToc

language-graph

Supporting repo for Balanced Data Approach for Evaluating Cross-Lingual Transfer: Mapping the Linguistic Blood Bank
Dan Malkin, Tomasz Limisiewicz, Gabriel Stanovsky,
NAACL 2022.

Web exploration

url

To view our results please visit this url

run evaluations and gather results

In order to deploy our results locally skip to 'run streamlit'.

To deploy your own results, run evaluations on all of your desired models as specified in the Evaluate MRR section. Then gather all of the results to a dataframe and save it in a location of your choice. Note that the dynamic visualization is intended for bidirectional relations, so make sure to evaluate all couples in your langauge set. Currently, we only support evaluation on one of the 22 languages used in our experiments. To see your results, pass the gathered dataframe explicilty using '--df_path' when calling 'launch_interface.py', as explained next.

run streamlit

Install requirements found in language-graph/visualization_tool dir using:

pip install -r requirements.txt

Then, in your terminal open the project root directory and run:

streamlit run visualization_tool/launch_interface.py --server.port=PORT 

Finally, visit the generate URL, printed in your console.

Installation and Requirements

Tested on python 3.7.

from the root directory run:

pip install -r requirements.txt

Getting the data

Download

Download the processed wikipedia data at https://drive.google.com/file/d/1q5eOxc-cNT1YXV2eVG8jqZBLsPEQ2_Ld/view?usp=sharing and unpack to a desired directory.

Information measures

To get the information approximation we use in our work, first save all tokens of a wanted corpus into a file TOKENS_FILE. This should be done by simply running the tokenizer on the data's line, and writing each token to a file on it's own line. Then run the following from the project's root directory:

python data/info_analysis.py -t TOKENS_FILE

This returns the total number of tokens, followed by unique number of tokens and the ratio of both. An example of such processed tokens file is:

he
he
llo
llo
world
world

Running the script with this file will output:

python data/info_analysis.py -t data/example_files/example_tokens.txt
...
INFO:root:total tokens:6, unique tokens:3, ratio:0.5

Training the models

Before you start: training configurations

I order to run a training you must create a training configuration which includes the paths to the training data, model parameters, and training procedure parameters. Each run get specification via a model config and a training procedure config. To create a config run the following:

python src/model/train_utils.py ARGS

for example, to create a model config and training config to produce a monolingual english model with 6 hidden layer and 8 attention heads run:

python src/model/train_utils.py -o ~/LMs/en -c model pretrain --pt_train_data_list DATA_PATH/en/train.txt --pt_eval_data_list DATA_PATH/en/test.txt --pt_num_epochs 3 --pt_batch_size 8 --pt_name pt_config --tokenizer_path TOKENIZER_PATH --vocab_size 100000 --hidden_size 512 --max_len 128 --num_attention 8 --num_hidden 6 --model_name en_model

this will output pt_config.json and en_model.json into ~/LMs/en where pt_config.json contains:

{"train_data_paths_list": ["DATA_PATH/en/train.txt"], 
"eval_data_paths_list": ["DATA_PATH/en/test.txt"], 
"num_epochs": 3, 
"batch_size": 8}

and en_model.json contains:

{"tokenizer_path": "TOKENIZER_PATH", "hidden_layer_size": 512, "vocab_size": 100000, "max_sent_len": 128, "num_hidden": 6, "num_attention": 8}

the ARGS are specified as followed: Specify to config type to produce (pretrain config, finetune config, model parameters config)

    '-c','--config_types', nargs='+', help='a list from {'pretrain','finetune','model'}

Paths to training data lists (finetune or pretrain) as well as the output path under which all configs will be generated:

    '-o','--out_path', type=str, 
    '--pt_train_data_list', nargs='+'
    '--pt_eval_data_list', nargs='+'
    '--ft_train_data_list', nargs='+'
    '--ft_eval_data_list', nargs='+'

Arguments to define the training pipeline and model parameters:

    '--pt_num_epochs',type=int
    '--pt_batch_size',type=int
    '--ft_num_epochs',type=int
    '--ft_batch_size',type=int
    '--tokenizer_path',type=str
    '--vocab_size',type=int
    '--hidden_size',type=int
    '--max_len',type=int
    '--num_hidden',type=int
    '--num_attention',type=int

Arguments to define the training pipeline and model parameters (the names of the config files, will output: model_name.json, pt_name.json, ft_name.json):

    '--model_name',type=str
    '--pt_name',type=str
    '--ft_name',type=str

Tokenizer

train a BertWordPieceTokenizer from the tokenizers library (transformers by Huggingface) on your desired data. All desired languages should be included here.

Base model

to train a base model run the following script from your root directory:

bash src/scripts/train_pretrained.sh PRETRAIN_SOURCE_NAME MODEL_CONFIG_PATH PRETRAIN_CONFIG_PATH OUTPUT_DIR SEED

This will output a model specified by MODEL_CONFIG_PATH & PRETRAIN_CONFIG_PATH into OUTPUT_DIR/PRETRAIN_SOURCE_NAME/PRETRAIN_SOURCE_NAME_SEED

Finetuned model on top of an existing one

To train a finetuned mlm model on top of the previously trained model in OUTPUT_DIR/PRETRAIN_SOURCE_NAME/PRETRAIN_MODEL_NAME_SEED, run the following from your root directory:

bash src/scripts/train_finetuned.sh PRETRAIN_SOURCE_NAME FINETUNED_TARGET_NAME PRETRAIN_CONFIG_PATH FINETUNE_CONFIG_PATH MODEL_CONFIG_PATH OUTPUT_DIR SEED

Make sure PRETRAIN_CONFIG_PATH, OUTPUT_DIR and MODEL_CONFIG_PATH are identical to what you ran in the base model. This will output the model at UTPUT_DIR/PRETRAIN_SOURCE_NAME/PRETRAIN_SOURCE_NAME_SEED while finetuned on the data specified by FINETUNE_CONFIG_PATH into OUTPUT_DIR/PRETRAIN_SOURCE_NAME/FINETUNED_TARGET_NAME_SEED

Example: train an arabic model on top of a russian monolingual model:

bash src/scripts/train_pretrained.sh ru MODEL_CONFIG_PATH PRETRAIN_CONFIG_PATH OUTPUT_DIR 10 >> OUTPUT_DIR/ru/ru_10/*
bash src/scripts/train_finetuned.sh ru ar PRETRAIN_CONFIG_PATH FINETUNE_CONFIG_PATH MODEL_CONFIG_PATH OUTPUT_DIR 10 >> OUTPUT_DIR/ru/ar_10/* 

Evaluate MRR

Given a model saved in MODEL_DIR_PATH and defined by the mentioned config file at MODEL_CONFIG_PATH, to evaluate its performance on a given data EVAL_DATA and save the data in OUTPUT_DIR_PATH, run the following script from the root directory:

bash src/scripts/eval_model_mrr.sh MODEL_DIR_PATH MODEL_CONFIG_PATH EVAL_DATA OUTPUT_DIR_PATH

Downstream training

To train the downstream task ontop of a given langauge model, follow the instructions at: https://github.com/google-research/xtreme .

To truncate the data as we did, limit the data in the files outputed by https://github.com/google-research/xtreme/blob/master/utils_preprocess.py to the desired example number (2000 for POS, 5000 for NER). This can be done manually by truncating the files or by changing the preprocess function to stop after X examples.

Make sure to additionally preprocess the non-english training files as well (the default behaviour process only english train files) using the same code.

Download Source Code

Download ZIP

Paper Preview

May 9, 2022