Abstract: Producing the embedding of a sentence in an unsupervised way is valuable to natural language matching and retrieval problems in practice. In this work, we conduct a thorough examination of pretrained model based unsupervised sentence embeddings. We study on four pretrained models and conduct massive experiments on seven datasets regarding sentence semantics. We have there main findings. First, averaging all tokens is better than only using [CLS] vector. Second, combining both top andbottom layers is better than only using top layers. Lastly, an easy whitening-based vector normalization strategy with less than 10 lines of code consistently boosts the performance.
WhiteningBERT
Source code and data for paper WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.
Preparation
git clone https://github.com/Jun-jie-Huang/WhiteningBERT.git
pip install -r requirements.txt
cd examples/evaluation
Usage
Datasets
We use seven STS datasets, including STSBenchmark, SICK-Relatedness, STS12, STS13, STS14, STS15, STS16.
The processed data can be found in ./examples/datasets/.
Run
- To run a quick demo:
python evaluation_stsbenchmark.py \
--pooling aver \
--layer_num 1,12 \
--whitening \
--encoder_name bert-base-cased
Specify --pooing
with cls
or aver
to choose whether use the [CLS] token or averaging all tokens. Also specify --layer_num
to combine layers, separated by a comma.
- To enumerate all possible combinations of two layers and automatically evaluate the combinations consequently:
python evaluation_stsbenchmark_layer2.py \
--pooling aver \
--whitening \
--encoder_name bert-base-cased
- To enumerate all possible combinations of N layers:
python evaluation_stsbenchmark_layerN.py \
--pooling aver \
--whitening \
--encoder_name bert-base-cased\
--combination_num 4
- You can also save the embeddings of the sentences
python evaluation_stsbenchmark_save_embed.py \
--pooling aver \
--layer_num 1,12 \
--whitening \
--encoder_name bert-base-cased \
--summary_dir ./save_embeddings
A list of PLMs you can select:
bert-base-uncased
,bert-large-uncased
roberta-base
,roberta-large
bert-base-multilingual-uncased
sentence-transformers/LaBSE
albert-base-v1
,albert-large-v1
microsoft/layoutlm-base-uncased
,microsoft/layoutlm-large-uncased
SpanBERT/spanbert-base-cased
,SpanBERT/spanbert-large-cased
microsoft/deberta-base
,microsoft/deberta-large
google/electra-base-discriminator
google/mobilebert-uncased
microsoft/DialogRPT-human-vs-rand
distilbert-base-uncased
- ......
Acknowledgements
Codes are adapted from the repos of the EMNLP19 paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks and the EMNLP20 paper An Unsupervised Sentence Embedding Method by Mutual Information Maximization