This dataset provides masked sentences and multi-token phrases that were masked-out of these sentences.
 
We offer 3 datasets: a general purpose dataset extracted from the Wikipedia and Books corpora, and 2 additional datasets extracted from pubmed abstracts. 
As for the pubmed data, please be aware that the dataset does not reflect the most current/accurate data available from NLM (it is not being updated).
For these datasets, the columns provided for each datapoint are as follows:
text- the original sentence
span- the span (phrase) which is masked out
span_lower- the lowercase version of span
range- the range in the text string which will be masked out (this is important because span might appear more than once in text)
freq- the corpus  frequency of span_lower
masked_text- the masked version of text, span is replaced with [MASK]


Additinaly, we provide a small (3K) dataset with human annotations.

For full description of datasets and annotation procedure, please see:
Oren Kalinsky, Guy Kushilevitz, Alex Libov, Yoav Goldberg 
Simple and Effective Multi-Token Completion from Masked Language Models, EACL findings 2023.