NLU的预训练任务还是有蛮多的:
上图是我自己定义的分类体系,可以在这里下载,效果比较通用+明显的主要还是生成式token level的预训练,任务难度大粒度细。同时可以加一些小改进,提升模型在某类下游任务上的效果(比如IR的话就需要好好训sentence-level的表示)。
之前这篇文章调研过国内各大厂的预训练模型情况:
在预训练上有几点常见优化:
除了NLU任务之外,NLG任务上的花样也很多:
分享一些之前的预训练Paper解读:
---
欢迎初入NLP领域的小伙伴们加入rumor建立的「NLP卷王养成群」一起学习,添加微信「leerumorrr」备注知乎+NLP即可,群里的讨论氛围非常好~
预训练的脑图可以在这里下载:
MLM和NSP分别对应的是Token级别和句子级别两个粒度的任务,可以替代NSP的任务还真不少,粘一些任务介绍放到这里(懒得翻译了),有兴趣的可以参看这篇On Losses for Modern Language Models:
Token级别的任务:
1. Term Frequency prediction (TF): Regression predicting a token’s frequency in the rest of the document. The frequency is re-scaled between 0 and 10 per document.
2. Term Frequency-Inverse Document Frequency prediction (TF-IDF): Regression predicting a token’s tf-idf that has been re-scaled between 0 and 10 per document.
3. Sentence Boundary Objective (SBO): Predict the masked token given the embeddings of the adjacent tokens.
4. Trigram-Shuffling (TGS): 6-way classification predicting the original order of shuffled tri-grams.
5. Token Corruption Prediction (TCP): Binary classification of whether a token has been corrupted (inserted, replaced, permuted) or not.
6. Capitalization Prediction (Cap.): Binary, whether a token is capitalized or not.
7. Token Length Prediction (TLP): Regression to predict the length of the WordPiece token.
句子级别的任务:
8. Next Sentence Prediction (NSP): Binary, whether the second sentence follows the first or comes from a separate document.
9. Adjacent Sentence Prediction (ASP): 3-way classification whether the second sentence proceeds the first, precedes the first, or they come from separate documents.
10. Sentence Ordering (SO): Binary, predicting if the two sentences are in or out of order.
11. Sentence Distance Prediction (SDP): 3-way classification of whether the second sentence proceeds, the two sentences are noncontiguous from the same document, or come from separate documents.
12. Sentence Corruption Prediction (SCP): Binary classification of whether a tokens in a sentence have been corrupted (inserted, replaced, permuted) or not.
13. Quick Thoughts variant (QT): Split each batch into two, where the second half contains the subsequent sentences of the first half (e.g. with batch size 32, sentence 17 follows sentence 1, sentence 18 follows sentence 2,...). We use an energy-based model to predict the correct continuation for each sentence in the first half where the energy between two sentences is defined by the negative cosine similarity of their [CLS] embeddings. We use one model to encode both halves concurrently. See Figure 1.
14. FastSent variant (FS): Split each batch into two, where the second half contains the subsequent sentences of the first half (same as QT above). The loss is defined as cross-entropy between 1.0 and the cosine similarity of a sentence [CLS] embedding and the other sentence token embeddings ([CLS] embedding from the first half with token embeddings from the second half and [CLS] embeddings from second half with token embeddigns from the first half). We use one model to encode both halves concurrently.
p.s. 关于问题描述中NSP作用不大的说法,以往一般认为因为任务形式太简单会使模型关注一些浅显的lexical特征,但是其实也有文章实验表明在特定场景(例如小规模预训练模型)下,BERT style(MLM+NSP)的预训练结果会强于RoBERTa style(仅MLM):
所以这仍然是一个有待讨论的观点。