预训练模型的训练任务在 MLM 之外还有哪些有效方式？第1页

rumor-lee 网友的相关建议:

NLU的预训练任务还是有蛮多的：

上图是我自己定义的分类体系，可以在这里下载，效果比较通用+明显的主要还是生成式token level的预训练，任务难度大粒度细。同时可以加一些小改进，提升模型在某类下游任务上的效果（比如IR的话就需要好好训sentence-level的表示）。

之前这篇文章调研过国内各大厂的预训练模型情况：

在预训练上有几点常见优化：

把单纯的MLM改成WWM，融入更多中文词汇、短语的知识，Motian和BERTSG都有采用。
多任务方式，比如Motian加入了搜索点击曝光任务；BERTSG参考了Cross thought和对比学习，学到更多句子级别特征，同时加入了文章标题生成和段落顺序预测任务；Pangu的encoder则是基于StructBERT，其中分别加入了WSO（打乱词序）以及改进的NSP任务。
分阶段预训练。Motian参考BERT使用两阶段预训练，先训128长度，再512长度；对于encoder-decoder架构，Pangu采取的方法是先训练基于StructBERT的encoder，之后加上decoder进行生成模型训练，前90%的时间保留MLM，后10%去掉。
Motian的博客中还提到了一个消除MLM预训练-精调不一致的方法，不进行Mask，而是采用随机词/同义词替换，也获得了一些提升。

除了NLU任务之外，NLG任务上的花样也很多：

分享一些之前的预训练Paper解读：

---

欢迎初入NLP领域的小伙伴们加入rumor建立的「NLP卷王养成群」一起学习，添加微信「leerumorrr」备注知乎+NLP即可，群里的讨论氛围非常好～

预训练的脑图可以在这里下载：

tylin98 网友的相关建议:

MLM和NSP分别对应的是Token级别和句子级别两个粒度的任务，可以替代NSP的任务还真不少，粘一些任务介绍放到这里（懒得翻译了），有兴趣的可以参看这篇On Losses for Modern Language Models：

Token级别的任务：

1. Term Frequency prediction (TF): Regression predicting a token’s frequency in the rest of the document. The frequency is re-scaled between 0 and 10 per document.

2. Term Frequency-Inverse Document Frequency prediction (TF-IDF): Regression predicting a token’s tf-idf that has been re-scaled between 0 and 10 per document.

3. Sentence Boundary Objective (SBO): Predict the masked token given the embeddings of the adjacent tokens.

4. Trigram-Shuffling (TGS): 6-way classification predicting the original order of shuffled tri-grams.

5. Token Corruption Prediction (TCP): Binary classification of whether a token has been corrupted (inserted, replaced, permuted) or not.

6. Capitalization Prediction (Cap.): Binary, whether a token is capitalized or not.

7. Token Length Prediction (TLP): Regression to predict the length of the WordPiece token.

句子级别的任务：

8. Next Sentence Prediction (NSP): Binary, whether the second sentence follows the first or comes from a separate document.

9. Adjacent Sentence Prediction (ASP): 3-way classification whether the second sentence proceeds the first, precedes the first, or they come from separate documents.

10. Sentence Ordering (SO): Binary, predicting if the two sentences are in or out of order.

11. Sentence Distance Prediction (SDP): 3-way classification of whether the second sentence proceeds, the two sentences are noncontiguous from the same document, or come from separate documents.

12. Sentence Corruption Prediction (SCP): Binary classification of whether a tokens in a sentence have been corrupted (inserted, replaced, permuted) or not.

13. Quick Thoughts variant (QT): Split each batch into two, where the second half contains the subsequent sentences of the first half (e.g. with batch size 32, sentence 17 follows sentence 1, sentence 18 follows sentence 2,...). We use an energy-based model to predict the correct continuation for each sentence in the first half where the energy between two sentences is defined by the negative cosine similarity of their [CLS] embeddings. We use one model to encode both halves concurrently. See Figure 1.

14. FastSent variant (FS): Split each batch into two, where the second half contains the subsequent sentences of the first half (same as QT above). The loss is defined as cross-entropy between 1.0 and the cosine similarity of a sentence [CLS] embedding and the other sentence token embeddings ([CLS] embedding from the first half with token embeddings from the second half and [CLS] embeddings from second half with token embeddigns from the first half). We use one model to encode both halves concurrently.

p.s. 关于问题描述中NSP作用不大的说法，以往一般认为因为任务形式太简单会使模型关注一些浅显的lexical特征，但是其实也有文章实验表明在特定场景（例如小规模预训练模型）下，BERT style（MLM+NSP）的预训练结果会强于RoBERTa style（仅MLM）：

所以这仍然是一个有待讨论的观点。

预训练模型的训练任务在 MLM 之外还有哪些有效方式？的其他答案点击这里

前一个讨论

课题组是做钙钛矿的，研究生出来出来能在什么单位上班？

下一个讨论

深度学习方面的科研工作中的实验代码有什么规范和写作技巧？如何妥善管理实验数据？

预训练模型的训练任务在 MLM 之外还有哪些有效方式？第1页

相关话题

前一个讨论

下一个讨论

相关的话题

预训练模型的训练任务在 MLM 之外还有哪些有效方式？ 第1页

相关话题

前一个讨论

下一个讨论

相关的话题

预训练模型的训练任务在 MLM 之外还有哪些有效方式？第1页