加和不加从结果上是等价的,BEiT在实现中去掉是为了fp16训练过程中数值稳定。
Both (i.e., with or without key.bias) are equivalent in terms of calculation results. They are canceled by the softmax function.
Softmax(q,k) = exp(q.weight * key.weight + q.bias * key.weight + q.weight * key.bias + q.bias * key.bias) / Z
Because the query is the same over all the keys, so the term (q.weight * key.bias + q.bias * key.bias) remains the same across all the keys, which in turn can be cancelled without affecting the softmax results.
exp(a)/(exp(a)+ exp(b)) == exp(a+C)/(exp(a+C)+ exp(b+C))
本站所有内容均为互联网搜索引擎提供的公开搜索信息,本站不存储任何数据与内容,任何内容与数据均与本站无关,如有需要请联系相关搜索引擎包括但不限于百度,google,bing,sogou 等
© 2025 tinynews.org All Rights Reserved. 百科问答小站 版权所有