[问与答] Bert 神经网络结构中是否使用了多头自注意力机制?

学习完 RNN 之后学习效果更好的 bert ,从网上 trasformers 的预训练库里加载了一个 bert 的模型,但是将其打印出来的结构类似下面这样:

BertForSequenceClassification( (bert): BertModel( (embeddings): BertEmbeddings( (word_embeddings): Embedding(21128, 768, padding_idx=0) (position_embeddings): Embedding(512, 768) (token_type_embeddings): Embedding(2, 768) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (encoder): BertEncoder( (layer): ModuleList( (0): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) (intermediate_act_fn): GELUActivation() ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (1): BertLayer(...) (2): BertLayer(...) (3): BertLayer(...) (4): BertLayer(...) (5): BertLayer(...) (6): BertLayer(...) (7): BertLayer(...) (8): BertLayer(...) (9): BertLayer(...) (10): BertLayer(...) (11): BertLayer(...) ) ) (pooler): BertPooler( (dense): Linear(in_features=768, out_features=768, bias=True) (activation): Tanh() ) ) (dropout): Dropout(p=0.1, inplace=False) (classifier): Linear(in_features=768, out_features=14, bias=True)
)

可以观察到它的实现里 embedding 完成之后是十二个 bertlayer ,而每个 bertlayer 的内容,输入长度是 768 ,分别生成长度为 768 的三个 kqv ,那么这应该是一个完整的自注意力要素,multihead 的成分在哪里呢?

还是说我理解错了,它打印出来这个结果意思是 embedding 完事之后有 12 个平行的 self-attention 并行计算,而不是顺序结算?

看到网上资料里都写 bert 是有 multihead selfattention 的