Viewing a single comment thread. View all comments

new_name_who_dis_ t1_ivtav2j wrote

No I’m not saying to discard Bert you still use the Bert as encoder and use mdn like network as a final layer. It could still be a self attention layer just trained with the mdn loss function. MDN isn't a different architecture, it's just a different loss function for your final output that isn't deterministic but allows for multiple outputs

1