HjalmarLucius t1_iqtl4yv wrote
Attention introduces multiplicative relationships, i.e. x*y whereas ordinary operations only have additive relationships.
graphicteadatasci t1_iqv0m4y wrote
This the one. A DNN may be a universal function approximator but only if data and n_parameters is infinite. When we have infinite data we can learn y as parameters and when we multiply the parameters with x we get x*y. But we don't have infinite data / infinite parameters and even if we did we don't have a stable method for training infinitely. So we need other stuff.
timonix t1_iqw49rz wrote
I saw a similar architecture were the outputs from one network was the filter coefficients of a second CNN.
Viewing a single comment thread. View all comments