Nobody else has mentioned resnets, yet. They have something like higher order weights with f(x) = σ(W1σ(W0x+b0)+b1) + σ(W0x+b0). Highway networks take it a step further with f(x) = σ(W0x+b0)σ(W1x+b1) + xσ(W2x+b2). However, both are done to resolve gradient issues.
Viewing a single comment thread. View all comments