femboyxx98
femboyxx98 t1_j4vlsfj wrote
Reply to [P] RWKV 14B Language Model & ChatRWKV : pure RNN (attention-free), scalable and parallelizable like Transformers by bo_peng
Have you compared it against modern transformer implementations e.g. with FlashAttention, which can provide 3x-5x speed up by itself?
femboyxx98 t1_jc601pw wrote
Reply to comment by PM_ME_JOB_OFFER in [R] Training Small Diffusion Model by crappr
The actual implementation of most models is quite simple and he often reuses the same building blocks. The challenge is obtaining the dataset and actually training the models (and hyper parameter search) and he doesn’t provide any trained weights himself - it’s hard to know if his implementations even work out of the box.