femboyxx98 t1_jc601pw wrote on March 14, 2023 at 8:25 AM

Reply to comment by PM_ME_JOB_OFFER in [R] Training Small Diffusion Model by crappr

The actual implementation of most models is quite simple and he often reuses the same building blocks. The challenge is obtaining the dataset and actually training the models (and hyper parameter search) and he doesn’t provide any trained weights himself - it’s hard to know if his implementations even work out of the box.

femboyxx98 t1_j4vlsfj wrote on January 18, 2023 at 3:58 PM

Reply to [P] RWKV 14B Language Model & ChatRWKV : pure RNN (attention-free), scalable and parallelizable like Transformers by bo_peng

Have you compared it against modern transformer implementations e.g. with FlashAttention, which can provide 3x-5x speed up by itself?