Viewing a single comment thread. View all comments

suflaj t1_iqt876f wrote

To develop a translation system you will need loads of data. To process this loads of data, you will need loads of processing power. To get good results, you will need large models. Now, depends on how much time you have. If you have a year, then a 2080Ti will be fine, it can pretrain a BERT model in several months. If you have a month, you should probably consider a few 3090s. If you have a week, then renting out 8xA100 rigs might be your best bet.

Overall I'd first focus on collecting the data. 10000 pages of data, assuming 50 sentences per page (and that is a generous guess) is nowhere near enough to develop a translation system. Aim for several tens of millions of sentences, ideally 100-200mil sentence pairs if you wish to outperform Google Translate, or consider developing a handmade tool instead.

The GPUs you mentioned will only be able to run LSTM-CNN models, which will never compete with Google Translate (which is hand-tuned system accompanied by a transformer model). You at least need a 1080Ti/2080Ti, and even that is fairly bad and you'll need months to get anything out of it.

2

wrsage OP t1_isj9bko wrote

Thank you for detailed information. I think I will go for single 2080ti or 3080. 3090 is out of my budget for now. By mean 10000 pages i meant tens of thousands so I have quite a lot of materials of my own work. And I worked with my colleagues to make dictionary and I have a lot (questionable) raw data. I don't know if those pair of words have any help.

1