zaemis t1_jdtm2zm wrote on March 27, 2023 at 2:20 AM

Reply to [D] Simple Questions Thread by AutoModerator

I'm going to train a gpt model (distilgpt2) in a language other than english. At this point I'm just teaching it the language - not worrying about further abilities such as Q&A, I expect that to be later with fine-tuning. Anyway, my dataset is currently a csv with [id, text] and each text is a paragraph.

It is my understanding that only 512 characters/tokens are going to be fed in (depending on my max_length, but my point is that it'll probably be less than the entire length of the paragraph), and beyond that will be ignored. If I were to break the paragraphs into 512-word chunks, I could make better use of the dataset. But most likely those subsequent chunks wouldn't start a phrase or sentence - it'd be starting in the middle of a sentence.

For example, "The quick brown fox jumped over the lazy sleeping dog." might be broken up into two samples. "The quick brown fox jumped over the lazy" and "sleeping dog."

Is it a problem if I use text samples that don't "start properly?"