Viewing a single comment thread. View all comments

macORnvidia OP t1_j0z24ie wrote

>. For example, the largest Parque file I've cleaned in pandas was about 7 million rows and about 10gb in size of just text. It can run queries through it in a few seconds.

Using rapids? Like cudf?

1

sayoonarachu t1_j0zlw4i wrote

No. I was just using pandas (cpu) for simple quick regex and removing and replacing text rows. It was just for a hobby project. The data was scraped from Midjourney and Stable diffusion discord so there were millions of rows of duplicate prompts and poor quality prompts which I had pandas delete and in the end the number of unique rows with more than 50 characters amounted to about 700k which was then used to train gpt-neo 125m.

I didn't know about cudf. Thanks 😅

1