in-your-own-words t1_irmn9d4 wrote on October 9, 2022 at 12:54 PM

#63,286

Randomly permute the rows of the table, and then take the first X% of them for training.

eingrid2 t1_irmnf02 wrote on October 9, 2022 at 12:55 PM

#63,296

You might shuffle dataframe (df.sample(frac=1)) and then just take first 80 % of samples as train and other 20 as test

Also you might use sklearn train_test_split

redditnit21 OP t1_irmniga wrote on October 9, 2022 at 12:56 PM

#63,300

Replying to in-your-own-words (#63,286)

Can you write the code for that?

redditnit21 OP t1_irmnk5i wrote on October 9, 2022 at 12:56 PM

#63,303

Replying to eingrid2 (#63,296)

But the images are present in some other folder. Can you send me the code for that?

in-your-own-words t1_irmnurg wrote on October 9, 2022 at 12:59 PM

#63,322

Replying to redditnit21 (#63,300)

Yes, there are dozens of ways of doing it. I encourage you to figure it out yourself. If you can't design and implement your own test & evaluation experiments for ML, you will end up doing the world more harm than good by dabbling in it. The entire ML field suffers from extremely weak T&E, and lots of people just learning to stuff inputs into functions.

Some hints:

There may be functions within standard machine learning software libraries that produce train/test splits given tabular data input.
There may be functions that will produce a random permutation of rows of a table.
There may be functions that produce random permutations of numbers from 0 to N, where you specify N. If N is the number of rows in your table, you could create a new column of these random numbers and then sort the table on that column.
You may want to consider class imbalance in your dataset. If this is the case, apply your train/test split independently to class 1 and class 0 such that your resulting split contains the same proportion of 1 and 0 in both train and test partitions.
Consider using an outer crossvalidation approach, where you do your experiment for k different train/test splits. When you report your metrics, look at the distribution of each metric over k experiments. Report the median, interquartile range, 5th and 95th percentiles, and outliers for each metric over k experiment trials.
version control your code and tag the commit that produces the results you report. Include this tag or the commit hash with your reporting of results.

eingrid2 t1_irmnv0u wrote on October 9, 2022 at 12:59 PM

#63,323

Replying to redditnit21 (#63,303)

If all images are in one folder you can make a variable like path_to_dataset = “your_path” Split normally and then just add dataset path to image name

PassionatePossum t1_irmr78w wrote on October 9, 2022 at 1:29 PM

#63,484

You seem to be quite new at this (no offense, but otherwise you wouldn't be asking for code for such a trivial task), I would like to give you some advice on how to do this right. Others have already told you how to implement a random split, which generally is good advice. However, the underlying assumption is, that the images themselves are not somehow correlated with one another.

I've actually seen people taking video frames (and of course every video frame doesn't look much different from the previous one) and randomly sample these frames into training/test sets and then bragging about their incredibly good performance. Of course any performance measurements you do on such a dataset will be worthless.

So how you want to sample training/test data is something you should think about carefully (i.e. are the training/validation/test set actually independent from one another).

So under the assumption, that the images are independent from one another a random split would be a good idea. If that isn't the case (and without more information, nobody here can tell you whether that is the case), you need some other way to split the data (e.g. by video).

Zealousideal_Low1287 t1_irmsue3 wrote on October 9, 2022 at 1:43 PM

#63,574

Christ

redditnit21 OP t1_irmtltz wrote on October 9, 2022 at 1:49 PM

#63,616

Replying to Zealousideal_Low1287 (#63,574)

What happened? What’s your problem? Am I not allowed to ask any question?

Street_Excitement_14 t1_irmupgk wrote on October 9, 2022 at 1:58 PM

#63,672

u/in-your-own-words explaining nicely but I feel you lack basics, thus here are my two cents:

You have 2 different entities, A. a csv file indicating path of the images and output class of the images, for each image, and B. the images itself.
Maintain your sv file in a dataframe (ie in Pandas) and shuffle it.
Create a new column which indicates images in that row is for training or for testing. (you can take indexes, and split them, or you can also stratified split ie take 20% test data from each class.) If you stuck, search web or ask stackoverflow. Pandas, sklearn and numpy are your friends.
Now you have a dataframe that has image path, image class and train/test indicator. Using this dataframe, you can create train and test folder and copy/move corresponding images to those folder easily. You can also create two different csv file from your dataframe for train and test and save them as well.
Alternatively, you can directly feed data from your dataframe to the learning algorithm. The loader function will take image path as an input, and load the image and feed the algorithm.

Do not ask for code:) As I have said, mostly pandas is your friend

Zealousideal_Low1287 t1_irmuxdc wrote on October 9, 2022 at 1:59 PM

#63,686

Replying to redditnit21 (#63,616)

Well, you’re clearly way too lazy / unaware to look it up for yourself. You’re asking another commenter to actually write the code for you for this insanely trivial task. And then also, if this is a problem for you, actually doing anything remotely technical with respect to the actual machine learning will be way way way beyond you.

in-your-own-words t1_irmv7ca wrote on October 9, 2022 at 2:02 PM

#63,701

Replying to Street_Excitement_14 (#63,672)

My pedagogical method is more socratic and from an engineering perspective. I think discovering how to find the names of the mainstream tools, how to find the documentation, and how to learn to read, understand, and rely on it, is ultimately the most beneficial and empowering to the developer.

redditnit21 OP t1_irmvsly wrote on October 9, 2022 at 2:06 PM

#63,733

Replying to Street_Excitement_14 (#63,672)

Thanks for such a good answer and I will keep in mind all you said and start learning basics. I don’t know why other guys are just straight up criticising me. What I did I split the images into 2 folders train and test and then further classified into folders (Class 1 and Class 2). Then I am thinking of using train data generator for training.

Street_Excitement_14 t1_irmwhvt wrote on October 9, 2022 at 2:12 PM

#63,768

Replying to redditnit21 (#63,733)

you are welcome. my advice is do not do anything manually, do it with Pandas. ie you can use pandas '.loc' command to filter training data, and write that data to training folder etc. İf stuck at any point, search the internet or ask it. good luck, have a nice day:)

redditnit21 OP t1_irmwspv wrote on October 9, 2022 at 2:14 PM

#63,787

Replying to Street_Excitement_14 (#63,768)

Just a last question, do you know any good resource to learn the basics for Pandas?

redditnit21 OP t1_irmxu3q wrote on October 9, 2022 at 2:22 PM

#63,829

Replying to Zealousideal_Low1287 (#63,686)

Sorry, if I am way too lazy for you. Just an advice, Please be kind for the world. You could have just given me an advice instead of just being harsh on me. Be kind and humble!

Street_Excitement_14 t1_irmygtj wrote on October 9, 2022 at 2:27 PM

#63,852

Replying to redditnit21 (#63,787)

Trust me it pays off well (even without the context of data science field, it gives you the ability to manage the tabular data effectively)

https://www.kaggle.com/learn/pandas

https://www.coursera.org/learn/python-data-analysis?irclickid=0uOThVwCHxyNT0H2N%3ASXpxqkUkDQrvVBXW5J1Q0&irgwc=1&utm_medium=partners&utm_source=impact&utm_campaign=3294490&utm_content=b2c

https://www.coursera.org/specializations/data-science-python

Zealousideal_Low1287 t1_irn4xjt wrote on October 9, 2022 at 3:14 PM

#64,107

Replying to redditnit21 (#63,829)

No. Sometimes people need harsh words. Your attitude of getting people to do your work for you is pathetic, and you should feel ashamed about it.

redditnit21 OP t1_irn55fi wrote on October 9, 2022 at 3:16 PM

#64,115

Replying to Zealousideal_Low1287 (#64,107)

I am not asking people to do my task. I was just asking them to tell me the command for which I am sorry I already told you. You should be ashamed for behaving horribly instead of giving constructive criticism. Shame on you.

Zealousideal_Low1287 t1_irn5bda wrote on October 9, 2022 at 3:17 PM

#64,125

Replying to redditnit21 (#64,115)

I am not remotely ashamed. I hope that my words encouraged you to apply yourself properly.

redditnit21 OP t1_irn5gnk wrote on October 9, 2022 at 3:18 PM

#64,131

Replying to Zealousideal_Low1287 (#64,125)

I saw your some previous post asking for datasets and some basic advice? Why are you asking for such basic questions? You could have just searched it on the internet.

Zealousideal_Low1287 t1_irn5pmm wrote on October 9, 2022 at 3:20 PM

#64,140

Replying to redditnit21 (#64,131)

Because I have already searched for and categorised all the public datasets for such tasks and contacted the appropriate people about commercial licenses. I was asking, in order to find more people to talk to about licensing their private data.

Nice try though pal. Maybe just move on.

redditnit21 OP t1_irn6vek wrote on October 9, 2022 at 3:28 PM

#64,171

Replying to Zealousideal_Low1287 (#64,140)

And I searched on the internet to split the csv file according to image paths but I only found 1 method of splitting it into different folders. Didn’t found any solution based on pandas.

Nice try!

redditnit21 OP t1_irn75ut wrote on October 9, 2022 at 3:30 PM

#64,186

Replying to Street_Excitement_14 (#63,852)

Thanks a lot man! I am really sorry for asking to write code.

Zealousideal_Low1287 t1_irn8ced wrote on October 9, 2022 at 3:38 PM

#64,240

Replying to redditnit21 (#64,171)

https://lmgtfy.app/?q=pandas+train+test+split

Zealousideal_Low1287 t1_irn8fo0 wrote on October 9, 2022 at 3:39 PM

#64,242

Replying to redditnit21 (#64,171)

Your question is the ML equivalent of asking how to write a for loop

Zealousideal_Low1287 t1_irn8pa7 wrote on October 9, 2022 at 3:41 PM

#64,257

Replying to redditnit21 (#64,171)

I must say, I find it really weird that someone who would ask people online to write trivially simple code for them would be this defensive. Can you not look at yourself and think, huh maybe something is wrong with my attitude?

[D] CSV File to training and testing split

Comments

in-your-own-words t1_irmn9d4 wrote on October 9, 2022 at 12:54 PM

eingrid2 t1_irmnf02 wrote on October 9, 2022 at 12:55 PM

redditnit21 OP t1_irmniga wrote on October 9, 2022 at 12:56 PM

redditnit21 OP t1_irmnk5i wrote on October 9, 2022 at 12:56 PM

in-your-own-words t1_irmnurg wrote on October 9, 2022 at 12:59 PM

eingrid2 t1_irmnv0u wrote on October 9, 2022 at 12:59 PM

PassionatePossum t1_irmr78w wrote on October 9, 2022 at 1:29 PM

Zealousideal_Low1287 t1_irmsue3 wrote on October 9, 2022 at 1:43 PM

redditnit21 OP t1_irmtltz wrote on October 9, 2022 at 1:49 PM

Street_Excitement_14 t1_irmupgk wrote on October 9, 2022 at 1:58 PM

Zealousideal_Low1287 t1_irmuxdc wrote on October 9, 2022 at 1:59 PM

in-your-own-words t1_irmv7ca wrote on October 9, 2022 at 2:02 PM

redditnit21 OP t1_irmvsly wrote on October 9, 2022 at 2:06 PM

Street_Excitement_14 t1_irmwhvt wrote on October 9, 2022 at 2:12 PM

redditnit21 OP t1_irmwspv wrote on October 9, 2022 at 2:14 PM

redditnit21 OP t1_irmxu3q wrote on October 9, 2022 at 2:22 PM

Street_Excitement_14 t1_irmygtj wrote on October 9, 2022 at 2:27 PM

Zealousideal_Low1287 t1_irn4xjt wrote on October 9, 2022 at 3:14 PM

redditnit21 OP t1_irn55fi wrote on October 9, 2022 at 3:16 PM

Zealousideal_Low1287 t1_irn5bda wrote on October 9, 2022 at 3:17 PM

redditnit21 OP t1_irn5gnk wrote on October 9, 2022 at 3:18 PM

Zealousideal_Low1287 t1_irn5pmm wrote on October 9, 2022 at 3:20 PM

redditnit21 OP t1_irn6vek wrote on October 9, 2022 at 3:28 PM

redditnit21 OP t1_irn75ut wrote on October 9, 2022 at 3:30 PM

Zealousideal_Low1287 t1_irn8ced wrote on October 9, 2022 at 3:38 PM

Zealousideal_Low1287 t1_irn8fo0 wrote on October 9, 2022 at 3:39 PM

Zealousideal_Low1287 t1_irn8pa7 wrote on October 9, 2022 at 3:41 PM