Comments

You must log in or register to comment.

in-your-own-words t1_irmn9d4 wrote

Randomly permute the rows of the table, and then take the first X% of them for training.

1

eingrid2 t1_irmnf02 wrote

You might shuffle dataframe (df.sample(frac=1)) and then just take first 80 % of samples as train and other 20 as test

Also you might use sklearn train_test_split

1

in-your-own-words t1_irmnurg wrote

Yes, there are dozens of ways of doing it. I encourage you to figure it out yourself. If you can't design and implement your own test & evaluation experiments for ML, you will end up doing the world more harm than good by dabbling in it. The entire ML field suffers from extremely weak T&E, and lots of people just learning to stuff inputs into functions.

Some hints:

  • There may be functions within standard machine learning software libraries that produce train/test splits given tabular data input.

  • There may be functions that will produce a random permutation of rows of a table.

  • There may be functions that produce random permutations of numbers from 0 to N, where you specify N. If N is the number of rows in your table, you could create a new column of these random numbers and then sort the table on that column.

  • You may want to consider class imbalance in your dataset. If this is the case, apply your train/test split independently to class 1 and class 0 such that your resulting split contains the same proportion of 1 and 0 in both train and test partitions.

  • Consider using an outer crossvalidation approach, where you do your experiment for k different train/test splits. When you report your metrics, look at the distribution of each metric over k experiments. Report the median, interquartile range, 5th and 95th percentiles, and outliers for each metric over k experiment trials.

  • version control your code and tag the commit that produces the results you report. Include this tag or the commit hash with your reporting of results.

3

eingrid2 t1_irmnv0u wrote

If all images are in one folder you can make a variable like path_to_dataset = “your_path” Split normally and then just add dataset path to image name

1

PassionatePossum t1_irmr78w wrote

You seem to be quite new at this (no offense, but otherwise you wouldn't be asking for code for such a trivial task), I would like to give you some advice on how to do this right. Others have already told you how to implement a random split, which generally is good advice. However, the underlying assumption is, that the images themselves are not somehow correlated with one another.

I've actually seen people taking video frames (and of course every video frame doesn't look much different from the previous one) and randomly sample these frames into training/test sets and then bragging about their incredibly good performance. Of course any performance measurements you do on such a dataset will be worthless.

So how you want to sample training/test data is something you should think about carefully (i.e. are the training/validation/test set actually independent from one another).

So under the assumption, that the images are independent from one another a random split would be a good idea. If that isn't the case (and without more information, nobody here can tell you whether that is the case), you need some other way to split the data (e.g. by video).

3

Street_Excitement_14 t1_irmupgk wrote

u/in-your-own-words explaining nicely but I feel you lack basics, thus here are my two cents:

  1. You have 2 different entities, A. a csv file indicating path of the images and output class of the images, for each image, and B. the images itself.
  2. Maintain your sv file in a dataframe (ie in Pandas) and shuffle it.
  3. Create a new column which indicates images in that row is for training or for testing. (you can take indexes, and split them, or you can also stratified split ie take 20% test data from each class.) If you stuck, search web or ask stackoverflow. Pandas, sklearn and numpy are your friends.
  4. Now you have a dataframe that has image path, image class and train/test indicator. Using this dataframe, you can create train and test folder and copy/move corresponding images to those folder easily. You can also create two different csv file from your dataframe for train and test and save them as well.
  5. Alternatively, you can directly feed data from your dataframe to the learning algorithm. The loader function will take image path as an input, and load the image and feed the algorithm.

Do not ask for code:) As I have said, mostly pandas is your friend

1

Zealousideal_Low1287 t1_irmuxdc wrote

Well, you’re clearly way too lazy / unaware to look it up for yourself. You’re asking another commenter to actually write the code for you for this insanely trivial task. And then also, if this is a problem for you, actually doing anything remotely technical with respect to the actual machine learning will be way way way beyond you.

1

in-your-own-words t1_irmv7ca wrote

My pedagogical method is more socratic and from an engineering perspective. I think discovering how to find the names of the mainstream tools, how to find the documentation, and how to learn to read, understand, and rely on it, is ultimately the most beneficial and empowering to the developer.

1

redditnit21 OP t1_irmvsly wrote

Thanks for such a good answer and I will keep in mind all you said and start learning basics. I don’t know why other guys are just straight up criticising me. What I did I split the images into 2 folders train and test and then further classified into folders (Class 1 and Class 2). Then I am thinking of using train data generator for training.

1

Street_Excitement_14 t1_irmwhvt wrote

you are welcome. my advice is do not do anything manually, do it with Pandas. ie you can use pandas '.loc' command to filter training data, and write that data to training folder etc. İf stuck at any point, search the internet or ask it. good luck, have a nice day:)

1

redditnit21 OP t1_irn55fi wrote

I am not asking people to do my task. I was just asking them to tell me the command for which I am sorry I already told you. You should be ashamed for behaving horribly instead of giving constructive criticism. Shame on you.

0

Zealousideal_Low1287 t1_irn5pmm wrote

Because I have already searched for and categorised all the public datasets for such tasks and contacted the appropriate people about commercial licenses. I was asking, in order to find more people to talk to about licensing their private data.

Nice try though pal. Maybe just move on.

1

redditnit21 OP t1_irn6vek wrote

And I searched on the internet to split the csv file according to image paths but I only found 1 method of splitting it into different folders. Didn’t found any solution based on pandas.

Nice try!

0

Zealousideal_Low1287 t1_irn8pa7 wrote

I must say, I find it really weird that someone who would ask people online to write trivially simple code for them would be this defensive. Can you not look at yourself and think, huh maybe something is wrong with my attitude?

1