I am beginning a problem similar to the one bellow for my work.

There is a score 1-4 (1 is bad, 4 is very good) of a persons back sprain recovery. The data we have are back sprain recovery scores recorded after two weeks, 3 months and 6 months, along with information (features) about their behavior like sleep, medications, diet, and exercise.

We want to predict there 2 week, 3 month, and 6 month back sprain recovery scores based on their initial behavior inputs. For example, given a user sleeps 8 hours a day, consumes x amount of sugar, does physical therapy 4 days a week, and takes x medication, what will there recovery scores be at 2 weeks, 3 months and 6 months?

The training data would look like:

Sleep Average	Medication	Days of Physical Therapy	Diet	Week 2 recovery score	Month 3 recovery score	Month 6 recovery score
9 hours per night	Advil	4 days/ week	Healthy	2	3	4
5 hours per night	None	0 days/week	Unhealthy	1	2	2

I want a model (or multiple models) to predict 3 values which is the 2 week, 3 month, and 6 month scores. I am not familiar with time series, but it seems like the data may be too sparse.

Should I be using time series here, or should I create 3 classification models?

Comments

suflaj t1_j57ce83 wrote on January 20, 2023 at 10:22 PM

This looks like something for XGBoost. In that case you're looking at the XGBRegressor class.

Your X are the first 4 features, your Y are the 3 outputs. You will need to convert the medication to a one-hot vector representation, and the diet will presumably be enumerated into whole numbers sorted by healthiness.

spiritualquestions OP t1_j57f0az wrote on January 20, 2023 at 10:39 PM

Thanks for getting back to me!

Would this be considered multi output regression? Also why would I not want to use multi output classification? For clarification the scores are discrete so there is no score of 1.2, rather they are either 1,2,3, or 4. Or even could be treated as "severe", "bad", "medium", "good".

suflaj t1_j57gnky wrote on January 20, 2023 at 10:51 PM

Well this is a regression task, not classification. You could classify 1, 2, 3 and 4 for each output, but it seems like they are continuous. You can always just truncate your result, ex. with y = max(1, min(4, ceil(x + 0.5))). With classification you could argmax a class, but then you'll overfit more easily. You would probably benefit from the bias coming from the regression task itself telling the algorithm that 2 is close to 3 and 1, but far away from 4.

Professional_Ball_58 t1_j57rolp wrote on January 21, 2023 at 12:08 AM

But wouldnt this also work for multi classification? If the numbers werent shown to be numbers and say the y values were good,bad,etc..wouldnt this be a classification problem?

suflaj t1_j57rz2o wrote on January 21, 2023 at 12:10 AM

It would but if you did classification you are enabling the model to overfit itself on data more easily. You ca represent it as a classification problem (classification problems are just regression with a cutoff), but naturally it seems like more of a regression problem.

Professional_Ball_58 t1_j57s95g wrote on January 21, 2023 at 12:12 AM

Thats interesting so if we are able to convert the labels into a numeric value where higher numbers are better and vise versa then regression is better? Can you please extend on what bias would help in this case?

suflaj t1_j57te64 wrote on January 21, 2023 at 12:21 AM

It's not necessarily better, but it will help you if your data is not really abundant...

For an example, if you look at it as regression, then the model uses your features and tries to figure out how correlated they are with the grade. Your grade is continuous and monotonous, meaning that if the features contribute in "sane" ways to the grade, it will map easily.

If you consider it a classification problem, then each class has basically its own degree of freedom. This could cause your model to be overconfident, whereas with the regression solution at the very least your model is going to try and fit it to a continuous monotonous function.

With the regression task, you are implicitly telling your model that grade 2 is better than 1 and worse than 3. But with a classification model, because each class can be independent, your model can only learn this implicitly through data. Which means that if your data is insufficient for the model to learn it, it won't work, whereas with a regression task, if your data is insufficient, it might still interpolate correctly.

Cll_dataEthics t1_j595kbd wrote on January 21, 2023 at 7:42 AM

I agree that this is a regression problem. You could try fitting 3 different linear regression models (one for each of the three recovery time periods). This would have the added benefit of providing interpretability and not just predictions.