Submitted by monkeysingmonkeynew t3_10e7cls in MachineLearning

I'm building some random forest models in sklearn using a dataset that updates daily. I want to take advantange of the new stream of data which could indicate changes in the X-y relationship, however I've also found that my model performs better with more data. The problem is that it takes a seriously long time to run (dataset is around 250000 rows and 50 features). Is there an approach where one builds the model at the beginning of the data stream, and then updates the parameters with new data as it arrives, instead of continuously retraining the model on the entire dataset for every day? Many thanks!

6

Comments

You must log in or register to comment.

thiru_2718 t1_j4piklu wrote

Inresting question. My intuition if you could maintain a continuously-updated cache of the metric you're using to split your branches (i.e. continuously compute mutual information for each fork), and we assume your new data roughly follows the same distribution as your old data, you maybe able to get away with only modifying the downstream branches of your trees which should be more efficient.

But if that assumption isn't true, then the new data changes your trees closer to the root, and there's little benefit.

2

blimpyway t1_j4pqj1b wrote

Search for "online random forests" there quite a few papers and articles on the subject. I assume they aren't simple usage of normal RF-s, it wouldn't worth publishing a paper on the subject.

1

__lawless t1_j4pzotl wrote

A lot of folks here already mentioned online learning and the resources for it. However I am going to offer a very hacky solution inspired by idea of boosting. Suppose you had a regression model already trained. Make prediction for the new training batch and calculate the errors. Now train a new random forest model for the residual errors. For inference pass the features into the first model. For inference just pass the features to both models and sum the results.

1

BenoitParis t1_j4qbih9 wrote

Hoeffding Trees come to mind. The keyword you are looking for is 'online learning'. Apparently there's a python package dedicated to that:

https://scikit-multiflow.readthedocs.io/en/stable/api/api.html

But 250000 rows is not that high. Since your time requirements are daily I'd consider looking for other algorithms or implementations in other languages before that.

6

Repulsive_Tart3669 t1_j4qqivs wrote

This should be considered in the first place. For instance, gradient boosting trees that are mostly implemented in C/C++ and have GPU compute backends - XGBoost, CatBoost and LightGBM. Given daily updates, you'll have enough time not only to train a model, but also optimize its hyperparameters. In my experience, XGBoost + RayTune work just fine.

2

monkeysingmonkeynew OP t1_j4r6lwj wrote

this sounds pretty cool. but I don't follow every step. By "calculate the errors" do you mean for example, extract the predicted probabilities from the actual outcome?

Also, I didn't get your last part about inference, what exactly are you referring to there?

2

__lawless t1_j4r9ebs wrote

Ok let me elaborate a bit. Imagine the old model is called m_0. Your newly obtained training data is X, y, features and labels, respectively. Now calculate the residual error which is the difference between y and prediction of m_0: dy = y - m_0(X). Now train a new model m_1. The labels and features are X, dy. Finally at inference time the prediction is the sum of the two models: y_pred = m_0(X_new) + m_1(X_new).

1

ClayStep t1_j4s2dfp wrote

Hackiest solution I can think of:

Just add new trees to the forest trained on the new data and weight the trees by how new the data is...(assuming we care more about the new data)

(probably a terrible idea)

2

monkeysingmonkeynew OP t1_j4un2xm wrote

OK I can almost see this working, thanks for the suggestion. The only thing that would prevent me from implementing this solution is that by taking the sum of the two models, it would let m_1 give as equal a contribution to the result as m_1. However I expect a single days data to be noisy, Thus I would need the contribution of the new days data to be down weighted somehow.

1

Equivalent-Way3 t1_j4wjuxe wrote

XGBoost can do this and you can set its hyperparameters so that it's a random forest

1

Equivalent-Way3 t1_j50y33r wrote

Yep very simple. Say you have model1 that you trained already, then you just use the xgb_model argument in your next training.

In R (Python should be the same or close to it)

new_model <- xgb.train(data = new_data, xgb_model = model1, blah blah blah)
1