I'm building some random forest models in sklearn using a dataset that updates daily. I want to take advantange of the new stream of data which could indicate changes in the X-y relationship, however I've also found that my model performs better with more data. The problem is that it takes a seriously long time to run (dataset is around 250000 rows and 50 features). Is there an approach where one builds the model at the beginning of the data stream, and then updates the parameters with new data as it arrives, instead of continuously retraining the model on the entire dataset for every day? Many thanks!

Comments

You must log in or register to comment.

SatoshiNotMe t1_j4pp5zy wrote on January 17, 2023 at 11:48 AM

This is called Online Learning, as opposed to Batch Learning. It’s a somewhat neglected topic in terms of available packages, but there is one here (it has decision trees, not RF):

https://github.com/online-ml/river

There is a nice interview with the author on the ML Podcast

https://podcasts.apple.com/us/podcast/the-machine-learning-podcast/id1626358243?i=1000577393019

BenoitParis t1_j4qbih9 wrote on January 17, 2023 at 3:02 PM

Hoeffding Trees come to mind. The keyword you are looking for is 'online learning'. Apparently there's a python package dedicated to that:

https://scikit-multiflow.readthedocs.io/en/stable/api/api.html

But 250000 rows is not that high. Since your time requirements are daily I'd consider looking for other algorithms or implementations in other languages before that.

Repulsive_Tart3669 t1_j4qqivs wrote on January 17, 2023 at 4:41 PM

This should be considered in the first place. For instance, gradient boosting trees that are mostly implemented in C/C++ and have GPU compute backends - XGBoost, CatBoost and LightGBM. Given daily updates, you'll have enough time not only to train a model, but also optimize its hyperparameters. In my experience, XGBoost + RayTune work just fine.

monkeysingmonkeynew OP t1_j4r539v wrote on January 17, 2023 at 6:10 PM

Yes, it's ok if i run it once a day, but I need to backtest two years of data and so it's not feasible on a laptop, or affordable on a GPU

thiru_2718 t1_j4piklu wrote on January 17, 2023 at 10:25 AM

Inresting question. My intuition if you could maintain a continuously-updated cache of the metric you're using to split your branches (i.e. continuously compute mutual information for each fork), and we assume your new data roughly follows the same distribution as your old data, you maybe able to get away with only modifying the downstream branches of your trees which should be more efficient.

But if that assumption isn't true, then the new data changes your trees closer to the root, and there's little benefit.

monkeysingmonkeynew OP t1_j4pjoxj wrote on January 17, 2023 at 10:40 AM

Thanks! I'll muse this over

ClayStep t1_j4s2dfp wrote on January 17, 2023 at 9:34 PM

Hackiest solution I can think of:

Just add new trees to the forest trained on the new data and weight the trees by how new the data is...(assuming we care more about the new data)

(probably a terrible idea)

blimpyway t1_j4pqj1b wrote on January 17, 2023 at 12:04 PM

Search for "online random forests" there quite a few papers and articles on the subject. I assume they aren't simple usage of normal RF-s, it wouldn't worth publishing a paper on the subject.

[deleted] t1_j4psqi6 wrote on January 17, 2023 at 12:27 PM

[deleted]

__lawless t1_j4pzotl wrote on January 17, 2023 at 1:32 PM

A lot of folks here already mentioned online learning and the resources for it. However I am going to offer a very hacky solution inspired by idea of boosting. Suppose you had a regression model already trained. Make prediction for the new training batch and calculate the errors. Now train a new random forest model for the residual errors. For inference pass the features into the first model. For inference just pass the features to both models and sum the results.

monkeysingmonkeynew OP t1_j4r6lwj wrote on January 17, 2023 at 6:20 PM

this sounds pretty cool. but I don't follow every step. By "calculate the errors" do you mean for example, extract the predicted probabilities from the actual outcome?

Also, I didn't get your last part about inference, what exactly are you referring to there?

__lawless t1_j4r9ebs wrote on January 17, 2023 at 6:37 PM

Ok let me elaborate a bit. Imagine the old model is called m_0. Your newly obtained training data is X, y, features and labels, respectively. Now calculate the residual error which is the difference between y and prediction of m_0: dy = y - m_0(X). Now train a new model m_1. The labels and features are X, dy. Finally at inference time the prediction is the sum of the two models: y_pred = m_0(X_new) + m_1(X_new).

[deleted] t1_j4rjw0s wrote on January 17, 2023 at 7:41 PM

[deleted]

monkeysingmonkeynew OP t1_j4un2xm wrote on January 18, 2023 at 11:13 AM

OK I can almost see this working, thanks for the suggestion. The only thing that would prevent me from implementing this solution is that by taking the sum of the two models, it would let m_1 give as equal a contribution to the result as m_1. However I expect a single days data to be noisy, Thus I would need the contribution of the new days data to be down weighted somehow.

Equivalent-Way3 t1_j4wjuxe wrote on January 18, 2023 at 7:27 PM

XGBoost can do this and you can set its hyperparameters so that it's a random forest

monkeysingmonkeynew OP t1_j4z1fum wrote on January 19, 2023 at 6:26 AM

Thanks! Do you have any more info on how to do it with XGBoost?

Equivalent-Way3 t1_j50y33r wrote on January 19, 2023 at 5:13 PM

Yep very simple. Say you have model1 that you trained already, then you just use the xgb_model argument in your next training.

In R (Python should be the same or close to it)

new_model &lt;- xgb.train(data = new_data, xgb_model = model1, blah blah blah)

[deleted] t1_j4poluo wrote on January 17, 2023 at 11:42 AM

[deleted]