Submitted by monkeysingmonkeynew t3_10e7cls in MachineLearning
I'm building some random forest models in sklearn using a dataset that updates daily. I want to take advantange of the new stream of data which could indicate changes in the X-y relationship, however I've also found that my model performs better with more data. The problem is that it takes a seriously long time to run (dataset is around 250000 rows and 50 features). Is there an approach where one builds the model at the beginning of the data stream, and then updates the parameters with new data as it arrives, instead of continuously retraining the model on the entire dataset for every day? Many thanks!
SatoshiNotMe t1_j4pp5zy wrote
This is called Online Learning, as opposed to Batch Learning. It’s a somewhat neglected topic in terms of available packages, but there is one here (it has decision trees, not RF):
https://github.com/online-ml/river
There is a nice interview with the author on the ML Podcast
https://podcasts.apple.com/us/podcast/the-machine-learning-podcast/id1626358243?i=1000577393019