Submitted by monkeysingmonkeynew t3_10e7cls in MachineLearning
I'm building some random forest models in sklearn using a dataset that updates daily. I want to take advantange of the new stream of data which could indicate changes in the X-y relationship, however I've also found that my model performs better with more data. The problem is that it takes a seriously long time to run (dataset is around 250000 rows and 50 features). Is there an approach where one builds the model at the beginning of the data stream, and then updates the parameters with new data as it arrives, instead of continuously retraining the model on the entire dataset for every day? Many thanks!
thiru_2718 t1_j4piklu wrote
Inresting question. My intuition if you could maintain a continuously-updated cache of the metric you're using to split your branches (i.e. continuously compute mutual information for each fork), and we assume your new data roughly follows the same distribution as your old data, you maybe able to get away with only modifying the downstream branches of your trees which should be more efficient.
But if that assumption isn't true, then the new data changes your trees closer to the root, and there's little benefit.