TheFlyingDrildo

TheFlyingDrildo t1_j4xw5oi wrote

Susan Athey and Stefan Wager's push towards generalized random forests is a major step forward in opening up the type of estimation tasks random forests are useful for, while simultaneously providing the theory for large-sample inference.

An underlying perspective in their research (and most modern random forest theoretical research) is that random forests are effectively kernel regressors, with the forest construction adaptively and implicitly defining the kernel. The component that ends up influencing the adaptivity of the kernel the most is what defines how two child nodes are formed from a parent node.

In the way we implement things right now we've chosen a few techniques for computational ease: random subspacing (controlled by an mtry hyperparameter), axis-aligned splits, and standard CART splitting rules. I think there is still a lot of work to be done here. An example of an interesting direction with respect to splitting rules is the Distributional Random Forests paper.

Edit: In terms of other hyperparameters that people care about, I have a few comments. The depth of the forest should be controlled by a min_samples_leaf parameter, which controls the local vs global trade-off in the kernel. Should pretty much always be be selected in a problem specific manner with a hyperparameter search, but generally should be quite small. It's choice is closely related to the n_trees hyperparameter, which should always be as large as you can afford computationally. An interesting research direction however may be how to adaptively figure out what value of n_trees is "good enough" - which there has been some work on through the analysis of the Purely Uniform Random Forests model.

Lastly, bootstrapping or alternatively subsampling percentage. I believe random forests should always have the honesty property, which naturally pushes us towards subsampling for the extra flexibility in the percentage of data point in the leaves. There could be work done here to determine the appropriate percentage for the split, likely based on convergence rates in learning the tree vs estimating the leaves. Definitely a strong interaction here with the min_samples_leaf hyperparameter. However, the extra variability induced by bootstrapping (and using out-of-bag for honesty) may have desirable properties for the kernel learning, though I believe it is subsampling that makes the large-sample inference theory tractable within our current understanding. Another worthy area of research.

12