Bayesian Additive Regression Trees ( BART )
Bayesian approach to nonparametric function estimation using Regression trees.
Regression Decision trees rely on recursive binary partitioning of predictor space into a set of hyper-rectangles to approximate some unknown function.
BART is defined by Two Major concepts :
Sum-Of-Trees Model
Regularized Priors
Sum-Of-Trees Model
Drawbacks of Single Decision Tree:
Over-emphasize the interactions between the Variables.
Have Difficulty finding Linear relationships.
Alternative: Fit many small models ( sequential weak learners )
using Back-Fitting Algorithm
where, T → Binary Regression Tree
M → Terminal Node parameters
Back-Fitting Algorithm :
Fit a small regression tree
Get fitted values from that tree
Estimate the Residuals = True values - Observed Values
Fit next Regression tree on above residuals
Process continues till m number of Trees are fitted.
Regularized Priors
Any Tree based Model can easily overfit the data.
Boosted Regression Trees limit the tree depth and use shrinkage by multiplying the fit of each tree by some small constant determined by cross-validation.
Unlike Cross-Validation in Boosted Regression Tree, BART sets intelligent priors for the depth of each decision tree and shrinkage factor and all other parameters.
Prior preference for trees to be small ( few terminal nodes)
Prior shrinking the means at terminal nodes towards 0 ( thereby, limiting the effect of the individual tree components by keeping them small )
Prior on sigma2
By using priors, we have a richer set of information than point estimates of classical regression methods.
Number of trees remains a tuning parameter ( For computational concerns, since the Model output is robust to this parameter )
The Tj prior :
The Tj prior, p(Tj), is specified by three aspects:
the probability that a node at depth d=(0,1,2,…) is nonterminal, given by: α(1+d)^(-β) with α∈(0,1) and β∈[0,∞).
the distribution on the splitting variable assignments at each interior node. Usually, this is the uniform prior on available variables
the distribution on the splitting rule assignment in each interior node, conditional on the splitting variable. Usually, this is the uniform prior on the discrete set of available splitting values.
The μij ∣ Tj prior:
For convenience, we first shift and rescale y so that the observed transformed values range from ymin=−0.5 to ymax=0.5, then the prior is:
μij ∼ N(0,σ2μ) where σμ=0.5k√m.This prior has the effect of shrinking the tree parameters μij toward zero, limiting the effect of the individual tree components by keeping them small.
Note that as k and/or m is increased, this prior will become tighter and apply greater shrinkage to the μij.
Chipman et al. (2010) found that a value of k between 1 and 3 yield good results.
The σ prior :
We used the inverse chi-square distribution: σ2 ∼ νλ/ χ2v.
v - degree of freedom and λ - scale
We then pick a value of ν between 3 and 10 to get an appropriate shape, and a value of λ so that the qth quantile of the prior on σ is located at ^σ, that is, P(σ<^σ)=q. We consider values of q such as 0.75, 0.90 or 0.99 to center the distribution below ^σ.
For automatic use, Chipman et al. (2010) recommend the default setting (ν,q)=(3,0.90). It is not recommended to choose ν<3 because it seems to concentrate too much mass on very small σ values, which leads to overfitting.
Bayesian backfitting MCMC algorithm - Gibbs Sampler
We initialize the MCMC chain with m simple single node trees
Then iterations are repeated until satisfactory convergence is obtained.
At each iteration,
Given the observed data y, Bayesian setup induces a posterior distribution on all the unknowns that determine a sum-of-trees model
p( (T1,M1),...,(Tm,Mm), σ | y)
Gibbs sampler here entails m successive draws of (Tj , Mj )
Two successive steps as :
Tj | Rj , σ,
Mj | Tj,Rj,σ.
Finally, the draw of Mj is simply a set of independent draws of the terminal node μij ’s from a normal distribution.
Posterior inference statistics
The induced sequence of sum-of-trees functions
Thus, by running the algorithm long enough after a suitable burn-in period, the sequence of f ∗ draws, say, f1∗, . . . , fK∗ , may be regarded as an approximate, dependent sample of size K from p(f|y).
To estimate f(x) or predict Y at a particular x, in-sample or out-of-sample, a natural choice is the average of the after burn-in sample f1∗, . . . , fK∗ , which approximates the posterior mean E(f (x)|y).