Bayesian Additive Regression Trees ( BART )

Regression Decision trees rely on recursive binary partitioning of predictor space into a set of hyper-rectangles to approximate some unknown function.

Sum-Of-Trees Model

Alternative: Fit many small models ( sequential weak learners )
using Back-Fitting Algorithm

where, T → Binary Regression Tree

M → Terminal Node parameters

Regularized Priors

Any Tree based Model can easily overfit the data.
Boosted Regression Trees limit the tree depth and use shrinkage by multiplying the fit of each tree by some small constant determined by cross-validation.
Unlike Cross-Validation in Boosted Regression Tree, BART sets intelligent priors for the depth of each decision tree and shrinkage factor and all other parameters.

Prior preference for trees to be small ( few terminal nodes)
Prior shrinking the means at terminal nodes towards 0 ( thereby, limiting the effect of the individual tree components by keeping them small )
Prior on sigma2

By using priors, we have a richer set of information than point estimates of classical regression methods.
Number of trees remains a tuning parameter ( For computational concerns, since the Model output is robust to this parameter )

the probability that a node at depth d=(0,1,2,…) is nonterminal, given by: α(1+d)^(-β) with α∈(0,1) and β∈[0,∞).
the distribution on the splitting variable assignments at each interior node. Usually, this is the uniform prior on available variables
the distribution on the splitting rule assignment in each interior node, conditional on the splitting variable. Usually, this is the uniform prior on the discrete set of available splitting values.

For convenience, we first shift and rescale y so that the observed transformed values range from ymin=−0.5 to ymax=0.5, then the prior is:
μij ∼ N(0,σ2μ) where σμ=0.5k√m.
This prior has the effect of shrinking the tree parameters μij toward zero, limiting the effect of the individual tree components by keeping them small.
Note that as k and/or m is increased, this prior will become tighter and apply greater shrinkage to the μij.
Chipman et al. (2010) found that a value of k between 1 and 3 yield good results.

We used the inverse chi-square distribution: σ2 ∼ νλ/ χ2v.
v - degree of freedom and λ - scale
We then pick a value of ν between 3 and 10 to get an appropriate shape, and a value of λ so that the qth quantile of the prior on σ is located at ^σ, that is, P(σ<^σ)=q. We consider values of q such as 0.75, 0.90 or 0.99 to center the distribution below ^σ.
For automatic use, Chipman et al. (2010) recommend the default setting (ν,q)=(3,0.90). It is not recommended to choose ν<3 because it seems to concentrate too much mass on very small σ values, which leads to overfitting.

Bayesian backfitting MCMC algorithm - Gibbs Sampler

Given the observed data y, Bayesian setup induces a posterior distribution on all the unknowns that determine a sum-of-trees model

p( (T1,M1),...,(Tm,Mm), σ | y)

Tj | Rj , σ,
Mj | Tj,Rj,σ.

Finally, the draw of Mj is simply a set of independent draws of the terminal node μij ’s from a normal distribution.

Posterior inference statistics

Thus, by running the algorithm long enough after a suitable burn-in period, the sequence of f ∗ draws, say, f1∗, . . . , fK∗ , may be regarded as an approximate, dependent sample of size K from p(f|y).
To estimate f(x) or predict Y at a particular x, in-sample or out-of-sample, a natural choice is the average of the after burn-in sample f1∗, . . . , fK∗ , which approximates the posterior mean E(f (x)|y).

The Tech Quest