Bayesian Additive Regression Trees ( BART )

  • Bayesian approach to nonparametric function estimation using Regression trees.

    • Regression Decision trees rely on recursive binary partitioning of predictor space into a set of hyper-rectangles to approximate some unknown function. 

  • BART is defined by Two Major concepts :

    • Sum-Of-Trees Model

    • Regularized Priors


Sum-Of-Trees Model


  • Drawbacks of Single Decision Tree:

    • Over-emphasize the interactions between the Variables.

    • Have Difficulty finding Linear relationships.

  • Alternative: Fit many small models ( sequential weak learners )

    using Back-Fitting Algorithm

where, T  → Binary Regression Tree

M → Terminal Node parameters

  • Back-Fitting Algorithm : 

  • Fit a small regression tree

  • Get fitted values from that tree

  • Estimate the Residuals = True values - Observed Values

  • Fit next Regression tree on above residuals

  • Process continues till m number of Trees are fitted.


Regularized Priors


  • Any Tree based Model can easily overfit the data.

  • Boosted Regression Trees limit the tree depth and use shrinkage by multiplying the fit of each tree by some small constant determined by cross-validation.

  • Unlike Cross-Validation in Boosted Regression Tree, BART sets intelligent priors for the depth of each decision tree and shrinkage factor and all other parameters.

    • Prior preference for trees to be small ( few terminal nodes)

    • Prior shrinking the means at terminal nodes towards 0 ( thereby, limiting the effect of the individual tree components by keeping them small ) 

    • Prior on sigma

  • By using priors, we have a richer set of information than point estimates of classical regression methods.

  • Number of trees remains a tuning parameter ( For computational concerns, since the Model output is robust to this parameter ) 


  • The Tj prior :

    • The Tj prior, p(Tj), is specified by three aspects:

      • the probability that a node at depth d=(0,1,2,…) is nonterminal, given by: α(1+d)^(-β) with α∈(0,1) and β∈[0,∞). 

      • the distribution on the splitting variable assignments at each interior node. Usually, this is the uniform prior on available variables

      • the distribution on the splitting rule assignment in each interior node, conditional on the splitting variable. Usually, this is the uniform prior on the discrete set of available splitting values.

  • The μij ∣ Tj prior:

    • For convenience, we first shift and rescale y so that the observed transformed values range from ymin=−0.5 to ymax=0.5, then the prior is:
      μij ∼ N(0,σ2μ)  where σμ=0.5k√m.

    • This prior has the effect of shrinking the tree parameters μij toward zero, limiting the effect of the individual tree components by keeping them small. 

    • Note that as k and/or m is increased, this prior will become tighter and apply greater shrinkage to the μij. 

    • Chipman et al. (2010) found that a value of k between 1 and 3 yield good results.

  • The σ prior :

    • We used the inverse chi-square distribution: σ2 ∼ νλ/ χ2v

    • v - degree of freedom and  λ - scale

    • We then pick a value of ν between 3 and 10 to get an appropriate shape, and a value of λ so that the qth quantile of the prior on σ is located at ^σ, that is, P(σ<^σ)=q. We consider values of q such as 0.75, 0.90 or 0.99 to center the distribution below ^σ.

    • For automatic use, Chipman et al. (2010) recommend the default setting (ν,q)=(3,0.90). It is not recommended to choose ν<3 because it seems to concentrate too much mass on very small σ values, which leads to overfitting.

Bayesian backfitting MCMC algorithm - Gibbs Sampler


  • We initialize the MCMC chain with m simple single node trees

  • Then iterations are repeated until satisfactory convergence is obtained. 

    • At each iteration, 

      • Given the observed data y, Bayesian setup induces a posterior distribution on all the unknowns that determine a sum-of-trees model

p( (T1,M1),...,(Tm,Mm), σ | y)

  • Gibbs sampler here entails m successive draws of (Tj , Mj )

  • Two successive steps as :

Tj  | Rj , σ,
Mj | Tj,Rj,σ.

  • Finally, the draw of Mj  is simply a set of independent draws of the terminal node μij ’s from a normal distribution.

Posterior inference statistics


  • The induced sequence of sum-of-trees functions


  • Thus, by running the algorithm long enough after a suitable burn-in period, the sequence of f ∗ draws, say, f1∗, . . . , fK∗ , may be regarded as an approximate, dependent sample of size K from p(f|y).

  • To estimate f(x) or predict Y at a particular x, in-sample or out-of-sample, a natural choice is the average of the after burn-in sample f1∗, . . . , fK∗ , which approximates the posterior mean E(f (x)|y).