A weak classifi er is one which performs just slightly better than random guessing. Boosting refers to the general problem of producing a very accurate classifi er by combining rough and moderately inaccurate weak classifi ers. Thus boosting consists of sequentially applying the weak classifi ers to repeatedly modifi ed versions of the data in order to produce a sequence of weak predictions ht(x), t = 1, 2, . . . , T where ht: X → {−1, +1} and X represents a training observation set. The predictions are then combined through a weighted majority vote to produce the fi nal prediction.
More specifi cally, the algorithm takes as an input a training observation set Z = {(x1, y1), . . . , (xM, yM)} where the predictor variables xi ∈ X belong to the same domain and each yi belongs to {−1, +1}.
The principal idea of a boosting algorithm is to consider a set of weights {wt(i)}, t = 1, . . . , T, i = 1, . . . , M over the training observations Z.
Initially all of the weights are set to w i1 ( ) = 1 ,i =1, . . . ,M, but each of the classifi ers ht−1 produces
M
a new distribution of weights wt(i) modifying the individual weights, increasing those weights where the observations were misclassifi ed and decreasing the weights of those that were classifi ed correctly, so that the weak classifi er is forced to focus on the hard examples in the training set and observations that are diffi cult to classify correctly receive ever-increasing infl uence. Each successive classifi er is thereby forced to concentrate on those observations that are missed by the previous ones in the sequence.
More formally, as Freund and Schapire (1999) show, the AdaBoost algorithm is as follows:
1
Initalize w i1 ( ) = ,i =1, . . . ,M,
M
For t = 1, . . . , T:
• Fit a classifi er ht(x) to the training data using weights wt(i).
∑M w I yi ( i ≠ ht ( )xi )
• Compute εt = i=1 M (1)
∑wi
i=1
• Compute α = 1 1−εtεt (2)
• Update wt+1 ( )i = w it ( )⋅exp(−αt y hi t ( )xi ),i =1, . . . ,M Zt where Zt is a normalization factor (selected so that wt+1 will be a distribution). The fi nal output will be |
(3) |
H x( ) = sign∑T αtht ( )x |
(4) |
t=1
Given a set of candidate models, Bayesian model averaging consists of taking a weighted average of the individual predictions, with weights proportional to the posterior probability of each model.
Thus, when we have a set of candidate models ht, t = 1, . . . , T for our training set Z = {(x1, y1), . . . , (xM, yM)}, it is also possible to use the Bayesian model-averaging approach as a way of improving all the individual predictions, taking a weighted average of the predictions proportional to the rate
st M
of success ∑T j of the model ht on the training set Z, where st = ∑i I y( i = ht (xi )) (see Hastie et al., s =1
j=1
for a formal demonstration). Therefore, the Bayesian model-averaging approach gives weight to each 2001, for a model depending on how well it fi ts.
On the other hand, we also consider committee methods, which consider a simple unweighted average of the predictions from each model, giving equal probability to each model.
Уважаемый посетитель!
Чтобы распечатать файл, скачайте его (в формате Word).
Ссылка на скачивание - внизу страницы.