• 尚未補圖
• 從這裡開始我就不太行了…

Chap01 SVM

All line is the same using PLA, but WHICH line is best?
→ 可以容忍的誤差愈大愈好(最近的點與分隔線的距離愈遠愈好)
→ fat hyperplane (large margin)(分隔線可以多寬)

Standard large-margin hyperplane problem

max fatness(w) (max margin)
= min distance($x_n$, w) (n = 1 ~ N)
= min distance($x_n$, b, w) – (1)

$w^tx$ = 0 → x除去x0成為x’ → $w^tx’$ + b = 0

$w^t$ 垂直於 $w^tx’$ + b = 0平面
distance(x, b, w) = x 到 平面的距離

$distance(x_n, b, w) = (1/|w|) * y_n(w^t x_n + b)$ – (2)

specialize

→ $distance(x_n, b, w) = 1/|w|$

$y_n(w_t x_n + b) >= 1$ is necessary constraints for $min y_n(w_t x_n + b) = 1$
if all y_n(w_t x_n + b) > p > 1 -> can generate more optimal answer (b/p, w/p) -> distance 1/|w| is smaller
so y_n(w_t x_n + b) >= 1 → y_n(w_t x_n + b) = 1

max 1/|w| -> min 1/2 * w_t * w

can solve w by solve N inequality

Support Vector Machine(SVM)

support vector: bounary data(胖線的邊界點)

gradient? : not easy with constraints
but we have:

• (convex) quadratic objective function(b, w)
• linear constraints (b, w)

can use quadratic programming (QP, 二次規劃): easy

Linear Hard-Margin(all wxy > 0) SVM

the same as ‘weight-decay regularization’ within Ein = 0

Restricts Dichotomies(堅持胖的線): if margin < p, no answer
fewer dichotomies -> small VC dim -> better generalization

can be good with transform

Chap02 dual SVM

if d_trans is big, or infinite?(非常複雜的轉換)
find: SVM without dependence of d_trans

(QP of d_trans+1 variables -> N variables)

Use lagrange multiplier: Dual SVM: 將lambda視為變數來解

1. convex
2. feasible(有解)
3. linear constraints

Dual SVM 最佳解為？

b可以被消除(=0)

w代入得

Karush-Kuhn-Tucker (KKT) conditions

primal-inner -> an = 0 或 1-yn(wtzn+b) = 0(complementary slackness)
=> at optimal all ‘Lagrange terms’ disappear

w = sigma anynzn
b 僅有邊界(primal feasible)，但b = yn - wtzn when an > 0 (primal-inner，代表必在SVM邊界上)

=> b, w 都可以只用SV求到

SVM:找到有用的點(SV)

Standard hard-margin SVM dual

when N is big, qn,m is dense array and very big(N >30000, use >3G ram)
use special QP solver

SVM 和 PLA 比較

w = linear combination of data => w represented by data
SVM: represent w by only SV

Primal: 對(b,w)做特別縮放
Dual: 找到SV 和其 lagrange multiplier

Chap03 Kernel SVM

(z_n^T)(z_m)如何算更快

r影響SV的選擇

infinite Kernel

taylor展開

large => sharp Gaussians => ‘overfit’?

Kernel選擇

linear kernel: 等於沒有轉換，linear first, 計算快
polynomial: 轉換過，限制小，strong physical control, 維度太大K會趨向極端值
-> 平常只用不大的維度
infinite dimension:
most powerful
less numerical difficulty than poly(僅兩次式)
one parameter only
cons: mysterious – no w , and too powerful

define new kernel is hard:
Mercer’s condition:

Chap04 Soft-Margin SVM

overfit reason: transform & hard-margin(全分開)

Soft-Margin – 容忍錯誤，有錯誤penalty，只有對的需要符合條件

No QP anymore
error大小:離fat boundary的距離

parameter C: large when want less violate margin
small when want large margin, tolerate some violation

Soft-margin Dual: 將條件加入min中

C is exactly the upper bound of an

Kernel Soft Margin SVM

more flexible: always solvable

(3)->solve b:

physical meaning

not SV(an = 0): C-an != 0 -> En = 0
unbounded SV(0 < an < C，口) -> En = 0 -> on fat boundary
bounded SV(an = C, △) -> En >= 0(有違反，不在boundary上)
-> 只有bounded SV才可違反

difficult to optimize(C, r)

SVM validation

leave-one-out error <= #SV/N

-> 可以靠此特性做參數選擇(不選#SV太大的)

Chap05 Kernel Logistic SVM

large margin <=> fewer choices <=> L2 regularization of short w
soft margin <=> special err
larger C(in soft-margin or in regularization) <=> smaller lagrange multiplier <=> less regularization

We can extend SVM to other learning models!

look (wtzn + b) as linear score(f(x) in PLA)

we can have Err_svm is upper bound of Err0/1
(hinge error measure)

Err_sce: 與svm相似的一個logistic regression

L2 logistic regression is similar to SVM,

-> SVM當作Log regression的起始點? 沒有比較快(SVM優點)
-> 將SVM答案當作Log的近似解(return theta(wx + b))? 沒有log reg的意義(maximum likelyhood)
=> 加兩個自由度，return theta(A*(wx+b) + B)
-> often A > 0(同方向), B~=0(無位移)

Platt’s Model

kernel SVM在Z空間的解 – 用Log Reg微調後 –> 用來近似Log Reg在Z空間的解(並不是在z空間最好的解)

solve LogReg to get(A, B)

Representer Theorem: 若解L2-正規化問題，最佳w必為z的線性組合

-> 解B

Kernel Logistic Regression(KLR)
= linear model of B

= linear model of w
with embedded-in-kernel transform & L2 regularizer

soft margin SVM ~= L2 LOG REG, special error measure:hinge

Chap 06 Support Vector Regression(SVR)

ridge regression : 有regularized的regression

Kernel Ridge Regression

g(x) = wz = sum(bz)z = sum(bk)

kernel自由度高
linear為O(d^3+d^2N)
kernel和資料量有關，為O(N^3)，檔案大時不快

LS(least-squares)SVM = kernel ridge regression:

=> 代表計算時間長
=> 找一個sparse B?

tube regression:

insensitive error:容忍一小段的差距(在誤差內err = 0，若超過, err只算超過的部分)
error增加的速度變慢

regulizer 和 超過tube上界的值，超過tube下界的值

=> 只要tube夠寬，B為sparse

Linear, SVM Summary

first row: less used due to worse performance
third row: less used due to dense B
fourth row: popular in LIBSVM

Chap07 Blending and Bagging

Selection: rely on only once hypothesis
Aggregation: mix or combine hypothesiss
select trust-worthy from their usual performance
=> validation
mix the prediction => vote with different weight of ballot
combine predictions conditionally(when some situation, give more ballots to friend t)

Aggregation可做到：

1. feature transform(?), 將hypothesis變強
2. regularization(?)
控制 油門 和 煞車

uniform blending: 一種model一票，取平均

expected performance of A = expected deviation to consensus + performance of consensus

linear blending: 加權(線性)平均，權重>0

any blending(stacking): 可用non-linear model(???)

算出g1-, g2- ...
phi-1 = (g1-, g2-, ...)
transform validation data to Z = (phi-1(x), y)
compuate g = AnyModel(Z, Y)
return G = g(phi(x))
phi = (g1, g2 ... )

compuate a = AnyModel(Z, Y)
return G = a * phi(x)

learning: 邊學邊合，

bootstrapping: 從有限的資料模擬出新的資料
bootstrap data: 從原本資料選擇N筆資料(可重複)
Virtual aggregation
bootstrap aggregation(bagging): 由bootstrap data訓練g，而非原資料
-> meta algorithm for [base algorithm(可使用不同演算法)]

gt = argmin(sum(ut * err))
gt+1 = argmin(sum(ut+1 * err))

err rate = 錯誤資料權重和 / (錯誤資料權重和 + 正確資料權重和) = 1/2
=> 希望 正確資料權重和 = 錯誤資料權重和

incorrect *= S
correct *= 1/S

→ err rate <= 1/2
→ incorrect↑, correct↓, close to 1/2

u1 可設所有為1/N，得到min Ein
G 設uniform會使成績變差

-> 設at = ln(St) (S = scale)
if(err rate == 1/2) -> St = 1 -> at = 0
if(err rate == 0) -> St = inf -> at = inf

decision stump: 三個參數：which feature, threshold(線), direction(ox)，可以使Ein <= 1/2

Chap09 Decision Tree

Traditional learning model that realize conditional aggregation

Path View:
G = sum(q * g)
q = condition (is x on this path?)
g = base hypothesis, only constant, leaf in tree

Recursive View:
G(x) = sum([b(x) == c] * Gc(x))
G: full tree
b: branching criteria
Gc: sub-tree hypothesis

advantage: human-explainable, simple, efficient, missing feature handle, categorical features easily, multiclass easily
Ex. C&RT, C4.5, J48…

four choices: number of branches, branching
criteria, termination criteria, & base hypothesis

C&RT(Classification and Regression Tree):
Tree which is fully-grown with constant leaves
C = 2(binary tree)，可用decision stump
gt(x) = 在此分類下output最有可能(出現最多次的yn or yn平均)
-> 分得愈純愈好(同一類的output皆相同)

impurity = 變異數 or 出現最多次的yn的比率

popular to use :
Gini for classification
regression error for regression

terminate criteria:

1. all yn is the same: impurity = 0
2. all xn the same: cannot cut

if all xn different: Ein = 0
low-level tree built with small D -> overfit

regularizer: number of leaves
argmin(Ein(G) + c * number of leaves(G))

Surrogate(代理) branch:

Chap10 Random Forest

Random Forest = bagging + fully-grown random-subspace random-combination C&RT decision tree

highly parallel, 減少 decision tree的variance

增加decision tree diversity

1. random sample features from x(random subspace of X)

-> efficient, can be used for any learning models
10000個features, 只用100個維度來learn

1. 將 x 作 低維度random projection -> 產生新的feature(斜線切割), random combination

Out-of-bag

out-of-bag: not sampled after N drawings
N個data抽N次，沒被抽到機率 ~= 1/e
=> 將沒抽到的DATA作g的validation(通常不做，因為g只為G的其中之一)
=> 將沒抽到的DATA作G的validation，Eoob = sum(err(G-(xn))) (G-不包含用到xn的g)

Eoob: self-validation

Feature Selection

want to remove redundant, irrelevant features…

learn a subset-transform for the final hypothesis

advantage: interpretability, remove ‘feature noise’, efficient
disadvantage: total computation time increase, ‘select feature overfit’, mis-interpretability(過度解釋)

decision tree: built-in feature selection

idea: rate importance of every features
linear model: 看w的大小
non-linear model: not easy to estimate

idea: random test
put some random value into feature, check performance↓，下降愈多代表愈重要

random value

• by original P(X = x)
• bootstrap, permutation

performance: 算很久
importance(i) = Eoob(G, D) - Eoob(G, Dp) (Dp = data with permutation in xn_i)

strength-correlation decomposition
s = average voting margin(投票最多-投票第二多…) with G
p = gt之間的相似度
bias-variance decomposition