Big Data Analytics 笔记 2


1 Linear Model

y i = θ 1 x i 1 θ 2 x i 2 . . . θ n p x i n p ε i ,       i = 1 , 2 , . . . , n d . y_i=\theta_1 x_{i1} \theta_2 x_{i2} ... \theta_{n_p} x_{in_p} \varepsilon_i,~~~~~i=1,2,...,n_d. yi=θ1xi1 θ2xi2 ... θnpxinp εi,     i=1,2,...,nd.
y = X θ ε y = X \theta \varepsilon y=Xθ ε
where ε \varepsilon ε is noise, θ \theta θ is the vector of unkown parameters.

  • The linear model is parametric with n p n_p np parameters
  • If adding an interception θ 0 \theta_0 θ0, then the intercept is a column of 1 in the design matrix X X X
  • For dimension of design matrix X :   n d × n p X:~n_d \times n_p X: nd×np or X :   n d × ( n p 1 ) X:~n_d \times (n_p 1) X: nd×(np 1) for adding θ 0 \theta_0 θ0, n d > n p n_d > n_p nd>np
  • The traditional assumption of distribution of noise (iid noise, means noise with distribution) ε \varepsilon ε: ε i   ∼   N ( 0 , σ 2 ) ,    i = 1 , 2 , . . . , n d \varepsilon_i ~\sim ~ \mathcal{N}(0,\sigma^2),~~i=1,2,...,n_d εi  N(0,σ2),  i=1,2,...,nd

1.1 parameter estimation

For parameter estimation of Linear Model: via the cost function or ** maximum likelihood principle**, these two approaches coincide to the same optimal solution of the parameter θ \theta θ, the difference is: cost function reduces the discrepancy between predictions and data, while MLE maximizes the likelihood of observing data given parameters.

1.2 Cost Function

The goal: reduces the discrepancy between predictions and data
⟹ \Longrightarrow reduce the MSE of predictions

M S E ( y ^ ) = 1 n d ∣ ∣ y − y ^ ∣ ∣ 2 = 1 n d ∑ i = 1 n d ( y i − y ^ i ) 2 = 1 n d ∣ ∣ y − X θ ^ ∣ ∣ 2 = 1 n d ∑ i = 1 n d ( y i − x i T θ ^ ) 2 MSE(\hat{y}) = \frac{1}{n_d}||y-\hat{y}||^2 = \frac{1}{n_d}\sum^{n_d}_{i=1}(y_i - \hat{y}_i)^2 \\ = \frac{1}{n_d}||y-X\hat{\theta}||^2 = \frac{1}{n_d}\sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\hat{\theta})^2 MSE(y^)=nd1yy^2=nd1i=1nd(yiy^i)2=nd1yXθ^2=nd1i=1nd(yixiTθ^)2
Let J ( θ ) J(\theta) J(θ) be cost function, find θ ^ \hat{\theta} θ^ of θ \theta θ to minimize J ( θ ) J(\theta) J(θ), where
J ( θ ) = ∣ ∣ y − X θ ∣ ∣ 2 = ∑ i = 1 n d ( y i − x i T θ ) 2 J(\theta) = ||y-X\theta||^2 = \sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2 J(θ)=yXθ2=i=1nd(yixiTθ)2
so θ ^ = arg ⁡ m i n θ J ( θ ) \hat\theta=\arg \underset{\theta} {min} J(\theta) θ^=argθminJ(θ), thia is the same as ordinary least squares (OLS) estimator, because OLS for θ = ∣ ∣ ε ∣ ∣ 2 = ∑ i = 1 n d ε 2 = ∑ i = 1 n d ( y i − x i T θ ) 2 \theta = ||\varepsilon||^2=\sum^{n_d}_{i=1}\varepsilon^2=\sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2 θ=ε2=i=1ndε2=i=1nd(yixiTθ)2, so OLS estimator for θ ^ \hat\theta θ^:
θ ^ = arg ⁡ m i n θ ∣ ∣ ε ( θ ) ∣ ∣ 2 = arg ⁡ m i n θ ( y i − x i T θ ) 2 \hat\theta = \arg \underset{\theta} {min} ||\varepsilon(\theta)||^2 = \arg \underset{\theta} {min} (y_i - x_i^{\mathsf{T}}\theta)^2 θ^=argθminε(θ)2=argθmin(yixiTθ)2

The ordinary least squares (OLS) θ ^ \hat\theta θ^:

θ ^ = arg ⁡ m i n θ ∣ ∣ ε ( θ ) ∣ ∣ 2 = arg ⁡ m i n θ ( y i − x i T θ ) 2 \hat\theta = \arg \underset{\theta} {min} ||\varepsilon(\theta)||^2 = \arg \underset{\theta} {min} (y_i - x_i^{\mathsf{T}}\theta)^2 θ^=argθminε(θ)2=argθmin(yixiTθ)2

Cost Function:

J ( θ ) = ∑ i = 1 n d ( y i − h θ ( x i ) ) 2 J(\theta)=\sum^{n_d}_{i=1}(y_i - h_\theta(x_i))^2 J(θ)=i=1nd(yihθ(xi))2, where h θ ( x ) h_\theta(x) hθ(x) n is called the hypothesis and h θ ( x i ) = x i T θ h_\theta(x_i)=x_i^\mathsf{T}\theta hθ(xi)=xiTθ for linear models. So, the estimator obtained by minimizing the cost function for h θ ( x i ) = x i T θ h_\theta(x_i)=x_i^\mathsf{T}\theta hθ(xi)=xiTθ and the OLS estimator obtained by minimizing the squared noise norm ∣ ∣ ε ( θ ) ∣ ∣ 2 ||\varepsilon(\theta)||^2 ε(θ)2 coincide

Get θ ^ \hat\theta θ^:

{ g r a d i e n t   d e s c e n t   a p p r o x i m a t e s   θ ^ ,       J ( θ )   i s   c o n v e x o t h e r   n u m e r i c a l   o p t i m i z a t i o n   s c h e m e s   θ ^ ,       J ( θ )   i s   n o t   c o n v e x \left\{ \begin{array}{cc} \mathrm{gradient ~descent ~approximates~} \hat\theta, ~~~~~ J(\theta)~\mathrm{is ~convex} & \\ \mathrm{ other~ numerical~ optimization~ schemes~} \hat\theta, ~~~~~ J(\theta)~\mathrm{is~ not~ convex} & \end{array} \right. {gradient descent approximates θ^,     J(θ) is convexother numerical optimization schemes θ^,     J(θ) is not convex

1.3 Normal equation to get θ ^ \hat\theta θ^

For invertible ( X T X ) − 1 (X^\mathsf{T}X)^{-1} (XTX)1,
θ ^ = ( X T X ) − 1 X T y \hat\theta = (X^\mathsf{T}X)^{-1}X^\mathsf{T}y θ^=(XTX)1XTy
Gauss-Markov assumptions

  • E ( ε i ) = 0 E(\varepsilon_i) = 0 E(εi)=0
  • V a r ( ε i ) = σ 2 Var(\varepsilon_i)=\sigma^2 Var(εi)=σ2
  • C o v ( ε i , ε j ) = 0 Cov(\varepsilon_i,\varepsilon_j)=0 Cov(εi,εj)=0

Under this assumption:

  • X X X should be non-singular to allow its columns to be linearly independent.
  • E ( θ ^ ) = θ E(\hat\theta)=\theta E(θ^)=θ
  • V a r ( θ ^ ) = σ 2 ( X T X ) − 1 Var(\hat\theta)=\sigma^2(X^\mathsf{T}X)^{-1} Var(θ^)=σ2(XTX)1
  • θ ^   ∼   N ( θ , σ 2 ( X T X ) − 1 ) ,    i = 1 , 2 , . . . , n d \hat\theta~\sim ~ \mathcal{N}(\theta,\sigma^2(X^\mathsf{T}X)^{-1}),~~i=1,2,...,n_d θ^  N(θ,σ2(XTX)1),  i=1,2,...,nd
    ( under the assumption of ε i   ∼   N ( 0 , σ 2 ) ,    i = 1 , 2 , . . . , n d \varepsilon_i ~\sim ~ \mathcal{N}(0,\sigma^2),~~i=1,2,...,n_d εi  N(0,σ2),  i=1,2,...,nd)
  • The maximum likelihood is
    L ( y , X ∣ θ , σ 2 ) = ∏ i = 1 n d ( 2 π σ 2 ) − n d / 2 e x p ( − 1 2 σ 2 ∑ i = 1 n d ( y i − x i T θ ) 2 ) \mathcal{L}(y,X|\theta,\sigma^2)=\prod^{n_d}_{i=1}(2\pi\sigma^2)^{-n_d/2} exp\left(-\frac{1}{2\sigma^2}\sum^{n_d}_{i=1}(y_i - x_i^\mathsf{T}\theta)^2\right) L(y,Xθ,σ2)=i=1nd(2πσ2)nd/2exp(2σ21i=1nd(yixiTθ)2)
  • The associated MLE:
    θ ^ = ( X T X ) − 1 X T y \hat\theta = (X^\mathsf{T}X)^{-1}X^\mathsf{T}y θ^=(XTX)1XTy, coincide with OLS for normal noise, equal to a r g m i n θ ( y i − x i T θ ) 2 \underset{\theta} {argmin} (y_i - x_i^{\mathsf{T}}\theta)^2 θargmin(yixiTθ)2

Limitations of normal equation with large n p n_p np

  • computational cost of ( X T X ) − 1 (X^\mathsf{T}X)^{-1} (XTX)1
  • singularity, with large n p n_p np, its columns might turn to be highly correlated

1.4 Gradient Descent Process to get θ ^ \hat\theta θ^

The Logic: start with θ = ( θ 0 , θ 1 ) \theta = (\theta_0,\theta_1) θ=(θ0,θ1) , then keep to change ( θ 0 , θ 1 ) (\theta_0,\theta_1) (θ0,θ1) to reduce J ( θ 0 , θ 1 ) J(\theta_0,\theta_1) J(θ0,θ1) until reach m i n θ 0 , θ 1 J ( θ 0 , θ 1 ) \underset{\theta_0,\theta_1} {min}J(\theta_0,\theta_1) θ0,θ1minJ(θ0,θ1)

1.4.1 Algorithm
  1. Initialize ( θ 1 , θ 2 , . . . , θ n ) (\theta_1,\theta_2,...,\theta_n) (θ1,θ2,...,θn)
  2. ( θ ~ 1 , θ ~ 2 , . . . , θ ~ n ) = ( θ 1 , θ 2 , . . . , θ n ) (\tilde\theta_1,\tilde\theta_2,...,\tilde\theta_n)=(\theta_1,\theta_2,...,\theta_n) (θ~1,θ~2,...,θ~n)=(θ1,θ2,...,θn)
  3. Update θ i \theta_i θi with θ i = θ ~ i − a ⋅ ∂ ∂ θ i J ( θ ~ 1 , θ ~ 2 , . . . , θ ~ n ) \theta_i=\tilde\theta_i - a \cdot \frac{\partial}{\partial\theta_i}J(\tilde\theta_1,\tilde\theta_2,...,\tilde\theta_n) θi=θ~iaθiJ(θ~1,θ~2,...,θ~n), where a a a is the learning rate
  4. Repeat 2, 3 until θ i \theta_i θi converges
Gradient descent algorithm for linear regression

cost function for multiple linear regression model:
J ( θ ) = 1 2 n d ∑ i = 1 n d ( x i T θ − y i ) 2 J(\theta)=\frac{1}{2n_d}\sum_{i=1}^{n_d}(x_i^\mathsf{T} \theta - y_i)^2 J(θ)=2nd1i=1nd(xiTθyi)2
The updating step of gradient descent for linear regression with the given cost function:
θ j ( k 1 ) = θ j ( k ) − a 1 n d ∑ i = 1 n d ( x i T θ ( k ) − y i ) x i j \theta_j^{(k 1)}=\theta_j^{(k)}-a\frac{1}{n_d}\sum_{i=1}^{n_d}(x_i^\mathsf{T}\theta^{(k)}-y_i)x_{ij} θj(k 1)=θj(k)and1i=1nd(xiTθ(k)yi)xij

1.4.2 Learning Rate

The learning rate a a a used in the algorithm should be > 0, constant.

  • Learning rate can affect convergence and speed of convergence:
    Too small learning rate might lead to slow convergence, while too large learning rate might lead to non-convergence or divergence.
  • Can be selected using a validation set or cross-validation.
1.4.3 Stoping Convergence

Check the error tolerance close to 0:

  • Absolute error tolerance:
    ε a b s = ∣ J ( θ ( k 1 ) ) − J ( θ k ) ∣ \varepsilon_{abs}=|J(\theta^{(k 1)})-J(\theta^{k})| εabs=J(θ(k 1))J(θk)
  • Relative error tolerance:
    ε r e l = ∣ J ( θ ( k 1 ) ) − J ( θ k ) J ( θ ( k 1 ) ) ∣ \varepsilon_{rel}=\left|\frac{J(\theta^{(k 1)})-J(\theta^{k})}{J(\theta^{(k 1)})}\right| εrel=J(θ(k 1))J(θ(k 1))J(θk)

2 Logistic Regression

Output: { 0 , 1 } \{0,1\} {0,1}
Decision boundary: x i T θ x_i^{\mathsf{T}}\theta xiTθ
Hypothesis: h θ ( x i ) = g ( x i T θ ) = 1 1 e x p ( − x i T θ ) h_\theta(x_i)=g(x_i^\mathsf{T}\theta) =\frac{1}{1 exp(-x_i^\mathsf{T}\theta)} hθ(xi)=g(xiTθ)=1 exp(xiTθ)1

Cost function:
J ( θ 0 , θ 1 ) = { − l o g ( h ( θ 0 , θ 1 ) ( x ) ) ,             i f   y = 1 − l o g ( 1 − h ( θ 0 , θ 1 ) ( x ) ) ,             i f   y = 0 J(\theta_0,\theta_1)=\left\{ \begin{array}{cc} -log(h_{(\theta_0,\theta_1)}(x)),~~~~~~~~~~~if~y=1 & \\ -log(1-h_{(\theta_0,\theta_1)}(x)),~~~~~~~~~~~if~y=0 & \end{array} \right. J(θ0,θ1)={log(h(θ0,θ1)(x)),           if y=1log(1h(θ0,θ1)(x)),           if y=0

General form for n samples cost function:
J ( θ ) = − 1 n ∑ i = 1 n ( y i l o g ( h θ ( x i ) ) ( 1 − y i ) l o g ( 1 − h θ ( x i ) ) ) J(\theta)=-\frac{1}{n}\sum^n_{i=1}(y_i log(h_\theta(x_i)) (1-y_i)log(1-h_\theta(x_i))) J(θ)=n1i=1n(yilog(hθ(xi)) (1yi)log(1hθ(xi)))


  • x i T θ ≥ 0   ⟹   h θ ( x i ) ≥ 0.5   ⟹   y ^ = 1 x_i^\mathsf{T}\theta \ge 0 ~\Longrightarrow ~h_\theta(x_i) \ge 0.5 ~\Longrightarrow ~ \hat{y}=1 xiTθ0  hθ(xi)0.5  y^=1
    y = 1 y=1 y=1, then J ( θ )   →   0 J(\theta)~\rightarrow~0 J(θ)  0
    y = 0 ,   J ( θ )   →   ∞ y=0,~J(\theta)~\rightarrow~\infty y=0, J(θ)  
  • x i T θ < 0   ⟹   h θ ( x i ) < 0.5   ⟹   y ^ = 0 x_i^\mathsf{T}\theta < 0 ~\Longrightarrow ~h_\theta(x_i) < 0.5 ~\Longrightarrow ~ \hat{y}=0 xiTθ<0  hθ(xi)<0.5  y^=0
    y = 1 y=1 y=1, then J ( θ )   →   ∞ J(\theta)~\rightarrow~\infty J(θ)  
    y = 0 ,   J ( θ )   →   0 y=0,~J(\theta)~\rightarrow~0 y=0, J(θ)  0

3 Avoid Overfitting

higher overfitting ⟹ \Longrightarrow higher variance of the predictions
higher underfitting ⟹ \Longrightarrow higher bias of the predictions

3.1 Penalizing parameters

∑ i = 1 n d ( y i − h θ ( x i ) ) 2 λ r ( θ ) \sum^{n_d}_{i=1}(y_i - h_\theta(x_i))^2 \lambda r(\theta) i=1nd(yihθ(xi))2 λr(θ), where λ r ( θ ) \lambda r(\theta) λr(θ) is the Penalty, r ( θ ) r(\theta) r(θ) is the parameter penalty, λ \lambda λ is the regularization parameter.l

regularization parameter λ \lambda λ

  • λ   →   0 \lambda ~ \rightarrow~0 λ  0 regularized regression tend to linear regression estimates
  • λ   →   ∞ \lambda ~ \rightarrow~\infty λ   parameters get penalized more, the less the over-fitting to the data
  • select λ \lambda λ via cross-validation

parameter penalty functions r ( θ ) r(\theta) r(θ)
Some typical parameter penalty functions (参考distance)

  • d − v a r i a t i o n d-variation dvariation function: r ( θ ) = ( ∑ i = 1 n p ∣ θ i ∣ d ) 1 / d r(\theta)=\left(\sum^{n_p}_{i=1}|\theta_i|^d\right)^{1/d} r(θ)=(i=1npθid)1/d
  • L 1   n o r m :   r ( θ ) = ∑ i = 1 n p ∣ θ i ∣ L_{1}~norm:~r(\theta)=\sum^{n_p}_{i=1}|\theta_i| L1 norm: r(θ)=i=1npθi
  • s q u a r e d   L 2   n o r m   r ( θ ) = ∑ i = 1 n p ∣ θ i ∣ 2 squared~L_2~norm~r(\theta)=\sum^{n_p}_{i=1}|\theta_i|^2 squared L2 norm r(θ)=i=1npθi2

3.2 Regression model with penalty function

3.2.1 Lasso regression

Lasso regression uses L 1   n o r m L_1~norm L1 norm as penalty function. Lasso regression “zeroes out” coefficients, so it performs variable selection and to some extent parameter shrinkage.

  • Cost function
    J L ( θ ) = ∣ ∣ y − X θ ∣ ∣ 2 λ ∣ ∣ θ ∣ ∣ 1 = ∑ i = 1 n d ( y i − x i T θ ) 2 λ ∑ j = 1 n d ∣ θ j ∣ J_L(\theta)=||y-X\theta||^2 \lambda||\theta||_1= \sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2 \lambda\sum^{n_d}_{j=1}|\theta_j| JL(θ)=yXθ2 λθ1=i=1nd(yixiTθ)2 λj=1ndθj

  • Gradient of Cost function
    ∂ J L ( θ ) ∂ θ 1 , . . . ∂ J L ( θ ) ∂ θ n d \frac{\partial J_L{(\theta)}}{\partial \theta_1},... \frac{\partial J_L{(\theta)}}{\partial \theta_{nd}} θ1JL(θ),...θndJL(θ)

∂ ∑ i = 1 n d ( y i − x i T θ ) 2 ∂ θ 1 = 2 x 1 i ∑ i = 1 n d ( y i − x i T θ ) \frac{\partial \sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2}{\partial \theta_1} = 2x_{1i}\sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta) θ1i=1nd(yixiTθ)2=2x1ii=1nd(yixiTθ)

  • Lasso estimate
    θ ^ L = a r g m i n θ ( ∑ i = 1 n d ( y i − x i T θ ) 2 λ ∑ j = 1 n d ∣ θ j ∣ ) \hat\theta_L=\underset{\theta}{argmin}\left(\sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2 \lambda\sum^{n_d}_{j=1}|\theta_j| \right) θ^L=θargmin(i=1nd(yixiTθ)2 λj=1ndθj)
    Can’t be solved, but J L ( θ ) J_L(\theta) JL(θ) is convex, using the least angle regression algorithm to solve.
3.2.2 Ridge regression

Ridge regression uses L 2   n o r m L_2~norm L2 norm as penalty function. So it does not “zero out” coefficients, i.e. it can not perform variable selection, but rather shrinks parameter values.

  • Cost function
    J R ( θ ) = ∣ ∣ y − X θ ∣ ∣ 2 λ ∣ ∣ θ ∣ ∣ 2 2 = ∑ i = 1 n d ( y i − x i T θ ) 2 λ ∑ j = 1 n d ( θ j ) 2 J_R(\theta)=||y-X\theta||^2 \lambda||\theta||_2^2= \sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2 \lambda\sum^{n_d}_{j=1}(\theta_j)^2 JR(θ)=yXθ2 λθ22=i=1nd(yixiTθ)2 λj=1nd(θj)2

  • ridge estimate
    θ ^ R = ( X T X λ I ) − 1 X T y \hat\theta_R = (X^\mathsf{T} X \lambda I)^{-1}X^\mathsf{T}y θ^R=(XTX λI)1XTy
    This is the closed form solution.

The ridge estimate is biased because:

E ( θ ^ R ) = E ( ( X T X λ I ) − 1 X T y ) = ( X T X λ I ) − 1 X T E ( y ) = ( X T X λ I ) − 1 X T X θ = ( X T X λ I ) − 1 ( ( X T X ) − 1 ) − 1 θ = [ ( X T X ) − 1 ( X T X λ I ) ] − 1 θ = [ I λ ( X T X ) − 1 ] − 1 θ \begin{aligned} E(\hat\theta_R) &= E\left( (X^\mathsf{T} X \lambda I)^{-1}X^\mathsf{T}y \right) \\ &=(X^\mathsf{T} X \lambda I)^{-1}X^\mathsf{T} E(y)\\ &=(X^\mathsf{T} X \lambda I)^{-1}X^\mathsf{T}X \theta\\ &=(X^\mathsf{T} X \lambda I)^{-1}((X^\mathsf{T}X)^{-1})^{-1}\theta\\ &=\left[(X^\mathsf{T}X)^{-1}(X^\mathsf{T} X \lambda I)\right]^{-1} \theta\\ &=\left[ I \lambda(X^\mathsf{T}X)^{-1}\right]^{-1}\theta \end{aligned} E(θ^R)=E((XTX λI)1XTy)=(XTX λI)1XTE(y)=(XTX λI)1XTXθ=(XTX λI)1((XTX)1)1θ=[(XTX)1(XTX λI)]1θ=[I λ(XTX)1]1θ
X T X X^\mathsf{T}X XTX is positive definite, λ > 0 \lambda > 0 λ>0 by definition, so E ( θ ^ R ) ≠ θ E(\hat\theta_R)\neq \theta E(θ^R)=θ

Although ridge regression does not perform variable selection, it performs grouped selection. So if one variable amongst a group of correlated ones is selected, ridge regression automatically includes the whole group. Ridge regression can resolve near multicollinearity.

3.2.3 Elastic net

The elastic net penalty is a compromise between Lasso and ridge.

  • Cost function
    J E ( θ ) = ∣ ∣ y − X θ ∣ ∣ 2 λ 1 ∣ ∣ θ ∣ ∣ 1 λ 2 ∣ ∣ θ ∣ ∣ 2 2 J_E(\theta)=||y-X\theta||^2 \lambda_1||\theta||_1 \lambda_2||\theta||_2^2 JE(θ)=yXθ2 λ1θ1 λ2θ22
    if X is an n d × n p n_d\times n_p nd×np design matrix
    J E ( θ ) = ∑ i = 1 n d ( y i − x i T θ ) 2 λ 1 ∑ j = 1 n p ∣ θ j ∣ λ 2 ∑ j = 1 n p θ j 2 J_E(\theta)=\sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2 \lambda_1\sum^{n_p}_{j=1}|\theta_j| \lambda_2\sum^{n_p}_{j=1}\theta_j^2 JE(θ)=i=1nd(yixiTθ)2 λ1j=1npθj λ2j=1npθj2
  • elastic net estimate
    θ ^ E = a r g m i n θ ( ∣ ∣ y − X θ ∣ ∣ 2 λ 1 ∣ ∣ θ ∣ ∣ 1 λ 2 ∣ ∣ θ ∣ ∣ 2 2 ) \hat\theta_E =\underset{\theta}{argmin}\left(||y-X\theta||^2 \lambda_1||\theta||_1 \lambda_2||\theta||_2^2\right) θ^E=θargmin(yXθ2 λ1θ1 λ2θ22)

no closed form solution for θ E \theta_E θE, but J E ( θ ) J_E(\theta) JE(θ) is convex to have a unique minimum

4 Learning curve

Increase prediction accuracy:

  • variable selection: increase or reduce number of covariates
  • add polynomial features: { x 2 , x 1 x 2 } \{x^2, x_1x_2\} {x2,x1x2}
  • regularized regression (Lasso, ridge regression and elastic nets):
  • Collect more data

Learning curve:

  • A learning curve is a metric of prediction accuracy, like cost function J ( θ ) J(\theta) J(θ)
  • Or error metric (function of a parameter that affects the metric)


