Big Data Analytics 笔记 2
1 Linear Model
y i = θ 1 x i 1 θ 2 x i 2 . . . θ n p x i n p ε i , i = 1 , 2 , . . . , n d . y_i=\theta_1 x_{i1} \theta_2 x_{i2} ... \theta_{n_p} x_{in_p} \varepsilon_i,~~~~~i=1,2,...,n_d. yi=θ1xi1 θ2xi2 ... θnpxinp εi, i=1,2,...,nd.
or,
y = X θ ε y = X \theta \varepsilon y=Xθ ε
where ε \varepsilon ε is noise, θ \theta θ is the vector of unkown parameters.
- The linear model is parametric with n p n_p np parameters
- If adding an interception θ 0 \theta_0 θ0, then the intercept is a column of 1 in the design matrix X X X
- For dimension of design matrix X : n d × n p X:~n_d \times n_p X: nd×np or X : n d × ( n p 1 ) X:~n_d \times (n_p 1) X: nd×(np 1) for adding θ 0 \theta_0 θ0, n d > n p n_d > n_p nd>np
- The traditional assumption of distribution of noise (iid noise, means noise with distribution) ε \varepsilon ε: ε i ∼ N ( 0 , σ 2 ) , i = 1 , 2 , . . . , n d \varepsilon_i ~\sim ~ \mathcal{N}(0,\sigma^2),~~i=1,2,...,n_d εi ∼ N(0,σ2), i=1,2,...,nd
1.1 parameter estimation
For parameter estimation of Linear Model: via the cost function or ** maximum likelihood principle**, these two approaches coincide to the same optimal solution of the parameter θ \theta θ, the difference is: cost function reduces the discrepancy between predictions and data, while MLE maximizes the likelihood of observing data given parameters.
1.2 Cost Function
The goal: reduces the discrepancy between predictions and data
⟹ \Longrightarrow ⟹ reduce the MSE of predictions
M S E ( y ^ ) = 1 n d ∣ ∣ y − y ^ ∣ ∣ 2 = 1 n d ∑ i = 1 n d ( y i − y ^ i ) 2 = 1 n d ∣ ∣ y − X θ ^ ∣ ∣ 2 = 1 n d ∑ i = 1 n d ( y i − x i T θ ^ ) 2 MSE(\hat{y}) = \frac{1}{n_d}||y-\hat{y}||^2 = \frac{1}{n_d}\sum^{n_d}_{i=1}(y_i - \hat{y}_i)^2 \\ = \frac{1}{n_d}||y-X\hat{\theta}||^2 = \frac{1}{n_d}\sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\hat{\theta})^2 MSE(y^)=nd1∣∣y−y^∣∣2=nd1i=1∑nd(yi−y^i)2=nd1∣∣y−Xθ^∣∣2=nd1i=1∑nd(yi−xiTθ^)2
Let J ( θ ) J(\theta) J(θ) be cost function, find θ ^ \hat{\theta} θ^ of θ \theta θ to minimize J ( θ ) J(\theta) J(θ), where
J ( θ ) = ∣ ∣ y − X θ ∣ ∣ 2 = ∑ i = 1 n d ( y i − x i T θ ) 2 J(\theta) = ||y-X\theta||^2 = \sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2 J(θ)=∣∣y−Xθ∣∣2=i=1∑nd(yi−xiTθ)2
so θ ^ = arg m i n θ J ( θ ) \hat\theta=\arg \underset{\theta} {min} J(\theta) θ^=argθminJ(θ), thia is the same as ordinary least squares (OLS) estimator, because OLS for θ = ∣ ∣ ε ∣ ∣ 2 = ∑ i = 1 n d ε 2 = ∑ i = 1 n d ( y i − x i T θ ) 2 \theta = ||\varepsilon||^2=\sum^{n_d}_{i=1}\varepsilon^2=\sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2 θ=∣∣ε∣∣2=∑i=1ndε2=∑i=1nd(yi−xiTθ)2, so OLS estimator for θ ^ \hat\theta θ^:
θ ^ = arg m i n θ ∣ ∣ ε ( θ ) ∣ ∣ 2 = arg m i n θ ( y i − x i T θ ) 2 \hat\theta = \arg \underset{\theta} {min} ||\varepsilon(\theta)||^2 = \arg \underset{\theta} {min} (y_i - x_i^{\mathsf{T}}\theta)^2 θ^=argθmin∣∣ε(θ)∣∣2=argθmin(yi−xiTθ)2
The ordinary least squares (OLS) θ ^ \hat\theta θ^:
θ ^ = arg m i n θ ∣ ∣ ε ( θ ) ∣ ∣ 2 = arg m i n θ ( y i − x i T θ ) 2 \hat\theta = \arg \underset{\theta} {min} ||\varepsilon(\theta)||^2 = \arg \underset{\theta} {min} (y_i - x_i^{\mathsf{T}}\theta)^2 θ^=argθmin∣∣ε(θ)∣∣2=argθmin(yi−xiTθ)2
Cost Function:
J ( θ ) = ∑ i = 1 n d ( y i − h θ ( x i ) ) 2 J(\theta)=\sum^{n_d}_{i=1}(y_i - h_\theta(x_i))^2 J(θ)=i=1∑nd(yi−hθ(xi))2, where h θ ( x ) h_\theta(x) hθ(x) n is called the hypothesis and h θ ( x i ) = x i T θ h_\theta(x_i)=x_i^\mathsf{T}\theta hθ(xi)=xiTθ for linear models. So, the estimator obtained by minimizing the cost function for h θ ( x i ) = x i T θ h_\theta(x_i)=x_i^\mathsf{T}\theta hθ(xi)=xiTθ and the OLS estimator obtained by minimizing the squared noise norm ∣ ∣ ε ( θ ) ∣ ∣ 2 ||\varepsilon(\theta)||^2 ∣∣ε(θ)∣∣2 coincide
Get θ ^ \hat\theta θ^:
{ g r a d i e n t d e s c e n t a p p r o x i m a t e s θ ^ , J ( θ ) i s c o n v e x o t h e r n u m e r i c a l o p t i m i z a t i o n s c h e m e s θ ^ , J ( θ ) i s n o t c o n v e x \left\{ \begin{array}{cc} \mathrm{gradient ~descent ~approximates~} \hat\theta, ~~~~~ J(\theta)~\mathrm{is ~convex} & \\ \mathrm{ other~ numerical~ optimization~ schemes~} \hat\theta, ~~~~~ J(\theta)~\mathrm{is~ not~ convex} & \end{array} \right. {gradient descent approximates θ^, J(θ) is convexother numerical optimization schemes θ^, J(θ) is not convex
1.3 Normal equation to get θ ^ \hat\theta θ^
For invertible ( X T X ) − 1 (X^\mathsf{T}X)^{-1} (XTX)−1,
θ ^ = ( X T X ) − 1 X T y \hat\theta = (X^\mathsf{T}X)^{-1}X^\mathsf{T}y θ^=(XTX)−1XTy
Gauss-Markov assumptions
- E ( ε i ) = 0 E(\varepsilon_i) = 0 E(εi)=0
- V a r ( ε i ) = σ 2 Var(\varepsilon_i)=\sigma^2 Var(εi)=σ2
- C o v ( ε i , ε j ) = 0 Cov(\varepsilon_i,\varepsilon_j)=0 Cov(εi,εj)=0
Under this assumption:
- X X X should be non-singular to allow its columns to be linearly independent.
- E ( θ ^ ) = θ E(\hat\theta)=\theta E(θ^)=θ
- V a r ( θ ^ ) = σ 2 ( X T X ) − 1 Var(\hat\theta)=\sigma^2(X^\mathsf{T}X)^{-1} Var(θ^)=σ2(XTX)−1
- θ ^ ∼ N ( θ , σ 2 ( X T X ) − 1 ) , i = 1 , 2 , . . . , n d \hat\theta~\sim ~ \mathcal{N}(\theta,\sigma^2(X^\mathsf{T}X)^{-1}),~~i=1,2,...,n_d θ^ ∼ N(θ,σ2(XTX)−1), i=1,2,...,nd
( under the assumption of ε i ∼ N ( 0 , σ 2 ) , i = 1 , 2 , . . . , n d \varepsilon_i ~\sim ~ \mathcal{N}(0,\sigma^2),~~i=1,2,...,n_d εi ∼ N(0,σ2), i=1,2,...,nd) - The maximum likelihood is
L ( y , X ∣ θ , σ 2 ) = ∏ i = 1 n d ( 2 π σ 2 ) − n d / 2 e x p ( − 1 2 σ 2 ∑ i = 1 n d ( y i − x i T θ ) 2 ) \mathcal{L}(y,X|\theta,\sigma^2)=\prod^{n_d}_{i=1}(2\pi\sigma^2)^{-n_d/2} exp\left(-\frac{1}{2\sigma^2}\sum^{n_d}_{i=1}(y_i - x_i^\mathsf{T}\theta)^2\right) L(y,X∣θ,σ2)=i=1∏nd(2πσ2)−nd/2exp(−2σ21i=1∑nd(yi−xiTθ)2) - The associated MLE:
θ ^ = ( X T X ) − 1 X T y \hat\theta = (X^\mathsf{T}X)^{-1}X^\mathsf{T}y θ^=(XTX)−1XTy, coincide with OLS for normal noise, equal to a r g m i n θ ( y i − x i T θ ) 2 \underset{\theta} {argmin} (y_i - x_i^{\mathsf{T}}\theta)^2 θargmin(yi−xiTθ)2
Limitations of normal equation with large n p n_p np
- computational cost of ( X T X ) − 1 (X^\mathsf{T}X)^{-1} (XTX)−1
- singularity, with large n p n_p np, its columns might turn to be highly correlated
1.4 Gradient Descent Process to get θ ^ \hat\theta θ^
The Logic: start with θ = ( θ 0 , θ 1 ) \theta = (\theta_0,\theta_1) θ=(θ0,θ1) , then keep to change ( θ 0 , θ 1 ) (\theta_0,\theta_1) (θ0,θ1) to reduce J ( θ 0 , θ 1 ) J(\theta_0,\theta_1) J(θ0,θ1) until reach m i n θ 0 , θ 1 J ( θ 0 , θ 1 ) \underset{\theta_0,\theta_1} {min}J(\theta_0,\theta_1) θ0,θ1minJ(θ0,θ1)
1.4.1 Algorithm
- Initialize ( θ 1 , θ 2 , . . . , θ n ) (\theta_1,\theta_2,...,\theta_n) (θ1,θ2,...,θn)
- ( θ ~ 1 , θ ~ 2 , . . . , θ ~ n ) = ( θ 1 , θ 2 , . . . , θ n ) (\tilde\theta_1,\tilde\theta_2,...,\tilde\theta_n)=(\theta_1,\theta_2,...,\theta_n) (θ~1,θ~2,...,θ~n)=(θ1,θ2,...,θn)
- Update θ i \theta_i θi with θ i = θ ~ i − a ⋅ ∂ ∂ θ i J ( θ ~ 1 , θ ~ 2 , . . . , θ ~ n ) \theta_i=\tilde\theta_i - a \cdot \frac{\partial}{\partial\theta_i}J(\tilde\theta_1,\tilde\theta_2,...,\tilde\theta_n) θi=θ~i−a⋅∂θi∂J(θ~1,θ~2,...,θ~n), where a a a is the learning rate
- Repeat 2, 3 until θ i \theta_i θi converges
Gradient descent algorithm for linear regression
cost function for multiple linear regression model:
J ( θ ) = 1 2 n d ∑ i = 1 n d ( x i T θ − y i ) 2 J(\theta)=\frac{1}{2n_d}\sum_{i=1}^{n_d}(x_i^\mathsf{T} \theta - y_i)^2 J(θ)=2nd1i=1∑nd(xiTθ−yi)2
The updating step of gradient descent for linear regression with the given cost function:
θ j ( k 1 ) = θ j ( k ) − a 1 n d ∑ i = 1 n d ( x i T θ ( k ) − y i ) x i j \theta_j^{(k 1)}=\theta_j^{(k)}-a\frac{1}{n_d}\sum_{i=1}^{n_d}(x_i^\mathsf{T}\theta^{(k)}-y_i)x_{ij} θj(k 1)=θj(k)−and1i=1∑nd(xiTθ(k)−yi)xij
1.4.2 Learning Rate
The learning rate a a a used in the algorithm should be > 0, constant.
- Learning rate can affect convergence and speed of convergence:
Too small learning rate might lead to slow convergence, while too large learning rate might lead to non-convergence or divergence. - Can be selected using a validation set or cross-validation.
1.4.3 Stoping Convergence
Check the error tolerance close to 0:
- Absolute error tolerance:
ε a b s = ∣ J ( θ ( k 1 ) ) − J ( θ k ) ∣ \varepsilon_{abs}=|J(\theta^{(k 1)})-J(\theta^{k})| εabs=∣J(θ(k 1))−J(θk)∣ - Relative error tolerance:
ε r e l = ∣ J ( θ ( k 1 ) ) − J ( θ k ) J ( θ ( k 1 ) ) ∣ \varepsilon_{rel}=\left|\frac{J(\theta^{(k 1)})-J(\theta^{k})}{J(\theta^{(k 1)})}\right| εrel=∣∣∣∣J(θ(k 1))J(θ(k 1))−J(θk)∣∣∣∣
2 Logistic Regression
Output: { 0 , 1 } \{0,1\} {0,1}
Decision boundary: x i T θ x_i^{\mathsf{T}}\theta xiTθ
Hypothesis: h θ ( x i ) = g ( x i T θ ) = 1 1 e x p ( − x i T θ ) h_\theta(x_i)=g(x_i^\mathsf{T}\theta) =\frac{1}{1 exp(-x_i^\mathsf{T}\theta)} hθ(xi)=g(xiTθ)=1 exp(−xiTθ)1
Cost function:
J ( θ 0 , θ 1 ) = { − l o g ( h ( θ 0 , θ 1 ) ( x ) ) , i f y = 1 − l o g ( 1 − h ( θ 0 , θ 1 ) ( x ) ) , i f y = 0 J(\theta_0,\theta_1)=\left\{ \begin{array}{cc} -log(h_{(\theta_0,\theta_1)}(x)),~~~~~~~~~~~if~y=1 & \\ -log(1-h_{(\theta_0,\theta_1)}(x)),~~~~~~~~~~~if~y=0 & \end{array} \right. J(θ0,θ1)={−log(h(θ0,θ1)(x)), if y=1−log(1−h(θ0,θ1)(x)), if y=0
General form for n samples cost function:
J ( θ ) = − 1 n ∑ i = 1 n ( y i l o g ( h θ ( x i ) ) ( 1 − y i ) l o g ( 1 − h θ ( x i ) ) ) J(\theta)=-\frac{1}{n}\sum^n_{i=1}(y_i log(h_\theta(x_i)) (1-y_i)log(1-h_\theta(x_i))) J(θ)=−n1i=1∑n(yilog(hθ(xi)) (1−yi)log(1−hθ(xi)))
Classification:
- x i T θ ≥ 0 ⟹ h θ ( x i ) ≥ 0.5 ⟹ y ^ = 1 x_i^\mathsf{T}\theta \ge 0 ~\Longrightarrow ~h_\theta(x_i) \ge 0.5 ~\Longrightarrow ~ \hat{y}=1 xiTθ≥0 ⟹ hθ(xi)≥0.5 ⟹ y^=1
y = 1 y=1 y=1, then J ( θ ) → 0 J(\theta)~\rightarrow~0 J(θ) → 0
y = 0 , J ( θ ) → ∞ y=0,~J(\theta)~\rightarrow~\infty y=0, J(θ) → ∞ - x i T θ < 0 ⟹ h θ ( x i ) < 0.5 ⟹ y ^ = 0 x_i^\mathsf{T}\theta < 0 ~\Longrightarrow ~h_\theta(x_i) < 0.5 ~\Longrightarrow ~ \hat{y}=0 xiTθ<0 ⟹ hθ(xi)<0.5 ⟹ y^=0
y = 1 y=1 y=1, then J ( θ ) → ∞ J(\theta)~\rightarrow~\infty J(θ) → ∞
y = 0 , J ( θ ) → 0 y=0,~J(\theta)~\rightarrow~0 y=0, J(θ) → 0
3 Avoid Overfitting
higher overfitting ⟹ \Longrightarrow ⟹ higher variance of the predictions
higher underfitting ⟹ \Longrightarrow ⟹ higher bias of the predictions
3.1 Penalizing parameters
∑ i = 1 n d ( y i − h θ ( x i ) ) 2 λ r ( θ ) \sum^{n_d}_{i=1}(y_i - h_\theta(x_i))^2 \lambda r(\theta) i=1∑nd(yi−hθ(xi))2 λr(θ), where λ r ( θ ) \lambda r(\theta) λr(θ) is the Penalty, r ( θ ) r(\theta) r(θ) is the parameter penalty, λ \lambda λ is the regularization parameter.l
regularization parameter λ \lambda λ
- λ → 0 \lambda ~ \rightarrow~0 λ → 0 regularized regression tend to linear regression estimates
- λ → ∞ \lambda ~ \rightarrow~\infty λ → ∞ parameters get penalized more, the less the over-fitting to the data
- select λ \lambda λ via cross-validation
parameter penalty functions r ( θ ) r(\theta) r(θ)
Some typical parameter penalty functions (参考distance)
- d − v a r i a t i o n d-variation d−variation function: r ( θ ) = ( ∑ i = 1 n p ∣ θ i ∣ d ) 1 / d r(\theta)=\left(\sum^{n_p}_{i=1}|\theta_i|^d\right)^{1/d} r(θ)=(∑i=1np∣θi∣d)1/d
- L 1 n o r m : r ( θ ) = ∑ i = 1 n p ∣ θ i ∣ L_{1}~norm:~r(\theta)=\sum^{n_p}_{i=1}|\theta_i| L1 norm: r(θ)=∑i=1np∣θi∣
- s q u a r e d L 2 n o r m r ( θ ) = ∑ i = 1 n p ∣ θ i ∣ 2 squared~L_2~norm~r(\theta)=\sum^{n_p}_{i=1}|\theta_i|^2 squared L2 norm r(θ)=∑i=1np∣θi∣2
3.2 Regression model with penalty function
3.2.1 Lasso regression
Lasso regression uses L 1 n o r m L_1~norm L1 norm as penalty function. Lasso regression “zeroes out” coefficients, so it performs variable selection and to some extent parameter shrinkage.
-
Cost function
J L ( θ ) = ∣ ∣ y − X θ ∣ ∣ 2 λ ∣ ∣ θ ∣ ∣ 1 = ∑ i = 1 n d ( y i − x i T θ ) 2 λ ∑ j = 1 n d ∣ θ j ∣ J_L(\theta)=||y-X\theta||^2 \lambda||\theta||_1= \sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2 \lambda\sum^{n_d}_{j=1}|\theta_j| JL(θ)=∣∣y−Xθ∣∣2 λ∣∣θ∣∣1=i=1∑nd(yi−xiTθ)2 λj=1∑nd∣θj∣ -
Gradient of Cost function
∂ J L ( θ ) ∂ θ 1 , . . . ∂ J L ( θ ) ∂ θ n d \frac{\partial J_L{(\theta)}}{\partial \theta_1},... \frac{\partial J_L{(\theta)}}{\partial \theta_{nd}} ∂θ1∂JL(θ),...∂θnd∂JL(θ)
note:
∂ ∑ i = 1 n d ( y i − x i T θ ) 2 ∂ θ 1 = 2 x 1 i ∑ i = 1 n d ( y i − x i T θ ) \frac{\partial \sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2}{\partial \theta_1} = 2x_{1i}\sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta) ∂θ1∂∑i=1nd(yi−xiTθ)2=2x1ii=1∑nd(yi−xiTθ)
- Lasso estimate
θ ^ L = a r g m i n θ ( ∑ i = 1 n d ( y i − x i T θ ) 2 λ ∑ j = 1 n d ∣ θ j ∣ ) \hat\theta_L=\underset{\theta}{argmin}\left(\sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2 \lambda\sum^{n_d}_{j=1}|\theta_j| \right) θ^L=θargmin(i=1∑nd(yi−xiTθ)2 λj=1∑nd∣θj∣)
Can’t be solved, but J L ( θ ) J_L(\theta) JL(θ) is convex, using the least angle regression algorithm to solve.
3.2.2 Ridge regression
Ridge regression uses L 2 n o r m L_2~norm L2 norm as penalty function. So it does not “zero out” coefficients, i.e. it can not perform variable selection, but rather shrinks parameter values.
-
Cost function
J R ( θ ) = ∣ ∣ y − X θ ∣ ∣ 2 λ ∣ ∣ θ ∣ ∣ 2 2 = ∑ i = 1 n d ( y i − x i T θ ) 2 λ ∑ j = 1 n d ( θ j ) 2 J_R(\theta)=||y-X\theta||^2 \lambda||\theta||_2^2= \sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2 \lambda\sum^{n_d}_{j=1}(\theta_j)^2 JR(θ)=∣∣y−Xθ∣∣2 λ∣∣θ∣∣22=i=1∑nd(yi−xiTθ)2 λj=1∑nd(θj)2 -
ridge estimate
θ ^ R = ( X T X λ I ) − 1 X T y \hat\theta_R = (X^\mathsf{T} X \lambda I)^{-1}X^\mathsf{T}y θ^R=(XTX λI)−1XTy
This is the closed form solution.
The ridge estimate is biased because:
E ( θ ^ R ) = E ( ( X T X λ I ) − 1 X T y ) = ( X T X λ I ) − 1 X T E ( y ) = ( X T X λ I ) − 1 X T X θ = ( X T X λ I ) − 1 ( ( X T X ) − 1 ) − 1 θ = [ ( X T X ) − 1 ( X T X λ I ) ] − 1 θ = [ I λ ( X T X ) − 1 ] − 1 θ \begin{aligned} E(\hat\theta_R) &= E\left( (X^\mathsf{T} X \lambda I)^{-1}X^\mathsf{T}y \right) \\ &=(X^\mathsf{T} X \lambda I)^{-1}X^\mathsf{T} E(y)\\ &=(X^\mathsf{T} X \lambda I)^{-1}X^\mathsf{T}X \theta\\ &=(X^\mathsf{T} X \lambda I)^{-1}((X^\mathsf{T}X)^{-1})^{-1}\theta\\ &=\left[(X^\mathsf{T}X)^{-1}(X^\mathsf{T} X \lambda I)\right]^{-1} \theta\\ &=\left[ I \lambda(X^\mathsf{T}X)^{-1}\right]^{-1}\theta \end{aligned} E(θ^R)=E((XTX λI)−1XTy)=(XTX λI)−1XTE(y)=(XTX λI)−1XTXθ=(XTX λI)−1((XTX)−1)−1θ=[(XTX)−1(XTX λI)]−1θ=[I λ(XTX)−1]−1θ
X T X X^\mathsf{T}X XTX is positive definite, λ > 0 \lambda > 0 λ>0 by definition, so E ( θ ^ R ) ≠ θ E(\hat\theta_R)\neq \theta E(θ^R)=θ
Tips:
Although ridge regression does not perform variable selection, it performs grouped selection. So if one variable amongst a group of correlated ones is selected, ridge regression automatically includes the whole group. Ridge regression can resolve near multicollinearity.
3.2.3 Elastic net
The elastic net penalty is a compromise between Lasso and ridge.
- Cost function
J E ( θ ) = ∣ ∣ y − X θ ∣ ∣ 2 λ 1 ∣ ∣ θ ∣ ∣ 1 λ 2 ∣ ∣ θ ∣ ∣ 2 2 J_E(\theta)=||y-X\theta||^2 \lambda_1||\theta||_1 \lambda_2||\theta||_2^2 JE(θ)=∣∣y−Xθ∣∣2 λ1∣∣θ∣∣1 λ2∣∣θ∣∣22
if X is an n d × n p n_d\times n_p nd×np design matrix
J E ( θ ) = ∑ i = 1 n d ( y i − x i T θ ) 2 λ 1 ∑ j = 1 n p ∣ θ j ∣ λ 2 ∑ j = 1 n p θ j 2 J_E(\theta)=\sum^{n_d}_{i=1}(y_i - x_i^{\mathsf{T}}\theta)^2 \lambda_1\sum^{n_p}_{j=1}|\theta_j| \lambda_2\sum^{n_p}_{j=1}\theta_j^2 JE(θ)=i=1∑nd(yi−xiTθ)2 λ1j=1∑np∣θj∣ λ2j=1∑npθj2 - elastic net estimate
θ ^ E = a r g m i n θ ( ∣ ∣ y − X θ ∣ ∣ 2 λ 1 ∣ ∣ θ ∣ ∣ 1 λ 2 ∣ ∣ θ ∣ ∣ 2 2 ) \hat\theta_E =\underset{\theta}{argmin}\left(||y-X\theta||^2 \lambda_1||\theta||_1 \lambda_2||\theta||_2^2\right) θ^E=θargmin(∣∣y−Xθ∣∣2 λ1∣∣θ∣∣1 λ2∣∣θ∣∣22)
no closed form solution for θ E \theta_E θE, but J E ( θ ) J_E(\theta) JE(θ) is convex to have a unique minimum
4 Learning curve
Increase prediction accuracy:
- variable selection: increase or reduce number of covariates
- add polynomial features: { x 2 , x 1 x 2 } \{x^2, x_1x_2\} {x2,x1x2}
- regularized regression (Lasso, ridge regression and elastic nets):
- Collect more data
Learning curve:
- A learning curve is a metric of prediction accuracy, like cost function J ( θ ) J(\theta) J(θ)
- Or error metric (function of a parameter that affects the metric)
这篇好文章是转载于:学新通技术网
- 版权申明: 本站部分内容来自互联网,仅供学习及演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,请提供相关证据及您的身份证明,我们将在收到邮件后48小时内删除。
- 本站站名: 学新通技术网
- 本文地址: /boutique/detail/tanhiefaki
-
photoshop保存的图片太大微信发不了怎么办
PHP中文网 06-15 -
photoshop扩展功能面板显示灰色怎么办
PHP中文网 06-14 -
word里面弄一个表格后上面的标题会跑到下面怎么办
PHP中文网 06-20 -
《学习通》视频自动暂停处理方法
HelloWorld317 07-05 -
TikTok加速器哪个好免费的TK加速器推荐
TK小达人 10-01 -
Android 11 保存文件到外部存储,并分享文件
Luke 10-12 -
excel下划线不显示怎么办
PHP中文网 06-23 -
微信公众号没有声音提示怎么办
PHP中文网 03-31 -
微信运动停用后别人还能看到步数吗
PHP中文网 07-22 -
excel打印预览压线压字怎么办
PHP中文网 06-22