Machine Learning Note - Convex Optimization

11 November 2019

Introduction

I’ve been taking an online Machine Learning class recently. This post is my note on convex optimization part.

Optimization
Convex Optimization

Optimization

1. Overview

AI problem = Model + Optimization (Similar to Program = Data Structure + Algorithm)
Models: DL, LR, SVM, XGBoost, NN, RL
Optimization: GD/SGD, Adagrad/Adam, Newton’s, L-BFGS, Coordinate Descent, EM.

2. Standard Form

Any optimization problem can be represented as the following standard form:

\[\begin{align} Minimize \ \ \ & f_0(x) \\ s.t. \ \ &f_i(x)\le 0, \ \ i={1,...,K}\\ & g_j(x)=0, \ \ j={1,...,L} \\ \end{align}\]

Examples:

Linear Regression:

\[\sum_{i=1}^{n}(w^Tx_i - y_i)^2+\|w\|^2_2 \\ \|Xw-Y\|^2_2 + \|w\|^2_2\]

Logistic Regression:

\[\sum_{i=1}^{n}(y_ilog\sigma(w^Tx_i+b) + (1-y_i)log(1-\sigma(w^Tx_i+b))\]

Support Vector Machine:

\[\|w\|^2_2 + \lambda\sum_{i=1}^{y}\epsilon_i\\ s.t. \ \ \epsilon_i \ge 1-y_i(w^Tx_i+b)\]

Collaborative Filtering:

\[\sum_{(i,j)\in\Omega} (u^T_i v_j-r_{ij})^2 + \lambda\|u\|^2_F + \gamma\|u\|^2_F\]

K-Means

\[\sum_{i=1}^{n}\ \sum_{k=1}^{K} \gamma_{ik}\|x_i - \mu_k\|^2_2\\ \gamma_{ik}= \begin{cases} 1, \ \ \ if \ x_i \in k^{th} \ cluster,\\ 0, \ \ \ otherwise. \end{cases}\]

3. Categories

Convex vs Non-Convex
Continuous vs Discrete
Constrained vs Unconstrained
Smooth vs Non-smooth

Solving a non-convex optimization problem:

For some simple function, we can use brute-force search to validate all feasible solutions.
Relax some constrains and convert the problem to a convex function.
Directly solving the problem using a optimization algorithm, try to find a better local optimal.

Convex function has a unique global optimal.

Non-convex function has a more than one local optimals.

Non-convex function has more than one local optimals.

Convex Optimization

1. Convex Set:

$\forall x,y \in C, \forall a \in [0, 1].$ $C$ is a convex set if $ax+(1-a)y\in C$ .

An example of convex set

Convex Set VS Non-Convex Set:

Left: Convex, Middle: Non-Convex, Right: Non-Convex

Examples of Convex Set:

$R^n$
Positive Integers $R^n_+$
Norm $|x| \le 1$
Affine set: All solution of a linear equation set $Ax = b$.
Halfspace: All solution of a inequality $Ax \le b$.

Theroem: Intersection of convex sets is convex.

2. Convex Function

2.1 Definition

Definition. Let $X$ be a convex set in a real vector space and let $f:X\to R$ be a function.

$f$ is called convex if: $\forall x_1, x_2 \in X, \forall t \in [0, 1]: f(tx_1+(1-t)x_2\le tf(x_1)+(1-t)f(x_2)$
$f$ is called strickly convex if: $\forall x_1\neq x_2 \in X, \forall t \in [0, 1]: f(tx_1+(1-t)x_2\le tf(x_1)+(1-t)f(x_2)$

Property: Summation of convex functions is convex function.

2.2 First Order Convexity Condition

Theorem. Suppose $f: R^n \to R$ is differentiable. Then, $f$ is convex $\iff f(y)\ge f(x)+\nabla(x)^T(y-x), \forall x, y \in dom(f).$

2.3 Second Order Convexity Condtion

Theorem. Suppose $f: R^n \to R$ is twice differentiable. Then $f$ is convex $\iff \nabla^2f(x)\succeq0, \forall x \in dom(f).$

Note that we use $\succeq$ (succeeds or equals) here instead of $\ge$ (greater than or equals). $A\succeq0$ means that $A$ is a positive semi-definite (PSD) matrix. (i.e.; all of the eigenvalues of $A$ are positive.

3. Proof of Convexity

Example 1. Prove that Linear Function: $f(x)=b^Tx + c$ is convex.

Proof: By definition, $\forall x_1, x_2 \in dom(f)$, we have:

\[\begin{align} b^T(tx_1+(1-t)x_2)+c &\le t(b^Tx_1+c) + (1-t)(b^Tx_2 + c)\\ tb^Tx_1+(1-t)b^Tx_2+c &\le tb^Tx_1+tc + (1-t)b^Tx_2 + (1-t)c\\ c &\le tc + (1-t)c\\ c &\le c \end{align}\]

The inequality holds, hence the linear function $f(x)=bTx+c$ is convex.

Example 2. Prove that qualdratic function: $f(x)=\frac{1}{2}x^TAx + b^Tx + c, \forall A\succeq 0 \ $ is convex.

Proof: We will use second order convexity condition.

Let $\nabla$ be the partial derivative operator $\frac{\partial}{\partial x}$

\[\frac{\partial f}{\partial x} = \nabla f = Ax + b \\ \frac{\partial^2 f}{\partial^2 x} = \nabla^2f = A\]

Since $A\succeq 0$, $\nabla^2f$ is a positive semi-definite matrix. Thus, the qualdratic function is convex.

Example 3. Prove that p-norm is convex. $\|x\|_p=(\sum_{i=1}^{n}\|x_i\|^p)^{1/p}$.

Proof: Let $f(x)=\|x\|_p$, using Minkowski inequality (triangle inequality), we have:

\[\|\lambda x_1+(1-\lambda)x_2\|_p \le \|\lambda x_1\|_p + \|(1-\lambda) x_2\|_p = \lambda\| x_1\|_p + (1-\lambda)\| x_2\|_p\]

So by definition, p-norm is convex.

Example 4. Prove that logistic regression loss is convex. $L(w)=-\sum_{n=1}^{N}(y_nlog(p_n) + (1-y_n)log(1-p_n)$

where: $p_n = \frac{1}{1+e^{-W^Tx_n}}$

Proof: We will the second order convexity condition to prove this.

Let $\nabla$ denotes partial derivative operator $\frac{\partial}{\partial w}$

\[\begin{align} \nabla p&=\nabla \frac{1}{(1+e^{-{W^Tx}})} \\ &=-\frac{1}{(1+e^{-{W^Tx}})^2} \nabla(1+e^{-{W^Tx}}) \\ &=-\frac{1}{(1+e^{-{W^Tx}})^2} e^{-{W^Tx}} \nabla(-w^Tx) \\ &=-\frac{1}{(1+e^{-{W^Tx}})^2} e^{-{W^Tx}}\cdot(-x) \\ &=\frac{e^{-{W^Tx}}}{(1+e^{-{W^Tx}})^2} \cdot x \\ &=\frac{1}{(1+e^{-{W^Tx}})}\frac{e^{-{W^Tx}}}{(1+e^{-{W^Tx}})} \cdot x \\ &=\frac{1}{(1+e^{-{W^Tx}})}\frac{1+e^{-{W^Tx}}-1}{(1+e^{-{W^Tx}})} \cdot x \\ &=\frac{1}{(1+e^{-{W^Tx}})}(1-\frac{1}{(1+e^{-{W^Tx}})}) \cdot x \\ &=p(1-p)x \end{align}\]

Now we can calculate the first order derivative:

\[\begin{align} \frac{\partial L(w)}{\partial w} &=\nabla{L(w)} \\ &=-\sum_{n=1}^{N}(y_n\nabla log(p_n) + (1-y_n)\nabla log(1-p_n)) \\ &=-\sum_{n=1}^{N}(y_n \frac{1}{p_n}\nabla p_n + (1-y_n)\frac{1}{1-p_n}\nabla (1-p)) \\ &=-\sum_{n=1}^{N}(y_n \frac{p_n(1-p_n)}{p_n}x_n + (1-y_n)\frac{-p_n(1-p_n)}{(1-p_n)}x_n) \\ &=-\sum_{n=1}^{N}(y_n(1-p_n)x_n - (1-y_n)p_nx_n) \\ &=-\sum_{n=1}^{N}(y_n - y_np_n - p_n + y_np)x_n \\ &=-\sum_{n=1}^{N}(y_n-p_n)x_n \\ &=\sum_{n=1}^{N}(p_n-y_n)x_n \end{align}\]

Now we calculate the second order derivative:

\[\begin{align} \frac{\partial^2 L(w)}{\partial^2 w} &=\sum_{n=1}^{N}\nabla((p_n-y_n)x_n) \\ &=\sum_{n=1}^{N}\nabla(p_nx_n) \\ &=\sum_{n=1}^{N}(p_n(1-p_n)x_n{x_n}^T) \end{align}\]

Then, we need to prove that matrix $p_n(1-p_n)x_n{x_n}^T$ is PSD:

\[\begin{align} \forall v\in \mathcal{R} v^THv &= v^Tp_n(1-p_n)x_n{x_n}^Tv \\ &= p_n(1-p_n)(x_n^Tv)^2 \\ \end{align}\]

Since $p_n$ is a sigmoid function, we have $p_n(1-p_n) \geq 0$, which leads to $v^THv \geq 0$.

Thus, H is PSD and we proved that the logistic regression loss is a convex function.

Note: To prove a matrix $H$ is PSD, we need to prove that $\forall v\in R, v\neq0, v^THv \ge0.$

Xiaowen Ying