## Machine Learning Note - Convex Optimization

11 November 2019

### Introduction

I’ve been taking an online Machine Learning class recently. This post is my note on convex optimization part.

### Contents

### Optimization

#### 1. Overview

- AI problem = Model + Optimization (Similar to Program = Data Structure + Algorithm)
- Models: DL, LR, SVM, XGBoost, NN, RL
- Optimization: GD/SGD, Adagrad/Adam, Newton’s, L-BFGS, Coordinate Descent, EM.

#### 2. Standard Form

Any optimization problem can be represented as the following standard form:

Examples:

- Linear Regression:

- Logistic Regression:

- Support Vector Machine:

- Collaborative Filtering:

- K-Means

#### 3. Categories

- Convex vs Non-Convex
- Continuous vs Discrete
- Constrained vs Unconstrained
- Smooth vs Non-smooth

Solving a non-convex optimization problem:

- For some simple function, we can use brute-force search to validate all feasible solutions.
- Relax some constrains and convert the problem to a convex function.
- Directly solving the problem using a optimization algorithm, try to find a better local optimal.

### Convex Optimization

#### 1. Convex Set:

$\forall x,y \in C, \forall a \in [0, 1].$ $C$ is a convex set if $ax+(1-a)y\in C$ .

**Convex Set VS Non-Convex Set:**

**Examples of Convex Set:**

- $R^n$
- Positive Integers $R^n_+$
- Norm $|x| \le 1$
- Affine set: All solution of a linear equation set $Ax = b$.
- Halfspace: All solution of a inequality $Ax \le b$.

**Theroem: Intersection of convex sets is convex.**

#### 2. Convex Function

##### 2.1 Definition

**Definition.** Let $X$ be a convex set in a real vector space and let $f:X\to R$ be a function.

- $f$ is called
**convex**if: - $f$ is called
**strickly**convex if:

**Property: Summation of convex functions is convex function.**

##### 2.2 First Order Convexity Condition

**Theorem.** Suppose $f: R^n \to R$ is *differentiable*. Then, $f$ is convex $\iff f(y)\ge f(x)+\nabla(x)^T(y-x), \forall x, y \in dom(f).$

##### 2.3 Second Order Convexity Condtion

**Theorem.** Suppose $f: R^n \to R$ is **twice differentiable**. Then $f$ is convex $\iff \nabla^2f(x)\succeq0, \forall x \in dom(f).$

Note that we use $\succeq$ (succeeds or equals) here instead of $\ge$ (greater than or equals). $A\succeq0$ means that $A$ is a positive semi-definite (PSD) matrix. (i.e.; all of the eigenvalues of $A$ are positive.

#### 3. Proof of Convexity

**Example 1.** Prove that Linear Function: $f(x)=b^Tx + c$ is convex.

**Proof:** By definition, $\forall x_1, x_2 \in dom(f)$, we have:

The inequality holds, hence the linear function $f(x)=bTx+c$ is convex.

**Example 2.** Prove that qualdratic function: $f(x)=\frac{1}{2}x^TAx + b^Tx + c, \forall A\succeq 0 \ $ is convex.

**Proof:** We will use second order convexity condition.

Let $\nabla$ be the partial derivative operator $\frac{\partial}{\partial x}$

Since $A\succeq 0$, $\nabla^2f$ is a positive semi-definite matrix. Thus, the qualdratic function is convex.

**Example 3.** Prove that p-norm is convex. .

**Proof:** Let , using Minkowski inequality (triangle inequality), we have:

So by definition, p-norm is convex.

**Example 4.** Prove that logistic regression loss is convex.

where:

**Proof:** We will the second order convexity condition to prove this.

Let $\nabla$ denotes partial derivative operator $\frac{\partial}{\partial w}$

Now we can calculate the first order derivative:

Now we calculate the second order derivative:

Then, we need to prove that matrix $p_n(1-p_n)x_n{x_n}^T$ is PSD:

Since $p_n$ is a sigmoid function, we have $p_n(1-p_n) \geq 0$, which leads to $v^THv \geq 0$.

Thus, H is PSD and we proved that the logistic regression loss is a convex function.

**Note:** To prove a matrix $H$ is PSD, we need to prove that $\forall v\in R, v\neq0, v^THv \ge0.$