Machine Learning Note - Convex Optimization

11 November 2019


I’ve been taking an online Machine Learning class recently. This post is my note on convex optimization part.



1. Overview

  • AI problem = Model + Optimization (Similar to Program = Data Structure + Algorithm)
  • Models: DL, LR, SVM, XGBoost, NN, RL
  • Optimization: GD/SGD, Adagrad/Adam, Newton’s, L-BFGS, Coordinate Descent, EM.

2. Standard Form

Any optimization problem can be represented as the following standard form:


  • Linear Regression:
  • Logistic Regression:
  • Support Vector Machine:
  • Collaborative Filtering:
  • K-Means

3. Categories

  1. Convex vs Non-Convex
  2. Continuous vs Discrete
  3. Constrained vs Unconstrained
  4. Smooth vs Non-smooth

Solving a non-convex optimization problem:

  1. For some simple function, we can use brute-force search to validate all feasible solutions.
  2. Relax some constrains and convert the problem to a convex function.
  3. Directly solving the problem using a optimization algorithm, try to find a better local optimal.
Convex function has a unique global optimal.
Convex function has a unique global optimal.
Non-convex function has a more than one local optimals.
Non-convex function has more than one local optimals.

Convex Optimization

1. Convex Set:

$\forall x,y \in C, \forall a \in [0, 1].$ $C$ is a convex set if $ax+(1-a)y\in C$ .

An example of convex set
An example of convex set

Convex Set VS Non-Convex Set:

Left: Convex, Middle: Non-Convex, Right: Non-Convex

Examples of Convex Set:

  • $R^n$
  • Positive Integers $R^n_+$
  • Norm $|x| \le 1$
  • Affine set: All solution of a linear equation set $Ax = b$.
  • Halfspace: All solution of a inequality $Ax \le b$.

Theroem: Intersection of convex sets is convex.

2. Convex Function

2.1 Definition

Definition. Let $X$ be a convex set in a real vector space and let $f:X\to R$ be a function.

  • $f$ is called convex if:
  • $f$ is called strickly convex if:
Illustration of the convex definition.

Property: Summation of convex functions is convex function.

2.2 First Order Convexity Condition

Theorem. Suppose $f: R^n \to R$ is differentiable. Then, $f$ is convex $\iff f(y)\ge f(x)+\nabla(x)^T(y-x), \forall x, y \in dom(f).$

First order convexity condition.
2.3 Second Order Convexity Condtion

Theorem. Suppose $f: R^n \to R$ is twice differentiable. Then $f$ is convex $\iff \nabla^2f(x)\succeq0, \forall x \in dom(f).$

Note that we use $\succeq$ (succeeds or equals) here instead of $\ge$ (greater than or equals). $A\succeq0$ means that $A$ is a positive semi-definite (PSD) matrix. (i.e.; all of the eigenvalues of $A$ are positive.

3. Proof of Convexity

Example 1. Prove that Linear Function: $f(x)=b^Tx + c$ is convex.

Proof: By definition, $\forall x_1, x_2 \in dom(f)$, we have:

The inequality holds, hence the linear function $f(x)=bTx+c$ is convex.

Example 2. Prove that qualdratic function: $f(x)=\frac{1}{2}x^TAx + b^Tx + c, \forall A\succeq 0 \ $ is convex.

Proof: We will use second order convexity condition.

Let $\nabla$ be the partial derivative operator $\frac{\partial}{\partial x}$

Since $A\succeq 0$, $\nabla^2f$ is a positive semi-definite matrix. Thus, the qualdratic function is convex.

Example 3. Prove that p-norm is convex. .

Proof: Let , using Minkowski inequality (triangle inequality), we have:

So by definition, p-norm is convex.

Example 4. Prove that logistic regression loss is convex.


Proof: We will the second order convexity condition to prove this.

Let $\nabla$ denotes partial derivative operator $\frac{\partial}{\partial w}$

Now we can calculate the first order derivative:

Now we calculate the second order derivative:

Then, we need to prove that matrix $p_n(1-p_n)x_n{x_n}^T$ is PSD:

Since $p_n$ is a sigmoid function, we have $p_n(1-p_n) \geq 0$, which leads to $v^THv \geq 0$.

Thus, H is PSD and we proved that the logistic regression loss is a convex function.

Note: To prove a matrix $H$ is PSD, we need to prove that $\forall v\in R, v\neq0, v^THv \ge0.$