TOC

Introduction

In optimization, duality refers to a pair of closely related problems, known as the primal problem and the dual problem. The primal problem is the original optimization problem that one wants to solve, while the dual problem is a different optimization problem that is derived from the primal problem.

The concept of duality is important in convex optimization because it provides useful insights into the structure of optimization problems and allows for the development of efficient algorithms for solving them. Specifically, the duality theorem states that under certain conditions, the optimal value of the primal problem is equal to the optimal value of the dual problem. This provides a way to check the optimality of a solution obtained from one problem by verifying its optimality in the other problem. Duality also allows for the development of optimization algorithms that solve the dual problem rather than the primal problem. This can be advantageous in certain situations, such as when the dual problem is easier to solve or has a simpler structure.

The Lagrangian

Let’s start the with the standard form of the optimization problem:

minimize \(f_0(x)\), subject to \(f_i(x) \leq 0, i=1..m, h_i(x) = 1..p, x \in R^n\). The domain D is intersection of all domains, not empty and the optimal value is \(p^*\). The main idea of Lagrangian duality is to augment the constraints into the object functions weightedly. As follows:

\(L: R^n x R^m x R^p \to R : L(x, \lambda, \nu) = f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \nu_i h_i (x)\).

The domain \(L = D x R^m x R^p\). \(\lambda_i, \nu_i\) are called the Lagrange multipliers. The problem becomes the Lagrangian dual function: \(g: R^m x R^p \to R\) so that we find the minimum value of the Lagrangian over x: for \(\lambda \in R^m, \nu in R^p\):

\(g(\lambda, \nu) = inf_{x \in D} L(x, \lambda, \nu) = inf_{x \in D} (f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \nu_i h_i(x))\).

The dual function will give us the lower bounds of the optimal value \(p^*\) of the problem. For any \(\lambda \geq 0\) and \(\nu: g(\lambda, \nu) \leq p^*\). We can prove it as follows. Suppose x is a feasible point: \(f_i(x) \leq 0, h_i(x) = 0, \lambda \geq 0\). So \(\sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \nu_i h_i(x) \leq 0\). Therefore the Lagrangian \(L(x, \lambda, \nu) = f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^p \nu_i h_i(x) \leq f_o(x)\). This makes \(g(\lambda, \nu) = inf_{x \in D} L(x, \lambda, \nu) \leq L(x, \lambda, \nu) \leq f_0(x)\). Since \(g(\lambda, \nu) \leq f_0(x)\) for all feasible x, it is followed that \(g(\lambda, \nu) \leq p^*\).

To find the best lower bound, we consider the problem: maximize \(g(\lambda, \nu)\) subject to \(\lambda \geq 0\). This is called the Lagrange dual problem. The optimal value of the Larange dual problem, \(d^*\), is the best lower bound on \(p^*: d^* \leq p^*\), this equality is called weak duality. The difference \(p^* - d^*\) is referred to as the optimal duality gap. If equality happens, the gap is zero, we call it the strong duality.

Strong duality does not hold in general, but if the primal problem is convex: minimize \(f_0(x)\) subject to \(f_i(x) \leq 0, i=1..m, Ax=b\) with \(f_0,...f_m\) convex, we usually have strong duality. There are many results establishing conditions on the problem, beyong convexity, to make the strong duality holds. Those conditions are called constraint qualifications. One simple constraint qualification is Slater’s condition: there exist an \(x \in relint D\) such that \(f_i(x) < 0, i=1..m, Ax = b\). Such a point can be called strictly feasible, since the inequality constraints hold with strict inequalities. Slater’s theorem states that strong duality holds if Slater’s condition holds and the problem is convex.

Mixed strategies in matrix games

Consider a game of two players. Player 1 makes a move \(k \in \{1,...n\}\) and player 2 makes a move \(l \in \{1,...m\}\). Player 1 then make payment \(P_{kl}\) to player 2 where \(P \in R^{nxm}\) is the payoff matrix for the game. The goal of player 1 is to make the payment as small as possible, while the goal of player 2 is to maximize it. The players use mixed strategies, with the following probability distribution over the choices: \(prob(k=i) = u_i, i=1...n; prob(l=i)=v_i, i=1...m\). The expected payoff from player 1 to player 2 is then \(\sum_{k=1}^n \sum_{l=1}^m u_k v_l P_{kl} = u^T P v\). Player 1 would choose u to minimize \(u^T P v\) while player 2 wishes to maximize it.

If player 2 knows player 1’s strategy in advance, she would try to maximize the expected payoff \(sup \{ u^T P v \mid v \geq 0, \textbf{1}^T v = 1 \} = max_{i=1..m} (P^T u)_i\). Player 1 choose to minimize this payoff: \(max_{i=1..m} (P^T u)_i\) subject to \(u \geq 0, \textbf{1}^T u = 1\). The smallest expected payoff is denoted \(p^*_1\).

If player 1 knows player 2’s strategy in advance, she would try to minimize \(u^T P v : inf\{u^T P v \mid u \geq 0, \textbf{1}^T u = 1 \} = min_{i=1...n} (Pv)_i\). Player 2 then maximize this with her v: \(min_{i=1...n} (Pv)_i\) subject to \(v \geq 0, \textbf{1}^T v = 1\). The optimal value of this scenario is \(p^*_2\).

It seems that knowing opponent’s strategy is advantegous, \(p_1^* \geq p_2^*\), but with mixed strategies, using duality, we can prove that \(p_1^* = p_2^*\).

Let’s start by formulating the problem as Linear Programming: minimize t subject to \(u \geq 0, \textbf{1}^T u = 1, P^T u \leq t\textbf{1}, t \in R\). Adding the multiplier \(\lambda\) for \(P^T u \leq t \textbf{1}, \mu\) for \(u \geq 0\) and $ \nu \(for\) \textbf{1}^T u = 1 $$, the Lagrangian is:

\(t + \lambda^T (P^T u - t \textbf{1}) - \mu^T u + \nu(1-\textbf{1}^T u) = \nu + (1 - \textbf{1}^T \lambda) t + (P \lambda - \nu \textbf{1} - \mu)^T u\).

The dual function is \(g(\lambda, \mu, \nu) = \begin{cases} \nu, \textbf{1}^T \lambda = 1, P \lambda - \nu \textbf{1} = \mu \\ - \infty \text{ otherwise} \\ \end{cases}\)

The dual problem is then: maximize \(\nu\) subject to \(\lambda \geq 0, \text{1}^T \lambda = 1, \mu \geq 0, P \lambda - \nu \textbf{1} = \mu\). This is equivalent to: maximize \(\nu\) subject to \(\lambda \geq 0, \textbf{1}^T \lambda = 1, P \lambda \geq \nu \textbf{1}\). Those two problems become equal, LPs are feasible, we have strong duality, the optimal values of the two problems are equal.

More examples

Find the minimum and maximum of the function \(f_0(x,y) = x + y\) such that \(f_1(x,y) = x^2 + y^2 = 2\)! The Lagragian is \(L(x,y,\lambda) = x+y+\lambda(x^2+y^2-2)\). Set the derivatives to be zero:

\[\nabla_{x,y,\lambda} L(x,y,\lambda) = 0\]

This is equivalent to \(\begin{cases} 1 + 2 \lambda x = 0 \\ 1 + 2 \lambda y = 0 \\ x^2 + y^2 = 2 \\ \end{cases}\) Hence, \(\begin{cases} x = y = \frac{-1}{2 \lambda} \\ \lambda^2 = \frac{1}{4} \\ \end{cases}\) \(\Rightarrow \lambda = \pm \frac{1}{2}\)

And we have two roots \((x,y) \in \{(1,1),(-1,-1)\}\). The max \(f_0(x,y) = f_0(1,1) = 2\) and the min \(f_0(x,y) = f_0(-1,-1) = -2\).

Another examples on two probability distributions and their cross entropy. As we know, the cross entropy function is to measure the similarity of two probability distributions. If the value is small the two distributions are close. Let’s take a probability distribution \(p = {[p_1, p_2...p_n]}^T, p_i \in {[0,1]}, \sum_{i=1}^n p_i = 1\). For another probability distribution \(q = {[q_1, q_2...q_n]}, q_i \neq 0, \forall i: f_0(q) = - \sum_{i=1}^n p_i log(q_i)\) is the cross entropy function between p and q. We find q to minimize the cross entropy. With the constraint \(\sum_{i=1}^n q_i = 1\), the Lagrangian is: \(L(q_1, q_2..q_n, \lambda) = - \sum_{i=1}^n p_i log(q_i) + \lambda(\sum_{i=1}^n q_i - 1)\). To find optimum, we set the derivatives to zero:

\[\nabla_{q_1,...q_n, \lambda} L(q_1, ...q_n, \lambda) = 0\]

This is equivalent to:

\[\begin{cases} -\frac{p_i}{q_i} + \lambda = 0, i = 1..n \\ q_1 + q_2 +... + q_n = 1 \\ \end{cases}\]

So, \(p_i = \lambda q_i\), hence, \(1 = \sum_{i=1}^n p_i = \lambda \sum_{i=1}^n q_i = \lambda \Rightarrow \lambda = 1 \Rightarrow q_i = p_i, \forall i\). The cross entropy function achieve minimum when the two probability distributions are the same.

Optimality conditions

Complementary slackness

Suppose we have strong duality, let \(x^*\) be a primal optimal and \((\lambda^*, \nu^*)\) be a dual optimal point. This means:

\(f_0(x^*) = g(\lambda^*, \nu^*)\) \(= inf_x(f_0(x) + \sum_{i=1}^m \lambda_i^* f_i(x) + \sum_{i=1}^p \nu_i^* h_i(x))\) \(\leq f_0(x^*) + \sum_{i=1}^m \lambda_i^* f_i(x^*) + \sum_{i=1}^p \nu_i^* h_i(x^*))\) \(\leq f_0(x^*)\)

The first line states that the optimal duality gap is zero. The second line is the definition of the dual function. The third line follows since the infimum of the Lagrangian over x is less than or equal to its value at \(x = x^*\). The last inequality follows from \(\lambda_i^* \geq 0, f_i(x^*) \leq 0, i=1..m, h_i(x^*) = 0, i=1..p\). The two inequalities in this chain hold with equality.

An important conclusion from this: \(\sum_{i=1}^m \lambda_i^* f_i(x^*) = 0\), so \(\lambda_i^* f_i(x^*) = 0, i=1...m\). This condition is called the complementary slackness. The complementary slackness condition is expressed as: \(\lambda_i^* > 0 \Rightarrow f_i(x^*) = 0\) or equivalently \(f_i(x^*) < 0 \Rightarrow \lambda_i^* = 0\).

KKT optimality condition

Assume that the functions \(f_0, ...f_m, h_1, ...h_p\) are differentiable. For non convex problem, assume strong duality, \(x^*, (\lambda^*, \nu^*)\) be the primal and dual optimal points. Since \(x^*\) minimizes \(L(x, \lambda^*, \nu^*)\) over x, the gradient vanishes at \(x^*\):

\[\nabla f_0(x^*) + \sum_{i=1}^m \lambda_i^* \nabla f_i(x^*) + \sum_{i=1}^p \nu_i^* \nabla h_i(x^*) = 0\]

Thus,

\(f_i(x^*) \leq 0, i=1...m\) \(h_i(x^*) = 0, i=1,...p\) \(\lambda_i^* \geq 0, i=1...m\) \(\lambda_i^* f_i(x^*) = 0, i=1...m\) \(\nabla f_0(x^*) + \sum_{i=1}^m \lambda_i^* \nabla f_i(x^*) + \sum_{i=1}^p \nu_i^* \nabla h_i(x^*) = 0\) which are called the Karush-Kuhn-Tucker (KKT) conditions. To summarize, any optimization problem with differentiable objective and constraint functions for which strong duality obtains, any pair of primal and dual optimal points must satisfy the KKT conditions.