Convex Optimization
-
Problem Setup
-
\(
\begin{align*}
\text{Time} &\quad& \text{choice 1} &\quad& \text{choice 2} &\quad& \cdots &\quad& \text{choice n}\\
1 && x^{(1)}_1, l^{(1)}_1 && x^{(1)}_2, l^{(1)}_2 && \cdots && x^{(1)}_n, l^{(d)}_n\\
2 && x^{(2)}_1, l^{(2)}_1 && x^{(2)}_2, l^{(2)}_2 && \cdots && x^{(2)}_n, l^{(d)}_n\\
\vdots && \ddots \quad && \ddots \quad && \ddots && \vdots \qquad \\
t && x^{(t)}_1, l^{(t)}_1 && x^{(t)}_2, l^{(t)}_2 && \cdots && x^{(t)}_n, l^{(d)}_n\\
\vdots && \ddots \quad && \ddots \quad && \ddots && \vdots \qquad \\
T && x^{(T)}_1, l^{(T)}_1 && x^{(T)}_2, l^{(T)}_2 && \cdots && x^{(T)}_n, l^{(d)}_n\\
\end{align*}
\)
Conditions:
\(
\forall t, x^{(t)}_i \geq 0; \\
\sum \limits_{i=1}^n x^{(t)}_i = 1;
\)
-
At step \(t\) the algorithm has loss \(\sum \limits_{i=1}^n x^{(t)}_i \cdot l^{(t)}_i\)
-
\(l^{(t)}_i\) is unknown ahead of time
-
Regret: actual performance minus best fixed strategy w/ hindsight
\(R_T = [\sum \limits_{t=1}^T \sum \limits_{i=1}^n x^{(i)}_t l^{(i)}_t] - [\min \limits_{i=1\cdots
n} \sum \limits_{t=1}^T l^{(t)}_i]\)
Comparing against someone who knew everything but is only able to make a fixed choice
-
Offline optimum: regret = 0
-
The actions performed are online since we do not have access to loss beforehand
-
Multiplicative Weights
-
Stores weights \(w^{(t)}_i\) initialized to 1
-
parameterized by \(0<\varepsilon<\frac{1}{2}\)
-
update: \(w^{(t+1)}_i = (1-\varepsilon)^{l^{(t)}_i} \)
update penalizes weights that generate loss
-
At time \(t\) play \(x^{(T)}_i = \frac{w^{(t)}_i}{\sum w^{(t)}_i}\)
-
Guarantees \(R_T \leq O(\sqrt{T\log n})\)
And \(\frac{R_T}{T} \leq O(\sqrt{\frac{\log n}{T}})\), \(\frac{R_T}{T} \rightarrow 0\) as \(T
\rightarrow \infty\)
-
Regret bound:
\(\begin{align*}
\sum \limits_{t=1}^T \sum \limits_{i=1}^n x^{(i)}_t l^{(i)}_t &\quad \leq &\min \limits_{x^*}\sum
\limits_{t=1}^T \sum \limits_{i=1}^n x^*_t l^{(i)}_t &\quad+ \varepsilon T + \frac{\ln
n}{\varepsilon} \\
\text{loss of algorithm} && \text{offline optimum} & \qquad \text{regret}
\end{align*}\)
Where \(x^*\) is a fixed strategy
-
Multiplicative Weights Analysis
-
Assume \(l \in [0,1], \varepsilon \in [0, \frac{1}{2}\)
-
Definitions:
\(L_t = \sum \limits_{i=1}^n x^{(i)}_t l^{(i)}_t\); loss at time \(t\)
\(L^* = \min \limits_{i} \sum \limits_{t=1}^T l^{(t)}_i\); offline optimum
\(\varepsilon T + \frac{\ln n}{\varepsilon}\); regret
\(W_t = \sum \limits_{i=1}^n w^{(t)}_i\); total weight at time \(t\)
-
Lemma 1:
\(W_{t+1} \geq (1-\varepsilon)^{L^*}\)
Proof:
\(w^{(t+1)}_i = (1-\varepsilon)^{l^{(1)}_i + l^{(2)}_i + l^{(3)}_i \cdots l^{(T)}_i}\)
\(W_t \geq w^{(t)}_i, \forall i\)
\(L^* = \sum l^{(1)}_i \cdots l^{(T)}_i\) for some \(i\)
-
Lemma 2:
\(W_{t+1} \leq n \cdot \prod \limits_{t=1}^T (1-\varepsilon L_t)\)
Proof:
\(W_1 = n\) since all weights are initialized as 1
Claim that \(W_{t+1} \leq W_t \cdot (1-\varepsilon L_t)\)
\(\begin{align*}
W_{t+1} &= \sum w_i^{(t+1)} = \sum \limits_{i=1}^{n} w_{i}^{(t)}
\cdot(1-\varepsilon)^{\ell_{i}^{(t)}} \\
&= W_{t} \sum_{i=1}^{n} x_{i}^{(t)} \cdot(1-\varepsilon)^{\ell_{i}^{(t)}} \\
&\leq W_{t} \sum_{i=1}^{n} x_{i}^{(t)}\left(1-\varepsilon \cdot l_{i}^{( t )}\right) \\
&=W_{t} \cdot\left(\sum_{i=1}^{n} x_{i}^{(t)}-\sum_{i=1}^{n} \varepsilon x_{i}^{t}
l_{i}^{(t)}\right) \\
&=W_t \cdot (1-\varepsilon L_t)
\end{align*}\)
-
From lemmas 1 and 2:
\(\begin{align*}
(1-\varepsilon)^{L^*} &\leq n \prod\limits_{t=1}^{T}\left(1-\varepsilon L_{t}\right) \\
\ln (1-\varepsilon)^{L^*} &\leq \ln \left(n \prod_{t=1}^{T}\left(1-\varepsilon L_{t}\right)\right)
\\
L^{*} \ln (1-\varepsilon) &\leq \ln n+\sum_{t=1}^{T} \ln \left(1-\varepsilon L_{t}\right)
\end{align*}\)
-
Multiplicative Weights: More General Losses
-
Follow the Leader
-
A more general case / parent of MW
-
Setup:
\(\begin{align*}
\text{Time} &\quad& \text{Algorithm} &\quad& \text{Loss} &\quad& \text{Loss of Algorithm} \\
t=1 && x^{(1)} \in K && f_1:K \rightarrow R && f_1(x^{(1)}) \\
t=2 && x^{(2)} \in K && f_2:K \rightarrow R && f_2(x^{(2)}) \\
\vdots && \vdots && \vdots && \vdots \\
\end{align*}\)
-
at time t \( x^{(t)} = \arg\min\limits_{x \in K} f_{1}(x)+\ldots f_{t-1}(x)\)
-
does badly in an interesting way
keeps changing its mind and choosing sub-optimal choices
-
Regret can be bounded in with respect to how much the algorithm changes its mind
-
FTL Analysis
-
Theorem: \(R_T \leq \sum\limits_{t=1}^{T} f_{t}\left(x^{(t)}\right)-f_{t}\left(x^{(t+1)}\right)\)
i.e. bounded by how much the loss function changes
-
Proof: Induction on T
\(\sum\limits_{t=1}^{T} f_{t}\left(x^{(t+1)}\right) \leq \min _{x \in \Delta} \sum\limits_{t=1}^{T}
f_{t}(x)\)
base case: t=1, by definition
assume true for t, prove for t+1
\(\begin{align*}
\sum_{t=1}^{T+1} f_{t}\left(x^{(t+1)}\right) &=f_{T+1}\left(x^{(T+2)}\right)+\sum_{t=1}^{T+1}
f_{t}\left(x^{(t+1)}\right) \\
& \leq f_{T+1}\left(x^{(T+2)}\right)+\sum_{t=1}^{T} f_{t}\left(x^{(T+2)}\right) \\
& = \sum_{t=1}^{T+1} f_{t}\left(x^{(T+2)}\right)
\end{align*}\)
-
Follow the Regularized Leader
-
define a function \(R: K \rightarrow R\)
-
Algorithm: \(x^{(t)}=\arg\min\limits_{x \in K} f_{1}(x)+ \cdots +f_{t-1}(x)+R(x)\)
-
Pick an \(R\) that has an unique minimum and is very smooth
-
New regret bound:
\(\sum_{t=1}^{T} f_{t}\left(x^{(t)}\right)-\sum_{t=1}^{T} f_{t}(x)
\leq\left(\sum_{t=1}^{T}
f_{t}\left(x^{(t)}\right)-f_{t}\left(x^{(t+1)}\right)\right)+R(x)-R\left(x^{(1)}\right)\)
i.e. \(R_T \leq \left(\sum_{t=1}^{T}
f_{t}\left(x^{(t)}\right)-f_{t}\left(x^{(t+1)}\right)\right)+\max _{x \in K}
R(x)-R\left(x^{(1)}\right) \)
-
\(R\) should be spread out to fight the tendency of \(x\) to be very concentrated
smaller if distribution is spread out, larger if distribution is concentrated
-
FTRL with entropy as regularizer = MW