# 1. 简介

Marcin Andrychowicz1, Misha Denil1, Sergio Gómez Colmenarejo, Nando de Freitas, et al. Learning to learn by gradient descent by gradient descent[J]. NIPS 2016.

Learning to learn is a very exciting topic for a host of reasons, not least of which is the fact that we know that the type of backpropagation currently done in neural networks is implausible as an mechanism that the brain is actually likely to use: there is no Adam optimizer nor automatic differentiation in the brain! Something else has to be doing the optimization of our brain’s neural network, and most likely that something else is itself a neural network!

$\theta^*=argmin_{\theta\in \Theta}f(\theta)$

$\theta_{t+1} - \theta_t - \alpha \nabla f(\theta_t)$

• （优化）优化器 optimizer：$g$，参数为 $\phi$
• （原始）优化器 optimizee：参数为 $\theta$

$\theta_{t+1} - \theta_t + g_t ( \nabla f(\theta_t),\phi)$

RNN 存在一个可以保存历史信息的隐状态，LSTM 可以从一个历史的全局去适应这个特定的优化过程，LSTM 的参数对每个时刻节点都保持 “聪明”，是一种 “全局性的聪明”，适应每分每秒。

## 1.1. 迁移学习和泛化

【强烈怀疑本节是审稿人要求加的】

This is in contrast to the ordinary approach of characterizing properties of interesting problems analytically and using these analytical insights to design learning algorithms by hand.

The meaning of generalization in this framework is

• the ability to transfer knowledge between different problems
• the way that learning some cmmon structures in different problems
• the capability applied to more general optimization problem.

• 在不同问题之间传递知识的能力
• 学习不同问题中某些通用结构的方式
• 该能力适用于更一般的优化问题

# 2. 采用 RNN 实现学会学习

## 2.1. 问题框架

$\theta^*=argmin_{\theta\in \Theta}f(\theta)$

$\mathcal L(\phi) = \mathbb E_f[f(\theta^*)]$

• 目标函数 $f$ （个人理解就是 loss）
• 最终优化后的 optimizee 的参数为 $\theta^$。写成 $\theta^(f,\phi)$，与 $f$ 有关因为不同的 $f$ 会导致不同的最优参数；与 $\phi$ 有关因为最优的参数是依赖 optimizer 给出的，而 optimizer 的参数为 $\phi$
• 最终的损失为 $f(\theta^*)$
• 因此 optimizer 的损失就是上述最终损失的期望 $\mathbb E_f[f(\theta^*)]$，为啥求期望？

$\mathcal L(\phi) = \mathbb E_f\left[ \sum_{t=1}^T\omega_tf(\theta_t) \right]$

$\theta_{t+1} = \theta_t+g_t$ $[ g_t,h_{t+1} ] = {\rm lstm}(\nabla_t,h_t,\phi)$

$\omega_t \in \mathbb R_{\geq0}$ 是各个优化时刻的任意权重，$\nabla_t = \nabla_\theta f(\theta_t)$。

$\mathcal L(\phi) = \mathbb E_f[f(\theta^*(f,\phi))] = \mathbb E_f\left[ \sum_{t=1}^T\omega_tf(\theta_t) \right]$

• Meta-optimizer 优化器：目标函数整个优化周期的 loss 都要很小（加权和）
• 传统优化器：对于当前的目标函数，只要这一步的 loss 比上一步的 loss 值要小就行

## 2.2. coordinatewise LSTM 优化器

One challenge in applying RNNs in our setting is that we want to be able to optimize at least tens of thousands of parameters. Optimizing at this scale with a fully connected RNN is not feasible as it would require a huge hidden state and an enormous number of parameters. To avoid this difficulty we will use an optimizer m which operates coordinatewise on the parameters of the objective function, similar to other common update rules like RMSprop and ADAM. This coordinatewise network architecture allows us to use a very small network that only looks at a single coordinate to define the optimizer and share optimizer parameters across different parameters of the optimizee.

Adrien Lucas Ecoffet 的解读[1]： The “coordinatewise” section is phrased in a way that is a bit confusing to me, but I think it is actually quite simple: what it means is simply this: every single “coordinate” has its own state (though the optimizer itself is shared), and information is not shared across coordinates. I wasn’t 100% sure about is what a “coordinate” is supposed to be. My guess, however, is that it is simply a weight or a bias, which I think is confirmed by my experiments. In other words, if we have a network with 100 weights and biases, there will be 100 hidden states involved in optimizing it, which means that effectively there will be 100 instances of our optimizer network running in parallel as we optimize.

## 2.3. 预处理与后处理

\begin{aligned} \nabla^k \rightarrow \left\{ \begin{matrix} \left( \frac{log(\vert\nabla\vert)}{p},sgn(\nabla) \right) &\quad if \vert\nabla\vert\geq e^{-p}\\ (-1,e^{p}\nabla) &\quad otherwise\\ \end{matrix} \right. \end{aligned}

$p>0$ is a parameter controlling how small gradients are disregarded

Adrien Lucas Ecoffet 的解读[1]： With this formula, if the first parameter is greater than -1, it is a log of gradient, otherwise it is a flag indicating that the neural net should look at the second parameter. Likewise, if the second parameter is -1 or 1, it is the sign of the gradient, but if it is between -1 and 1 it is a scaled version of the gradient itself, exactly what we want!

# 3. 实验

## 3.1. 10 维函数

$f(\theta)=\vert\vert W\theta-y \vert\vert_2^2$

Adrien Lucas Ecoffet 的解读[1]： These are pretty simple: our optimizer is supposed to find a 10-element vector called $\theta$ that, when multiplied by a $10\times 10$ matrix called $W$, is as close as possible to a 10-element vector called $y$. Both $y$ and $W$ are generated randomly. The error is simply the squared error.

## 3.2. MNIST（MLP）

In this experiment we test whether trainable optimizers can learn to optimize a small neural network on MNIST. We train the optimizer to optimize a base network and explore a series of modifications to the network architecture and training procedure at test time.

• 40 个隐层神经元的 optimizee
• 2 层/每层 20 个神经元的 optimizee
• 采用 ReLu 激活函数的 optimizee

## 3.3. CIFAR（CNN）

optimizee 采用包含卷积层和全连接层在内的网络，三层卷积层+池化层，最后跟一个 32 神经元的全连接层。激活函数都为 ReLu，采用了 batch normalization。

The left-most plot displays the results of using the optimizer to fit a classifier on a held-out test set. The additional two plots on the right display the performance of the trained optimizer on modified datasets which only contain a subset of the labels, i.e. the CIFAR-2 dataset only contains data corresponding to 2 of the 10 labels. Additionally we include an optimizer LSTM-sub which was only trained on the held-out labels.

http://www.cs.toronto.edu/~kriz/cifar.html 163 MB python version The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.