缘由

写这篇文章的主要目的是为了记录阅读深度学习相关论文的体会与感受(其实是为了完成作业)。

Softmax 函数的相关变体

$ softmax(x)_i = e^{x_i} $

这个公式存在有什么问题？

上溢：由于 $e^x$ 会将值算到很大，甚至可能超出计算机所能表示的最大范围，而无法运算
下溢：由于 $e^x$ 会将接近0的值算至很小，接近于0，但出现四舍五入的时候十分的容易出现问题

因此我们需要对 $softmax(x)_i$ 进行相关的改进。

Large-Margin Softmax Loss for Convolutional Neural Networks

paper link: text

motivation

current softmax loss does
not explicitly encourage intra-class compactness and inter-
class-separability. Our key intuition is that the separabil-
ity between sample and parameter can be factorized into
amplitude ones and angular ones with cosine similarity:
Wcx = ‖Wc‖2‖x‖2 cos(θc), where c is the class index,
and the corresponding parameters Wc of the last fully con-
nected layer can be regarded as the linear classifier of class
c. Under softmax loss, the label prediction decision rule is
largely determined by the angular similarity to each class
since softmax loss uses cosine distance as classification
score. The purpose of this paper, therefore, is to generalize
the softmax loss to a more general large-margin softmax
(L-Softmax) loss in terms of angular similarity, leading to
potentially larger angular separability between learned fea-
tures. This is done by incorporating a preset constant m
multiplying with the angle between sample and the classi-
fier of ground truth class. m determines the strength of get-
ting closer to the ground truth class, producing an angular
margin. One shall see, the conventional softmax loss be-
comes a special case of the L-Softmax loss under our pro-
posed framework. Our idea is verified by Fig. 2 where the
learned features by L-Softmax become much more com-
pact and well separated.

当前的 softmax 损失并不明确鼓励类内紧凑性和类间可分离性。我们的关键直觉是，样本和参数之间的可分离性可以分解为具有余弦相似性的幅度和角度：

W_c x = ‖W_c‖_2 ‖x‖_2 cos(θ_c)

其中 c 是类索引，最后一个完全连接层的相应参数 Wc 可以看作是类 c 的线性分类器。

在 softmax 损失下，标签预测决策规则很大程度上取决于与每个类的角度相似性，因为 softmax 损失使用余弦距离作为分类分数。

因此，本文的目的是将 softmax 损失推广到角度相似性方面更通用的大边距 softmax (L-Softmax) 损失，从而可能使学习到的特征之间的角度可分离性更大。 这是通过将预设常数 m 与样本和地面真实类别分类器之间的角度相乘来实现的。m 决定了接近地面真实类别的强度，从而产生了一个角度边界。可以看出，在我们提出的框架下，传统的 softmax 损失变成了 L-Softmax 损失的一个特例。我们的想法得到了图 2 的验证，其中 L-Softmax 学习到的特征变得更加紧凑和分离。

在分类任务中，使得类内距离尽可能小，类间距离尽可能大。