ILSVRC:大规模图像识别挑战赛从包含21841个类别、14197122张图片的ImageNet数据集中挑选了1000类的1200000张作为训练集,获得了最优的结果,“top-1 and top-5 error rates of 37.5% and 17.0%” (Krizhevsky 等, 2017, p. 84)

“The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully connected layers with a final 1000-way softmax.” (Krizhevsky 等, 2017, p. 84)

“To make training faster, we used nonsaturating neurons and a very efficient GPU implementation of the convolution operation.” (Krizhevsky 等, 2017, p. 84)

“To reduce overfitting in the fully connected layers we employed a recently developed regularization method called “dropout” that proved to be very effective.” (Krizhevsky 等, 2017, p. 84)


“But objects in realistic settings exhibit considerable variability, so to learn to recognize them it is necessary to use much larger training sets. And indeed, the shortcomings of small image datasets have been widely recognized (e.g., Ref.25), but it has only recently become possible to collect labeled datasets with millions of images. The new larger datasets include LabelMe,28 which consists of hundreds of thousands of fully segmented images, and ImageNet,7 which consists of over 15 million labeled high-resolution images in over 22,000 categories.” (Krizhevsky 等, 2017, p. 85)

“To learn about thousands of objects from millions of images, we need a model with a large learning capacity. However, the immense complexity of the object recognition task means that this problem cannot be specified even by a dataset as large as ImageNet, so our model should also have lots of prior knowledge to compensate for all the data we do not have.” (Krizhevsky 等, 2017, p. 85)

“In the end, the network’s size is limited mainly by the amount of memory available on current GPUs and by the amount of training time that we are willing to tolerate.” (Krizhevsky 等, 2017, p. 85)

“mageNet is a dataset of over 15 million labeled high-resolution images belonging to roughly 22,000 categories. The images were collected from the web and labeled by human labelers using Amazon’s Mechanical Turk crowd-sourcing tool. Starting in 2010, as part of the Pascal Visual Object Challenge, an annual competition called the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) has been held. ILSVRC uses a subset of ImageNet with roughly 1000 images in each of 1000 categories. In all, there are roughly 1.2 million training images, 50,000 validation images, and 150,000 testing images.” (Krizhevsky 等, 2017, p. 85)

“ImageNet consists of variable-resolution images, while our system requires a constant input dimensionality. Therefore, we down-sampled the images to a fixed resolution of 256 × 256. Given a rectangular image, we first rescaled the image such that the shorter side was of length 256, and then cropped out the central 256 × 256 patch from the resulting image. We did not pre process the images in any other way, except for subtracting the mean activity over the training set from each pixel. So we trained our network on the (centered) raw RGB values of the pixels.” (Krizhevsky 等, 2017, p. 85)

(1)、tanh 和 sigmod函数是饱和的激活函数;ReLU以及其变种为非饱和激活函数。非饱和激活函数主要有如下优势:



“This is demonstrated in Figure 1, which shows the number of iterations required to reach 25% training error on the CIFAR-10 dataset for a particular four-layer convolutional network.” (Krizhevsky 等, 2017, p. 85)

“A four-layer convolutional neural network with ReLUs (solid line) reaches a 25% training error rate on CIFAR-10 six times faster than an equivalent network with tanh neurons (dashed line).” (Krizhevsky 等, 2017, p. 86)


“A single GTX 580 GPU has only 3GB of memory, which limits the maximum size of the networks that can be trained on it. It turns out that 1.2 million training examples are enough to train networks which are too big to fit on one GPU.” (Krizhevsky 等, 2017, p. 86)

“Current GPUs are particularly well-suited to cross-GPU parallelization, as they are able to read from and write to one another’s memory directly, without going through host machine memory.” (Krizhevsky 等, 2017, p. 86)

双GPU的具体使用方法:“The parallelization scheme that we employ essentially puts half of the kernels (or neurons) on each GPU, with one additional trick: the GPUs communicate only in certain layers. This means that, for example, the kernels of layer 3 take input from all kernel maps in layer 2. However, kernels in layer 4 take input only from those kernel maps in layer 3 which reside on the same GPU.” (Krizhevsky 等, 2017, p. 86)


“local normalization scheme aids generalization” (Krizhevsky 等, 2017, p. 86)


“We also verified the effectiveness of this scheme on the CIFAR-10 dataset: a four-layer CNN achieved a 13% test error rate without normalization and 11% with normalization.c” (Krizhevsky 等, 2017, p. 86)

(4)、Overlapping Pooling:使用池化层减少过拟合

“To be more precise, a pooling layer can be thought of as consisting of a grid of pooling units spaced s pixels apart, each summarizing a neighborhood of size z × z centered at the location of the pooling unit. If we set s = z, we obtain traditional local pooling as commonly employed in CNNs. If we set s < z, we obtain overlapping pooling. This is what we use throughout our network, with s = 2 and z = 3. This scheme reduces the top-1 and top-5 error rates by 0.4% and 0.3%, respectively, as compared with the non overlapping scheme s = 2, z = 2, which produces output of equivalent dimensions. We generally observe during training that models with overlapping pooling find it slightly more difficult to overfit.” (Krizhevsky 等, 2017, p. 87)


“the net contains eight layers with weights; the first five are convolutional and the remaining three are fully connected. The output of the last fully connected layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels.” (Krizhevsky 等, 2017, p. 87)

“We employ two distinct forms of data augmentation, both of which allow transformed images to be produced from the original images with very little computation, so the transformed images do not need to be stored on disk” (Krizhevsky 等, 2017, p. 87)

“The first form of data augmentation consists of generating image translations and horizontal reflections. We do this by extracting random 224 × 224 patches (and their horizontal reflections) from the 256 × 256 images and training our network on these extracted patches.d This increases the size of our training set by a factor of 2048, though the resulting training examples are, of course, highly inter dependent. Without this scheme, our network suffers from substantial overfitting, which would have forced us to use much smaller networks. At test time, the network makes a prediction by extracting five 224 × 224 patches (the four corner patches and the center patch) as well as their horizontal reflections (hence 10 patches in all), and averaging the predictions made by the network’s softmax layer on the ten patches.” (Krizhevsky 等, 2017, p. 87)









“The second form of data augmentation consists of altering the intensities of the RGB channels in training images.” (Krizhevsky 等, 2017, p. 88)

对 RGB这些通道上的数据进行一个主成分分析,然后对主成分分析上的参数进行一个扰动,经过这些扰动,图像的色彩就会发生一个微小的变化来实现对图像数据增强,增加图像的一个多样性和丰富度。但是效果了有限


“The recently introduced technique, called “dropout”,12 consists of setting to zero the output of each hidden neuron with probability 0.5. The neurons which are “dropped out” in this way do not contribute to the forward pass and do not participate in back propagation. So every time an input is presented, the neural network samples a different architecture, but all these architectures share weights.” (Krizhevsky 等, 2017, p. 88)

“This technique reduces complex co adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons. At test time, we use all the neurons but multiply their outputs by 0.5, which is a reasonable approximation to taking the geometric mean of the predictive distributions produced by the exponentially-many dropout networks.” (Krizhevsky 等, 2017, p. 88)

“We use dropout in the first two fully connected layers of Figure 2. Without dropout, our network exhibits substantial overfitting. Dropout roughly doubles the number of iterations required to converge.” (Krizhevsky 等, 2017, p. 88)

“We trained our models using stochastic gradient descent with a batch size of 128 examples, momentum of 0.9, and weight decay of 0.0005. We found that this small amount of weight decay was important for the model to learn. In other words, weight decay here is not merely a regularizer: it reduces the model’s training error. The update rule for weight w was where i is the iteration index, u is the momentum variable, ε is the learning rate, and 〈 wi〉Di is the average over the ith batch Di of the derivative of the objective with respect to w, evaluated at wi.” (Krizhevsky 等, 2017, p. 88)


“We initialized the weights in each layer from a zero-mean Gaussian distribution with standard deviation 0.01. We initialized the neuron biases in the second, fourth, and fifth convolutional layers, as well as in the fully connected hidden layers, with the constant 1. This initialization accelerates the early stages of learning by providing the ReLUs with positive inputs. We initialized the neuron biases in the remaining layers with the constant 0.” (Krizhevsky 等, 2017, p. 88)




“Computing similarity by using Euclidean distance between two 4096-dimensional, real-valued vectors is inefficient, but it could be made efficient by training an auto encoder to compress these vectors to short binary codes. This should produce a much better image retrieval method than applying auto encoders to the raw pixels,16 which does not make use of image labels and hence has a tendency to retrieve images with similar patterns of edges, whether or not they are semantically similar.” (Krizhevsky 等, 2017, p. 90)













“Their capacity can be controlled by varying their depth and breadth” (Krizhevsky 等, 2017, p. 85)


“All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.” (Krizhevsky 等, 2017, p. 85)


“Given a rectangular image, we first rescaled the image such that the shorter side was of length 256, and then cropped out the central 256 × 256 patch from the resulting image” (Krizhevsky 等, 2017, p. 85)


“ReLUs have the desirable property that they do not require input normalization to prevent them from saturating. I” (Krizhevsky 等, 2017, p. 86)


“The network has learned a variety of frequency- and orientation-selective kernels, as well as various colored blobs.” (Krizhevsky 等, 2017, p. 89)


“If two images produce feature activation vectors with a small Euclidean separation, we can say that the higher levels of the neural network consider them to be similar.” (Krizhevsky 等, 2017, p. 89)


“This should produce a much better image retrieval method than applying auto encoders to the raw pixels,16 which does not make use of image labels and hence has a tendency to retrieve images with similar patterns of edges, whether or not they are semantically similar.” (Krizhevsky 等, 2017, p. 90)


“It is notable that our network’s performance degrades if a single convolutional layer is removed.” (Krizhevsky 等, 2017, p. 90)


“Ultimately we would like to use very large and deep convolutional nets on video sequences where the temporal structure provides very helpful information, that is, missing or far less obvious in static images.” (Krizhevsky 等, 2017, p. 90)



