VGG16 Introduction and Implementation
在CNN 的出現後,開始出現許多不同的延伸model,其中比較常聽到的其中之一就是今天要介紹的VGG16 ,那今天主要分三部分:
- VGG16介紹
- 程式碼
- 論文說明
VGG16介紹
VGG 其名稱來自於英國牛津大學 Visual Geometry Group 的縮寫,其所做出的貢獻是使用更多的隱藏層,並進行大量的圖片訓練,最後成功提高準確率至90%。VGG16裡面一共有16層:13個卷積層及3個全連接層所組成(如下圖),而VGG16能成功也是因為當時能使用GPU做加速。
VGG最重要的概念就是大量使用3X3的Convolution layers、較小的stride (strides=1)以及Pooling (2X2),論文作者認為較小的ConV layers可以提高所得到的資訊量。此外,相對於Alexnet所使用的7X7 Conv layers,3X3的Cony layers也有較高的Non-linearity。此外,VGG16 也證明了更「深」的神經層其效果確實優於較淺者,透過較小的filter所疊出來的model 仍然能繼續提高Accuracy。
程式碼
那如果有興趣想使用,可以點進去 keras 的官方文件,你會發現使用方法異常簡單(註:include_top 若設False 代表自己需要在後面做 classificate ),這方面的實作可以參考我之前所寫的🗂 Transfer Learning + Grad-CAM with Flask,裡面是基於VGG16為基礎做transfer learning。
model = VGG16(weights='imagenet', include_top=True)
那當然也可以自己手動刻出架構(對應於上方架構圖)並加以訓練:
最後底下有附上colab程式碼,有興趣可以去執行看看,但要記得開啟GPU(執行階段-> 變更執行階段類型 -> 選擇硬體加速器:GPU ),以加速執行時間。
論文說明
那大概理解VGG16架構跟程式後,讓我們來從論文來更深的了解!
Abstract
Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3*3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16–19 weight layers.
本文主要貢獻與研究:3X3 的convolution 對於抓出特徵並提高準確度已然足夠。
1. Introduction
For instance, the best-performing submissions to the ILSVRC- 2013 (Zeiler & Fergus, 2013; Sermanet et al., 2014) utilised smaller receptive window size and smaller stride of the first convolutional layer. In this paper, we address another important aspect of ConvNet architecture design — its depth. To this end, we fix other parameters of the architecture, and steadily increase the depth of the network by adding more convolutional
Convolution Network 在近幾年的視覺領域中逐漸越來越重要:像是 ILSVRC 改動第一層的Convolution layer,使其stride 跟kernel size更小。但論文作者嘗試另一個方向:不斷用convolution layer 增加深度。(附圖是我很喜歡的一張迷因)
2. CONVNET CONFIGURATIONS
The image is passed through a stack of convolutional (conv.) layers, where we use filters with a very small receptive field: 3 X 3.
In one of the configurations we also utilise 1 1 convolution filters, which can be seen as a linear transformation of the input channels (followed by non-linearity). The convolution stride is fixed to 1 pixel;
唯一的預處理就是將圖片(224X224)減去每隔像素平均值作歸一化。然後把圖片丟進由3X3 , stride =1 的 Conv 組成的model裡,選用3X3是因爲這是最小可捕捉左右跟上下特徵的尺寸。在其他模型中也會使用1X1來嘗試。
A stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers: the first two have 4096 channels each, the third performs 1000-way ILSVRC classification and thus contains 1000 channels (one for each class). The final layer is
the soft-max layer.
那在Convolution layer 後面,作者建立了三層FCN,其Neuron數分別為(4096,4096, 1000),最後加上 Softmax呈現機率。
那為了找出最好的模型,作者列出了如下圖的六種模型來做比較,其中 A-LRN 在中間串入了LRN(Local Response Normalization),作用是局部歸一化,放大局部神經元中較大的weight,並抑制其中較小的。
3. CLASSIFICATION FRAMEWORK
The initialisation of the network weights is important, since bad initialisation can stall learning due to the instability of gradient in deep nets. To circumvent this problem, we began with training the configuration A (Table 1), shallow enough to be trained with random initialisation. Then, when training deeper architectures, we initialised the first four convolutional layers and the last three fullyconnected layers with the layers of net A (the intermediate layers were initialised randomly). We did not decrease the learning rate for the pre-initialised layers, allowing them to change during learning.
因為一般來說會需要初始化權重,但初始化這一部很重要,不恰當的初始化會造成學習不穩定或是停滯,那作者採用的方式是先去訓練層數較低的model A (由上圖可看出),再將訓練過的權重當成其他模型前幾層的初始化權重。
4. CLASSIFICATION EXPERIMENTS
Dataset. In this section, we present the image classification results achieved by the described ConvNet architectures on the ILSVRC-2012 dataset (which was used for ILSVRC 2012–2014 challenges). The dataset includes images of 1000 classes, and is split into three sets: training (1.3M images), validation (50K images), and testing (100K images with held-out class labels). The classification performance is evaluated using two measures: the top-1 and top-5 error.
使用的dataset擁有1000種分類,並使用top-1 跟top-5做評估。
(1) Top-1 accuracy :即對圖片進行預測時,機率最大的結果為正確答案,才視為正確。
(2) Top-5 accuracy:對圖片進行預測時,機率排序後的前五的結果中包含正確答案,即視為正確。
First, we note that using local response normalisation (A-LRN network) does not improve on the model A without any normalisation layers. We thus do not employ normalisation in the deeper architectures (B–E).
在這一開始的實驗中,作者發現加入LRN 沒有改進效果,所以在其他model就不加入了。
The error rate of our architecture saturates when the depth reaches 19 layers, but even deeper models might be beneficial for larger datasets. We also compared the net B with a shallow net with five 5 5 conv. layers, which was derived from B by replacing each pair of 3 3 conv. layers with a single 5 5 conv. layer (which has the same receptive field as explained in Sect. 2.3). The top-1 error of the shallow net was measured to be 7% higher than that of B (on a center crop), which confirms that a deep net with small filters outperforms a shallow net with larger filters.
在實驗過程中,發現疊加到19層時就達到飽和(無法進一步提升),如下圖所呈現的,作者也在其中嘗試過在B中加入 5X5,但效果不佳(錯誤率高7%)。
Finally, scale jittering at training time (S ∈ [256; 512]) leads to significantly better results than training on images with fixed smallest side (S = 256 or S = 384), even though a single scale is used at test time. This confirms that training set augmentation by scale jittering is indeed helpful for capturing multi-scale image statistics.
在做訓練時,作者發現加入scale jitter 能有效提高模型效能,其中scale jitter 代表圖像輸入大小只要在一個範圍(這裡就是256~512)內,再使用裁切(本篇使用224*224 size )來擴增image set。
那最後作者們對訓練出來model做ensemble 後,拿出來跟其他模型做比較,如下圖所示,可看到VGG的改善。
5. Conclusion
In this work we evaluated very deep convolutional networks (up to 19 weight layers) for largescale image classification. It was demonstrated that the representation depth is beneficial for the classification accuracy, and that state-of-the-art performance on the ImageNet challenge dataset can be achieved using a conventional ConvNet architecture (LeCun et al., 1989; Krizhevsky et al., 2012) with substantially increased depth. Our results yet again confirm the importance of depth in visual representations.
在這篇論文中作者成功證實在電腦視覺領域上,deep model 比shallow model 效能上更佳,但也說明深度不斷增加對效能的改善有其極限。
今天的說明就到這裡,謝謝各位耐心觀看 🎉🎉!如果喜歡我的內容,請點個拍手或留言吧!也順手按下follow鍵,隨時追蹤新文章。歡迎不吝賜教🙂~
Colab
https://colab.research.google.com/drive/1cWS-9kQKZdwklA9g5UPVRZpStBTAH0dZ?usp=sharing