ResNet Introduction with TensorFlow2.0

9 min readJun 24, 2021

那在上一篇我們理解VGG16後（🗂傳送門），我們了解到深度單純的疊加必定有其極限存在。那隨者時間推進，改善方法也逐漸被研究出來，那今天要來說的就是這個改善方法：ResNet，那我會先由論文部分著手，在最後以tensorflow呈現。

論文探討

ResNet 是由Kaiming He一行人在2015年所提出的，在ImageNet當年度比賽中獲得第一名，也獲的CVPR2016最佳論文。而之後許多的圖像分類或是檢測都以此作為基準完成。

那讓我們先釐清一下ResNet要解決的問題：深度提高時，準確度卻沒能跟著提高（如下圖），其根本原因是「梯度消失/爆炸」並阻擋網路層的收斂。

梯度消失／爆炸

這邊很簡單的說明，就是在更新權重時，因為不斷求導數，造成如果原先activation functioin 的微分>1，後面就會以指數方式成長，造成爆炸。反之若小於1 則會造成以指數速度消失，相關說明可以參考這裡。

This problem, however, has been largely addressed by normalized initial- ization [23, 9, 37, 13] and intermediate normalization layers [16], which enable networks with tens of layers to start con- verging for stochastic gradient descent (SGD) with back- propagation [22].

那論文作者也嘗試過大家熟知的Batch Normalization，能有效解決大部分的問題，但稱不上是完全解決。

那ResNet主要創新的地方是使用Identity mapping，文中描述：The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart. 簡單來說，更深的模型不該有更高的error。所以作者將上一層的輸出直接與下一層的輸出合併，令H(x) = F(x)+x。

上圖中右邊那條弧線稱為identity mapping，而左邊這條稱為residual mapping，而這樣能改善訓練問題的原因是如果現在這個模型已經是最優，那訓練理應很輕易就可以把residual mapping訓練到0，那這時候就只剩下identity mapping在進行傳輸，那換句話說就是中間都被略過，那既然中間算是被略過，其自然也就不會隨模型深度增加而效果降低了。

The formulation of F (x) + x can be realized by feedfor- ward neural networks with “shortcut connections” (Fig. 2). Shortcut connections [2, 34, 49] are those skipping one or more layers. In our case, the shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers (Fig. 2). Identity short- cut connections add neither extra parameter nor computa- tional complexity. The entire network can still be trained end-to-end by SGD with backpropagation, and can be easily implemented using common libraries (e.g., Caffe [19]) without modifying the solvers.

那上圖中又稱為「 feedfor- ward neural networks」，並不額外帶有變數，也不增加計算複雜度，整個網路層仍能以SGD的反向傳播訓練

The dimensions of x and F must be equal in Eqn.(1). If this is not the case (e.g., when changing the input/output channels), we can perform a linear projection Ws by the shortcut connections to match the dimensions:
y = F (x, {Wi }) + Ws x.

如果遇到 x 跟F(x)維度不相同的情況則使用linear projection 的方式使其相匹配，那這邊指的是使用1*1 convolutoin 對其更動維度，詳細的1*1使用方法及原理請參照「1×1卷積計算在做什麼」。

架構圖

It is worth noticing that our model has fewer filters and lower complexity than VGG nets [41] (Fig. 3, left). Our 34- layer baseline has 3.6 billion FLOPs (multiply-adds), which is only 18% of VGG-19 (19.6 billion FLOPs).

作者特別講到，即使是34層的residual model其運算量也少於VGG-19，下圖為作者對各種不同ResNet所提出的實驗架構。