論文介紹 — YOLOv3: An Incremental Improvement

3 min readMay 13, 2022

1. Introduction

Sometimes you just kinda phone it in for a year, you know? I didn’t do a whole lot of research this year. Spent a lot of time on Twitter. Played around with GANs a little. I had a little momentum left over from last year [12] [1]; I managed to make some improvements to YOLO. But, honestly, nothing like super interesting, just a bunch of small changes that make it better. I also helped out with other people’s research a little.

作者在開頭介紹就說了，他只做了一些小改進，因為他在twiiter上花太多時間….，然後玩了一些ＧＡＮ，幫助了一些其他人的研究，所以這篇會比較像是技術報告。

2. Deal

2.1. Bounding Box Prediction

開頭提到 yolov3 顯著比其他detection方法還快，如同YOLO9000一樣，YOLOv3使用dimension cluster 來跑anchor box，然後預測方法一樣是去預測x,y,w,h的0~1數值(normal過)，再將其轉換回來，如下圖。

使用Squared error loss 來計算訓練loss，gradient就是 tx- t*(以x為例，xt: 預測出來的數值，t*：GT值)。

YOLOv3使用Logistic regression ｘ來預測object score，並設定threshold為0.5，如果低於這個threshold則忽略。

2.2. Class Prediction

作者說自己不使用softmax來處理multi-label，因為這樣對於效能沒有幫助，他使用獨立的logistic classifier 處理，並以binary cross-entropy 處理loss。

2.3. Predictions Across Scales

YOLOv3會在3種不同scale 去預測 box，作者就從這這些scale裡面擷取特徵，在最後每一個scale預測3個box，所以最後如果是預測4個bounding box offset(tx,ty,tw,th)，一個分數輸出，以及80個物件分類，則輸出tensor形狀會是N*N*[3*(4+1+80)] 。

將前兩層的feature map upsample到兩倍，並將更前面的feature map跟這個upsample過後的相結合，以這個方式從upsample 那邊得到更多的semantic information，並從更早的feature map上得到finer-grained information。

一樣在COCO dataset 上使用K-means來做cluster 分類，最後輸出九種，分別是(10 13); (16 30); (33 23); (30 61); (62 45); (59 119); (116 90); (156 198); (373 326).

2.4. Feature Extractor

作者使用一種混合式架構，融合了YOLOv2跟Darknet-19以及residual的block，作者在命名框架也很有想法，以下節自原文。

Our network uses successive 3＊3 and 1＊1 convolutional layers but now has some shortcut connections as well and is significantly larger. It has 53 convolutional layers so we call it…. wait for it….. Darknet-53!

這種結合方式使其比Darknet19更強大，也比Resnet101或是Resnet-152更快速。

作者也提到說Darknet-53可以達到更高的Floating point operation per seconds，代表其可以更有效率的使用GPU。

3. How We DO

過去YOLO 在小物件上表現較差。然而現在我們看到了這種趨勢的逆轉。隨著multi-scale predict 我們看到 YOLOv3 有比較高的APs 性能。不過相對來說在中型和大型物件上的性能較差。
當我們在 AP50 指標上繪製準確性與速度的關係時，我們可以看到 YOLOv3 比其他detection方法更好。

4. ThingsWe Tried That Didn’t Work

4–1. anchor box: 嘗試用普通的anchor box直接預測x,y，效果不好。

4–2. Linear: 缺少模型穩定性，導致mAP下降。

4–3. Focal loos: mAP下降2 points，推測是因為YOLOv3已經試圖解決了。

4–4. Dual IOU thresholds and truth assignment: Faster-RCNN使用兩個IOU threshold來處理，超過0.7為positive，低於0.3為negative，但作者測試後發現並沒有比較好。

5. What This All Means

呃…就貼上原文好了，太扎實的結論了，以一句話總結：希望世界和平。

Well the other people heavily funding vision research are the military and they’ve never done anything horrible like killing lots of people with new technology.
Oh wait…..I have a lot of hope that most of the people using computer vision are just doing happy, good stuff with it, like counting the number of zebras in a national park [13], or tracking their cat as it wanders around their house [19]. But computer vision is already being put to questionable use and as researchers we have a responsibility to at least consider the harm our work might be doing and think of ways to mitigate it. We owe the world that much.

好久沒看到這麼有趣的論文了，我愛讀論文，週末愉快。

原文連接：https://arxiv.org/abs/1804.02767