論文介紹 — YOLOv3: An Incremental Improvement

Watson Wang
3 min readMay 13, 2022
Photo by Petri Heiskanen on Unsplash

1. Introduction

Sometimes you just kinda phone it in for a year, you know? I didn’t do a whole lot of research this year. Spent a lot of time on Twitter. Played around with GANs a little. I had a little momentum left over from last year [12] [1]; I managed to make some improvements to YOLO. But, honestly, nothing like super interesting, just a bunch of small changes that make it better. I also helped out with other people’s research a little.

作者在開頭介紹就說了,他只做了一些小改進,因為他在twiiter上花太多時間….,然後玩了一些GAN,幫助了一些其他人的研究,所以這篇會比較像是技術報告。

2. Deal

2.1. Bounding Box Prediction

開頭提到 yolov3 顯著比其他detection方法還快,如同YOLO9000一樣,YOLOv3使用dimension cluster 來跑anchor box,然後預測方法一樣是去預測x,y,w,h的0~1數值(normal過),再將其轉換回來,如下圖。

使用Squared error loss 來計算訓練loss,gradient就是 tx- t*(以x為例,xt: 預測出來的數值,t*:GT值)。

YOLOv3使用Logistic regression x來預測object score,並設定threshold為0.5,如果低於這個threshold則忽略。

2.2. Class Prediction

作者說自己不使用softmax來處理multi-label,因為這樣對於效能沒有幫助,他使用獨立的logistic classifier 處理,並以binary cross-entropy 處理loss。

2.3. Predictions Across Scales

YOLOv3會在3種不同scale 去預測 box,作者就從這這些scale裡面擷取特徵,在最後每一個scale預測3個box,所以最後如果是預測4個bounding box offset(tx,ty,tw,th),一個分數輸出,以及80個物件分類,則輸出tensor形狀會是N*N*[3*(4+1+80)] 。

將前兩層的feature map upsample到兩倍,並將更前面的feature map跟這個upsample過後的相結合,以這個方式從upsample 那邊得到更多的semantic information,並從更早的feature map上得到finer-grained information。

一樣在COCO dataset 上使用K-means來做cluster 分類,最後輸出九種,分別是(10 13); (16 30); (33 23); (30 61); (62 45); (59 119); (116 90); (156 198); (373 326).

2.4. Feature Extractor

作者使用一種混合式架構,融合了YOLOv2跟Darknet-19以及residual的block,作者在命名框架也很有想法,以下節自原文。

Our network uses successive 3*3 and 1*1 convolutional layers but now has some shortcut connections as well and is significantly larger. It has 53 convolutional layers so we call it…. wait for it….. Darknet-53!

model architecture

這種結合方式使其比Darknet19更強大,也比Resnet101或是Resnet-152更快速。

作者也提到說Darknet-53可以達到更高的Floating point operation per seconds,代表其可以更有效率的使用GPU。

3. How We DO

過去YOLO 在小物件上表現較差。 然而現在我們看到了這種趨勢的逆轉。 隨著multi-scale predict 我們看到 YOLOv3 有比較高的APs 性能。 不過相對來說在中型和大型物件上的性能較差。
當我們在 AP50 指標上繪製準確性與速度的關係時,我們可以看到 YOLOv3 比其他detection方法更好。

4. ThingsWe Tried That Didn’t Work

4–1. anchor box: 嘗試用普通的anchor box直接預測x,y,效果不好。

4–2. Linear: 缺少模型穩定性,導致mAP下降。

4–3. Focal loos: mAP下降2 points,推測是因為YOLOv3已經試圖解決了。

4–4. Dual IOU thresholds and truth assignment: Faster-RCNN使用兩個IOU threshold來處理,超過0.7為positive,低於0.3為negative,但作者測試後發現並沒有比較好。

5. What This All Means

呃…就貼上原文好了,太扎實的結論了,以一句話總結:希望世界和平。

Well the other people heavily funding vision research are the military and they’ve never done anything horrible like killing lots of people with new technology.

Oh wait…..I have a lot of hope that most of the people using computer vision are just doing happy, good stuff with it, like counting the number of zebras in a national park [13], or tracking their cat as it wanders around their house [19]. But computer vision is already being put to questionable use and as researchers we have a responsibility to at least consider the harm our work might be doing and think of ways to mitigate it. We owe the world that much.

好久沒看到這麼有趣的論文了,我愛讀論文,週末愉快。

原文連接:https://arxiv.org/abs/1804.02767

--

--