ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • SSD: Single-Shot Multibox Detector
    AI\ML\DL/๋…ผ๋ฌธ ๋ฆฌ๋ทฐ 2023. 8. 24. 23:30
    ๋ฐ˜์‘ํ˜•

     

    ๏นก

    <Motivation>

    R-CNN ๊ณ„์—ด์˜ 2-stage detector๋Š” region proposal ๊ณผ ๊ฐ™์€ ๋‹ค์–‘ํ•œ view๋ฅผ ๋ชจ๋ธ์— ์ œ๊ณตํ•˜์—ฌ ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค.

    ํ•˜์ง€๋งŒ region proposal์„ ์ถ”์ถœํ•˜๊ณ  ์ด๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ณผ์ •์—์„œ ๋งŽ์€ ์‹œ๊ฐ„์ด ๊ฑธ๋ ค detection ์†๋„๊ฐ€ ๋Š๋ฆฌ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค.

    YOLO v1 ์€ ์›๋ณธ ์ด๋ฏธ์ง€ ์ „์ฒด๋ฅผ ํ†ตํ•ฉ๋œ ๋„คํŠธ์›Œํฌ๋กœ ์ฒ˜๋ฆฌํ•˜๊ธฐ ๋•Œ๋ฌธ์— detection ์†๋„๊ฐ€ ๋งค์šฐ ๋น ๋ฅด๋‹ค. ํ•˜์ง€๋งŒ grid cell ๋ณ„๋กœ 2๊ฐœ์˜ bounding box ๋งŒ์„ ์„ ํƒํ•˜์—ฌ ์ƒ๋Œ€์ ์œผ๋กœ ์ ์€ view ๋ฅผ ๋ชจ๋ธ์— ์ œ๊ณตํ•˜์—ฌ ์ •ํ™•๋„๊ฐ€ ๋–จ์–ด์ง„๋‹ค.

    ์ด๋Ÿฌํ•œ ์ •ํ™•๋„์™€ detection ์†๋„์˜ trade-off ๊ด€๊ณ„๋ฅผ ์™„ํ™”ํ•œ ๋ชจ๋ธ์ด SSD ์ด๋‹ค.

    SSD ๋Š” ๋‹ค์–‘ํ•œ view๋ฅผ ํ™œ์šฉํ•˜๋ฉด์„œ ํ†ตํ•ฉ๋œ single deep neural network ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง„ 1-stage detector๋กœ์„œ ๋†’์€ ์ •ํ™•๋„์™€ ๋น ๋ฅธ ์†๋„๋ฅผ ๊ฐ€์ง„๋‹ค.


    Abstract

     

    SSD discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location(multiple feature map).

     

    SSD ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ•œ ๋ฌธ์žฅ์œผ๋กœ ์ •๋ฆฌํ•˜๋ฉด ์œ„์™€ ๊ฐ™๋‹ค. 

    • SSD๋Š” ์—ฌ๋Ÿฌ ๊ฐœ์˜ feature map ์„ ํ†ตํ•ด default bounding box๋ฅผ ์ถœ๋ ฅํ•œ๋‹ค.
      (default box ๋Š” YOLO์˜ Anchor box์™€ ๋น„์Šทํ•œ ๊ฐœ๋…)
    • ๊ฐ Feature map์€ ์„œ๋กœ ๋‹ค๋ฅธ ๋น„์œจ๊ณผ ํฌ๊ธฐ๋ฅผ ๊ฐ€์ง„ Default box๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ์ด๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ์€ bounding box์˜ ์œ„์น˜์™€ ํด๋ž˜์Šค๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์™€ ๋น„์œจ์„ ๊ฐ€์ง„ ๊ฐ์ฒด๋“ค์„ ํšจ๊ณผ์ ์œผ๋กœ ๊ฒ€์ถœํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค. 

     

    feature map ๋งˆ๋‹ค ์ƒ์„ฑ๋˜๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ํฌ๊ธฐ์™€ ratio (์ข…ํšก๋น„)์˜ default bounding box ๋“ค. ํ•™์Šต ๊ณผ์ •์—์„œ default boxes ๋ฅผ ground truth boxes ์— ๋งค์นญ์‹œํ‚จ๋‹ค. ๋น„๊ต์ ์œผ๋กœ ํฐ ๊ฐ•์•„์ง€๋Š” ๋’ค์— ์žˆ๋Š” ์ž‘์€ feature map ์—์„œ ๋ฝ‘๊ณ , ๋น„๊ต์ ์œผ๋กœ ์ž‘์€ ํฌ๊ธฐ์˜ ๊ณ ์–‘์ด๋Š” ์•ž์ชฝ์— ์žˆ๋Š” ํฐ feature map ์—์„œ ๋ฝ‘๋Š”๋‹ค.

     

    Yolo v1 ์™€ ๊ฐ™์€ ๊ฒฝ์šฐ ์ตœ์ข… output์œผ๋กœ ํ•˜๋‚˜์˜ feature map ์„ ์ถœ๋ ฅํ•˜๊ณ , feature map์˜ ๊ฐ grid cell ๋‹น 2๊ฐœ์˜ bounding box๋ฅผ ์˜ˆ์ธกํ•˜๋Š”๋ฐ ๋น„ํ•ด, SSD ๋Š” output ์œผ๋กœ ์—ฌ๋Ÿฌ๊ฐ€์ง€์˜ feature map์„ ์ถœ๋ ฅํ•˜๊ณ , ๊ฐ grid cell ๋‹น 6๊ฐœ ํ˜น์€ 4๊ฐœ์˜ ์˜ bounding box๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค. 

     

    ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” ๊ฑธ๋ฆฐ ์‹œ๊ฐ„๊ณผ ์ •ํ™•๋„์— ๋Œ€ํ•œ ๋ถ„์„์„ ํ–ˆ์„ ๋•Œ, PASCAL VOC, COCO, ILSVRC ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ Faster R-CNN, YOLO ๋Œ€๋น„ ์ œ์ผ ์„ฑ๋Šฅ์ด ์ข‹์•˜๋‹ค.


    Model

    ๋…ผ๋ฌธ์— ๋‚˜์™€์žˆ๋Š” ๋ชจ๋ธ ๊ตฌ์กฐ๋Š” ๋‹ค์Œ์œผ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. 

    1. Base convolution network

    2. Auxiliary convolution network

    ์ „์ฒด ๋„คํŠธ์›Œํฌ๋Š” pretrained ๋œ VGG16์„ base network๋กœ ์‚ฌ์šฉํ•˜๊ณ , ์ดํ›„ ๋ณด์กฐ(auxiliary) network ๋ฅผ ์ถ”๊ฐ€ํ•œ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. 

     

    Base Convolution Network

    SSD๋Š” Base network ๋กœ์„œ ImageNet ๋ฐ์ดํ„ฐ์…‹์— ์‚ฌ์ „ํ•™์Šต๋œ VGG-16 ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. 

    ์ดํ›„์— ๋‚˜์˜ค๋Š” auxiliary convolution network์™€ ์—ฐ๊ฒฐํ•  ๋•Œ๋Š” VGG16 ํ›„๋ฐ˜๋ถ€์— ๋“ฑ์žฅํ•˜๋Š” fc layer (FC6, FC7) ๋ฅผ conv layer๋กœ ๋ฐ”๊ฟ”์ค€๋‹ค. ์ด ๊ณผ์ •์—์„œ fc layer ๊ฐ€ ์ œ๊ฑฐ๋˜๋ฉด์„œ detection ์†๋„๊ฐ€ ํ–ฅ์ƒ๋˜๋Š” ์ด์ ์ด ์žˆ๋‹ค.

     

    Auxiliary Convolution Network

    ์ถ”๊ฐ€์ ์ธ convolution layer๋ฅผ ์ ์šฉ์‹œ์ผœ์ฃผ๋Š” ๋ณด์กฐ ๋„คํŠธ์›Œํฌ์—์„œ๋Š” ๊ณ„์† pooling๊ณผ ํ•„ํ„ฐ๋ฅผ ๊ฑฐ์น˜๋ฉด์„œ ์ฑ„๋„ ์ˆ˜๋ฅผ ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์—ฌ๋‚˜๊ฐ„๋‹ค.

    ์ด์™€ ๋™์‹œ์— ์ค‘๊ฐ„ ์ค‘๊ฐ„์˜ feature map ์—์„œ ๊ณ„์† bounding box ๋ฅผ ๋ฝ‘์•„๋‚ด๋Š”๋ฐ, ์ด๊ฒƒ์ด SSD ์˜ ์ค‘์š”ํ•œ ํŠน์ง• ์ค‘ ํ•˜๋‚˜์ด๋‹ค.

    ๋ฐ”๋กœ ๋‹ค์–‘ํ•œ ํฌ๊ธฐ์˜ Feature map๋“ค ์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ๋น„์œจ(ratio) ์™€ ํฌ๊ธฐ(scale)์„ ๊ฐ€์ง„ bounding box๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ ์ƒ์„ฑํ•ด ๋‚ธ๋‹ค๋Š” ์ ์ด๋‹ค.

    ์ด ๋ชจ๋“  bounding box ์˜ ํ›„๋ณด๋“ค์„ 'default boxes'๋ผ๊ณ ๋„ ํ•œ๋‹ค.

    ์ตœ์ข…์ ์œผ๋กœ ๋‚˜์˜ค๋Š” bounding box์˜ ๊ฐœ์ˆ˜๋Š” 300x300 ์ž…๋ ฅ ์ด๋ฏธ์ง€ ๊ธฐ์ค€์œผ๋กœ 8732 ๊ฐœ๋ผ๊ณ  ๋‚˜์™€์žˆ๋‹ค. (์œ„์˜ ๋ชจ๋ธ ๊ทธ๋ฆผ ์ฐธ์กฐ)

     

     

    BBox ๋ฅผ ์ƒ์„ฑํ•˜๋Š” Feature map ์€ 6๊ฐœ๊ฐ€ ์žˆ๋Š”๋ฐ, 38x38, 19x19, 10x10, 5x5, 3x3, 1x1 ์ด ์žˆ๋‹ค.

    ๊ฐ feature map ์œผ๋กœ๋ถ€ํ„ฐ conv ์—ฐ์‚ฐ์„ ํ•˜์—ฌ ์šฐ๋ฆฌ๊ฐ€ ์˜ˆ์ธกํ•˜๊ณ ์ž ํ•˜๋Š” bounding box ์˜ class ์ ์ˆ˜๋“ค๊ณผ, offset ์„ ์–ป๊ฒŒ ๋œ๋‹ค.

    • class score: default box ๋‚ด์— ๊ฐ class ๊ฐ€ ์กด์žฌํ•˜๋Š” ํ™•๋ฅ 
    • offset: default box์˜ x, y, w, h (์ด 4๊ฐœ)

    ์ด๋•Œ ์ปจ๋ณผ๋ฃจ์…˜ kernel_size๋Š” 3x3 ์ด๊ณ , ํ•„ํ„ฐ ์ฑ„๋„์˜ ํฌ๊ธฐ๋Š” ์ƒ์„ฑํ•  BBox์˜ ๊ฐœ์ˆ˜ $\times$ (class ๊ฐœ์ˆ˜ + offset ๊ฐœ์ˆ˜) ์ด๋‹ค.

     

    ๋”ฐ๋ผ์„œ ๊ฐ feature map์— ์ปจ๋ณผ๋ฃจ์…˜ ์—ฐ์‚ฐ์„ ํ•  ํ•„ํ„ฐ ํฌ๊ธฐ๋ฅผ ์ •๋ฆฌํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

    $$ 3\times 3\times (\textrm{the number of bounding box}\times (\textrm{class score}+\textrm{offset}))$$

    ๊ฐ feature map (=location) ์—์„œ๋Š” $k$ ๊ฐœ์˜ bounding box ๋ฅผ ์ƒ์„ฑํ•˜๊ณ ,

    ๊ฐ bounding box๋Š” $c$๊ฐœ์˜ class ์— ๋Œ€ํ•œ softmax ํ™•๋ฅ ๊ฐ’๊ณผ, default bounding box ์—์„œ ์กฐ์ •ํ•  ์œ„์น˜๊ฐ’์ธ offset (x, y, width, height) ๊ฐ’ 4๊ฐœ๋ฅผ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.


    Q. ๊ทธ๋ ‡๋‹ค๋ฉด 19x19x1024 feature map ์œผ๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑ๋˜๋Š” bounding box ์˜ ๊ฐœ์ˆ˜๋Š” ์ด ๋ช‡ ๊ฐœ์ผ๊นŒ?

     

    19x19์˜ ํ–‰์—ด ํฌ๊ธฐ๋ฅผ ๊ฐ€์ง€๋Š” feature map ์—์„œ๋Š” grid cell๋งˆ๋‹ค 6๊ฐœ์˜ bounding box ๊ฐ€ ์ƒ์„ฑ๋˜๋ฏ€๋กœ,

    19x19x1024 feature map ์œผ๋กœ๋ถ€ํ„ฐ 19x19x6(=2166) ๊ฐœ์˜ default box๊ฐ€ ์ƒ๊ธธ ๊ฒƒ์ด๋‹ค.

    (ํ•˜๋‚˜์˜ bounding box ๋งˆ๋‹ค 21(VOC2007 ๊ธฐ์ค€)+4=25 ๊ฐœ์˜ ๊ฐ’์ด ์˜ˆ์ธก๋œ๋‹ค๊ณ  ํ–ˆ์œผ๋‹ˆ, 25 ์ž์ฒด๋Š” ํ•˜๋‚˜์˜ bounding box๊ฐ€ ํฌํ•จํ•˜๊ณ  ์žˆ๋Š”ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐœ์ˆ˜๋กœ, ์ƒ์„ฑ๋˜๋Š” bounding box ๊ฐœ์ˆ˜์™€๋Š” ๊ด€๋ จ์ด ์—†๋‹ค.)

     

     

    ๋”ฐ๋ผ์„œ 38x38, 19x19, 10x10, 5x5, 3x3, 1x1 ์ด 6๊ฐœ์˜ feature map ๊ฐ๊ฐ์—์„œ ์˜ˆ์ธก๋œ ๋ฐ”์šด๋”ฉ๋ฐ•์Šค์˜ ์ด ํ•ฉ์€ 8732๊ฐ€ ๋˜๋Š” ๊ฒƒ์ด๋‹ค.

    ์ด์ œ๋Š” ์•ž์„œ ์ดˆ๋ฐ˜์— abstract ์—์„œ ์„ค๋ช…ํ–ˆ๋˜ Multiple feature map (location) ์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ratio ์™€ scale์„ ๊ฐ€์ง€๋Š” default bounding box ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค๋Š” ๊ฒƒ์˜ ์˜๋ฏธ๋ฅผ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค.


    NMS (Non-Maximum suppression)

    ์—ฌ๋Ÿฌ feature map ์—์„œ ์ƒ์„ฑ๋œ default box์— ๋Œ€ํ•ด NMS๋ฅผ ์‹œํ–‰ํ•˜์—ฌ ์ตœ์ข… ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•œ๋‹ค.

     

     

     

    Matching strategy

    During training we need to determine which default boxes correspond to a ground truth detection and train the network accordingly. We begin by matching each ground truth box to the default box with the best jaccard overlap (=IoU). Unlike MultiBox, We then match default boxes to any ground truth with jaccard overlap higher than a thereshold (0.5).
    • Ground truth ์™€ 'default box'๋ฅผ ๋งค์นญ ์‹œํ‚ด
    • ๋‘ ์˜์—ญ์˜ IoU๊ฐ€ 0.5 ์ด์ƒ์ธ ๊ฒƒ๋“ค๋งŒ ๋งค์นญ ์‹œํ‚ด

     

    Training objective

    SSD์˜ loss ํ•จ์ˆ˜๋Š” classification์„ ์œ„ํ•œ loss ์ธ confidence loss (conf) ์™€, BBox์˜ ์œ„์น˜๋ฅผ regression ์‹œํ‚ค๋Š” localization loss (loc) ์˜ ํ•ฉ์œผ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. 

    $$L(x,c,l,g)=\frac{1}{n}(L_{conf}(x,c)+\alpha L_{loc}(x,l,g))$$

     

    (โ˜…์ž‘์„ฑ์ค‘)

    'AI\ML\DL > ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

    Inception Net (2014.09)  (0) 2023.09.17
    VGG Net (2014.09)  (0) 2023.09.16
    cyclic ordinal regression ํ•™์Šต๋ฒ•  (0) 2023.07.28
    Super resolution (SR) technique  (1) 2023.06.17
Designed by Tistory.