ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • Inception Net (2014.09)
    AI\ML\DL/๋…ผ๋ฌธ ๋ฆฌ๋ทฐ 2023. 9. 17. 00:07
    ๋ฐ˜์‘ํ˜•

    *  *  *

    Inception Net ์€ Google์—์„œ ๋ฐœํ‘œํ•œ ๋„คํŠธ์›Œํฌ๋กœ VGG Net๊ณผ ๊ฐ™์€ ํ•ด์— ๋‚˜์˜จ ๋ชจ๋ธ๋กœ,
    ILSVRC 2014 ๋Œ€ํšŒ์—์„œ 1์œ„๋ฅผ ์ฐจ์ง€ํ•œ ๋ชจ๋ธ์ด๋‹ค. GoogLeNet์ด๋ผ๊ณ ๋„ ๋ถˆ๋ฆฐ๋‹ค.
    Inception Net์€ Lin et al. ์˜ Network in Network ์ด๋ผ๋Š” ๋…ผ๋ฌธ๊ณผ "we need to go deeper" ์ด๋ผ๋Š” ์œ ๋ช…ํ•œ ์ธํ„ฐ๋„ท ๋ฐˆ์œผ๋กœ๋ถ€ํ„ฐ ์˜๊ฐ์„ ๋ฐ›์•„ ์ด๋ฆ„์„ ์ง€์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค. ์‹ค์ œ๋กœ ๋…ผ๋ฌธ์—์„œ ์•„๋ž˜์˜ internet meme์„ ์ฒซ ๋ฒˆ์งธ๋กœ ์ธ์šฉํ–ˆ๋‹ค. 
     

    ์˜ํ™” Inception ์˜ ํ•œ ์žฅ๋ฉด, https://knowyourmeme.com/memes/we-need-to-go-deeper


    Inception Module
     

    Inception Module์€ Inception Net ๋‚ด๋ถ€์—์„œ ์ด 9๋ฒˆ์ด๋‚˜ ๋ฐ˜๋ณตํ•˜๋Š” ๋ชจ๋“ˆ์ด๋‹ค.
     
    VGGNet ๋ฅผ ์ƒ๊ฐํ•ด๋ณด๋ฉด 3x3 conv๋ฅผ ๋‘ ๋ฒˆ ํ•ด์„œ 5x5 conv ์™€ ๊ฐ™์€ receptive field๋ฅผ ์–ป์—ˆ์—ˆ๋‹ค.
    Inception Net์—์„œ๋Š” 1x1, 3x3, 5x5์˜ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ์‚ฌ์ด์ฆˆ๋ฅผ ๊ฐ€์ง„ conv layer๋ฅผ ๋‹ค ํ†ต๊ณผ์‹œ์ผœ ๋ณด๊ณ  ๊ทธ๋ ‡๊ฒŒ ์–ป์€ feature map ๋“ค์„ concat ํ•˜๋Š” ์•„์ด๋””์–ด(Filter concatenation) ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ์•„๋ž˜ ๊ทธ๋ฆผ์€ Inception module์˜ naive version์œผ๋กœ filter concatenation ์„ ํ•˜๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค.


    ์ด๋•Œ feature map ๋“ค์„ depth (์ฑ„๋„) ๋กœ ์Œ“์œผ๋ ค๋ฉด spatial (ํ–‰x์—ด) size ๊ฐ€ ๊ฐ™์•„์•ผ๋งŒ ํ•œ๋‹ค. ์ฆ‰, ์‚ฌ์ด์ฆˆ๋ฅผ ์ž˜ ๋งž์ถ”๋Š”๊ฒŒ ์ค‘์š”ํ•˜๋‹ค.
    ๋…ผ๋ฌธ์—์„œ๋Š” inception module์˜ stride ๋ฅผ ๋ชจ๋‘ 1๋กœ ๊ณ ์ •ํ•˜๊ณ  padding์„ ์ ์ ˆํžˆ ์กฐ์ ˆํ•ด์„œ ์‚ฌ์ด์ฆˆ๋ฅผ ๋งž์ถฐ์คฌ๋‹ค.
    ์˜ˆ๋ฅผ ๋“ค์–ด 3x3 conv ์—์„œ stride=1 ์ผ๋•Œ output feature map์˜ spatial size๋ฅผ ์œ ์ง€์‹œ์ผœ ์ฃผ๋ ค๋ฉด padding=1 ์ด์–ด์•ผ ํ•œ๋‹ค.
     


    <1x1 conv๋ฅผ ํ™œ์šฉํ•œ ์ฐจ์› ์ถ•์†Œ>


    Inception Net์€ 1x1 convolution ์„ ์‚ฌ์šฉํ•ด์„œ ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋ฅผ ๊ฐ์†Œ์‹œํ‚จ๋‹ค.
    ์˜ˆ์ปจ๋ฐ feature map์˜ depth๋ฅผ 192์—์„œ 128๋กœ ์ถ•์†Œ์‹œํ‚ค๋ ค๊ณ  ํ•  ๋•Œ,
    ๋ฐ”๋กœ 3x3 ์ปค๋„์„ ํ†ต๊ณผํ•ด์„œ weight shape์„ [128, 192, 3, 3] ๋กœ ๋งŒ๋“ ๋‹ค๋ฉด, 192x128x3x3 (221,184) ๊ฐœ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ํ•„์š”ํ•œ๋ฐ 

    3x3 ์ด์ „์— 1x1 ํฌ๊ธฐ์˜ ํ•„ํ„ฐ๋ฅผ ๊ฑฐ์ณ์„œ ํ†ต๊ณผํ•œ๋‹ค๋ฉด (192 -> 96 -> 128)
    weight shape: [96, 192, 1, 1] + [128, 96, 3, 3] =  129,024๊ฐœ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ํ•„์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ทธ ์ˆ˜๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

     
    1x1 ์„ ๋‚˜์ค‘์— ํ•˜๋Š” ๊ฒƒ๋„ ์‹œ๋„ํ•ด ๋ณด์•˜๋Š”๋ฐ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๊ฐ€ ๋œ ์ค„์–ด๋“ค๊ธฐ ๋•Œ๋ฌธ์— ์ฐจ์› ๊ฐ์†Œ์šฉ 1x1 conv๋Š” ํ•„ํ„ฐ ์ด์ „์— ํ†ต๊ณผ์‹œํ‚ค๊ธฐ๋กœ ํ•œ๋‹ค.
    [96, 192, 1, 1] + [128, 96, 3, 3] < [96, 102, 3, 3] + [128, 96, 1, 1]
    (= 129,024 < 178,176)
     
    ๋ฐ˜๋ฉด pooling๊ณผ 1x1 conv๋Š” ์ˆœ์„œ๋ฅผ ๋ฐ”๊ฟ”๋„ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜์— ๋ณ€ํ™”๊ฐ€ ์—†์–ด์„œ max pooling์„ ๋จผ์ € ํ•ด์ค€๋‹ค.
     


     

    ์ „์ฒด ๊ตฌ์กฐ

    ์ „์ฒด ๊ตฌ์กฐ๋Š” ์œ„์™€ ๊ฐ™์€๋ฐ ๊ธ€์”จ๊ฐ€ ๋„ˆ๋ฌด ์ž‘์•„์„œ ์•ˆ๋ณด์ด๋‹ˆ๊นŒ ์™ผ์ชฝ๋ถ€ํ„ฐ ํ™•๋Œ€ํ•ด์„œ ์ˆœ์„œ๋Œ€๋กœ ๋ณด๊ฒ ๋‹ค.
     


    input

    Notation
    - 3x3+1 ์ด๋ฉด ์‚ฌ์ด์ฆˆ 3x3 ์ปค๋„์— stride=1 ์ด๋ผ๋Š” ๋œป์ด๋‹ค.

    - S: ๋ ˆ์ด์–ด ํ†ต๊ณผํ•œ ํ›„ ํ–‰๋ ฌ ํฌ๊ธฐ๊ฐ€ Same์ด ๋˜๋„๋ก padding์„ ์ ์ ˆํžˆ ๋งž์ถฐ์ค€๋‹ค๋Š” ๋œป
    - V: Valid ํ•œ ์• ๋“ค๊นŒ์ง€๋งŒ ์Šค์บ” => padding=0 ์œผ๋กœ ํ•˜๊ฒ ๋‹ค๋Š” ๋œป
     


    LocalRespNorm (Local Response Normalization)

    LRN์€ normalization ๊ธฐ๋ฒ•์œผ๋กœ ํ”ฝ์…€ ๊ฐ’์ด ๋„ˆ๋ฌด ํฐ ํ”ฝ์…€์˜ ์˜ํ–ฅ์„ ์™„ํ™”ํ•ด์ฃผ๋Š” ์—ญํ• ์„ ํ•œ๋‹ค.
    $$b^{i}_{x,y}=\frac{a^{i}_{x,y}}{(k+\alpha \sum_{j=max(0,i-n/2)}^{min(N-1, i+2/N)}(a^{i}_{x,y})^{2})^{\beta }}$$
     

     
    depth๋กœ ์Œ“์ธ ์ธ์ ‘ํ•œ feature map ๋“ค์„ ์•„๋ž˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ด ๋ณด์ž.
     
     

     
    i ๋ฒˆ์งธ feature map์„ ๊ธฐ์ค€์œผ๋กœ ์–‘์ชฝ์— ๋‘๊ฐœ์”ฉ ์žˆ๋‹ค. ์ด๋•Œ ๊ณ ๋ คํ•  ์ด์›ƒ ํ•„ํ„ฐ์˜ ์ˆ˜๋Š” $n$๊ฐœ๋กœ ์ •ํ•ด์ค€๋‹ค. 
    ๊ธฐ์กด ํ”ฝ์…€์€ $a^{i}_{x,y}$ ์ด๊ณ , ์ด ํ”ฝ์…€ ๊ฐ’์„ ์ธ์ ‘ํ•œ feature map ๋“ค์—์„œ
    ๊ฐ™์€ ๊ณต๊ฐ„์  ์œ„์น˜์— ํ•ด๋‹นํ•˜๋Š” ๊ฐ ํ”ฝ์…€๋“ค์˜ ์กฐํ•ฉ ($={(k+\alpha \sum_{j=max(0,i-n/2)}^{min(N-1, i+2/N)}(a^{i}_{x,y})^{2})^{\beta }}$) ์œผ๋กœ ๋‚˜๋ˆ ์คŒ์œผ๋กœ์จ ๋„ˆ๋ฌด ํฐ ํ”ฝ์…€๋•Œ๋ฌธ์— ๋‹ค๋ฅธ ์ž‘์€ ํ”ฝ์…€๋“ค์ด ๋ฌด์‹œ๋˜์ง€ ์•Š๋„๋ก normalize๋ฅผ ํ•ด์ค€๋‹ค.
    ํ•˜์ง€๋งŒ ์š”์ƒˆ๋Š” ์ด ๊ธฐ๋ฒ•์ด ์ž˜ ์•ˆ์“ฐ์ด๊ณ  Batch normalization์„ ์ฃผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค.


     
    Inception Module (โ‘ , โ‘ก, โ‘ข, โ‘ค, โ‘ฅ, โ‘ง, โ‘จ)
     

     
    ์•ž์„œ ๋ณธ (b) Inception module with dimension reductions ๊ทธ๋ฆผ๊ณผ ๋™์ผํ•œ ๋ ˆ์ด์–ด๊ฐ€ ๋ฐ˜๋ณต๋˜๋Š” ๊ตฌ๊ฐ„์ด๋‹ค.
    ๋ชจ๋“  conv layer์˜ stride๋Š” 1์ด๊ณ  padding์ˆ˜๋ฅผ ์กฐ์ ˆํ•ด์ฃผ์–ด spatial size๋ฅผ ๋งž์ถฐ์ค€๋‹ค.
    ์ด๋•Œ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด 1x1 convolution ๊ฐ€ ๋“ค์–ด์žˆ๋‹ค.
    ์•ž์—์„œ ์„ค๋ช…ํ–ˆ์œผ๋‹ˆ ์ž์„ธํ•œ ์„ค๋ช…์€ ์ƒ๋žตํ•˜๊ฒ ๋‹ค.
     
     

    Auxiliary classifiers (โ‘ฃ, โ‘ฆ)

    Inception Net์—๋Š” main classifier ๋ง๊ณ ๋„ 2๊ฐœ์˜ ๋ณด์กฐ classifier๊ฐ€ ๋” ์žˆ๋Š”๋ฐ, ์ด๋Š” vanishing gradient๋ฅผ ๋ง‰๊ธฐ ์œ„ํ•ด ์กด์žฌํ•œ๋‹ค. 

    auxiliary classifier 1, 2 ์—์„œ๋„ softmax ๊ฐ’์„ ์ถœ๋ ฅํ•ด์„œ ์ตœ์ข… Loss๋ฅผ ๊ตฌํ•  ๋•Œ ๋”ํ•ด์ค€๋‹ค.
    ์ „์ฒด loss ํ•จ์ˆ˜๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜ํ•ด์„œ weighted (๋…ผ๋ฌธ์—์„œ 0.3๋กœ ์„ค์ •) loss ๋ฅผ ๊ณ„์‚ฐํ•ด์ค€๋‹ค.
    Loss = out_loss + 0.3*(aux1_loss + aux2_loss)
    ํ•˜์ง€๋งŒ auxiliary classifier๋Š” ํ›ˆ๋ จ ๊ณผ์ •์—์„œ๋งŒ ์กด์žฌํ•˜๊ณ  test ๋• ๋–ผ๋ฒ„๋ฆฐ๋‹ค.

     
    Main classifier

     
    Main classifier ๊นŒ์ง€ ์ด 3๊ฐ€์ง€์˜ softmax (softmax 0, softmax 1, softmax 2) ๊ฐ€ ๋‚˜์˜ค๊ฒŒ ๋œ๋‹ค.


    ์ „์ฒด ๊ตฌ์กฐ๋ฅผ ํ‘œ๋กœ ๋‚˜ํƒ€๋‚ธ ๊ฒƒ
     

    ๋…ธ๋ž€์ƒ‰์œผ๋กœ ์ƒ‰์น ํ•œ ๋ถ€๋ถ„์ด 9๊ฐœ์˜ Inception Module ์ด๋‹ค. 
    ๊ทธ๋ฆผ (b) ์™€ ๊ฐ™์ด 1x1 conv, 3x3 conv, 5x5 conv, pool project ์—์„œ ์–ป์€ feature map์„ concatenation ํ–ˆ๋‹ค.
    inception (3a) ๋ฅผ ๋ณด๋ฉด ์•Œ ์ˆ˜ ์žˆ๋“ฏ์ด output size 28x28x256 ์—์„œ ์ฑ„๋„ ์ˆ˜๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” 256์€ ๊ฐ๊ฐ 1x1 conv, 3x3 conv, 5x5 conv, pool project์— ํ•ด๋‹นํ•˜๋Š” 64, 128, 32, 32 ๋ฅผ ํ•ฉํ•œ ๊ฒƒ์ด๋‹ค.
    ๋‹ค๋ฅธ inception module์—๋„ ๋™์ผํ•˜๊ฒŒ ์ ์šฉ๋œ๋‹ค. 

     

    ์„ฑ๋Šฅ ๋น„๊ต ํ‘œ

    ์•„๋ž˜ ํ‘œ๋Š” 7๊ฐœ์˜ ๋ชจ๋ธ (์•ž์„œ ์ œ์‹œ๋œ ๋ชจ๋ธ๊ณผ ๋™์ผํ•œ ์‚ฌ์ด์ฆˆ 6๊ฐœ + ์ข€๋” ํฐ ๋ชจ๋ธ 1๊ฐœ) ์„ ์•™์ƒ๋ธ”ํ•˜๊ณ  ์ด๋ฏธ์ง€ ํ•˜๋‚˜๋ฅผ 144๊ฐœ ์ด๋ฏธ์ง€๋กœ ๋ณ€ํ˜•์„ ๊ฑฐ์นœ ๋‹ค์Œ ํ†ต๊ณผ์‹œ์ผœ 7*144=1008 ๊ฐœ์˜ ์ถœ๋ ฅ์„ ํ‰๊ท ๋‚ด์„œ ์ตœ์ข… ๋ถ„๋ฅ˜๋ฅผ ์ˆ˜ํ–‰ํ•œ ์„ฑ๋Šฅ์ด๋‹ค.
     
    ์ด๋•Œ 6๊ฐœ์˜ ๋ชจ๋ธ์€ ๊ตฌ์กฐ๋ถ€ํ„ฐ ์ดˆ๊ธฐ weight๊นŒ์ง€ ๋ชจ๋‘ ๋™์ผํ•˜๊ณ  ๋‹จ์ง€ data ๋ณด๋Š” ์ˆœ์„œ๋งŒ ๋‹ค๋ฅด๊ฒŒ ํ–ˆ๋‹ค๊ณ  ํ•œ๋‹ค.
     
    ๋˜ํ•œ 144๊ฐœ์˜ ์ด๋ฏธ์ง€๋Š” ์—ฌ๋Ÿฌ๊ฐ€์ง€ resize, crop ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ์ด๋ฏธ์ง€์˜ ๋ณ€ํ˜•(transform)์„ ๊ฑฐ์นœ ๊ฒƒ์ด๋‹ค. ์ผ์ข…์˜ TTA (Test time augmentation)์„ ํ•œ ๊ฒƒ์ด๋‹ค.

     
    Transform ๊ธฐ๋ฒ•

    • 4๊ฐ€์ง€๋กœ resize (ํ–‰ ๋˜๋Š” ์—ด, ๋‘˜ ์ค‘ ์งง์€ ์ชฝ์ด 256, 288, 320, 352๊ฐ€ ๋˜๋„๋ก) (4๊ฐ€์ง€ case)
    • ๊ฐ€๋กœ๊ฐ€ ๊ธด ์ด๋ฏธ์ง€๋ฉด ์™ผ์ชฝ, ์ค‘, ์˜ค๋ฅธ์ชฝ / ์„ธ๋กœ๊ฐ€ ๊ธด ์ด๋ฏธ์ง€๋ฉด ์œ„, ์ค‘๊ฐ„, ์•„๋ž˜๋กœ crop (3๊ฐ€์ง€ case)
    • ์ด๋ฏธ์ง€ ์ขŒ์šฐ ๋ฐ˜์ „ (2๊ฐœ case)
    • 224x2224 ๋กœ ์™ผ์ชฝ ์œ„, ์˜ค๋ฅธ์ชฝ ์œ„, ์™ผ์ชฝ ์•„๋ž˜, ์˜ค๋ฅธ์ชฝ ์•„๋ž˜, ์ค‘์‹ฌ ์ด๋ ‡๊ฒŒ ๋‹ค์„ฏ ๊ฐ€์ง€๋กœ ์ž๋ฆ„ & ์ž๋ฅด์ง€ ์•Š๊ณ  224x224 ๋กœ resize ํ•œ ๊ฒƒ๊นŒ์ง€ ํฌํ•จ ์ด 5+1=6๊ฐœ์˜ case
    • ๊ทธ๋ž˜์„œ 4*3*2*6 = 144๊ฐœ!

    ์š”์•ฝ

    1. ์—ฌ๋Ÿฌ ์‚ฌ์ด์ฆˆ์˜ ํ•„ํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๊ทธ ๊ฒฐ๊ณผ๋ฅผ Concat ํ•ด์„œ ์‚ฌ์šฉ
    2. 1x1 conv๋กœ dimension reduction → ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋ฅผ ์ค„์ž„
    3. Auxiliary classifier๋กœ vanishing gradient๋ฅผ ์™„ํ™”
    4. VGGNet ๋ณด๋‹ค 10๋ฐฐ ์ด์ƒ ์ ์€ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋กœ 1๋“ฑ ๋‹ฌ์„ฑ
    5. GoogLeNet ์ด๋ผ๊ณ ๋„ ๋ถˆ๋ฆผ

    'AI\ML\DL > ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

    Transformer: Attention is all you need  (0) 2023.10.23
    Evaluation for Weakly Supervised Object Localization: Protocol, Metrics, and Datasets  (0) 2023.10.03
    VGG Net (2014.09)  (0) 2023.09.16
    SSD: Single-Shot Multibox Detector  (0) 2023.08.24
Designed by Tistory.