ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • Stochastic Gradient Descent
    AI\ML\DL/Deep learning theory 2023. 5. 6. 22:17
    ๋ฐ˜์‘ํ˜•

    Vanilla GD vs. SGD

    Gradient descent๋Š” ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค ๊ณ ๋ คํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ตœ์†Œ๋ฅผ ํ–ฅํ•˜๋Š” ์™„๋ฒฝํ•œ ๋ฐฉํ–ฅ์œผ๋กœ ๋‚˜์•„๊ฐ„๋‹ค. ํ•˜์ง€๋งŒ, ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค ๊ณ ๋ คํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์‹œ๊ฐ„์ด ๋„ˆ๋ฌด ๋Š๋ฆฌ๋‹ค. 

    Stochastic gradient descent (SGD)๋Š” GD์™€ ๋‹ฌ๋ฆฌ ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋‘ ์ž…๋ ฅํ•˜์ง€ ์•Š๊ณ  ๋žœ๋คํ•˜๊ฒŒ ์ถ”์ถœํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ํ•˜๋‚˜์”ฉ ์ž…๋ ฅํ•ด์„œ loss function๋ฅผ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. 

    SGD๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด GD์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๊ณ„์‚ฐ ์†๋„ ๋ฌธ์ œ์™€ local minimum์— ๋น ์ง€๋Š” ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•  ์ˆ˜ ์žˆ๋‹ค.

     

    SGD์˜ ํŠน์ง•

    1. ๋žœ๋คํ•˜๊ฒŒ ๋ฐ์ดํ„ฐ๋ฅผ ๋น„๋ณต์›์ถ”์ถœ๋กœ ํ•˜๋‚˜์”ฉ ๋ฝ‘์•„์„œ loss๋ฅผ ๋งŒ๋“ค๊ณ  gradient๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค. (๋žœ๋ค์ด๋ผ stochastic์ด๋ผ๋Š” ์ด๋ฆ„์ด ๋ถ™์—ˆ๋‹ค.)

    ์ฆ‰, ๋ฐ์ดํ„ฐ ํ•˜๋‚˜๋งŒ ๋ณด๊ณ  ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ๊ฒฐ์ •ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋น ๋ฅด๊ฒŒ ์œ„์น˜๋ฅผ ์—…๋ฐ์ดํŠธํ•œ๋‹ค. ๋”ฐ๋ผ์„œ gradient ๋ฐฉํ–ฅ์ด ํ•ญ์ƒ ์ผ์ •ํ•˜์ง€๋Š” ์•Š๋‹ค.)

    2. GD๋Š” ๋ฌด์กฐ๊ฑด ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค ๊ณ ๋ คํ•œ ์†์‹คํ•จ์ˆ˜๋ฅผ ๋ณด๊ณ  ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ•ญ์ƒ ๊ฐ™์€ ๋ฐฉํ–ฅ์œผ๋กœ ํ–ฅํ•ด์„œ local minimum์— ๋น ์งˆ ์œ„ํ—˜์ด ์žˆ๋‹ค. ํ•˜์ง€๋งŒ SGD๋Š” ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ๋งŒ ๋ฝ‘์•„์„œ ์†์‹คํ•จ์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ๋•Œ๋ฌธ์— iteration๋งˆ๋‹ค ์†์‹คํ•จ์ˆ˜๊ฐ€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ๊ณ  gradient ๋˜ํ•œ ๋‹ฌ๋ผ์ง„๋‹ค. ์ด๋ ‡๊ฒŒ ๋•Œ๋ฌธ์— local minimum ์—์„œ ํƒˆ์ถœํ•  ๊ธฐํšŒ๊ฐ€ ์žˆ๋‹ค.

     

    ํ•˜์ง€๋งŒ SGD๋Š” ๋ช‡๋ฐฑ๋งŒ๊ฐœ์˜ ๋ฐ์ดํ„ฐ ์ค‘์—์„œ ํ•˜๋‚˜์”ฉ๋งŒ ๋ณด๊ณ  ๊ฑ”๋งŒ ๋งž์ถœ๋ ค๊ณ  ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋„ˆ๋ฌด ์„ฑ๊ธ‰ํ•˜๊ฒŒ ๋ฐฉํ–ฅ(gradient)๋ฅผ ๊ฒฐ์ •ํ•ด์„œ ์‹ ์ค‘ํ•˜์ง€ ์•Š๊ณ , ์žฆ์€ ์›€์ง์ž„์ด ์žˆ๋‹ค..

    ์ด๊ฒƒ ๋•Œ๋ฌธ์— ๋“ฑ์žฅํ•œ ๊ฒƒ์ด mini-batch SGD์ด๋‹ค.

    mini-batch SGD๋Š” 2๊ฐœ ์ด์ƒ์”ฉ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•ด์„œ loss๋ฅผ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ์˜ˆ๋ฅผ๋“ค์–ด mini-batch size = 2 ๋ผ๋ฉด, ๋ฐ์ดํ„ฐ๋ฅผ 2๊ฐœ์”ฉ ๋น„๋ณต์›์ถ”์ถœ๋กœ ๋žœ๋คํ•˜๊ฒŒ ๋ฝ‘๋Š”๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค. (iteration์ด ๋” ๋‚จ์•˜๋Š”๋ฐ ๋‚จ์€ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•˜๋‚˜๋ผ๋ฉด mini batch ์‚ฌ์ด์ฆˆ๊ฐ€ 2๋ผ๋„ ๊ทธ๋ƒฅ ํ•˜๋‚˜๋งŒ ๋ฝ‘๋Š”๋‹ค. ์†Œ์™ธ๋˜๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ์—†๋„๋ก ํ•˜๊ธฐ ์œ„ํ•จ์ด๋‹ค.)

    Mini-batch SGD์—์„œ ๋ฌด๋ฆฌํ•˜๊ฒŒ batch size๋ฅผ ํ‚ค์šฐ๋ฉด ์„ฑ๋Šฅ์ด ์•ˆ์ข‹์•„์ง„๋‹ค๋Š” ์—ฐ๊ตฌ ๊ฒฐ๊ณผ๊ฐ€ ์žˆ๋‹ค.

    batch size๊ฐ€ ์ปค์ง€๋ฉด ๊ทธ๋งŒํผ GD์™€ ๋น„์Šทํ•ด์ง€๋Š” ๊ฑฐ๋‹ˆ๊นŒ ์•ˆ์ข‹์€ local minimum์œผ๋กœ ๋น ์งˆ ๊ฐ€๋Šฅ์„ฑ์ด ์ปค์ง€๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

    https://arxiv.org/abs/1706.02677

     

    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

    Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this

    arxiv.org

    ๋”ฐ๋ผ์„œ ์ด ๋…ผ๋ฌธ์—์„œ๋Š” batch size๋ฅผ ํ‚ค์šฐ๋ ค๋ฉด learning rate๋„ ๊ฐ™์ด ํ‚ค์šฐ๊ณ , warmup๋„ ํ•ด์•ผ ๊ทธ๋‚˜๋งˆ ์ž‘์€ batch size์ผ๋•Œ์˜ ์„ฑ๋Šฅ์„ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค. 

     

     

     

     

     

    'AI\ML\DL > Deep learning theory' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

    Logistic Regression  (0) 2023.05.08
    Backpropagation  (0) 2023.05.07
    Momentum, RMSProp Optimizer  (0) 2023.05.06
    Gradient descent  (0) 2023.05.06
Designed by Tistory.