ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • ์ด์ง„๋ถ„๋ฅ˜์—์„œ Maximum Likelihood Estimation (MLE)
    AI\ML\DL/Deep learning theory 2023. 5. 10. 17:24
    ๋ฐ˜์‘ํ˜•

     

    [์ด์ง„ ๋ถ„๋ฅ˜๋ฌธ์ œ์—์„œ MLE ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ ]

    ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ๋ชจ๋ธ์—์„œ 0๊ณผ 1์‚ฌ์ด์˜ ํ™•๋ฅ  ($q$) ์„ ์˜ˆ์ธกํ•˜์—ฌ ๊ฐ•์•„์ง€์™€ ๊ณ ์–‘์ด๋ฅผ ๋ถ„๋ฅ˜๋ฅผ ํ•œ๋‹ค๊ณ  ํ•˜์ž.

    ์˜ˆ๋ฅผ ๋“ค์–ด ๋ชจ๋ธ์ด ๊ฐ•์•„์ง€๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ํ•˜๋ ค๊ณ  ํ•œ๋‹ค๋ฉด,

    ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ๋ชจ๋ธ์˜ ์ถœ๋ ฅ๊ฐ’์ธ ๊ฐ•์•„์ง€์ผ ํ™•๋ฅ  $q$ ๋ฅผ ํ‚ค์šฐ๋Š” ๊ฒƒ์ด ํ•™์Šต์˜ ๋ชฉ์ ์ด๋‹ค.

    ๋ฐ˜๋Œ€๋กœ ๊ณ ์–‘์ด๋ฅผ ์˜ˆ์ธกํ•˜๋ ค๋ฉด, ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ๋ชจ๋ธ์€ ๊ณ ์–‘์ด์ผ ํ™•๋ฅ  $1-q$ ๋ฅผ ํ‚ค์šฐ๋ ค๊ณ  ํ•  ๊ฒƒ์ด๋‹ค.

     

    ํ™•๋ฅ ๊ฐ’ q๋Š” ์ž…๋ ฅ๊ฐ’$\times $๊ฐ€์ค‘์น˜์— ์‹œ๊ทธ๋ชจ์ด๋“œ๋ฅผ ๊ฑฐ์ณ ๊ณ„์‚ฐ๋œ ๊ฐ’์ด๋ฏ€๋กœ ๊ฐ€์ค‘์น˜ w์— ๋Œ€ํ•œ ํ•จ์ˆ˜๋กœ๋„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

    ์ด๋•Œ ๋ชจ๋ธ์˜ ๊ฐ€์ค‘์น˜ w๋ฅผ ์ถ”์ •ํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ตœ๋Œ€์šฐ๋„์ถ”์ • ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•œ๋‹ค.

    ์ฆ‰, q ๋˜๋Š” 1-q๋ฅผ ์ตœ๋Œ€๋กœ ๋งŒ๋“œ๋Š” ๊ฐ€์ค‘์น˜ w๋ฅผ ์ฐพ๊ณ ์ž ํ•˜๋Š” ๊ฒƒ์ด MLE ๊ธฐ๋ฒ•์ด๋‹ค.

    (๊ฐ•์•„์ง€๋ฉด $q$ , ๊ณ ์–‘์ด๋ฉด $1-q$ ๋ฅผ ์ตœ๋Œ€ํ™”)

     

    ๊ฐ•์•„์ง€ ํ˜น์€ ๊ณ ์–‘์ด ์ž…๋ ฅ ์‚ฌ์ง„ $x_{i}$ ์— ๋Œ€ํ•ด ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•œ ๋™๋ฌผ์— ๋Œ€ํ•œ ํ•ด๋‹น ํ™•๋ฅ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค. 

    $$P(y|q=f_{w}(x_{i}))=p_{i}^{y_{i}}(1-p_{i})^{1-y_{i}} \\\\\ (y_{i}=0 \ or\ 1)$$

    ์œ„ ์‹์€ i ๋ฒˆ์งธ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ $x_{i}$ ์— ๋Œ€ํ•œ ๋จธ์‹ ์˜ ์ถœ๋ ฅ๊ฐ’์ด $f_{w}(x_{i})$๋กœ ์ฃผ์–ด์ ธ ์žˆ์„ ๋•Œ์˜ ์ •๋‹ต $y_{i}$ ์˜ ๋ถ„ํฌ์— ๋Œ€ํ•œ ์‹์ด๋‹ค. 

    ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์˜ Y๊ฐ€ ์ด์ง„ ๋ถ„๋ฅ˜๋ฅผ ํ•˜๋ฏ€๋กœ ๋ฒ ๋ฅด๋ˆ„์ด ์‹œํ–‰์„ ๋”ฐ๋ฅธ๋‹ค.

    $$P(y|q=f_{w}(x_{i}))$$ ์ด๋ฅผ Likelihood๋กœ ์‚ผ๊ณ  ์ด ๊ฐ’์„ ํ‚ค์šฐ๋ฉด ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•œ ๊ฐ’์˜ ํ™•๋ฅ  (either $q$ or $1-q$ ) ์„ ๋†’์ผ ์ˆ˜ ์žˆ๋‹ค. 

     

    ์ธ๊ณต์‹ ๊ฒฝ๋ง์—์„œ likelihood๋Š”
    w๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ y์˜ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ w์˜ ํ•จ์ˆ˜๋กœ ๋ฐ”๋ผ๋ณธ ํ•จ์ˆ˜์ด๋‹ค.

     

    [์ง๊ด€์ ์ธ ์ดํ•ด]

    ์œ„์˜ Likelihood ์‹์—์„œ y=0 ์ผ๋•Œ, 1-q ๋งŒ ๋‚จ๋Š”๋‹ค.

    ๋ชจ๋ธ์˜ ์˜ˆ์ธก๊ฐ’ q๊ฐ€ ์ •๋‹ต 0๊ณผ ์œ ์‚ฌํ• ์ˆ˜๋ก likelihood ๊ฐ’์ด 1์— ๊ฐ€๊นŒ์›Œ์ง„๋‹ค

    ๋ฐ˜๋Œ€๋กœ q๊ฐ€ ์ •๋‹ต 0๊ณผ ์˜ˆ์ธก์ด ๋‹ค๋ฅธ ๊ฒฝ์šฐ likelihood๊ฐ€ 0์— ๊ฐ€๊นŒ์›Œ์ง„๋‹ค.

     

    y=1 ์ผ๋•Œ๋„ ๋งˆ์ฐฌ๊ฐ€์ง€๋‹ค.

    ์‹ค์ œ ์ •๋‹ต ๋ ˆ์ด๋ธ”์ด y=1 ์ผ๋•Œ, likelihood ๋Š” q๋งŒ ๋‚จ๋Š”๋‹ค.

    q๊ฐ€ ์ •๋‹ต 1์— ๊ฐ€๊นŒ์›Œ์งˆ ์ˆ˜๋ก likelihood ๊ฐ’์€ 1๊ณผ ๊ฐ€๊นŒ์›Œ์ง„๋‹ค.

    ๋ฐ˜๋Œ€๋กœ q๊ฐ€ ์ •๋‹ต 0๊ณผ ์˜ˆ์ธก์ด ๋‹ฌ๋ผ์ง€๋ฉด likelihood ๊ฐ€ 0์— ๊ฐ€๊นŒ์›Œ์ง„๋‹ค.

     

    ์ด์ฒ˜๋Ÿผ ์ด๋ฏธ ์ •ํ•ด์ ธ์žˆ๋Š” ์˜ˆ์ธก๊ฐ’๊ณผ ์ •๋‹ต์„ ์œ ์‚ฌํ•˜๊ฒŒ ๋งŒ๋“ค์ˆ˜๋ก likelihood๊ฐ€ 1์— ๊ฐ€๊นŒ์›Œ์ง€๋ฏ€๋กœ (์ปค์ง€๋ฏ€๋กœ),

    Likelihood ๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค.

     

    ์ฆ‰, likelihood๋Š” ์ •๋‹ต ๋ผ๋ฒจ๊ณผ ์˜ˆ์ธก๋œ ํ™•๋ฅ ๊ฐ’์„ ๋น„๊ตํ•จ์œผ๋กœ์จ ์˜ˆ์ธก์˜ ์ •ํ™•๋„๋ฅผ ์ธก์ •ํ•œ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.


    Likelihood๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ฒƒ์€ Likelihood์— ๋งˆ์ด๋„ˆ์Šค๋ฅผ ๋ถ™์ด๊ณ  ์ด๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™๋‹ค.

    ์ด Negative (log) likelihood ๋Š” ์ด์ง„๋ถ„๋ฅ˜์˜ loss ํ•จ์ˆ˜์ธ Binary cross entropy๊ฐ€ ๋œ๋‹ค.

    ๋จผ์ € ์ž…๋ ฅ ์‚ฌ์ง„ n๊ฐœ์— ๋Œ€ํ•œ Likelihood ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

    $$L=\prod_{i}^{n}p_{i}^{y_{i}}(1-p_{i})^{1-y_{i}}$$

     

    ์˜ˆ๋ฅผ ๋“ค์–ด, ์ž…๋ ฅ ์‚ฌ์ง„์ด 2๊ฐœ๋ผ๋ฉด ๊ฐ ์‹œํ–‰์— ๋Œ€ํ•ด ๋…๋ฆฝ์ด๋ฏ€๋กœ Likelihood๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

    $$L=p_{1}^{y_{1}}(1-p_{1})^{1-y_{1}}\times p_{2}^{y_{2}}(1-p_{2})^{1-y_{2}}$$

    ๊ณฑ ์—ฐ์‚ฐ์„ ์šฉ์ดํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด log scaling ์„ ํ•ด์ฃผ๋ฉด Log Likelihood (LL) ๊ฐ€ ๋œ๋‹ค.

    ์ด๋•Œ log๋ฅผ ์ทจํ•  ์ˆ˜ ์žˆ๋Š” ์ด์œ ๋Š” ๋…๋ฆฝ ๋ณ€์ˆ˜์˜ ๊ฐ’์ด ์ฆ๊ฐ€ํ•  ๋•Œ ํ•จ์ˆ˜์˜ ๊ฐ’๋„ ๊ฐ™์ด ์ฆ๊ฐ€ํ•˜๋Š” ๋‹จ์กฐ์ฆ๊ฐ€ (monotonically increasing) ์˜ ํŠน์„ฑ์„ ๊ฐ€์ง€๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

    $$LL=y_{1}\textrm{log}p_{1}+(1-y_{1})\textrm{log}(1-p_{1})+y_{2}\textrm{log}p_{2}+(1-y_{2})\textrm{log}(1-p_{2})$$

     

    ์—ฌ๊ธฐ์— ๋งˆ์ด๋„ˆ์Šค๋ฅผ ๋ถ™์ธ ๊ฒƒ์ด Negative Log likelihood (NLL) ์ด ๋œ๋‹ค.

    $$NLL=-\left\{y_{1}\textrm{log}p_{1}+(1-y_{1})\textrm{log}(1-p_{1})+y_{2}\textrm{log}p_{2}+(1-y_{2})\textrm{log}(1-p_{2})) \right\}$$

     

    ์ด๋•Œ ์˜ˆ์ธกํ•œ ํ™•๋ฅ ์€ $p=\sigma (w^{T}x)$๋กœ ์ด๋ฏ€๋กœ, NLL ์„ ๊ฐ€์ค‘์น˜ w ์— ๋Œ€ํ•œ ํ•จ์ˆ˜๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค.

    NLL์€ ์ตœ์†Œํ™”ํ•ด์•ผ ๋  ํ•จ์ˆ˜์ด๋ฏ€๋กœ ์ด๋ฅผ Loss function ์œผ๋กœ ๋‘˜ ์ˆ˜ ์žˆ๋‹ค.

     

    ๊ฒฐ๋ก ์ ์œผ๋กœ ์ธ๊ณต ์‹ ๊ฒฝ๋ง์€ ํ•™์Šต์„ ํ†ตํ•ด ๊ฐ€์ค‘์น˜ w ์— ๋Œ€ํ•œ MLE ๋ฅผ ์ˆ˜ํ–‰ํ•œ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

    i ๋ฒˆ์งธ ๋ฐ์ดํ„ฐ $(x_{i},y_{i})$์— ๋Œ€ํ•˜์—ฌ ๋ชจ๋ธ์€ ๋‹ค์Œ Likelihood์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ฐ€์ค‘์น˜๋ฅผ ์ฐพ์•„๋‚˜๊ฐ„๋‹ค.

     


    Cross-Entropy vs. Negative log likelihood

     

    discrete ํ™˜๊ฒฝ์—์„œ, ๋‘ ํ™•๋ฅ  ๋ถ„ํฌ p์™€ q๊ฐ€ ์ฃผ์–ด์กŒ์„๋•Œ, cross-entropy ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋œ๋‹ค.

    $\textrm{H(p,q)}=-\sum_{x\in \chi }^{} p(x)\textrm{log}(q(x))$

    Cross-Entropy์˜ p๋ฅผ ์ •๋‹ต ๋ถ„ํฌ, q๋ฅผ ๋ชจ๋ธ์˜ ์˜ˆ์ธก ๋ถ„ํฌ๋ผ๊ณ  ๊ฐ€์ •ํ•œ๋‹ค๋ฉด Negative-log likelihood ์‹๊ณผ ๋งค์šฐ ์œ ์‚ฌํ•˜๋‹ค.

    ๋”ฐ๋ผ์„œ ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์˜ ์†์‹คํ•จ์ˆ˜๋ฅผ binary cross entropy๋ผ๊ณ ๋„ ํ•œ๋‹ค.


    BCE vs. MSE

     

    BCE ๋Œ€์‹  MSE (Mean squared error) ๋ฅผ ์ด์ง„๋ถ„๋ฅ˜์˜ ์†์‹คํ•จ์ˆ˜๋กœ ์‚ฌ์šฉํ•˜๋ฉด ์–ด๋–จ๊นŒ?

    ์ด์ง„ ๋ถ„๋ฅ˜ ๋ฌธ์ œ์— ์•ž์„œ ์„ ํ˜•ํšŒ๊ท€์—์„œ ์‚ฌ์šฉํ•˜๋˜

    MSE ๋ฅผ ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์— ๊ทธ๋Œ€๋กœ ์ ์šฉํ•ด๋ณด๋Š” ์ƒ๊ฐ์„ ํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค.

    ๋‘ ๊ฐ€์ง€ ์†์‹คํ•จ์ˆ˜๋ฅผ ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€์— ๋Œ€์ž…ํ•ด ๋ณด์•˜์„ ๋•Œ 2๊ฐ€์ง€ ๊ด€์ ์—์„œ BCE ๊ฐ€ ๋” ์ข‹์€ ์ ์„ ์ฐพ์•„๋ณผ ์ˆ˜ ์žˆ๋‹ค.

     

    1) ์†์‹คํ•จ์ˆ˜์˜ ๋ฏผ๊ฐ๋„

    ๊ฐ•์•„์ง€/๊ณ ์–‘์ด๋ฅผ ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ฌธ์ œ์—์„œ, ๊ฐ•์•„์ง€ ์‚ฌ์ง„์ด๋ฉด ์˜ˆ์ธก ํ™•๋ฅ  q๋ฅผ ํ‚ค์šด๋‹ค๊ณ  ํ–ˆ๋‹ค. 

    ์ด๋•Œ MSE ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด $(q-1)^{2}$ ๋ฅผ ์ตœ์†Œํ™”ํ•˜๋ฉด ๋œ๋‹ค. 

    ์ด๋•Œ NLL ์†์‹คํ•จ์ˆ˜๋Š” $-\textrm{log}{q}$ ์ด๋‹ค.

    ์‹ค์ œ ๋ ˆ์ด๋ธ”์ด 1 ์ผ ๋•Œ, ๋ชจ๋ธ์ด ํ™•๋ฅ ๊ฐ’์„ 0์œผ๋กœ ์ž˜๋ชป ์˜ˆ์ธกํ–ˆ๋‹ค๋ฉด

    MSE์™€ NLL ๊ฐ๊ฐ์˜ ์†์‹คํ•จ์ˆ˜ ๊ฒฐ๊ณผ๊ฐ’์€ $1$, $\infty $ ์ด๋‹ค. 

    ์ฆ‰, $-\textrm{log}{q}$ ๊ฐ€ ํ›จ์”ฌ ๋” ์˜ค๋ฅ˜์— ๋ฏผ๊ฐํ•˜๊ฒŒ ๋ฐ˜์‘ํ•œ๋‹ค. 

    ๋”ฐ๋ผ์„œ MSE ๋ณด๋‹ค ์˜ค๋ฅ˜์— ๋” ๋ฏผ๊ฐํ•œ NLL์„ ์ด์ง„๋ถ„๋ฅ˜์— ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋” ์ ํ•ฉํ•˜๋‹ค.

     

    2) Convexity

    ์–ด๋–ค ์†์‹คํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋А๋ƒ์— ๋”ฐ๋ผ ์†์‹คํ•จ์ˆ˜ ๊ทธ๋ž˜ํ”„์˜ ๊ฐœํ˜•์ด ๋‹ฌ๋ผ์ง„๋‹ค.

    ๊ทธ๋ž˜ํ”„์˜ ๊ฐœํ˜•์ด Convex ํ•œ ๊ฒƒ์ด Non-Convex ํ•œ ๊ทธ๋ž˜ํ”„๋ณด๋‹ค global minimum ์„ ์ฐพ๋Š” ๋ฐ ๋” ์œ ๋ฆฌํ•  ๊ฒƒ์ด๋‹ค.

    ์ˆ˜์‹์„ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋งŒ๋“ค์–ด์ฃผ๊ธฐ ์œ„ํ•ด, ์‹ ๊ฒฝ๋ง์˜ ๊ฐ€์ค‘์น˜๊ฐ€ w ํ•˜๋‚˜๋งŒ ์žˆ๊ณ  bias๋Š” 0์ด๋ผ๊ณ  ๊ฐ€์ •ํ•˜์ž. 

    ๊ทธ๋Ÿฌ๋ฉด sigmoid ๋ฅผ ํ†ต๊ณผํ•˜๊ธฐ ์ง์ „์— w์— ๋Œ€ํ•œ loss ํ•จ์ˆ˜์˜ ๊ฐœํ˜•์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

    • MSE: $(\frac{1}{1+e^{-w}}-1)^{2}$

    • BCE: $-\textrm{log}\frac{1}{1+e^{-w}}$

    ๊ฐ๊ฐ์˜ ๊ทธ๋ž˜ํ”„ ๊ฐœํ˜•์„ ์‚ดํŽด๋ณด์ž.

    MSE์˜ ๊ทธ๋ž˜ํ”„๋Š” ๋ณ€๊ณก์ ์ด ์กด์žฌํ•˜๊ณ  Non-convex ํ•˜๋‹ค.

    ๋ฐ˜๋ฉด, BCE์˜ ๊ทธ๋ž˜ํ”„๋Š” Convex ํ•˜๋‹ค.

    ๋”ฐ๋ผ์„œ MSE ๋ณด๋‹ค BCE ๊ฐ€ global minimum์„ ์ฐพ๊ธฐ์— ๋”์šฑ ์ ํ•ฉํ•˜๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

     

    'AI\ML\DL > Deep learning theory' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

    2D convolution (Conv2d) ๊ณผ์ •์˜ ์ดํ•ด  (0) 2023.05.28
    Batch Normalization  (0) 2023.05.11
    Logistic Regression  (0) 2023.05.08
    Backpropagation  (1) 2023.05.07
Designed by Tistory.