ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • Kullback-Leibler (KL) divergence
    Mathematics/Probability, Information theory 2023. 5. 30. 00:02
    ๋ฐ˜์‘ํ˜•

    KL Divergence

    ์ฟจ๋ฐฑ-๋ผ์ด๋ธ”๋Ÿฌ ๋ฐœ์‚ฐ์€ ๋‘ ํ™•๋ฅ ๋ถ„ํฌ์˜ ์ฐจ์ด๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐ์— ์‚ฌ์šฉํ•˜๋Š” ํ•จ์ˆ˜๋กœ, ์–ด๋–ค ์ด์ƒ์ ์ธ ํ™•๋ฅ  ๋ถ„ํฌ์— ๋Œ€ํ•ด, ๊ทธ ๋ถ„ํฌ๋ฅผ ๊ทผ์‚ฌํ•˜๋Š” ๋‹ค๋ฅธ ๋ถ„ํฌ๋ฅผ ์‚ฌ์šฉํ•ด ์ƒ˜ํ”Œ๋ง์„ ํ•œ๋‹ค๋ฉด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” ์ •๋ณด ์—”ํŠธ๋กœํ”ผ์˜ ์ฐจ์ด๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.

     

    KL divergence๋Š” $D_{\textrm{KL}}(P\parallel Q)$ ๋กœ ํ‘œํ˜„๋˜๊ณ , ์ด๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ํ™•๋ฅ  ๋ถ„ํฌ P์™€ Q์˜ statiscal distance๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค.

    KL divergence ๊ฐ€ 0์ธ ๊ฒƒ์€ ์„œ๋กœ ๋‹ค๋ฅธ ๋ถ„ํฌ P์™€ Q๊ฐ€ ์™„์ „ํžˆ ๊ฐ™์€ ๋ถ„ํฌ์ผ ๋•Œ๋งŒ ์„ฑ๋ฆฝํ•œ๋‹ค.

     

     

    ๋จธ์‹ ๋Ÿฌ๋‹์˜ Supervised learning ์— ์ƒํ™ฉ์„ ๋Œ€์ž…ํ•˜๋ฉด 

    Q๊ฐ€ model์˜ prediction ๊ฐ’์ด๊ณ  P๊ฐ€ label์˜ ๋ถ„ํฌ์ผ ๋•Œ ๋ถ„ํฌ์˜ ์ฐจ์ด๋ฅผ ์ธก์ •ํ•˜๊ธฐ ์œ„ํ•ด์„œ KL divergence๋ฅผ metric ์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

    ์ด์ง„๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•ด ์‚ฌ์šฉํ•˜๋Š” ๋Œ€ํ‘œ์ ์ธ ๋ฐฉ๋ฒ•์ธ logistic regression ์˜ cost function๋„ binary cross entropy ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

    ๋จผ์ € ์ด ๊ฐœ๋…์„ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด ์ •๋ณด๋Ÿ‰๊ณผ ์—”ํŠธ๋กœํ”ผ์˜ ๊ฐœ๋…์„ ์•Œ์•„์•ผ ํ•œ๋‹ค. 

    ์ •๋ณด๋Ÿ‰

    ํ™•๋ฅ  p๋ฅผ ๊ฐ€์ง€๋Š” ์‚ฌ๊ฑด A์˜ ์ •๋ณด๋Ÿ‰์„ self-information (ํ˜น์€ surpisal) ์ด๋ผ๊ณ  ํ•œ๋‹ค.

    ์ •๋ณด์ด๋ก ์˜ ๋Œ€๋ถ€ Shannon์€ ๋ชจ๋“  ์ข…๋ฅ˜์˜ ๋ฐ์ดํ„ฐ์˜ ์ •๋ณด๋Ÿ‰์€ ๋น„ํŠธ (0 ๋˜๋Š” 1)๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ๋งํ–ˆ๋‹ค.

    ๋ฐ์ดํ„ฐ๋ฅผ ์ธ์ฝ”๋”ฉํ•  ์ •๋ณด๋Ÿ‰์€ ์ž‘์„ ์ˆ˜๋ก ์ข‹์€๊ฒƒ์ด๊ณ , ์ด๋ฅผ ์ตœ์†Œํ•œ์œผ๋กœ ํ•˜๋ ค๋ฉด ํ™•๋ฅ ์ด ๋†’์€ ๋ฐ์ดํ„ฐ ์ผ์ˆ˜๋ก ์ž‘์€ ์ •๋ณด๋Ÿ‰์œผ๋กœ(๊ธธ์ด๊ฐ€ ์ž‘๊ฒŒ) ์ธ์ฝ”๋”ฉ์„ ํ•˜๊ณ , ํ™•๋ฅ ์ด ๋‚ฎ์€ ๋ฐ์ดํ„ฐ ์ผ์ˆ˜๋ก ๋†’์€ ์ •๋ณด๋Ÿ‰์œผ๋กœ(๊ธธ์ด๊ฐ€ ๊ธธ๊ฒŒ) ์ธ์ฝ”๋”ฉ์„ ํ•˜๋ฉด ์ •๋ณด๋Ÿ‰์„ ์ค„์ผ ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค. 

     

    ์ฆ‰, ์ด์ƒ์ ์ธ ์ตœ์†Œ ์ •๋ณด๋Ÿ‰์€ ํ™•๋ฅ ์— ๋ฐ˜๋น„๋ก€ํ•ด์•ผ ํ•  ๊ฒƒ์ด๋‹ค. $-\textrm{log}(x)$ ํ•จ์ˆ˜๋Š” ๋Œ€ํ‘œ์ ์ธ ๋ฐ˜๋น„๋ก€ ํ•จ์ˆ˜์ด๋ฏ€๋กœ ์ด๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. 

    ์–ด๋–ค ๋ฉ”์„ธ์ง€ m ์— ๋Œ€ํ•œ ์ •๋ณด๋Ÿ‰์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋œ๋‹ค. 

    $$I(m)=\textrm{log}(\frac{1}{p(m)})=-\textrm{log}(p(m))$$

    ์—ฌ๊ธฐ์„œ log ์˜ ๋ฐ‘์€ bit(0๋˜๋Š” 1)๋ฅผ ๊ณ ๋ คํ•ด์„œ 2์ด๊ณ , ๊ทธ๋กœ ์ธํ•ด ์ •๋ณด๋Ÿ‰ $I(m)$์˜ ๋‹จ์œ„๋Š” bit(s) ์ด๋‹ค.

    ์˜ˆ๋ฅผ ๋“ค์–ด ํ™•๋ฅ  $\frac{1}{8}$ ์„ ๊ฐ€์ง€๋Š” ์‚ฌ๊ฑด์— ๋Œ€ํ•œ ์ •๋ณด๋Ÿ‰์€ $\textrm{log}(8)=3 bits$ ์ด ๋œ๋‹ค.

    ์ฆ‰, ์ด ๋ฉ”์„ธ์ง€๋ฅผ ์ „๋‹ฌํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” 3bit์˜ ์ •๋ณด๋Ÿ‰์ด ํ•„์š”ํ•˜๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

     

    ์—”ํŠธ๋กœํ”ผ

    ์—”ํŠธ๋กœํ”ผ๋ž€ 'Shannon Entropy' ๋ผ๊ณ ๋„ ๋ถˆ๋ฆฌ๋ฉฐ ์–ด๋–ค ๋‹ค์ˆ˜์˜ ๋ฉ”์„ธ์ง€ ์ง‘ํ•ฉ (M) ์— ๋Œ€ํ•ด์„œ ๊ฐ๊ฐ์˜ ์ •๋ณด๋Ÿ‰์„ ํ‰๊ท ํ•œ ๊ฐ’์„ ์˜๋ฏธํ•œ๋‹ค. 

    ์—”ํŠธ๋กœํ”ผ๋Š” ์ •๋ณด๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ๋ฐ ํ•„์š”ํ•œ ์ตœ์†Œํ•œ์˜ ํ‰๊ท  ์ž์›๋Ÿ‰์ด๋‹ค.

    ๋ฉ”์„ธ์ง€ ๊ณต๊ฐ„(M) ์˜ ์‚ฌ๊ฑด๋“ค์ด Discrete ํ•œ ๊ฒฝ์šฐ

    $$H(M)=E[I(M)]=\sum_{m\in M}^{}p(m)I(m)=-\sum_{m\in M}^{}p(m)\textrm{log}p(m)$$

    ๋ฉ”์„ธ์ง€ ๊ณต๊ฐ„(M) ์˜ ์‚ฌ๊ฑด๋“ค์ด Continuous ํ•œ ๊ฒฝ์šฐ

    $$H(M)=E[I(M)]=\int p(x)I(x) dx=-\int p(x)ln(p(x))dx$$

    ๋ชจ๋“  ๋ณ€์ˆ˜ x์— ๋Œ€ํ•œ ์ ๋ถ„๊ฐ’์œผ๋กœ ๊ธฐ๋Œ“๊ฐ’์„ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋•Œ ๋กœ๊ทธ์˜ ๋ฐ‘์ด 2์—์„œ 10์œผ๋กœ ๋ฐ”๋€๋‹ค.

    ์—”ํŠธ๋กœํ”ผ๋Š” p(x) ๋“ค์— ๋Œ€ํ•œ ํ•จ์ˆ˜์ด๋‹ค. ์ด๋Š” ํ‰๊ท ์ ์ธ ์ •๋ณด๋Ÿ‰์„ ๋‚˜ํƒ€๋‚ด๋Š” ํ•จ์ˆ˜์ด๋ฉฐ, ๋ฐ์ดํ„ฐ๋ฅผ ์ธ์ฝ”๋”ฉํ•˜๊ธฐ ์œ„ํ•ด ํ‰๊ท ์ ์œผ๋กœ ํ•„์š”ํ•œ ์ตœ์†Œ bit ์ˆ˜๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค.

     

    Cross Entropy and KL Divergence

     

    ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๋Š” source๊ฐ€ ์–ด๋–ค ํ™•๋ฅ ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•˜๋Š”์ง€ ์ž˜ ๋ชจ๋ฅธ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜์ž. 

    ๋งŒ์•ฝ ์‹ค์ œ๋กœ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ถ„ํฌ p(x) ๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑ๋˜๋Š”๋ฐ ์šฐ๋ฆฌ๊ฐ€ p(x)(์‹ค์ œ, ground truth) ์— ๋Œ€ํ•ด ์ž˜ ๋ชฐ๋ผ์„œ ๋Œ€์‹  q(x)(์˜ˆ์ธกํ•œ ๊ฐ’) ๋ผ๋Š” ์•Œ๊ณ ์žˆ๋Š” ๋ถ„ํฌ์˜ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•˜๋ฉด ์–ด๋–ป๊ฒŒ ๋ ๊นŒ? ์ด๋ฅผ ์ˆ˜์‹์œผ๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

    $$H(p,q)=-\sum_{x}^{}p(x)\textrm{log}(q(x))$$

    ์œ„์˜ ์‹์€ ๋ถ„ํฌ q(x)๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์ธ์ฝ”๋”ฉํ•˜์˜€์ง€๋งŒ ์‹ค์ œ๋กœ ๋ฐ์ดํ„ฐ๋Š” p(x) ๋ผ๋Š” ํ™•๋ฅ  ๋ถ„ํฌ๋กœ๋ถ€ํ„ฐ ์ƒ˜ํ”Œ๋ง๋˜๋Š” ๊ฒฝ์šฐ ์‚ฌ์šฉํ•˜๋Š” ํ‰๊ท ์ ์ธ bit ์ˆ˜๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค.

    ์ด๋ฅผ cross entropy๋ผ ํ•œ๋‹ค. 

    ์ด๋ ‡๊ฒŒ ๋งŒ๋“ค์–ด์ง„ ์ฝ”๋“œ๋Š” ์—”ํŠธ๋กœํ”ผ๋ณด๋‹ค ๊ธด ์ •๋ณด๋Ÿ‰์„ ๋‚˜ํƒ€๋‚ผ ๊ฒƒ์ด๋‹ค.

    ์ฆ‰ $H(p)\leq H(p,q)$ ์ด๋‹ค. 

     

    ๊ธฐ๊ณ„ํ•™์Šต์—์„œ logistic regression์„ ์˜ˆ๋กœ ๋“ค๋ฉด binary cross entropy๋ฅผ cost function์œผ๋กœ ๋‘๊ณ  ์ด๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ์ตœ์ ํ™” ๋ฌธ์ œ๋ฅผ ํ‘ธ๋Š”๋ฐ, ์ด๋Š” ๊ฒฐ๊ณผ์ ์œผ๋กœ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ณด์—ฌ์ฃผ๋Š” ์ •๋‹ต ๋ถ„ํฌ(p)์™€ ๋ชจ๋ธ์ด ์˜ˆ์ธก ๋ถ„ํฌ(q)๊ฐ€ ๊ฐ™์•„์ง€๋„๋ก ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. 

     

    KL divergence ๋Š” ์ด์ƒ์ ์ธ ์ •๋ณด๋Ÿ‰(entropy)์™€ ๋‚˜์˜ ์˜ˆ์ธก ์ •๋ณด๋Ÿ‰(cross entropy) ์˜ ์ฐจ์ด๋กœ, ๊ฐ„๋‹จํžˆ ๋งํ•ด '์ž˜๋ชป ์˜ˆ์ธกํ•œ ์ •๋„'์ด๋‹ค.

    Cross entropy๋Š” Entropy ๋ณด๋‹ค ํ•ญ์ƒ ํฌ๊ณ  $p=q$ ์ผ ๋•Œ๋งŒ ๊ฐ™์œผ๋ฏ€๋กœ cross entropy๋กœ๋ถ€ํ„ฐ entropy๋ฅผ ๋บ€ ๊ฐ’์„ ๋‘ ๋ถ„ํฌ ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ์ฒ˜๋Ÿผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ Kullback-Leibler divergence, KL Divergence, ํ˜น์€ Relative entropy๋ผ ํ•œ๋‹ค. 

    <KL Divergence>

    $$D_{KL}(p\parallel q)=H(p,q)-H(p)=\sum_{x}^{}(p(x)log\frac{1}{q(x)}-p(x)\textrm{log}\frac{1}{p(x)})=\sum_{x}^{}p(x)\textrm{log}\frac{p(x)}{q(x)}$$

     

    $D_{KL}(p\parallel q)\neq D_{KL}(q\parallel p)$ ์ด๊ธฐ ๋•Œ๋ฌธ์— KL Divergence๋ฅผ ์ •ํ™•ํ•œ ๊ฑฐ๋ฆฌํ•จ์ˆ˜๋ผ๊ณ ๋Š” ๋ณผ ์ˆ˜ ์—†๋‹ค.

    ํ•˜์ง€๋งŒ ๋‘ ๋ถ„ํฌ๊ฐ€ ๋‹ค๋ฅด๋ฉด ๋‹ค๋ฅผ์ˆ˜๋ก ํฐ ๊ฐ’์„ ๊ฐ€์ง€๋ฉฐ ๋‘˜์ด ์ผ์น˜ํ• ๋•Œ๋งŒ 0์„ ๊ฐ–๊ธฐ ๋•Œ๋ฌธ์— ๊ฑฐ๋ฆฌ์™€ ๋น„์Šทํ•œ metric์œผ๋กœ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค. 

    ๋”ฐ๋ผ์„œ Cross entropy minimization ๋ฌธ์ œ๋Š” ๊ฒฐ๊ตญ KL divergence minimization ๋ฌธ์ œ์™€ ๋™์น˜์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

     

     

     

    ์ถœ์ฒ˜:

    https://shurain.net/personal-perspective/information-theory/

    https://angeloyeo.github.io/2020/10/27/KL_divergence.html

    https://reniew.github.io/17/

     

    KL divergence - ๊ณต๋Œ์ด์˜ ์ˆ˜ํ•™์ •๋ฆฌ๋…ธํŠธ (Angelo's Math Notes)

     

    angeloyeo.github.io

     

    Information Theory — Sungjoo Ha

    Information Theory Published on 2017-01-30 Last updated on 2017-01-30 ์ •๋ณด์ด๋ก ์˜ ๊ธฐ์ดˆ ์†Œ๊ฐœ. Entropy, KL divergence, cross entropy, conditional entropy, mutual information ์†Œ๊ฐœ. ์—”ํŠธ๋กœํ”ผ๋Š” ์–ด๋–ค ๋ถ„ํฌ๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑ๋˜๋Š” ์ •๋ณด๋ฅผ ์ธ์ฝ”

    shurain.net

     

     

     

     

Designed by Tistory.