ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • Backpropagation
    AI\ML\DL/Deep learning theory 2023. 5. 7. 19:38
    ๋ฐ˜์‘ํ˜•

    Backpropagation

    ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง์—์„œ๋Š”
    1. ํŠน์ • ํŒŒ๋ผ๋ฏธํ„ฐ(weight/bias) ์— ๋Œ€ํ•œ Loss function์˜ ํŽธ๋ฏธ๋ถ„ ๊ฐ’์ธ ๊ทธ๋ž˜๋””์–ธํŠธ๋ฅผ ๊ตฌํ•˜๊ณ ,
    2. SGD (Stochastic gradient descent) ๋“ฑ์˜ Optimizer ๋กœ ์ตœ์ ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์—…๋ฐ์ดํŠธํ•œ๋‹ค. 

    1๋ฒˆ์—์„œ ์‹ ๊ฒฝ๋ง์˜ ๊นŠ์€ ๊ณณ์— ์žˆ๋Š” weight ์— ๋Œ€ํ•œ ํŽธ๋ฏธ๋ถ„์„ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” Chain rule ์„ ํ™œ์šฉํ•ด์•ผ ํ•˜๋Š”๋ฐ,

    Chain rule์€ ๋ฏธ๋ถ„์„ ๋’ค๋กœ ์ „๋‹ฌํ•˜๋ฉด์„œ ๊ณฑํ•˜๋Š” ๊ฑฐ๋‹ˆ๊นŒ ์ด ๋ฐฉ๋ฒ•์„ Backpropagation์ด๋ผ๊ณ  ํ•œ๋‹ค.

    ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์‹ ๊ฒฝ๋ง์—์„œ Backpropagation์„ ํ†ตํ•ด ํ–‰๋ ฌ๋กœ ํ‘œํ˜„๋œ weight ์— ๋Œ€ํ•ด loss function ์„ ๋ฏธ๋ถ„ํ•ด๋ณด์ž.

    ๊ทธ๋ฆผ์˜ ๊ฐ ๋…ธ๋“œ์— ๋Œ€ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์‹์„ ์“ธ ์ˆ˜ ์žˆ๋‹ค.

    $$\begin{align} & \textbf{d}_{1}=\textbf{n}_{0}\textbf{W}_{1}+\textbf{b}_{1}\\&\textbf{n}_{1}=\textbf{f}_{1}(\textbf{d}_{1})\\&\textbf{d}_{2}=\textbf{n}_{1}\textbf{W}_{2}+\textbf{b}_{2}\\&\textbf{n}_{2}=\textbf{f}_{2}(\textbf{d}_{2})\ \end{align}$$

    Loss function์€ $L=(\hat{y_{1}}-y_{1})^{2}+(\hat{y_{2}}-y_{2})^{2}$ ์ด๋‹ค.

    $\textbf{n}_{2}=[\begin{matrix}
    \hat{y_{1}} & \hat{y_{2}} \\
    \end{matrix}]$ 

    $$\begin {align*} L &=(\hat{y_{1}}-y_{1})^{2}+(\hat{y_{2}}-y_{2})^{2} \\&=(\textbf{n}_{2}-\textbf{y})(\textbf{n}_{2}-\textbf{y})^{T} \\ &= (\left [ \begin{matrix} \hat{y_{1}} & \hat{y_{2}} \\ \end{matrix} \right ]-\left [ \begin{matrix} y_{1} & y_{2} \\ \end{matrix} \right ])(\left [ \begin{matrix} \hat{y_{1}} \\ \hat{y_{2}} \end{matrix} \right ]-\left [ \begin{matrix} y_{1} \\ y_{2} \end{matrix} \right ]) \\&=\left [ \begin{matrix} \hat{y_{1}}-y_{1} & \hat{y_{2}}-y_{2} \\ \end{matrix} \right ]\left [ \begin{matrix} \hat{y_{1}}-y_{1} \\ \hat{y_{2}}-y_{2} \end{matrix} \right ] \\&= (\hat{y_{1}}-y_{1})^{2}+(\hat{y_{2}}-y_{2})^{2} \end{align*}โ€‹ $$

     

    ์ฆ‰, Loss function ์„ $L=(\textbf{n}_{2}-\textbf{y})(\textbf{n}_{2}-\textbf{y})^{T}$ ์™€ ๊ฐ™์ด ๋ฒกํ„ฐ๋กœ ๋‚˜ํƒ€๋‚ด์—ˆ๋‹ค.

    $L$ ์„ ํ–‰๋ ฌ $W$ ๋กœ ๋ฏธ๋ถ„ํ•˜๊ธฐ ์œ„ํ•ด $W$ ๋ฅผ ๋ฒกํ„ฐํ™” ํ•  ๊ฒƒ์ด๋‹ค.

    $$\mathbf{w_{1}}=\textrm{vec}(\mathbf{W_{1}})$$

    $$\mathbf{w_{2}}=\textrm{vec}(\mathbf{W_{2}})$$

     

    $L$ ์ด๋ผ๋Š” ๋ฒกํ„ฐํ•จ์ˆ˜๋ฅผ $\mathbf{w}$ ์ด๋ผ๋Š” ๋ฒกํ„ฐ๋กœ ํŽธ๋ฏธ๋ถ„ํ•˜๋ ค๋ฉด chain rule ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

    (์ฐธ๊ณ ํ•˜๊ธฐ https://deep-learning-basics.tistory.com/1 5๋ฒˆ)

     

    ๋‹ค๋ณ€์ˆ˜ ํ•จ์ˆ˜์˜ ๋ฏธ๋ถ„

    1. ํŽธ๋„ํ•จ์ˆ˜ ์ผ๋ณ€์ˆ˜ ํ•จ์ˆ˜ $f: โ„ \rightarrow โ„$ ์˜ ๋„ํ•จ์ˆ˜ $f' (x) $ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜ํ•œ๋‹ค. $ f' (x)= \displaystyle \lim_{h \to 0}\frac{f(x+h)-f(x)}{h} $ ์ผ๋ณ€์ˆ˜ ํ•จ์ˆ˜ $f $ ์˜ ๋„ํ•จ์ˆ˜ $f' $ ์€ ์  $x$์—์„œ '$x$ ๊ฐ€ ๋ณ€ํ• 

    deep-learning-basics.tistory.com

    $$\frac{\partial L}{\partial \mathbf{w^{T}_{2}}}=\frac{\partial \textbf{d}_{2}}{\partial \mathbf{w^{T}_{2}}}\frac{\partial \textbf{n}_{2}}{\partial \mathbf{d^{T}_{2}}}\frac{\partial L}{\partial \mathbf{n^{T}_{2}}} \tag{1}$$

    $$(\textbf{w}_{2}\to\textbf{d}_{2}\to \textbf{n}_{2}\to\textbf{L})$$

    $$\frac{\partial L}{\partial \mathbf{w^{T}_{1}}}=\frac{\partial \textbf{d}_{1}}{\partial \mathbf{w^{T}_{1}}}\frac{\partial \textbf{n}_{1}}{\partial \mathbf{d^{T}_{1}}}\frac{\partial \mathbf{d_{2}}}{\partial \mathbf{n^{T}_{1}}}\frac{\partial \textbf{n}_{2}}{\partial \mathbf{d^{T}_{2}}}\frac{\partial L}{\partial \mathbf{n^{T}_{2}}} \tag{2}$$

    $$(\textbf{w}_{1}\to\textbf{d}_{1}\to \textbf{n}_{1}\to \textbf{d}_{2}\to \textbf{n}_{2}\to L)$$

    ์•ž์—์„œ ๋’ค์ชฝ์œผ๋กœ ๋ฏธ๋ถ„ํ•˜๊ณ  ๊ณฑํ•˜์—ฌ ํŽธ๋ฏธ๋ถ„์„ ๊ตฌํ–ˆ๋‹ค.

    ์‹ (1) ๋ถ€ํ„ฐ ๋ฏธ๋ถ„๊ฐ’์„ ๊ตฌํ•ด๋ณด์ž.

    $$\begin{align} & \frac{\partial \textbf{d}_{2}}{\partial \textbf{w}^{T}_{2}}=\textbf{n}_{1}^{T}\otimes\textbf{I} \\& \frac{\partial \textbf{n}_{2}}{\partial \textbf{d}^{T}_{2}}=\textrm{diag}(\textbf{f}_{2}^{'}(\textbf{d}_{2})) \end{align}$$

     

    ์‹ (2) ๋ฅผ ๊ณ„์‚ฐํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. 
    $$\begin{align} & \frac{\partial \textbf{d}_{1}}{\partial \mathbf{w^{T}_{1}}}=\textbf{n}_{0}^{T}\otimes\textbf{I} \\& \frac{\partial \textbf{n}_{2}}{\partial \mathbf{d^{T}_{2}}}=\textrm{diag}(\textbf{f}_{1}^{'}(\textbf{d}_{1})) \\& \frac{\partial \mathbf{d_{2}}}{\partial \mathbf{n^{T}_{1}}}=\textbf{W}_{2} \\& \frac{\partial \textbf{n}_{2}}{\partial \textbf{d}^{T}_{2}}=\textrm{diag}(\textbf{f}_{2}^{'}(\textbf{d}_{2})) \\& \frac{\partial L}{\partial \mathbf{n^{T}_{2}}}=2(\textbf{n}^{T}_{2}-\textbf{y}) \end{align}$$

     

     

    'AI\ML\DL > Deep learning theory' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

    ์ด์ง„๋ถ„๋ฅ˜์—์„œ Maximum Likelihood Estimation (MLE)  (1) 2023.05.10
    Logistic Regression  (0) 2023.05.08
    Momentum, RMSProp Optimizer  (0) 2023.05.06
    Stochastic Gradient Descent  (1) 2023.05.06
Designed by Tistory.