728x90

 

 

 

Batch Normalization

https://arxiv.org/pdf/1502.03167

 

 

Background

batch normalizaion ์€ 2015๋…„์— ์ œ์‹œ๋œ ICS(Internal Covariate Shift) ๋ฌธ์ œ๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๋Š” ์•„์ด๋””์–ด์ž…๋‹ˆ๋‹ค. covariate shift ๋Š” ํ•™์Šต ๋•Œ ํ™œ์šฉํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ์‹ค์ œ ์ถ”๋ก ์— ์‚ฌ์šฉ๋˜๋Š” ๋ฐ์ดํ„ฐ๊ฐ„์˜ ๋ถ„ํฌ๊ฐ€ ๋‹ค๋ฅด๋ฉด ์ถ”๋ก  ์„ฑ๋Šฅ์— ์•…์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜ ์žˆ๋‹ค๋ผ๋Š” ์ฃผ์žฅ์ธ๋ฐ ์ด๊ฒŒ ์‹ ๊ฒฝ๋ง ๋‚ด๋ถ€์—์„œ๋„ ๋ฐœ์ƒํ•  ๊ฒƒ์ด๋‹ค ๋ผ๋Š” ์ฃผ์žฅ์„ ํ•˜๋ฉฐ ์ƒ๊ธด์šฉ์–ด๊ฐ€ Internal Covariate Shift ๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์•„๋ž˜ ์‚ฌ์ง„์„ ๋ณด๋ฉด ์ง๊ด€์ ์œผ๋กœ ์ดํ•ด๊ฐ€ ๋  ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์‹ ๊ฒฝ๋ง์„ ํ†ต๊ณผํ•˜๋ฉด์„œ ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๊ฐ€ ๋‹ฌ๋ผ์ง€๋Š” ํ˜„์ƒ์ด ๋ฐœ์ƒํ•˜๋Š”๋ฐ

 

ํ†ต๊ณผํ•˜๋Š” ๋ ˆ์ด์–ด ์ˆ˜๊ฐ€ ๋งŽ์•„์งˆ์ˆ˜๋ก ๊ทธ ์ •๋„๊ฐ€ ์‹ฌํ•ด์ง€๊ธฐ ๋•Œ๋ฌธ์— ๋‹น์—ฐํžˆ ์ถ”๋ก ์ด๋‚˜ ํ•™์Šต ์„ฑ๋Šฅ์— ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธธ ํ™•๋ฅ ์ด ํฝ๋‹ˆ๋‹ค. Batch Normalizaion ์€ ๊ธฐ์กด์˜ ์ •๊ทœํ™” ๊ณผ์ •์—์„œ ํ•™์Šต๋ฐ์ดํ„ฐ๋งˆ๋‹ค ๋ถ„ํฌ๊ฐ€ ๋‹ค๋ฅธ๊ฒƒ์„ ๋ฐฐ์น˜๋ณ„๋กœ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ํ™œ์šฉํ•ด ์ •๊ทœํ™”ํ•˜๋Š” ๊ฒƒ ์ž…๋‹ˆ๋‹ค.

๋‚˜๋™๋นˆ๋‹˜์˜ ์˜์ƒ์„ ์ฐธ๊ณ ํ•˜์—ฌ ์•Œ๊ฒŒ ๋œ batch normalizaion๊ฐ€ ํ˜„์‹ค์—์„œ๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์˜์กด๋„๋ฅผ ์ค„์˜€์œผ๋ฉฐ, ํ•™์Šต์†๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค๊ณ , ๋ชจ๋ธ์ด ์ผ๋ฐ˜์ ์œผ๋กœ ์ฆ‰, ํ•™์Šต๋ฐ์ดํ„ฐ์—๋งŒ ํƒœ์Šคํฌ๋ฅผ ์ž˜ ์ฒ˜๋ฆฌํ•˜๋„๋ก ํ•˜๋Š”๊ฒƒ์ด ์•„๋‹Œ ์‹ค์ œ ํ˜„์ƒ์„ ์ž˜ ๋ฐ˜์˜์‹œํ‚ค๊ฒŒ ๋œ ํšจ๊ณผ๊ฐ€ ์žˆ์—ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋Ÿฐ๋ฐ ๋…ผ๋ฌธ์—์„œ๋Š” ics ๋ฅผ ๊ฐ์†Œ์‹œํ‚จ๋‹ค๊ณ  ์ฃผ์žฅํ•˜์˜€์œผ๋‚˜ ์‹ค์ œ๋กœ ์ฆ๋ช…ํ•˜์ง€๋Š” ๋ชปํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ๊ทธ๊ฒƒ์„ ์ฆ๋ช…ํ•˜๊ธฐ ์œ„ํ•œ How Does Batch Normalization Help Optimization?  ๋ผ๋Š” ๋…ผ๋ฌธ์ด ๋‚˜์™”์Šต๋‹ˆ๋‹ค.

https://arxiv.org/pdf/1805.11604

 

 

์šฐ์„  ์ผ๋ฐ˜์ ์œผ๋กœ Batch Norm ์„ ์ ์šฉ์‹œํ‚จ ๋„คํŠธ์›Œํฌ๊ฐ€ Accuracy ๊ฐ€ ๊ฐ€ํŒŒ๋ฅธ ํญ์œผ๋กœ ์˜ฌ๋ผ๊ฐ”๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

 

 

์šฐ์ธก์˜ ํžˆ์Šคํ† ๊ทธ๋žจ์„ ๋ณด๋ฉด ๊ฐ ๋ ˆ์ด์–ด์˜ ๋ถ„ํฌ๋ฅผ ๋‚˜ํƒ€๋‚ด๊ณ  ์žˆ๋Š”๋ฐ์š” ๊ฐ€์žฅ์šฐ์ธก์˜ Standard + Noisy BatchNorm ์—์„œ Layer3 ๋ถ€ํ„ฐ ๋ถ„ํฌ๊ฐ€ ๊ฐ‘์ž‘์Šค๋Ÿฝ๊ฒŒ ๋ณ€ํ•˜์—ฌ ICS๊ฐ€ ๋ฐœ์ƒํ•˜๊ณ  ์žˆ์Œ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ICS๊ฐ€ ๋ฐœ์ƒํ•˜๊ณ  ์žˆ์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์™ผ์ชฝ ๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋ฉด ํ•™์Šต์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ•จ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ฆ‰ ์ž„์˜๋กœ Batch Norm Layer ์ดํ›„ ๋ฐ”๋กœ Noise ๋ฅผ ๋„ฃ์–ด covariate shift ๋ฅผ ๋ฐœ์ƒ์‹œ์ผฐ์„ ๋•Œ์—๋„ BatchNorm ์ด ํฌํ•จ๋œ ๋„คํŠธ์›Œํฌ๋Š” ์ผ๋ฐ˜์ ์ธ ๋„คํŠธ์›Œํฌ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ•จ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์‹คํ—˜์ ์œผ๋กœ Batch Norm ์ด ICS ๋ฌธ์ œ๋ฅผ ํ•ด์†Œํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ด์ „ ๋…ผ๋ฌธ์˜ ๋ฐ˜๋ฐ•์„ ํ•˜์˜€๊ณ , ์‹ฌ์ง€์–ด ICS๊ฐ€ ํฌ๊ฒŒ ๋ฐœ์ƒํ•จ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  Batch Norm ์ด ์žˆ์œผ๋ฉด ์„ฑ๋Šฅ์ด ์ข‹์•„์ง„๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€ ์‚ฌ๋ก€๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ํ•ด๋‹น๋…ผ๋ฌธ์—์„œ ICS๋ฅผ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐํ•˜์—ฌ ICS๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ–ˆ๋Š”๋ฐ, ํฌ์ŠคํŒ…์˜ ๋ชฉ์ ๋ณด๋‹ค ๋„ˆ๋ฌด ๋ฒ—์–ด๋‚˜๋Š”๊ฒƒ ๊ฐ™์•„ ๋‹ค๋ฃจ์ง€ ์•Š๊ฒ ์Šต๋‹ˆ๋‹ค. ๊ถ๊ธˆํ•˜์‹ ๋ถ„๊ป˜์„œ๋Š” ๋…ผ๋ฌธ์„ ์ฐธ๊ณ ํ•˜์‹œ๋ฉด ๋  ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๊ทธ๋ ‡๋‹ค๋ฉด ICS ๋ฅผ ํ•ด์†Œํ•˜์ง€ ๋ชปํ–ˆ์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์„ฑ๋Šฅ์ด ์ข‹์€ ์ด์œ ๋Š” ๋ญ˜๊นŒ์š”? ๋…ผ๋ฌธ์—์„œ๋Š” Batch Norm ์˜ Smoothing ํšจ๊ณผ ๋•Œ๋ฌธ์ด๋ผ๊ณ  ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

 

Loss Landscape ๊ฐ€ ํ›จ์”ฌ ๋” ์˜ˆ์ƒ ๊ฐ€๋Šฅํ•œ ๋ฒ”์œ„๋กœ ํ˜•์„ฑ๋˜๋ฉด์„œ ํ•™์Šตํšจ๊ณผ๊ฐ€ ์ฆ๋Œ€๋œ๋‹ค๊ณ  ๋งํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

Batch Normalization Layer

๋ฏธ๋‹ˆ๋ฐฐ์น˜์˜ ํ‰๊ท ๊ฐ’๊ณผ ๋ถ„์‚ฐ์„ ๊ตฌํ•ด์„œ normalizaion ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ฐ๋งˆ์™€ ๋ฒ ํƒ€๋ฅผ ํ™œ์šฉํ•ด ์‹ค์ œ output ์„ ๋‚ด๋Š”๋ฐ์š”, ์—ฌ๊ธฐ์„œ ๊ฐ๋งˆ์™€ ๋ฒ ํƒ€๊ฐ€ ์‹ค์ œ ํ•™์Šต์— ํ™œ์šฉ๋˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ์ž…๋‹ˆ๋‹ค. ํ•™์Šต์ค‘์—๋Š” loss ๋ฅผ ์ตœ์†Œํ™” ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๊ฐ๋งˆ์™€ ๋ฒ ํƒ€๋ฅผ ์ฐพ์•„๊ฐˆ ๊ฒƒ ์ž…๋‹ˆ๋‹ค.

์ •๊ทœํ™”์—์„œ ํ•™์Šต ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ ๋Š” ํ™œ์„ฑํ™” ํ•จ์ˆ˜์˜ ํŠน์ง•์— ์žˆ์Šต๋‹ˆ๋‹ค. sigmoid๋ฅผ ์˜ˆ์‹œ๋กœ ๋“ค๋ฉด ์–ด๋–ค ๊ตฌ๊ฐ„์—์„œ๋Š” ๋งค์šฐ ์„ ํ˜•์ ์œผ๋กœ ์ž‘๋™ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ‘œ์ค€์ •๊ทœ๋ถ„ํฌ๋กœ ์ •๊ทœํ™”ํ•œ 0๊ณผ 1์‚ฌ์ด์˜ ๊ฐ’์—์„œ ์„ ํ˜•์ ์œผ๋กœ ์ž‘๋™ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ๊ฐ๋งˆ์™€ ๋ฒ ํƒ€๋ฅผ ํ™œ์šฉํ•ด non-linearity ๋ฅผ ์ง€์ผœ์ฃผ๊ณ , ํ•ด๋‹น ์ •๊ทœํ™” ๋ ˆ์ด์–ด์˜ output ๋„ ์ ์ ˆํ•˜๊ฒŒ ๋‚ด๋ณด๋‚ผ ์ˆ˜ ์žˆ๊ฒŒ๋ฉ๋‹ˆ๋‹ค. ๊ฒฐ๋ก ์€ ๋ ˆ์ด์–ด์˜ ์ž…๋ ฅ์„ ์ •๊ทœํ™”ํ•  ๋•Œ๋Š” linearity ๋ฅผ ์ฃผ์˜ํ•ด์„œ ์ •๊ทœํ™” ํ•ด์•ผํ•œ๋‹ค๋Š” ์  ์ž…๋‹ˆ๋‹ค.

 

Batch Normalization Layer ์—ฐ์‚ฐ๊ตฌ๋ถ„

batch normalization Layer ๋Š” ํ•™์Šตํ• ๋•Œ์™€ ์ถ”๋ก ํ•  ๋•Œ ๋„คํŠธ์›Œํฌ์—์„œ์˜ ์—ญํ• ์ด ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค. ํ•™์Šตํ• ๋•Œ ๊ฐ๋งˆ์™€ ๋ฒ ํƒ€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ•™์Šต์‹œ์ผœ์•ผ ํ•˜์ง€๋งŒ ์ถ”๋ก ๋•Œ์—๋Š” ํ•„์š”์—†์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ ํ•ด๋‹น ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์„ ๊ณ ์ •ํ•˜์—ฌ ํ•™์Šต๋œ ํŒŒ๋ผ๋ฏธํ„ฐ์— ์˜ํ•œ ๊ฐ’์ด ๋‚˜์™€์•ผํ•ฉ๋‹ˆ๋‹ค.

 

step 7 ์—์„œ๋ถ€ํ„ฐ๋Š” BN ์ด training ๋ชจ๋“œ๋กœ ๋„คํŠธ์›Œํฌ์— ์žˆ์—ˆ๋˜ ๊ฒƒ์„ inference ๋ชจ๋“œ๋กœ ๋ฐ”๊ฟ‰๋‹ˆ๋‹ค. ( ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณ ์ •์„ ํ†ตํ•ด์„œ )

Batch Normalization Data Flow

์ž…๋ ฅ ๋ฐ์ดํ„ฐ (X)

 

$$

X = \begin{bmatrix} [1,\ 2] \ [2,\ 4] \ [3,\ 6] \end{bmatrix}

$$

๋ฐฐ์น˜๋กœ ๋“ค์–ด์˜จ ๋ฐ์ดํ„ฐ

shape: (3, 2)

→ ์ƒ˜ํ”Œ 3๊ฐœ, ๊ฐ ์ƒ˜ํ”Œ์€ 2์ฐจ์› ๋ฒกํ„ฐ


Linear Layer ํ†ต๊ณผ

๊ฐ€์ค‘์น˜์™€ bias๋ฅผ ์ด๋ ‡๊ฒŒ ๋‘๊ฒ ์Šต๋‹ˆ:

$$ [ W = \begin{bmatrix} [1,0], \ [0,1] \end{bmatrix}, \quad b = [0,\ 0] ] $$

์ฆ‰, ์•„๋ฌด ๋ณ€ํ™” ์—†๋Š” ์„ ํ˜•์ธต

$$ [ Z = XW + b = X ] $$

๊ฒฐ๊ณผ:

Z =
[
 [1, 2],
 [2, 4],
 [3, 6]
]

shape ๊ทธ๋Œ€๋กœ (3, 2)


Batch Normalization

1๏ธโƒฃ Batch Mean (μ)

feature๋ณ„ ํ‰๊ท :

$$ μ=[(1+2+3)/3, (2+4+6)/3]=[2, 4] $$


2๏ธโƒฃ Batch Variance (σ²)

$$ σ2=[((1−2)2+(2−2)2+(3−2)2)/3,((2−4)2+(4−4)2+(6−4)2)/3]=[2/3, 8/3] $$


3๏ธโƒฃ Normalize (xฬ‚)

$$ \hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} (ε ๋ฌด์‹œํ•œ๋‹ค๊ณ  ๊ฐ€์ •) $$

์ƒ˜ํ”Œ๋ณ„ ๊ณ„์‚ฐ

์ฒซ ๋ฒˆ์งธ ์ƒ˜ํ”Œ

$$ [1,2] → [-1/\sqrt{2/3},\ -2/\sqrt{8/3}] ≈ [-1.22,\ -1.22] $$

๋‘ ๋ฒˆ์งธ

$$ [2,4] → [0,\ 0] $$

์„ธ ๋ฒˆ์งธ

$$ [3,6] → [1.22,\ 1.22] $$

๊ฒฐ๊ณผ:

X_hat =
[
 [-1.22, -1.22],
 [ 0.00,  0.00],
 [ 1.22,  1.22]
]

๊ทธ๋ฆฌ๊ณ  ํ•ด๋‹น๊ฐ’์— gamma ์™€ betta ์—ฐ์‚ฐ์„ ํ†ตํ•ด Layer ๋ฅผ ํ†ต๊ณผ์‹œํ‚ต๋‹ˆ๋‹ค. ์ด์ฒ˜๋Ÿผ batch norm ์€ ๋ฏธ๋‹ˆ ๋ฐฐ์น˜์˜ ํ”ผ์ฒ˜๋ณ„๋กœ ํ‰๊ท , ๋ถ„์‚ฐ์„ ๊ตฌํ•ด์„œ ์›๋ณธ ๋ฐ์ดํ„ฐ์— ๋Œ€์ž…์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ Normalizaion ์„ ์ˆ˜ํ–‰ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

Layer Normalization

arxiv.org

Layer Normalization ์€ Batch Norm ์ด RNN ์— ์ ์šฉํ•˜๊ธฐ ์–ด๋ ค์šด ๋ฌธ์ œ์ ์„ ํ•ด์†Œํ•˜๊ธฐ ์œ„ํ•ด ์ œ์‹œ๋œ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. RNN์€ ์‹œ๊ฐ„๋‹จ์œ„๋กœ ๊ณ„์‚ฐ์„ ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ฏธ๋‹ˆ๋ฐฐ์น˜์˜ ๊ฐ ํ”ผ์ณ๋งˆ๋‹ค ํ†ต๊ณ„๋ฅผ ์ด์šฉํ•ด ์ •๊ทœํ™”ํ•˜๋Š” BN ์˜ ๊ฒฝ์šฐ์—๋Š” ํ•ด๋‹น ์ŠคํŠธ๋ฆผ์˜ ๋งฅ๋ฝ์„ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค.

๊ฐ€์žฅ ํฐ ๋ฌธ์ œ๋Š” RNN ์ด๋‚˜ NLP, ํ˜น์€ ์Œ์„ฑ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ๋Š” ๋ฐฐ์น˜๋งˆ๋‹ค ๊ธธ์ด๊ฐ€ ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

์ƒ˜ํ”Œ 1: "๋‚˜๋Š” ๋ฐฅ์„ ๋จน์—ˆ๋‹ค"        (๊ธธ์ด 4)
์ƒ˜ํ”Œ 2: "์˜ค๋Š˜"                    (๊ธธ์ด 1)
์ƒ˜ํ”Œ 3: "์–ด์ œ ๋น„๊ฐ€ ์™€์„œ ์šฐ์‚ฐ์„ ์ผ๋‹ค" (๊ธธ์ด 6)

์ด๊ฒƒ์„ BN ์„ ํ™œ์šฉํ•œ Layer output ์„ ์‚ฌ์šฉํ•œ๋‹ค๋ฉด ์ƒ˜ํ”Œ2 ์˜ 2,3 ์ƒ˜ํ”Œ1์˜ 3,4 ๊ฐ€ 0์ด ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ๋ฐ์ดํ„ฐ์˜ ์˜๋ฏธ๋ฅผ ์ถฉ๋ถ„ํžˆ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฌธ์ œ๋Š” ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ์—๋„ ๊ทธ๋Œ€๋กœ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€๋‚˜ ์„ฑ์ ํ†ต๊ณ„(๊ตญ์–ด๋Š” ๊ตญ์–ด๋ผ๋ฆฌ, ์ˆ˜ํ•™์€ ์ˆ˜ํ•™๋ผ๋ฆฌ) ์™€ ๊ฐ™์€ ๋ฐ์ดํ„ฐ๊ฐ€ ์•„๋‹ˆ๋ผ ํ”ผ์ณํ•˜๋‚˜๊ฐ€ ๋‹ค๋ฅธ ํ”ผ์ณ๋‚˜ ๋ฐ์ดํ„ฐ์—๋„ ์˜ํ–ฅ์„ ์ฃผ๋Š”๊ฒฝ์šฐ๋Š” Batch ์‚ฌ์ด์ฆˆ์— ์˜ํ–ฅ์„ ๋ฐ›์ง€ ์•Š๊ณ  ๋ฐ์ดํ„ฐ์˜ ์˜๋ฏธ๋ฅผ ์ž˜ ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ๋Š” LN ์ด ์„ฑ๋Šฅ์ด ์ข‹๋‹ค๊ณ  ์ฃผ์žฅํ•ฉ๋‹ˆ๋‹ค.

 

BN ๊ณผ์˜ ์ฐจ์ด์ 

Batch Normalization์€ ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ๋‹จ์œ„๋กœ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ๊ณ„์‚ฐํ•˜์—ฌ ์ •๊ทœํ™”๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด **Layer Normalization(LN)**์€ ์ด๋ฆ„ ๊ทธ๋Œ€๋กœ ๋ ˆ์ด์–ด ๋‹จ์œ„, ์ •ํ™•ํžˆ๋Š” ํ•˜๋‚˜์˜ ์ƒ˜ํ”Œ ๋‚ด๋ถ€ feature๋“ค์— ๋Œ€ํ•ด์„œ๋งŒ ์ •๊ทœํ™”๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ์ •๊ทœํ™”์˜ ๊ธฐ์ค€์ด ์™„์ „ํžˆ ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

  • Batch Normalization
    • ํ‰๊ท , ๋ถ„์‚ฐ ๊ณ„์‚ฐ ์ถ•: batch ๋ฐฉํ–ฅ
    • ๊ฐ™์€ feature๋ฅผ ๊ฐ€์ง„ ์—ฌ๋Ÿฌ ์ƒ˜ํ”Œ์„ ํ•จ๊ป˜ ์‚ฌ์šฉ
  • Layer Normalization
    • ํ‰๊ท , ๋ถ„์‚ฐ ๊ณ„์‚ฐ ์ถ•: feature ๋ฐฉํ–ฅ
    • ํ•˜๋‚˜์˜ ์ƒ˜ํ”Œ ์•ˆ์—์„œ๋งŒ ๊ณ„์‚ฐ

ํ•˜๋‚˜์˜ ์ƒ˜ํ”Œ x = [xโ‚, xโ‚‚, ..., xโ‚]์— ๋Œ€ํ•ด:

$$ \mu = \frac{1}{d} \sum_{i=1}^{d} x_i $$

$$ \sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2 $$

$$ \hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}} $$

๊ทธ๋ฆฌ๊ณ  Batch Normalization๊ณผ ๋™์ผํ•˜๊ฒŒ scale, shift ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค:

$$ y_i = \gamma_i \hat{x}_i + \beta_i $$

์—ฌ๊ธฐ์„œ ์ค‘์š”ํ•œ ์ ์€ γ, β๋Š” feature ์ฐจ์›์— ๋Œ€ํ•ด์„œ๋งŒ ์กด์žฌํ•˜๋ฉฐ batch ํฌ๊ธฐ์™€ ๋ฌด๊ด€ํ•˜๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์œ„์˜ ์ˆ˜์‹๋Œ€๋กœ ๊ฐ™์€ ์ƒ˜ํ”Œ์„ ๊ฐ€์ง€๊ณ  ๋ ˆ์ด์–ด๋ฅผ ํ†ต๊ณผํ•˜๋Š” ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

Layer Normalization Data Flow

์ž…๋ ฅ ๋ฐ์ดํ„ฐ (X)

$$ X = \begin{bmatrix} [1,\ 2] \\ [2,\ 4] \\ [3,\ 6] \end{bmatrix} $$

shape: (3, 2)

→ ์ƒ˜ํ”Œ 3๊ฐœ, ๊ฐ ์ƒ˜ํ”Œ์€ 2์ฐจ์› ๋ฒกํ„ฐ


Linear Layer ํ†ต๊ณผ

๊ฐ€์ค‘์น˜์™€ bias๋Š” ์ด์ „๊ณผ ๋™์ผํ•˜๊ฒŒ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.

$$ Z = X $$


Layer Normalization ์ ์šฉ

Layer Normalization์€ ๊ฐ ์ƒ˜ํ”Œ๋งˆ๋‹ค ๋…๋ฆฝ์ ์œผ๋กœ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

์ฒซ ๋ฒˆ์งธ ์ƒ˜ํ”Œ [1, 2]

$$ \mu = (1 + 2) / 2 = 1.5 $$

$$ \sigma^2 = ((1 - 1.5)^2 + (2 - 1.5)^2) / 2 = 0.25 $$

์ •๊ทœํ™” ๊ฒฐ๊ณผ:

$$ [1, 2] \rightarrow [-1, 1] $$


๋‘ ๋ฒˆ์งธ ์ƒ˜ํ”Œ [2, 4]

$$ \mu = 3,\quad \sigma^2 = 1 $$

์ •๊ทœํ™” ๊ฒฐ๊ณผ:

$$ [2, 4] \rightarrow [-1, 1] $$


์„ธ ๋ฒˆ์งธ ์ƒ˜ํ”Œ [3, 6]

$$ \mu = 4.5,\quad \sigma^2 = 2.25 $$

์ •๊ทœํ™” ๊ฒฐ๊ณผ:

$$ [3, 6] \rightarrow [-1, 1] $$


Layer Normalization ๊ฒฐ๊ณผ

X_hat =
[
 [-1,  1],
 [-1,  1],
 [-1,  1]
]

Transformer ๊ตฌ์กฐ์—์„œ Layer Normalization ์ด Batch Normalization ๋ณด๋‹ค ์ ํ•ฉํ•œ ์ด์œ 

1. ์‹œํ€€์Šค ๊ธธ์ด ๊ฐ€๋ณ€์„ฑ๊ณผ Masking ๋ฌธ์ œ

Transformer์˜ Self-Attention์€ ๊ฐ€๋ณ€ ๊ธธ์ด ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ž…๋ ฅํ˜•ํƒœ๋Š” ๊ฐ ๋ฌธ์žฅ๋งˆ๋‹ค ๊ธธ์ด๊ฐ€ ๋‹ค๋ฅด๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์งง์€ ๋ฌธ์žฅ์—๋Š” padding์„ ์ถ”๊ฐ€ํ•˜ attention mask๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

Batch Normalization์„ ์ด๋Ÿฌํ•œ ๊ตฌ์กฐ์— ์ ์šฉํ•˜๋ฉด ์‹ฌ๊ฐํ•œ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. BN์€ ๋ฐฐ์น˜์™€ ์‹œํ€€์Šค ์ฐจ์› ์ „์ฒด์— ๊ฑธ์ณ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ๊ณ„์‚ฐํ•˜๋Š”๋ฐ ์œ„์—์„œ ๋ดค๋˜ ๊ฒƒ ์ฒ˜๋Ÿผ ์˜๋ฏธ ์—†๋Š” padding ํ† ํฐ์˜ 0 ๋ฒกํ„ฐ๊ฐ€ ํ†ต๊ณ„์— ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ ๋ฌธ์žฅ ๊ธธ์ด์— ๋”ฐ๋ผ ์ •๊ทœํ™” ํ†ต๊ณ„๊ฐ€ ์™œ๊ณก๋˜๊ณ , ๊ฐ™์€ ๋‚ด์šฉ์˜ ๋ฌธ์žฅ์ด๋ผ๋„ padding์˜ ์–‘์— ๋”ฐ๋ผ ๋‹ค๋ฅด๊ฒŒ ์ •๊ทœํ™”๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

๋ฐ˜๋ฉด Layer Normalization์€ ๊ฐ ํ† ํฐ์˜ feature ์ฐจ์›์— ๋Œ€ํ•ด์„œ๋งŒ ์ •๊ทœํ™”๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ํ•˜๋‚˜์˜ ํ† ํฐ ๋‚ด๋ถ€์—์„œ๋งŒ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ๊ณ„์‚ฐํ•˜๊ธฐ ๋•Œ๋ฌธ์— padding ํ† ํฐ์ด๋‚˜ ์‹œํ€€์Šค ๊ธธ์ด๊ฐ€ ์ •๊ทœํ™” ํ†ต๊ณ„์— ์ „ํ˜€ ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๊ฐ ํ† ํฐ์€ ๋…๋ฆฝ์ ์œผ๋กœ ์ •๊ทœํ™”๋˜๋ฏ€๋กœ ๋ฐ์ดํ„ฐ์˜ ์˜๋ฏธ๊ฐ€ ์ถฉ์‹คํžˆ ๋ฐ˜์˜๋˜๊ณ  ๋ฐฐ์น˜๋‚˜ ์‹œํ€€์Šค ๊ตฌ์กฐ์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ ์ผ๊ด€๋œ ์ •๊ทœํ™”๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

2. Autoregressive Decoding๊ณผ ๋ฐฐ์น˜ ํฌ๊ธฐ ๋ถˆ์ผ์น˜

Transformer Decoder๋Š” ์ถ”๋ก  ์‹œ ๋ฏธ๋ž˜์˜ ์ •๋ณด๋ฅผ ์ฐธ์กฐํ•˜์ง€ ๋ชปํ•˜๋„๋ก autoregressive ๋ฐฉ์‹์œผ๋กœ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ์ด์ „์— ์ƒ์„ฑํ•œ ํ† ํฐ์„ ๋ฐ”ํƒ•์œผ๋กœ ๋‹ค์Œ ํ† ํฐ์„ ํ•˜๋‚˜์”ฉ ์ˆœ์ฐจ์ ์œผ๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์—์„œ ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ ๋ฐฐ์น˜ ํฌ๊ธฐ๊ฐ€ 1์ด ๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” Layer Normalization ๋…ผ๋ฌธ์—์„œ ๋ณด์—ฌ์ค€๊ฒƒ์ฒ˜๋Ÿผ Batch Normalization์— ์น˜๋ช…์ ์ธ ๋ฌธ์ œ๋ฅผ ์•ผ๊ธฐํ•ฉ๋‹ˆ๋‹ค.

Layer Normalization์€ ๋ฐฐ์น˜ ํฌ๊ธฐ์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ ์•ˆ์ •์ ์œผ๋กœ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค. ๋ฐฐ์น˜ ํฌ๊ธฐ๊ฐ€ 1์ด๋“  32๋“  ์ •๊ทœํ™” ๊ฒฐ๊ณผ๋Š” ์ผ๊ด€๋˜๋ฉฐ, ํ•™์Šต ์‹œ ๊ด€์ฐฐํ•œ ์„ฑ๋Šฅ์ด ์ถ”๋ก  ์‹œ์—๋„ ๊ทธ๋Œ€๋กœ ์œ ์ง€๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” Transformer Decoder์˜ ์ƒ์„ฑ ํ’ˆ์งˆ์— ๊ฒฐ์ •์ ์œผ๋กœ ์ค‘์š”ํ•œ ํŠน์„ฑ์ž…๋‹ˆ๋‹ค.

3. Residual Connection๊ณผ์˜ ๊ตฌ์กฐ์  ๋ถˆ์ผ์น˜

Transformer์˜ ๊ฐ ๋ธ”๋ก์€ residual connection์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค: y = x + Sublayer(LN(x)). ์ด ๊ตฌ์กฐ๊ฐ€ ์ค‘์š”ํ•œ ์ด์œ ๋Š” gradient์˜ ํ๋ฆ„ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์—ญ์ „ํŒŒ ์‹œ ∂y/∂x = 1 + ∂Sublayer/∂x ๊ฐ€ ๋˜์–ด, gradient๊ฐ€ ํ•ญ์ƒ ์ง์ ‘ ํ๋ฅผ ์ˆ˜ ์žˆ๋Š” ๊ฒฝ๋กœ(identity mapping)๊ฐ€ ๋ณด์žฅ๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๊นŠ์€ ๋„คํŠธ์›Œํฌ์—์„œ gradient vanishing ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ํ•ต์‹ฌ ๋ฉ”์ปค๋‹ˆ์ฆ˜์ž…๋‹ˆ๋‹ค.

๋งŒ์•ฝ Batch Normalization์„ residual path์— ์‚ฌ์šฉํ•˜๋ฉด, BN์˜ ์ถœ๋ ฅ์ด ๋ฐฐ์น˜ ํ†ต๊ณ„์— ์˜์กดํ•˜๊ธฐ ๋•Œ๋ฌธ์— residual path์— batch-dependent noise๊ฐ€ ์ฃผ์ž…๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” gradient flow๋ฅผ ๋ถˆ์•ˆ์ •ํ•˜๊ฒŒ ๋งŒ๋“ค๊ณ , ํŠนํžˆ ๊นŠ์€ Transformer์—์„œ๋Š” gradient ํญ๋ฐœ์ด๋‚˜ ์†Œ์‹ค์„ ์ผ์œผํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ Post-LN Transformer(residual ํ›„์— LN์„ ์ ์šฉ)๋Š” ๋ ˆ์ด์–ด๊ฐ€ ๊นŠ์–ด์งˆ์ˆ˜๋ก ํ•™์Šต์ด ๋ถˆ์•ˆ์ •ํ•ด์ง€๋Š” ๊ฒƒ์œผ๋กœ ์•Œ๋ ค์ ธ ์žˆ์œผ๋ฉฐ, Pre-LN Transformer(residual ์ „์— LN์„ ์ ์šฉ)๊ฐ€ ๋” ์•ˆ์ •์ ์ธ ํ•™์Šต์„ ๋ณด์ž…๋‹ˆ๋‹ค. BN์€ ์ด๋Ÿฌํ•œ residual connection์˜ ํŠน์„ฑ๊ณผ ๊ทผ๋ณธ์ ์œผ๋กœ ์ถฉ๋Œํ•ฉ๋‹ˆ๋‹ค.

Layer Normalization์€ ๊ฐ ์ƒ˜ํ”Œ์„ ๋…๋ฆฝ์ ์œผ๋กœ ์ •๊ทœํ™”ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฐฐ์น˜์— ์˜์กดํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ residual path์˜ gradient flow๋ฅผ ๋ฐฉํ•ดํ•˜์ง€ ์•Š์œผ๋ฉฐ, ์ˆ˜์‹ญ ๊ฐœ์˜ ๋ ˆ์ด์–ด๋กœ ์ด๋ฃจ์–ด์ง„ ๊นŠ์€ Transformer์—์„œ๋„ ์•ˆ์ •์ ์ธ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ตฌ์กฐ์  ์กฐํ™”๊ฐ€ Transformer๊ฐ€ Layer Normalization์„ ์‚ฌ์šฉํ•˜๋Š” ๋˜ ๋‹ค๋ฅธ ์ค‘์š”ํ•œ ์ด์œ ์ž…๋‹ˆ๋‹ค.

728x90

'Dev,AI > Machine Learning' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

Seq2Seq  (4) 2024.01.28

+ Recent posts