728x90

๋ฐฐ๊ฒฝ


vllm ์„œ๋ฒ„ ์šด์˜์ค‘ 0.14.0 ๋ฏธ๋งŒ ๋ฒ„์ „์—์„œ RCE ์ทจ์•ฝ์ ์ด ๋ฐœ์ƒํ–ˆ๋‹ค๊ณ  ํ•ด์„œ ๋ฒ„์ „ ํŒจ์น˜๋ฅผ ํ–ˆ์Šต๋‹ˆ๋‹ค. 
๊ทธ๋Ÿฐ๋ฐ ์ด์ „์— ๋‚˜์™€์žˆ๋˜ ์ทจ์•ฝ์  ์ค‘ ๋ชจ๋ธ ๋กœ๋“œ๋ฅผ ํ†ตํ•ด์„œ RCE ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ธ€์„ ๋ณด
๊ณ  ์ด๊ฒŒ ์–ด๋–ป๊ฒŒ ๊ฐ€๋Šฅํ•œ๊ฑด์ง€ ์ฐพ์•„๋ณด๊ฒŒ ๋˜์—ˆ๋Š”๋ฐ์š”, 
๋ฐฐํฌํฌ๋งท์ด๋‚˜ ์ผ๋ถ€ ํ”„๋ ˆ์ž„์›Œํฌ์—์„œ ๋ชจ๋ธ๋กœ๋“œ์—์„œ ๊ฐ€์ค‘์น˜๋งŒ ๋ถˆ๋Ÿฌ์˜ค๋Š”๊ฒƒ์ด ์•„๋‹ˆ๋ผ
ํŒŒ์ด์ฌ ์ฝ”๋“œ ๋กœ์ง์„ ํƒˆ ์ˆ˜ ์žˆ๋‹ค๋Š” ์‚ฌ์‹ค์„ ์•Œ๊ฒŒ ๋˜์–ด ์ •๋ฆฌํ•  ๊ฒธ ๊ธ€์„ ์ž‘์„ฑํ•ฉ๋‹ˆ๋‹ค.

 

 

CVE-2025-66448: vLLM Config Trust Bypass RCE | Miggo

The vulnerability lies in the __init__ method of the Nemotron_Nano_VL_Config class, located in the now-removed file vllm/transformers_utils/configs/nemotron_vl.py. The commit ffb08379d8870a1a81ba82b72797f196838d0c86 addresses the vulnerability by completel

www.miggo.io

 

๋ชจ๋ธ ๋ฐฐํฌ ํฌ๋งท

์ธ๊ณต์ง€๋Šฅ ๋ชจ๋ธ์„ ๊ฐœ๋ฐœํ•˜๋‹ค ๋ณด๋ฉด ํ•™์Šต ์ž์ฒด๋ณด๋‹ค ๋” ๋งŽ์€ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๋Š” ์ง€์ ์ด ๋ฐ”๋กœ ๋ฐฐํฌ์ž…๋‹ˆ๋‹ค. ํ•™์Šต๋œ ๋ชจ๋ธ์€ ๋‹จ์ˆœํ•œ ์ฝ”๋“œ๊ฐ€ ์•„๋‹ˆ๋ผ ์ˆ˜๋ฐฑ MB์—์„œ ์ˆ˜์‹ญ GB์— ์ด๋ฅด๋Š” ๊ฐ€์ค‘์น˜ ๋ฐ์ดํ„ฐ์™€ ์‹คํ–‰ ๊ตฌ์กฐ๋ฅผ ํ•จ๊ป˜ ๊ฐ–๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์ด๋•Œ ๋ชจ๋ธ์„ ์–ด๋–ค ํ˜•ํƒœ๋กœ ์ €์žฅํ•˜๊ณ  ์ „๋‹ฌํ•  ๊ฒƒ์ธ๊ฐ€์— ๋Œ€ํ•œ ๋ฌธ์ œ๊ฐ€ ๋ฐ”๋กœ ๋ชจ๋ธ ๋ฐฐํฌ ํฌ๋งท์˜ ์ถœ๋ฐœ์ ์ž…๋‹ˆ๋‹ค.

์ดˆ๊ธฐ์—๋Š” ํ•™์Šตํ•œ ํ”„๋ ˆ์ž„์›Œํฌ ๋‚ด๋ถ€์—์„œ๋งŒ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ–ˆ๊ธฐ ๋•Œ๋ฌธ์—, ๋‹จ์ˆœํžˆ ๋ฉ”๋ชจ๋ฆฌ ๊ฐ์ฒด๋ฅผ ๊ทธ๋Œ€๋กœ ์ง๋ ฌํ™”ํ•˜๋Š” ๋ฐฉ์‹์ด ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๋ชจ๋ธ์ด ์ปค์ง€๊ณ , ํ˜‘์—…๊ณผ ์™ธ๋ถ€ ๊ณต์œ ๊ฐ€ ๋Š˜์–ด๋‚˜๋ฉด์„œ ์ž์—ฐ์Šค๋Ÿฌ์šด ์š”๊ตฌ์‚ฌํ•ญ์ด ๋“ฑ์žฅํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ฐ€์žฅ ํฐ ๊ฒƒ์€ ๋‹ค๋ฅธ ํ™˜๊ฒฝ์—์„œ๋„ ๋™์ผํ•˜๊ฒŒ ๋ชจ๋ธ์„ ๋กœ๋“œํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์ธ๋ฐ์š”, ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ  ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ์€ ์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ์„ ๊ตฌ์„ฑํ•˜์ง€ ์•Š๋Š” ํ•œ ๊ทธ๋‹ค์ง€ ๋ฌธ์ œ๊ฐ€ ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค๋งŒ, ์ถ”๋ก ์„ ํ•  ๋•Œ์—๋Š” ์ด์‹์„ฑ์ด ์ค‘์š”ํ•˜๊ฒŒ ์—ฌ๊ฒจ์กŒ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ๋ชจ๋ธ ํŒŒ์ผ๋งŒ export ํ•˜๊ฒŒ ๋˜์—ˆ๊ณ , ์ด๋Ÿฐ ์š”๊ตฌ์‚ฌํ•ญ๋“ค์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ๋ชจ๋ธ ๋ฐฐํฌ ํฌ๋งท์ด ๋“ฑ์žฅํ•˜๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Pytorch .pt .pth

Pytorch ์˜ ๋ชจ๋ธ ์ €์žฅ ๋ฐฉ์‹์€ Python ๊ฐ์ฒด๋ฅผ ๊ทธ๋Œ€๋กœ ๋ฐ์ดํ„ฐ๋กœ ๋งŒ๋“œ๋Š” ๊ฒƒ์ธ๋ฐ ์ด๊ฒƒ์„ ์ง๋ ฌํ™”๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์ด ํฌ๋งท๋„ ๋‹ค๋ฅธ ํฌ๋งท๋“ค๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋ชจ๋ธ ์žฌํ˜„์„ฑ์˜ ์š”๊ตฌ์‚ฌํ•ญ์„ ํ•ด๊ฒฐํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— Research Level ์—์„œ๋Š” ํŽธํ•˜๊ฒŒ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์ง€๋งŒ, ๋‚ด๋ถ€์ ์œผ๋กœ pickle ์„ ์‚ฌ์šฉํ•˜๊ณ , ์ฝ”๋“œ๋‚˜ ๋ฐ์ดํ„ฐ ์ž์ฒด๋ฅผ ๋ชจ๋‘ ์ง๋ ฌํ™” ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ•ด๋‹น ๊ฐ์ฒด๋ฅผ ๋กœ๋“œํ•˜๋Š” ๊ฒฝ์šฐ RCE๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์น˜๋ช…์ ์ธ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

python ๊ณต์‹œ๋ฌธ์„œ์—์„œ pickle ์€ ์ง๋ ฌํ™”์™€ ์—ญ์ง๋ ฌํ™”๋ฅผ ์œ„ํ•œ ๋ชจ๋“ˆ์ด๋ผ๊ณ  ๋‚˜์™€์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค๋ฅธ ์˜ˆ์‹œ๋กœ ์‚ฌ์šฉ๋˜๋Š” ๊ฒƒ๋“ค๋„ ๋‚˜์ค‘์— ํ•œ๋ฒˆ ์ฐพ์•„๋ณผ๋ฒ• ํ•œ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

pickle — Python object serialization

๊ทธ๋ž˜์„œ Pytorch ์˜ ๋ชจ๋ธ์€ ๋ฐฐํฌํ™˜๊ฒฝ์—์„œ๋Š” ์‚ฌ์šฉ์„ ์ง€์–‘ํ•˜๋Š” ๊ฒƒ์ด ์ข‹์€ ๊ฒƒ ์ž…๋‹ˆ๋‹ค.

pytorch ๋Š” ๋ชจ๋ธ์˜ ํ˜•ํƒœ๋ฅผ ์ €์žฅํ•  ๋•Œ ์•„๋ž˜์™€ ๊ฐ™์ด ์ €์žฅํ•˜๋ฉด์„œ ์ง๋ ฌํ™”๋ฅผ ํ•˜๋Š”๋ฐ์š”, ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ ์ €์žฅํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

import torch
#model ๊ฐ์ฒด ๊ทธ๋Œ€๋กœ ์ง๋ ฌํ™”
torch.save(model, 'model.pth')
torch.load('model.pth')

#model ํŒŒ๋ผ๋ฏธํ„ฐ ์ง๋ ฌํ™” 
torch.save(model.state_dict(), 'model.pth')
model.load_state_dict(torch.load('model.pth'))

๋ฐœ์ƒ๊ฐ€๋Šฅํ•œ ์ทจ์•ฝ์ 

# Define model
class TheModelClass(nn.Module):
    def __init__(self):
        super(TheModelClass, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

# Initialize model
model = TheModelClass()

๋งŒ์•ฝ ์œ„์™€ ๊ฐ™์€ ๋ชจ๋ธ์ด ์žˆ๋‹ค๋ฉด torch.save ํ•˜๋Š” ์‹œ์ ์—์„œ TheModelClass ๊ฐ€ ์ง๋ ฌํ™”๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿผ class ์•ˆ์— ์žˆ๋Š” ํ•จ์ˆ˜๋“ค์— ๋ญ”๊ฐ€ ๋‹ค๋ฅธ ๋ชฉ์ ์˜ ์ฝ”๋“œ๊ฐ€ ์žˆ๋‹ค๋ฉด torch.load() ํ•˜๋Š” ์‹œ์ ์—์„œ ๊ทธ๋Œ€๋กœ ์‹คํ–‰๋˜๊ฒ ์ง€์š”. ์ด๊ฒƒ์ด pytorch ์˜ model.state_dict() ๋ฅผ ์ €์žฅํ•˜์ง€ ์•Š๊ณ  save ํ–ˆ์„ ๋•Œ์˜ ๋ฌธ์ œ์  ์ž…๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ pytorch ๊ถŒ์žฅ์‚ฌํ•ญ์€ ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ ์ €์žฅ๋˜๊ฒŒ ํ•˜๋Š” torch.save(model.state_dict,’model.pth’) ๋ฉ”์„œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

Huggingface .safetensors

safetensors ๋Š” ๊ฐ€์ค‘์น˜๋ฅผ ๋น ๋ฅด๊ฒŒ ์ €์žฅํ•˜๊ณ  ๋ถˆ๋Ÿฌ์˜ค๊ธฐ ์œ„ํ•œ ํ˜•์‹์ธ๋ฐ์š”, ๋‹ค๋ฅธ ๋ชจ๋ธ์—์„œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” ์ทจ์•ฝ์  ๋ฌธ์ œ ํŠนํžˆ pickle ์„ ์‚ฌ์šฉํ•˜๋ฉด์„œ ๋ฐœ์ƒํ•˜๋Š” python ๊ฐ์ฒด์ €์žฅ์ด๋‚˜ ์‹คํ–‰๊ฐ€๋Šฅํ•œ ๊ตฌ์กฐ๋ฅผ ํฌํ•จํ•˜๊ณ  ์žˆ์ง€ ์•Š์Šต๋‹ˆ๋‹ค. safetensors ํŒŒ์ผ ๊ตฌ์กฐ๋Š” ํ—ค๋”์™€ ๋ธ”๋ก์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

ํ—ค๋”๋Š” JSON ํ˜•์‹์œผ๋กœ ๋œ ํ…์„œ๋“ค์˜ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ์ด๊ณ , ๋ฐ์ดํ„ฐ๋ธ”๋ก์€ weight๋“ค์ด ์กด์žฌํ•˜๋Š” ๋ฐ”์ด๋„ˆ๋ฆฌ ํ˜•ํƒœ์ž…๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ safetensors ๋ฅผ ์—ด์–ด์„œ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋Š”๋ฐ์š”

https://huggingface.co/Qwen/Qwen3-ASR-1.7B/tree/main

 

Qwen/Qwen3-ASR-1.7B at main

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

 

์˜ ๋‘๋ฒˆ์งธ safetensors ๊ฐ€ ๋ฐœ๊ฒฌํ•œ๊ฒƒ์ค‘ ์šฉ๋Ÿ‰์ด ์ข€ ์ž‘๋„ค์š”, ์ด๊ฑฐ๋กœ ํ…Œ์ŠคํŠธ ํ•ด๋ณด์…”๋„ ์ข‹์„ ๋“ฏ ํ•ฉ๋‹ˆ๋‹ค.

from safetensors import safe_open

safetensors_file = 
with safe_open(safetensors_file, framework="pt") as f:
  tensor_name = f.keys()
  print(f"tensor list {tensor_name}")

  for key in tensor_name:
    tensor = f.get_tensor(key)
    print(f"tensor name {key} ์˜ ๋ฐ์ดํ„ฐํƒ€์ž… : {tensor.dtype}")
    print(f"tensor name {key} ์˜ shape : {tensor.shape}")
tensor list ['thinker.model.layers.5.mlp.gate_proj.weight', 'thinker.model.layers.5.mlp.up_proj.weight', 'thinker.model.layers.5.post_attention_layernorm.weight', 'thinker.model.layers.5.self_attn.k_norm.weight', 'thinker.model.layers.5.self_attn.k_proj.weight', 'thinker.model.layers.5.self_attn.o_proj.weight', 'thinker.model.layers.5.self_attn.q_norm.weight', 'thinker.model.layers.5.self_attn.q_proj.weight', 'thinker.model.layers.5.self_attn.v_proj.weight', 'thinker.model.layers.6.input_layernorm.weight', 'thinker.model.layers.6.mlp.down_proj.weight', 'thinker.model.layers.6.mlp.gate_proj.weight', 'thinker.model.layers.6.mlp.up_proj.weight', 'thinker.model.layers.6.post_attention_layernorm.weight', 'thinker.model.layers.6.self_attn.k_norm.weight', 'thinker.model.layers.6.self_attn.k_proj.weight', 'thinker.model.layers.6.self_attn.o_proj.weight', 'thinker.model.layers.6.self_attn.q_norm.weight', 'thinker.model.layers.6.self_attn.q_proj.weight', 'thinker.model.layers.6.self_attn.v_proj.weight', 'thinker.model.layers.7.input_layernorm.weight', 'thinker.model.layers.7.mlp.down_proj.weight', 'thinker.model.layers.7.mlp.gate_proj.weight', 'thinker.model.layers.7.mlp.up_proj.weight', 'thinker.model.layers.7.post_attention_layernorm.weight', 'thinker.model.layers.7.self_attn.k_norm.weight', 'thinker.model.layers.7.self_attn.k_proj.weight', 'thinker.model.layers.7.self_attn.o_proj.weight', 'thinker.model.layers.7.self_attn.q_norm.weight', 'thinker.model.layers.7.self_attn.q_proj.weight', 'thinker.model.layers.7.self_attn.v_proj.weight', 'thinker.model.layers.8.input_layernorm.weight', 'thinker.model.layers.8.mlp.down_proj.weight', 'thinker.model.layers.8.mlp.gate_proj.weight', 'thinker.model.layers.8.mlp.up_proj.weight', 'thinker.model.layers.8.post_attention_layernorm.weight', 'thinker.model.layers.8.self_attn.k_norm.weight', 'thinker.model.layers.8.self_attn.k_proj.weight', 'thinker.model.layers.8.self_attn.o_proj.weight', 'thinker.model.layers.8.self_attn.q_norm.weight', 'thinker.model.layers.8.self_attn.q_proj.weight', 'thinker.model.layers.8.self_attn.v_proj.weight', 'thinker.model.layers.9.input_layernorm.weight', 'thinker.model.layers.9.mlp.down_proj.weight', 'thinker.model.layers.9.mlp.gate_proj.weight', 'thinker.model.layers.9.mlp.up_proj.weight', 'thinker.model.layers.9.post_attention_layernorm.weight', 'thinker.model.layers.9.self_attn.k_norm.weight', 'thinker.model.layers.9.self_attn.k_proj.weight', 'thinker.model.layers.9.self_attn.o_proj.weight', 'thinker.model.layers.9.self_attn.q_norm.weight', 'thinker.model.layers.9.self_attn.q_proj.weight', 'thinker.model.layers.9.self_attn.v_proj.weight', 'thinker.model.norm.weight']
tensor name thinker.model.layers.5.mlp.gate_proj.weight ์˜ ๋ฐ์ดํ„ฐํƒ€์ž… : torch.bfloat16
tensor name thinker.model.layers.5.mlp.gate_proj.weight ์˜ shape : torch.Size([6144, 2048])
tensor name thinker.model.layers.5.mlp.up_proj.weight ์˜ ๋ฐ์ดํ„ฐํƒ€์ž… : torch.bfloat16
tensor name thinker.model.layers.5.mlp.up_proj.weight ์˜ shape : torch.Size([6144, 2048])
tensor name thinker.model.layers.5.post_attention_layernorm.weight ์˜ ๋ฐ์ดํ„ฐํƒ€์ž… : torch.bfloat16
tensor name thinker.model.layers.5.post_attention_layernorm.weight ์˜ shape : torch.Size([2048])
tensor name thinker.model.layers.5.self_attn.k_norm.weight ์˜ ๋ฐ์ดํ„ฐํƒ€์ž… : torch.bfloat16
tensor name thinker.model.layers.5.self_attn.k_norm.weight ์˜ shape : torch.Size([128])
tensor name thinker.model.layers.5.self_attn.k_proj.weight ์˜ ๋ฐ์ดํ„ฐํƒ€์ž… : torch.bfloat16
tensor name thinker.model.layers.5.self_attn.k_proj.weight ์˜ shape : torch.Size([1024, 2048])
tensor name thinker.model.layers.5.self_attn.o_proj.weight ์˜ ๋ฐ์ดํ„ฐํƒ€์ž… : torch.bfloat16
tensor name thinker.model.layers.5.self_attn.o_proj.weight ์˜ shape : torch.Size([2048, 2048])
tensor name thinker.model.layers.5.self_attn.q_norm.weight ์˜ ๋ฐ์ดํ„ฐํƒ€์ž… : torch.bfloat16
tensor name thinker.model.layers.5.self_attn.q_norm.weight ์˜ shape : torch.Size([128])
tensor name thinker.model.layers.5.self_attn.q_proj.weight ์˜ ๋ฐ์ดํ„ฐํƒ€์ž… : torch.bfloat16
tensor name thinker.model.layers.5.self_attn.q_proj.weight ์˜ shape : torch.Size([2048, 2048])
tensor name thinker.model.layers.5.self_attn.v_proj.weight ์˜ ๋ฐ์ดํ„ฐํƒ€์ž… : torch.bfloat16
tensor name thinker.model.layers.5.self_attn.v_proj.weight ์˜ shape : torch.Size([1024, 2048])
tensor name thinker.model.layers.6.input_layernorm.weight ์˜ ๋ฐ์ดํ„ฐํƒ€์ž… : torch.bfloat16
tensor name thinker.model.layers.6.input_layernorm.weight ์˜ shape : torch.Size([2048])
tensor name thinker.model.layers.6.mlp.down_proj.weight ์˜ ๋ฐ์ดํ„ฐํƒ€์ž… : torch.bfloat16
tensor name thinker.model.layers.6.mlp.down_proj.weight ์˜ shape : torch.Size([2048, 6144])
tensor name thinker.model.layers.6.mlp.gate_proj.weight ์˜ ๋ฐ์ดํ„ฐํƒ€์ž… : torch.bfloat16
tensor name thinker.model.layers.6.mlp.gate_proj.weight ์˜ shape : torch.Size([6144, 2048])
tensor name thinker.model.layers.6.mlp.up_proj.weight ์˜ ๋ฐ์ดํ„ฐํƒ€์ž… : torch.bfloat16
tensor name thinker.model.layers.6.mlp.up_proj.weight ์˜ shape : torch.Size([6144, 2048])
tensor name thinker.model.layers.6.post_attention_layernorm.weight ์˜ ๋ฐ์ดํ„ฐํƒ€์ž… : torch.bfloat16

weight ์— ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๋Š”๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. RCE ๋ฅผ ์›์ฒœ์ ์œผ๋กœ ๋ง‰๊ธฐ ์œ„ํ•ด ์„ค๊ณ„ ๋œ ๋งŒํผ safetensors ๋ชจ๋ธ์ž์ฒด์— ๋Œ€ํ•ด์„œ๋Š” ๋ฐœ๊ฒฌ๋œ ์ทจ์•ฝ์ ์ด ์—†์Šต๋‹ˆ๋‹ค.

Microsoft ONNX(Open Neural Network Exchange)

ONNX ๋Š” ๋งŽ์€ ๋จธ์‹ ๋Ÿฌ๋‹ ํ”„๋ ˆ์ž„์›Œํฌ ๊ฐ„์˜ ๋ชจ๋ธ์„ ํ†ตํ•ฉํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„๋œ ์˜คํ”ˆ์†Œ์Šค ํฌ๋งท์ž…๋‹ˆ๋‹ค. ONNX ๋ฅผ ํ†ตํ•ด์„œ ๊ฐœ๋ฐœ์ž๋“ค์€ Pytorch ๋‚˜ Tensorflow ๋“ฑ ์ƒ์ดํ•œ ๋จธ์‹ ๋Ÿฌ๋‹ ํ”„๋ ˆ์ž„์›Œํฌ์—์„œ ๊ฐœ๋ฐœํ•ด๋„ ONNX ๋ฅผ ํ†ตํ•ด์„œ ์„œ๋กœ๋‹ค๋ฅธ ํ”„๋ ˆ์ž„์›Œํฌ๋กœ ์‰ฝ๊ฒŒ ์ „ํ™˜ํ•ด์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ์—ญ์‹œ ๋ฐฐํฌ๋ฅผ ์›ํ™œํ•˜๊ฒŒ ํ•˜์ž๋Š” ์ •์‹ ์—์„œ ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

import torch
import torchvision.models as models
import onnx

# ์‚ฌ์ „ ํ›ˆ๋ จ๋œ PyTorch ๋ชจ๋ธ ๋กœ๋“œ
model = models.resnet18(pretrained=True)
model.eval()

# ๋”๋ฏธ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ ์ƒ์„ฑ
x = torch.randn(1, 3, 224, 224, requires_grad=True)

# ๋ชจ๋ธ์„ ONNX ํฌ๋งท์œผ๋กœ ๋ณ€ํ™˜
torch.onnx.export(model,               # ์‹คํ–‰ํ•  ๋ชจ๋ธ
                  x,                   # ๋ชจ๋ธ ์ž…๋ ฅ๊ฐ’ (ํŠœํ”Œ ๋˜๋Š” ์—ฌ๋Ÿฌ ์ž…๋ ฅ๊ฐ’์„ ์œ„ํ•œ ํŠœํ”Œ๋„ ๊ฐ€๋Šฅ)
                  "resnet18.onnx",     # ์ €์žฅ๋  ๋ชจ๋ธ์˜ ์ด๋ฆ„
                  export_params=True,  # ๋ชจ๋ธ ํŒŒ์ผ ๋‚ด ํ•™์Šต๋œ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜๋ฅผ ์ €์žฅํ• ์ง€์˜ ์—ฌ๋ถ€
                  opset_version=10,    # ๋ชจ๋ธ์„ ๋ณ€ํ™˜ํ•  ๋•Œ ์‚ฌ์šฉํ•  ONNX ๋ฒ„์ „
                  do_constant_folding=True,  # ์ตœ์ ํ™”: ์ƒ์ˆ˜ ํด๋”ฉ์„ ์ˆ˜ํ–‰ํ• ์ง€ ์—ฌ๋ถ€
                  input_names = ['input'],   # ๋ชจ๋ธ์˜ ์ž…๋ ฅ๊ฐ’์— ๋Œ€ํ•œ ์ด๋ฆ„
                  output_names = ['output'], # ๋ชจ๋ธ์˜ ์ถœ๋ ฅ๊ฐ’์— ๋Œ€ํ•œ ์ด๋ฆ„
                  dynamic_axes={'input' : {0 : 'batch_size'},    # ๋ฐฐ์น˜ ํฌ๊ธฐ์— ๋”ฐ๋ผ ๋™์ ์œผ๋กœ ๋ณ€ํ•˜๋Š” ์ž…๋ ฅ ์ฐจ์›
                                'output' : {0 : 'batch_size'}})  # ๋ฐฐ์น˜ ํฌ๊ธฐ์— ๋”ฐ๋ผ ๋™์ ์œผ๋กœ ๋ณ€ํ•˜๋Š” ์ถœ๋ ฅ ์ฐจ์›

ONNX ๋ฐœ์ƒ ๊ฐ€๋Šฅํ•œ ์ทจ์•ฝ์ 

์ตœ๊ทผ๊นŒ์ง€๋Š” ONNX ์˜ ๋ณด๊ณ ๋œ ์ทจ์•ฝ์ ๋“ค์—์„œ ONNX ์ž์ฒด์˜ ์ทจ์•ฝ์ ์€ ๊ฑฐ์˜ ์—†๋‹ค๊ณ  ํ•ด๋„ ๋ ์ •๋„๋กœ ์—†์—ˆ๊ณ , ๊ฒŒ๋‹ค๊ฐ€ RCE ๋Š” ์ „ํ˜€ ๋ณผ์ˆ˜ ์—†์—ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋งˆ์ €๋„ C/C++ ์—„๋ฐ€ํžˆ ๋งํ•˜๋ฉด ๋Ÿฐํƒ€์ž„ ์œ ํ˜•์˜ ์ทจ์•ฝ์ ์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค๋Š”๋ฐ์š”, ์ตœ๊ทผ ๋ฐœํ‘œ ๋œ Path Traveling ์ทจ์•ฝ์ ๋„ ONNX ํฌ๋งท์˜ ๋ฌธ์ œ๋ผ๊ธฐ๋ณด๋‹ค๋Š”, ONNX ๋ชจ๋ธ์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๊ตฌํ˜„์˜ ์ทจ์•ฝ์ ์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

 

 

ONNX Path Traversal Vulnerability Exploited | Matt T.๋‹˜์ด ํ† ํ”ฝ์— ๋Œ€ํ•ด ์˜ฌ๋ฆผ | LinkedIn

CVE-2025-51480 Path Traversal vulnerability in onnx.external_data_helper.save_external_data in ONNX 1.17.0 allows attackers to overwrite arbitrary files by supplying crafted external_data.location paths containing traversal sequences, bypassing intended di

www.linkedin.com

 

GGUF / GGML

GGML (Georgi Gerganov Machine Learning Format)

GGML์€ Georgi Gerganov๊ฐ€ ๊ฐœ๋ฐœํ•œ ๊ฒฝ๋Ÿ‰ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ, ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ์„ ํฌํ•จํ•œ ์‹ ๊ฒฝ๋ง ๋ชจ๋ธ์„ CPU ํ™˜๊ฒฝ์—์„œ ํšจ์œจ์ ์œผ๋กœ ์ถ”๋ก ํ•˜๊ธฐ ์œ„ํ•ด ์„ค๊ณ„๋œ C/C++ ๊ธฐ๋ฐ˜ ํ”„๋กœ์ ํŠธ์ž…๋‹ˆ๋‹ค. Hugging Face์˜ ์†Œ๊ฐœ ๊ธ€์—์„œ๋„ ๊ฐ•์กฐํ•˜๋“ฏ, GGML์€ ๊ธฐ์กด ๋”ฅ๋Ÿฌ๋‹ ํ”„๋ ˆ์ž„์›Œํฌ๊ฐ€ ๊ฐ–๋Š” ๋ณต์žก์„ฑ๊ณผ ๋ฌด๊ฑฐ์šด ์˜์กด์„ฑ์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ๋งŒ๋“ค์–ด์กŒ์Šต๋‹ˆ๋‹ค.

์ผ๋ฐ˜์ ์ธ ๋จธ์‹ ๋Ÿฌ๋‹ ํ”„๋ ˆ์ž„์›Œํฌ์ธ PyTorch๋‚˜ TensorFlow๋Š” ๋งค์šฐ ๊ฐ•๋ ฅํ•˜์ง€๋งŒ, ๋Œ€๊ทœ๋ชจ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์˜์กด์„ฑ๊ณผ ๋ณต์žกํ•œ ๋นŒ๋“œ ํ™˜๊ฒฝ์„ ์š”๊ตฌํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์„œ๋ฒ„ ํ™˜๊ฒฝ์—์„œ๋Š” ๋ฌธ์ œ๊ฐ€ ๋˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์ง€๋งŒ, ๊ฐœ์ธ PC๋‚˜ ๋‚ด๋ถ€๋ง, ์˜คํ”„๋ผ์ธ ํ™˜๊ฒฝ, ํ˜น์€ ๋ฆฌ์†Œ์Šค๊ฐ€ ์ œํ•œ๋œ ์‹œ์Šคํ…œ์—์„œ๋Š” ๋ถ€๋‹ด์œผ๋กœ ์ž‘์šฉํ•ฉ๋‹ˆ๋‹ค. GGML์€ ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์™ธ๋ถ€ ์˜์กด์„ฑ์„ ๊ฑฐ์˜ ๊ฐ–์ง€ ์•Š๋Š” ๊ตฌ์กฐ, ๊ทธ๋ฆฌ๊ณ  ๋‹จ์ˆœํ•œ C ์ฝ”๋“œ ๊ธฐ๋ฐ˜ ๊ตฌํ˜„์„ ์„ ํƒํ–ˆ์Šต๋‹ˆ๋‹ค.

GGML์˜ ํ•ต์‹ฌ ์ฒ ํ•™์€ “์ž‘๊ณ , ๋‹จ์ˆœํ•˜๋ฉฐ, ์˜ˆ์ธก ๊ฐ€๋Šฅํ•œ ์‹คํ–‰”์ž…๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ GGML์€ ๋ช‡ ๊ฐœ์˜ ์†Œ์Šค ํŒŒ์ผ๋งŒ์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์œผ๋ฉฐ, ์ปดํŒŒ์ผ๋œ ๋ฐ”์ด๋„ˆ๋ฆฌ ํฌ๊ธฐ ์—ญ์‹œ ๋งค์šฐ ์ž‘์Šต๋‹ˆ๋‹ค. ๋ณ„๋„์˜ Python ๋Ÿฐํƒ€์ž„์ด๋‚˜ ๋Œ€ํ˜• ํ”„๋ ˆ์ž„์›Œํฌ ์—†์ด๋„ ๋ชจ๋ธ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ํ™˜๊ฒฝ ์ด์‹์„ฑ์ด ๋งค์šฐ ๋›ฐ์–ด๋‚ฉ๋‹ˆ๋‹ค. Linux, macOS, Windows๋Š” ๋ฌผ๋ก ์ด๊ณ  ARM ์•„ํ‚คํ…์ฒ˜๋‚˜ Apple Silicon ํ™˜๊ฒฝ์—์„œ๋„ ๋น„๊ต์  ์‰ฝ๊ฒŒ ๋นŒ๋“œํ•˜๊ณ  ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋˜ ํ•˜๋‚˜์˜ ์ค‘์š”ํ•œ ํŠน์ง•์€ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ์ž…๋‹ˆ๋‹ค. GGML์€ ํ…์„œ ํ‘œํ˜„๊ณผ ์—ฐ์‚ฐ์—์„œ ๋ถˆํ•„์š”ํ•œ ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ œ๊ฑฐํ•˜๊ณ , CPU ์บ์‹œ ์นœํ™”์ ์ธ ๋ฉ”๋ชจ๋ฆฌ ๋ ˆ์ด์•„์›ƒ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ GGML์ด ๋„๋ฆฌ ์ฃผ๋ชฉ๋ฐ›๊ฒŒ ๋œ ์ด์œ  ์ค‘ ํ•˜๋‚˜๋Š” ๊ฐ•๋ ฅํ•œ ์–‘์žํ™”(quantization) ์ง€์›์ž…๋‹ˆ๋‹ค. float32 ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์„ int8, int5, int4 ์ˆ˜์ค€์œผ๋กœ ์••์ถ•ํ•ด ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ํฌ๊ฒŒ ์ค„์ด๋ฉด์„œ๋„, ์ถ”๋ก  ์„ฑ๋Šฅ์„ ์‹ค์šฉ์ ์ธ ์ˆ˜์ค€์œผ๋กœ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ํŠน์„ฑ ๋•๋ถ„์— GGML์€ ํ•™์Šต๋ณด๋‹ค๋Š” ์ถ”๋ก  ์ค‘์‹ฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ด๋ฏธ ํ•™์Šต๋œ ๋ชจ๋ธ์„ ๊ฐ€๋Šฅํ•œ ํ•œ ์ ์€ ์ž์›์œผ๋กœ ๋น ๋ฅด๊ฒŒ ์‹คํ–‰ํ•˜๋Š” ๊ฒƒ์ด ๋ชฉ์ ์ด๋ฉฐ, ์‹ค์ œ๋กœ llama.cpp, whisper.cpp, GPT4All, LM Studio, Ollama์™€ ๊ฐ™์€ ์—ฌ๋Ÿฌ ํ”„๋กœ์ ํŠธ๋“ค์ด GGML์„ ์ €์ˆ˜์ค€ ์—ฐ์‚ฐ ์—”์ง„์œผ๋กœ ํ™œ์šฉํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฒฝ์šฐ GGML์€ ๋‹จ์ˆœํ•œ ๋ชจ๋ธ ํฌ๋งท์ด๋ผ๊ธฐ๋ณด๋‹ค๋Š”, ๋ชจ๋ธ ์‹คํ–‰์„ ๋‹ด๋‹นํ•˜๋Š” ์ €์ˆ˜์ค€ ๋Ÿฐํƒ€์ž„์— ๊ฐ€๊น๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ตฌ์กฐ์ ์œผ๋กœ ๋ณด๋ฉด GGML์€ ๋‚ด๋ถ€์— ํ…์„œ์™€ ์—ฐ์‚ฐ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ด€๋ฆฌํ•˜๋Š” context๋ฅผ ๋‘๊ณ , ์—ฐ์‚ฐ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ CPU, CUDA, Metal ๋“ฑ ๋‹ค์–‘ํ•œ ๋ฐฑ์—”๋“œ๋ฅผ ์ง€์›ํ•  ์ˆ˜ ์žˆ๋„๋ก ์„ค๊ณ„๋˜์–ด ์žˆ์œผ๋ฉฐ, ๋ฐฑ์—”๋“œ๋ณ„๋กœ ๋ฉ”๋ชจ๋ฆฌ ํ• ๋‹น๊ณผ ์—ฐ์‚ฐ ์Šค์ผ€์ค„๋ง์„ ๋ถ„๋ฆฌํ•ด ๊ด€๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ตฌ์กฐ ๋•๋ถ„์— ๊ฐ€๋ณ์ง€๋งŒ ๋‹จ์ˆœํ•œ ์ˆ˜์ค€์„ ๋„˜๋Š” ์œ ์—ฐ์„ฑ์„ ํ™•๋ณดํ•  ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

๋‹ค๋งŒ GGML์€ ์ด๋Ÿฌํ•œ ์žฅ์ ๊ณผ ํ•จ๊ป˜ ํ•œ๊ณ„๋„ ๊ฐ–๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. C/C++ ๊ธฐ๋ฐ˜ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ํŠน์„ฑ์ƒ ์‚ฌ์šฉ ๋‚œ์ด๋„๊ฐ€ ๋†’๊ณ , Python ๊ธฐ๋ฐ˜ ํ”„๋ ˆ์ž„์›Œํฌ์— ์ต์ˆ™ํ•œ ์‚ฌ์šฉ์ž์—๊ฒŒ๋Š” ์ง„์ž… ์žฅ๋ฒฝ์ด ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ๋ชจ๋ธ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ ํ‘œํ˜„์ด ์ œํ•œ์ ์ด๊ณ , ํ† ํฌ๋‚˜์ด์ €๋‚˜ specia1 token, rope ์„ค์ •๊ณผ ๊ฐ™์€ ๋ถ€๊ฐ€ ์ •๋ณด๋ฅผ ํ•จ๊ป˜ ๊ด€๋ฆฌํ•˜๋Š” ๋ฐ์—๋Š” ๋ถˆํŽธํ•จ์ด ์กด์žฌํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ํ•œ๊ณ„๋Š” ๋ชจ๋ธ์ด ๋ณต์žกํ•ด์งˆ์ˆ˜๋ก ์ ์  ๋” ๋ฌธ์ œ๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ ๋ฐฐ๊ฒฝ ์†์—์„œ GGML์€ ์ ์ฐจ GGUF(GGML Unified Format)๋กœ ๋ฐœ์ „ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. GGUF๋Š” GGML์˜ ์ฒ ํ•™์„ ์œ ์ง€ํ•˜๋ฉด์„œ๋„, ๋ชจ๋ธ ์‹คํ–‰์— ํ•„์š”ํ•œ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๋ฅผ ๋ณด๋‹ค ๋ช…ํ™•ํ•˜๊ณ  ํ™•์žฅ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋‹ด๊ธฐ ์œ„ํ•ด ์„ค๊ณ„๋œ ํฌ๋งท์ž…๋‹ˆ๋‹ค. ํ˜„์žฌ llama.cpp ์ƒํƒœ๊ณ„์—์„œ๋„ GGML๋ณด๋‹ค๋Š” GGUF ์‚ฌ์šฉ์ด ๊ถŒ์žฅ๋˜๊ณ  ์žˆ์œผ๋ฉฐ, GGML์€ ์ ์ฐจ ๋ ˆ๊ฑฐ์‹œ ํฌ๋งท์˜ ์œ„์น˜๋กœ ์ด๋™ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์ •๋ฆฌํ•˜์ž๋ฉด, GGML์€ “๋ชจ๋ธ์„ ์•ˆ์ „ํ•˜๊ฒŒ ์ €์žฅํ•œ๋‹ค”๋Š” ๋ฐฐํฌ ํฌ๋งท์˜ ๊ฐœ๋…๋ณด๋‹ค๋Š”, “๋ชจ๋ธ์„ ๊ฐ€๋ณ๊ณ  ํšจ์œจ์ ์œผ๋กœ ์‹คํ–‰ํ•œ๋‹ค”๋Š” ๋ชฉ์ ์— ์ถฉ์‹คํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ž…๋‹ˆ๋‹ค. Python ๊ฐ์ฒด ์ง๋ ฌํ™”๋‚˜ ์‹คํ–‰ ๊ฐ€๋Šฅํ•œ ์ฝ”๋“œ ๋กœ๋”ฉ๊ณผ๋Š” ๊ฑฐ๋ฆฌ๊ฐ€ ๋ฉ€๊ธฐ ๋•Œ๋ฌธ์—, ๊ตฌ์กฐ์ ์œผ๋กœ RCE์™€ ๊ฐ™์€ ์ทจ์•ฝ์ ๊ณผ๋„ ๋ฌด๊ด€ํ•œ ํŽธ์ž…๋‹ˆ๋‹ค. ๋‹ค๋งŒ ๋‹ค๋ฅธ ๋ชจ๋“  ์‹คํ–‰ ์—”์ง„๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ์ตœ์ข…์ ์ธ ์•ˆ์ •์„ฑ๊ณผ ๋ณด์•ˆ์„ฑ์€ ๋Ÿฐํƒ€์ž„ ๊ตฌํ˜„๊ณผ ์šด์˜ ๋ฐฉ์‹์— ์˜ํ•ด ๊ฒฐ์ •๋œ๋‹ค๋Š” ์ ์€ ๋™์ผํ•˜๊ฒŒ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค.

GGUF (GGML Unified Format)

GGUF๋Š” GGML์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ๊ฐœ์„ ๋œ ํฌ๋งท์ž…๋‹ˆ๋‹ค. ์ด๋ฆ„์—์„œ ์•Œ ์ˆ˜ ์žˆ๋“ฏ 'ํ†ตํ•ฉ๋œ(Unified)' ํ˜•์‹์„ ์ง€ํ–ฅํ•˜๋ฉฐ, ๋” ๋งŽ์€ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๋ฅผ ํฌํ•จํ•˜๊ณ  ํ™•์žฅ์„ฑ์„ ๋†’์˜€์Šต๋‹ˆ๋‹ค. ์ด๋ฆ„์„ ๋ถ™์ผ๋•Œ์—๋„

<BaseName><SizeLabel><FineTune><Version><Encoding><Type><Shard>.gguf ๋ผ๋Š” ๋„ค์ด๋ฐ ๊ทœ์น™์„ ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค. ๋” ๋งŽ์€ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ๋ฅผ ํฌํ•จํ•  ์ˆ˜ ์žˆ๊ฒŒ ํŒŒ์ผ๊ตฌ์กฐ๊ฐ€ ๊ฐœ์„ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

GGUF ๋Š” ๋„ˆ๋ฌด ๋งŽ์€ ์ด์•ผ๊ธฐ๋“ค์ด ์žˆ๋Š”๋ฐ ๋”ฐ๋กœ ๋‹ค๋ฃจ๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค. ๊ฒฐ๋ก ์€ GGML ์€ ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ ์„œ๋น™ ํŠนํ™” ๋ฐฐํฌ ํฌ๋งท์ด๊ณ , GGUF ๋Š” ์—ฌ๊ธฐ์„œ ๊ด€๋ฆฌ์ ์ธ ์ธก๋ฉด์„ ๊ณ ๋„ํ™”ํ•œ ํฌ๋งท์ด๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋  ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

GGML /GGUF ์˜ ์ทจ์•ฝ์  ๋ฐœ์ƒ ๊ฐ€๋Šฅ์„ฑ

GGML ์ด๋‚˜ GGUF ๋‘˜๋‹ค Python ๊ฐ์ฒด๋ฅผ ํฌํ•จํ•˜์ง€ ์•Š๊ณ  ๊ฐ™์€ ์˜๋ฏธ๋กœ pickle ์ด๋‚˜ ์–ด๋–ค ์Šคํฌ๋ฆฝํŠธ๋ฅผ ํฌํ•จํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ๋ชจ๋ธ ์ž์ฒด๊ฐ€ ์ฝ”๋“œ๋ฅผ ์‹คํ–‰์‹œํ‚จ๋‹ค๋˜์ง€์˜ ์ทจ์•ฝ์ ์€ ๋ฐœ์ƒํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์•Œ์•„๋ณด๋‹ค ๋ณด๋‹ˆ ์ •๋ง ๋„ˆ๋ฌด ๋งŽ์€ ํ”„๋ ˆ์ž„์›Œํฌ๋“ค์ด ์žˆ๋”๋ผ๊ตฌ์š”, ๊ทธ๋ž˜์„œ GPT ์—๊ฒŒ ์ •๋ฆฌ๋ฅผ ์ข€ ํ•ด๋‹ฌ๋ผ ํ–ˆ๋”๋‹ˆ ์–ด๋””์„œ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๋Š”์ง€๋„ ๋ชจ๋ฅด๋Š” ๋…€์„๋“ค๊นŒ์ง€ ๊ฐ€์ ธ๋‹ค ์ •๋ฆฌ๋ฅผ ํ–ˆ๋„ค์š”,

ํฌ๋งท / ํ˜•ํƒœ ์ฃผ ์‚ฌ์šฉ์ฒ˜ ํฌํ•จ ๋‚ด์šฉ ์ฝ”๋“œ ์‹คํ–‰ ๊ฐ€๋Šฅ์„ฑ ๋ณด์•ˆ ์œ„ํ—˜๋„ ์žฅ์  ๋‹จ์  ๊ถŒ์žฅ ์‚ฌ์šฉ ์—ฌ๋ถ€
safetensors HF, ๋‚ด๋ถ€๋ง, ๋ณด์•ˆ ํ™˜๊ฒฝ ์ˆœ์ˆ˜ ํ…์„œ ๊ฐ€์ค‘์น˜ โŒ ์—†์Œ โญ ๋งค์šฐ ๋‚ฎ์Œ pickle ๋ฏธ์‚ฌ์šฉ, fast mmap, ์•ˆ์ „ ๊ฐ€์ค‘์น˜๋งŒ ์ €์žฅ โœ… ๊ฐ•๋ ฅ ๊ถŒ์žฅ
PyTorch .pt / .pth ์—ฐ๊ตฌ/๊ฐœ๋ฐœ Python ๊ฐ์ฒด + ๊ฐ€์ค‘์น˜ ๐Ÿ”ฅ ๊ฐ€๋Šฅ ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ ์ €์žฅ ์œ ์—ฐ์„ฑ pickle ๊ธฐ๋ฐ˜ RCE โŒ ๋ฐฐํฌ ๊ธˆ์ง€
HF .bin (pytorch_model.bin) HF ๊ตฌ๋ฒ„์ „ pickle ๊ฐ€์ค‘์น˜ ๐Ÿ”ฅ ๊ฐ€๋Šฅ ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ ํ˜ธํ™˜์„ฑ ์‚ฌ์‹ค์ƒ .pt โŒ
ONNX .onnx ์ถ”๋ก /์„œ๋น™ ์ •์  ๊ทธ๋ž˜ํ”„ + ๊ฐ€์ค‘์น˜ โŒ โญ ๋‚ฎ์Œ ํ”„๋ ˆ์ž„์›Œํฌ ๋…๋ฆฝ, ๋น ๋ฆ„ ๋™์  ๊ตฌ์กฐ ์ œํ•œ โœ… ์ถ”๋ก ์šฉ
TorchScript .ts / .pt PyTorch ์„œ๋น™ IR ๊ทธ๋ž˜ํ”„ + ๊ฐ€์ค‘์น˜ โš ๏ธ ์ œํ•œ์  โš ๏ธ ์ค‘๊ฐ„ Python ์ œ๊ฑฐ ๋””๋ฒ„๊น… ์–ด๋ ค์›€ โš ๏ธ ์ œํ•œ์ 
TensorFlow SavedModel TF ์„œ๋น™ ๊ทธ๋ž˜ํ”„ + ๊ฐ€์ค‘์น˜ โŒ โญ ๋‚ฎ์Œ TF Serving ์ตœ์  TF ์ข…์† โš ๏ธ
HDF5 .h5 Keras ๊ฐ€์ค‘์น˜ + ๊ตฌ์กฐ โŒ โญ ๋‚ฎ์Œ ๋‹จ์ˆœ ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ ํ•œ๊ณ„ โš ๏ธ
GGUF / GGML llama.cpp ์–‘์žํ™” ๊ฐ€์ค‘์น˜ โŒ โญ ๋‚ฎ์Œ CPU ์นœํ™” ํ•™์Šต ๋ถˆ๊ฐ€ โœ… ๋กœ์ปฌ
MLflow model MLOps ๋ชจ๋ธ + ๋ฉ”ํƒ€ + ์ฝ”๋“œ ๐Ÿ”ฅ ๊ฐ€๋Šฅ ๐Ÿ”ฅ๐Ÿ”ฅ ๊ด€๋ฆฌ ํŽธํ•จ ์ฝ”๋“œ ํฌํ•จ โš ๏ธ ๊ฒ€์ฆ ํ•„์ˆ˜
Triton model repo NVIDIA Triton ๋ชจ๋ธ + config โŒ โญ ๋‚ฎ์Œ ๊ณ ์„ฑ๋Šฅ ์„œ๋น™ ์„ค์ • ๋ณต์žก โœ…
Docker image ๋ฐฐํฌ ๋ชจ๋ธ + ์ฝ”๋“œ + OS ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ ์žฌํ˜„์„ฑ ๊ณต๊ฒฉ๋ฉด ํผ โš ๏ธ ๋‚ด๋ถ€๊ฒ€์ฆ
HF repo (์ „์ฒด) ๊ณต์œ  ๊ฐ€์ค‘์น˜ + Python ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ ํŽธ์˜์„ฑ trust_remote_code โŒ ๋ฌด๊ฒ€์ฆ
LoRA / Adapter ํŒŒ์ธํŠœ๋‹ ๊ฐ€์ค‘์น˜ delta โŒ โญ ๋‚ฎ์Œ ๊ฒฝ๋Ÿ‰ base ํ•„์š” โœ…

๊ทธ๋ž˜์„œ ๊ฒฐ๋ก ์€ ๋ชจ๋ธ์€ ์—ฌ๋Ÿฌ ์š”๊ตฌ์‚ฌํ•ญ๋“ค์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ ํ†ตํ•ฉ๋œ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์‚ฌ์šฉํ–ˆ๊ณ , ๊ทธ๊ณณ์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์ทจ์•ฝ์ ์€ ๋Œ€์ฒด๋กœ pickle ์˜ ์ง๋ ฌํ™”๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๊ธฐ๋Œ€๋˜๋Š” ๋ฌธ์ œ์ ๋“ค์ด์˜€์Šต๋‹ˆ๋‹ค.

๊ทธ๋ž˜์„œ pickle ์˜ ์ง๋ ฌํ™”๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค๋ฉด, RCE ๊ฐ™์€ ์น˜๋ช…์ ์ธ ๋ฌธ์ œ๋“ค์€ ๋ชจ๋ธ ์ž์ฒด์—์„œ ์ƒ๊ธฐ์ง€ ์•Š์„ ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ ๋ชจ๋ธ ๋Ÿฐํƒ€์ž„ ํ”„๋ ˆ์ž„์›Œํฌ์—์„œ ๋ฐœ์ƒํ•˜๋Š” ์ทจ์•ฝ์ ๋“ค์€ ์ „ํ˜€ ๋‹ค๋ฅธ ์˜์—ญ์ด๋‹ˆ ์‚ฌ์šฉ์— ์ฐธ๊ณ ํ•ด์•ผํ•  ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

ํ‹€๋ฆฐ ์ •๋ณด๊ฐ€ ์žˆ๋‹ค๋ฉด ์•Œ๋ ค์ฃผ์„ธ์š”!

728x90
728x90

 

 

 

Batch Normalization

https://arxiv.org/pdf/1502.03167

 

 

Background

batch normalizaion ์€ 2015๋…„์— ์ œ์‹œ๋œ ICS(Internal Covariate Shift) ๋ฌธ์ œ๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๋Š” ์•„์ด๋””์–ด์ž…๋‹ˆ๋‹ค. covariate shift ๋Š” ํ•™์Šต ๋•Œ ํ™œ์šฉํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ์‹ค์ œ ์ถ”๋ก ์— ์‚ฌ์šฉ๋˜๋Š” ๋ฐ์ดํ„ฐ๊ฐ„์˜ ๋ถ„ํฌ๊ฐ€ ๋‹ค๋ฅด๋ฉด ์ถ”๋ก  ์„ฑ๋Šฅ์— ์•…์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜ ์žˆ๋‹ค๋ผ๋Š” ์ฃผ์žฅ์ธ๋ฐ ์ด๊ฒŒ ์‹ ๊ฒฝ๋ง ๋‚ด๋ถ€์—์„œ๋„ ๋ฐœ์ƒํ•  ๊ฒƒ์ด๋‹ค ๋ผ๋Š” ์ฃผ์žฅ์„ ํ•˜๋ฉฐ ์ƒ๊ธด์šฉ์–ด๊ฐ€ Internal Covariate Shift ๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์•„๋ž˜ ์‚ฌ์ง„์„ ๋ณด๋ฉด ์ง๊ด€์ ์œผ๋กœ ์ดํ•ด๊ฐ€ ๋  ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. ์‹ ๊ฒฝ๋ง์„ ํ†ต๊ณผํ•˜๋ฉด์„œ ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๊ฐ€ ๋‹ฌ๋ผ์ง€๋Š” ํ˜„์ƒ์ด ๋ฐœ์ƒํ•˜๋Š”๋ฐ

 

ํ†ต๊ณผํ•˜๋Š” ๋ ˆ์ด์–ด ์ˆ˜๊ฐ€ ๋งŽ์•„์งˆ์ˆ˜๋ก ๊ทธ ์ •๋„๊ฐ€ ์‹ฌํ•ด์ง€๊ธฐ ๋•Œ๋ฌธ์— ๋‹น์—ฐํžˆ ์ถ”๋ก ์ด๋‚˜ ํ•™์Šต ์„ฑ๋Šฅ์— ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธธ ํ™•๋ฅ ์ด ํฝ๋‹ˆ๋‹ค. Batch Normalizaion ์€ ๊ธฐ์กด์˜ ์ •๊ทœํ™” ๊ณผ์ •์—์„œ ํ•™์Šต๋ฐ์ดํ„ฐ๋งˆ๋‹ค ๋ถ„ํฌ๊ฐ€ ๋‹ค๋ฅธ๊ฒƒ์„ ๋ฐฐ์น˜๋ณ„๋กœ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ํ™œ์šฉํ•ด ์ •๊ทœํ™”ํ•˜๋Š” ๊ฒƒ ์ž…๋‹ˆ๋‹ค.

๋‚˜๋™๋นˆ๋‹˜์˜ ์˜์ƒ์„ ์ฐธ๊ณ ํ•˜์—ฌ ์•Œ๊ฒŒ ๋œ batch normalizaion๊ฐ€ ํ˜„์‹ค์—์„œ๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์˜์กด๋„๋ฅผ ์ค„์˜€์œผ๋ฉฐ, ํ•™์Šต์†๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค๊ณ , ๋ชจ๋ธ์ด ์ผ๋ฐ˜์ ์œผ๋กœ ์ฆ‰, ํ•™์Šต๋ฐ์ดํ„ฐ์—๋งŒ ํƒœ์Šคํฌ๋ฅผ ์ž˜ ์ฒ˜๋ฆฌํ•˜๋„๋ก ํ•˜๋Š”๊ฒƒ์ด ์•„๋‹Œ ์‹ค์ œ ํ˜„์ƒ์„ ์ž˜ ๋ฐ˜์˜์‹œํ‚ค๊ฒŒ ๋œ ํšจ๊ณผ๊ฐ€ ์žˆ์—ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋Ÿฐ๋ฐ ๋…ผ๋ฌธ์—์„œ๋Š” ics ๋ฅผ ๊ฐ์†Œ์‹œํ‚จ๋‹ค๊ณ  ์ฃผ์žฅํ•˜์˜€์œผ๋‚˜ ์‹ค์ œ๋กœ ์ฆ๋ช…ํ•˜์ง€๋Š” ๋ชปํ–ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ๊ทธ๊ฒƒ์„ ์ฆ๋ช…ํ•˜๊ธฐ ์œ„ํ•œ How Does Batch Normalization Help Optimization?  ๋ผ๋Š” ๋…ผ๋ฌธ์ด ๋‚˜์™”์Šต๋‹ˆ๋‹ค.

https://arxiv.org/pdf/1805.11604

 

 

์šฐ์„  ์ผ๋ฐ˜์ ์œผ๋กœ Batch Norm ์„ ์ ์šฉ์‹œํ‚จ ๋„คํŠธ์›Œํฌ๊ฐ€ Accuracy ๊ฐ€ ๊ฐ€ํŒŒ๋ฅธ ํญ์œผ๋กœ ์˜ฌ๋ผ๊ฐ”๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

 

 

์šฐ์ธก์˜ ํžˆ์Šคํ† ๊ทธ๋žจ์„ ๋ณด๋ฉด ๊ฐ ๋ ˆ์ด์–ด์˜ ๋ถ„ํฌ๋ฅผ ๋‚˜ํƒ€๋‚ด๊ณ  ์žˆ๋Š”๋ฐ์š” ๊ฐ€์žฅ์šฐ์ธก์˜ Standard + Noisy BatchNorm ์—์„œ Layer3 ๋ถ€ํ„ฐ ๋ถ„ํฌ๊ฐ€ ๊ฐ‘์ž‘์Šค๋Ÿฝ๊ฒŒ ๋ณ€ํ•˜์—ฌ ICS๊ฐ€ ๋ฐœ์ƒํ•˜๊ณ  ์žˆ์Œ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ICS๊ฐ€ ๋ฐœ์ƒํ•˜๊ณ  ์žˆ์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์™ผ์ชฝ ๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋ฉด ํ•™์Šต์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ•จ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ฆ‰ ์ž„์˜๋กœ Batch Norm Layer ์ดํ›„ ๋ฐ”๋กœ Noise ๋ฅผ ๋„ฃ์–ด covariate shift ๋ฅผ ๋ฐœ์ƒ์‹œ์ผฐ์„ ๋•Œ์—๋„ BatchNorm ์ด ํฌํ•จ๋œ ๋„คํŠธ์›Œํฌ๋Š” ์ผ๋ฐ˜์ ์ธ ๋„คํŠธ์›Œํฌ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ•จ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ์‹คํ—˜์ ์œผ๋กœ Batch Norm ์ด ICS ๋ฌธ์ œ๋ฅผ ํ•ด์†Œํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ด์ „ ๋…ผ๋ฌธ์˜ ๋ฐ˜๋ฐ•์„ ํ•˜์˜€๊ณ , ์‹ฌ์ง€์–ด ICS๊ฐ€ ํฌ๊ฒŒ ๋ฐœ์ƒํ•จ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  Batch Norm ์ด ์žˆ์œผ๋ฉด ์„ฑ๋Šฅ์ด ์ข‹์•„์ง„๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€ ์‚ฌ๋ก€๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ํ•ด๋‹น๋…ผ๋ฌธ์—์„œ ICS๋ฅผ ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐํ•˜์—ฌ ICS๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ–ˆ๋Š”๋ฐ, ํฌ์ŠคํŒ…์˜ ๋ชฉ์ ๋ณด๋‹ค ๋„ˆ๋ฌด ๋ฒ—์–ด๋‚˜๋Š”๊ฒƒ ๊ฐ™์•„ ๋‹ค๋ฃจ์ง€ ์•Š๊ฒ ์Šต๋‹ˆ๋‹ค. ๊ถ๊ธˆํ•˜์‹ ๋ถ„๊ป˜์„œ๋Š” ๋…ผ๋ฌธ์„ ์ฐธ๊ณ ํ•˜์‹œ๋ฉด ๋  ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๊ทธ๋ ‡๋‹ค๋ฉด ICS ๋ฅผ ํ•ด์†Œํ•˜์ง€ ๋ชปํ–ˆ์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์„ฑ๋Šฅ์ด ์ข‹์€ ์ด์œ ๋Š” ๋ญ˜๊นŒ์š”? ๋…ผ๋ฌธ์—์„œ๋Š” Batch Norm ์˜ Smoothing ํšจ๊ณผ ๋•Œ๋ฌธ์ด๋ผ๊ณ  ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

 

Loss Landscape ๊ฐ€ ํ›จ์”ฌ ๋” ์˜ˆ์ƒ ๊ฐ€๋Šฅํ•œ ๋ฒ”์œ„๋กœ ํ˜•์„ฑ๋˜๋ฉด์„œ ํ•™์Šตํšจ๊ณผ๊ฐ€ ์ฆ๋Œ€๋œ๋‹ค๊ณ  ๋งํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

Batch Normalization Layer

๋ฏธ๋‹ˆ๋ฐฐ์น˜์˜ ํ‰๊ท ๊ฐ’๊ณผ ๋ถ„์‚ฐ์„ ๊ตฌํ•ด์„œ normalizaion ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ฐ๋งˆ์™€ ๋ฒ ํƒ€๋ฅผ ํ™œ์šฉํ•ด ์‹ค์ œ output ์„ ๋‚ด๋Š”๋ฐ์š”, ์—ฌ๊ธฐ์„œ ๊ฐ๋งˆ์™€ ๋ฒ ํƒ€๊ฐ€ ์‹ค์ œ ํ•™์Šต์— ํ™œ์šฉ๋˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ์ž…๋‹ˆ๋‹ค. ํ•™์Šต์ค‘์—๋Š” loss ๋ฅผ ์ตœ์†Œํ™” ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๊ฐ๋งˆ์™€ ๋ฒ ํƒ€๋ฅผ ์ฐพ์•„๊ฐˆ ๊ฒƒ ์ž…๋‹ˆ๋‹ค.

์ •๊ทœํ™”์—์„œ ํ•™์Šต ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ ๋Š” ํ™œ์„ฑํ™” ํ•จ์ˆ˜์˜ ํŠน์ง•์— ์žˆ์Šต๋‹ˆ๋‹ค. sigmoid๋ฅผ ์˜ˆ์‹œ๋กœ ๋“ค๋ฉด ์–ด๋–ค ๊ตฌ๊ฐ„์—์„œ๋Š” ๋งค์šฐ ์„ ํ˜•์ ์œผ๋กœ ์ž‘๋™ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ‘œ์ค€์ •๊ทœ๋ถ„ํฌ๋กœ ์ •๊ทœํ™”ํ•œ 0๊ณผ 1์‚ฌ์ด์˜ ๊ฐ’์—์„œ ์„ ํ˜•์ ์œผ๋กœ ์ž‘๋™ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ ๊ฐ๋งˆ์™€ ๋ฒ ํƒ€๋ฅผ ํ™œ์šฉํ•ด non-linearity ๋ฅผ ์ง€์ผœ์ฃผ๊ณ , ํ•ด๋‹น ์ •๊ทœํ™” ๋ ˆ์ด์–ด์˜ output ๋„ ์ ์ ˆํ•˜๊ฒŒ ๋‚ด๋ณด๋‚ผ ์ˆ˜ ์žˆ๊ฒŒ๋ฉ๋‹ˆ๋‹ค. ๊ฒฐ๋ก ์€ ๋ ˆ์ด์–ด์˜ ์ž…๋ ฅ์„ ์ •๊ทœํ™”ํ•  ๋•Œ๋Š” linearity ๋ฅผ ์ฃผ์˜ํ•ด์„œ ์ •๊ทœํ™” ํ•ด์•ผํ•œ๋‹ค๋Š” ์  ์ž…๋‹ˆ๋‹ค.

 

Batch Normalization Layer ์—ฐ์‚ฐ๊ตฌ๋ถ„

batch normalization Layer ๋Š” ํ•™์Šตํ• ๋•Œ์™€ ์ถ”๋ก ํ•  ๋•Œ ๋„คํŠธ์›Œํฌ์—์„œ์˜ ์—ญํ• ์ด ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค. ํ•™์Šตํ• ๋•Œ ๊ฐ๋งˆ์™€ ๋ฒ ํƒ€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ•™์Šต์‹œ์ผœ์•ผ ํ•˜์ง€๋งŒ ์ถ”๋ก ๋•Œ์—๋Š” ํ•„์š”์—†์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ ํ•ด๋‹น ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์„ ๊ณ ์ •ํ•˜์—ฌ ํ•™์Šต๋œ ํŒŒ๋ผ๋ฏธํ„ฐ์— ์˜ํ•œ ๊ฐ’์ด ๋‚˜์™€์•ผํ•ฉ๋‹ˆ๋‹ค.

 

step 7 ์—์„œ๋ถ€ํ„ฐ๋Š” BN ์ด training ๋ชจ๋“œ๋กœ ๋„คํŠธ์›Œํฌ์— ์žˆ์—ˆ๋˜ ๊ฒƒ์„ inference ๋ชจ๋“œ๋กœ ๋ฐ”๊ฟ‰๋‹ˆ๋‹ค. ( ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณ ์ •์„ ํ†ตํ•ด์„œ )

Batch Normalization Data Flow

์ž…๋ ฅ ๋ฐ์ดํ„ฐ (X)

 

$$

X = \begin{bmatrix} [1,\ 2] \ [2,\ 4] \ [3,\ 6] \end{bmatrix}

$$

๋ฐฐ์น˜๋กœ ๋“ค์–ด์˜จ ๋ฐ์ดํ„ฐ

shape: (3, 2)

→ ์ƒ˜ํ”Œ 3๊ฐœ, ๊ฐ ์ƒ˜ํ”Œ์€ 2์ฐจ์› ๋ฒกํ„ฐ


Linear Layer ํ†ต๊ณผ

๊ฐ€์ค‘์น˜์™€ bias๋ฅผ ์ด๋ ‡๊ฒŒ ๋‘๊ฒ ์Šต๋‹ˆ:

$$ [ W = \begin{bmatrix} [1,0], \ [0,1] \end{bmatrix}, \quad b = [0,\ 0] ] $$

์ฆ‰, ์•„๋ฌด ๋ณ€ํ™” ์—†๋Š” ์„ ํ˜•์ธต

$$ [ Z = XW + b = X ] $$

๊ฒฐ๊ณผ:

Z =
[
 [1, 2],
 [2, 4],
 [3, 6]
]

shape ๊ทธ๋Œ€๋กœ (3, 2)


Batch Normalization

1๏ธโƒฃ Batch Mean (μ)

feature๋ณ„ ํ‰๊ท :

$$ μ=[(1+2+3)/3, (2+4+6)/3]=[2, 4] $$


2๏ธโƒฃ Batch Variance (σ²)

$$ σ2=[((1−2)2+(2−2)2+(3−2)2)/3,((2−4)2+(4−4)2+(6−4)2)/3]=[2/3, 8/3] $$


3๏ธโƒฃ Normalize (xฬ‚)

$$ \hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}} (ε ๋ฌด์‹œํ•œ๋‹ค๊ณ  ๊ฐ€์ •) $$

์ƒ˜ํ”Œ๋ณ„ ๊ณ„์‚ฐ

์ฒซ ๋ฒˆ์งธ ์ƒ˜ํ”Œ

$$ [1,2] → [-1/\sqrt{2/3},\ -2/\sqrt{8/3}] ≈ [-1.22,\ -1.22] $$

๋‘ ๋ฒˆ์งธ

$$ [2,4] → [0,\ 0] $$

์„ธ ๋ฒˆ์งธ

$$ [3,6] → [1.22,\ 1.22] $$

๊ฒฐ๊ณผ:

X_hat =
[
 [-1.22, -1.22],
 [ 0.00,  0.00],
 [ 1.22,  1.22]
]

๊ทธ๋ฆฌ๊ณ  ํ•ด๋‹น๊ฐ’์— gamma ์™€ betta ์—ฐ์‚ฐ์„ ํ†ตํ•ด Layer ๋ฅผ ํ†ต๊ณผ์‹œํ‚ต๋‹ˆ๋‹ค. ์ด์ฒ˜๋Ÿผ batch norm ์€ ๋ฏธ๋‹ˆ ๋ฐฐ์น˜์˜ ํ”ผ์ฒ˜๋ณ„๋กœ ํ‰๊ท , ๋ถ„์‚ฐ์„ ๊ตฌํ•ด์„œ ์›๋ณธ ๋ฐ์ดํ„ฐ์— ๋Œ€์ž…์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ Normalizaion ์„ ์ˆ˜ํ–‰ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

Layer Normalization

arxiv.org

Layer Normalization ์€ Batch Norm ์ด RNN ์— ์ ์šฉํ•˜๊ธฐ ์–ด๋ ค์šด ๋ฌธ์ œ์ ์„ ํ•ด์†Œํ•˜๊ธฐ ์œ„ํ•ด ์ œ์‹œ๋œ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. RNN์€ ์‹œ๊ฐ„๋‹จ์œ„๋กœ ๊ณ„์‚ฐ์„ ํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋ฏธ๋‹ˆ๋ฐฐ์น˜์˜ ๊ฐ ํ”ผ์ณ๋งˆ๋‹ค ํ†ต๊ณ„๋ฅผ ์ด์šฉํ•ด ์ •๊ทœํ™”ํ•˜๋Š” BN ์˜ ๊ฒฝ์šฐ์—๋Š” ํ•ด๋‹น ์ŠคํŠธ๋ฆผ์˜ ๋งฅ๋ฝ์„ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค.

๊ฐ€์žฅ ํฐ ๋ฌธ์ œ๋Š” RNN ์ด๋‚˜ NLP, ํ˜น์€ ์Œ์„ฑ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ๋Š” ๋ฐฐ์น˜๋งˆ๋‹ค ๊ธธ์ด๊ฐ€ ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

์ƒ˜ํ”Œ 1: "๋‚˜๋Š” ๋ฐฅ์„ ๋จน์—ˆ๋‹ค"        (๊ธธ์ด 4)
์ƒ˜ํ”Œ 2: "์˜ค๋Š˜"                    (๊ธธ์ด 1)
์ƒ˜ํ”Œ 3: "์–ด์ œ ๋น„๊ฐ€ ์™€์„œ ์šฐ์‚ฐ์„ ์ผ๋‹ค" (๊ธธ์ด 6)

์ด๊ฒƒ์„ BN ์„ ํ™œ์šฉํ•œ Layer output ์„ ์‚ฌ์šฉํ•œ๋‹ค๋ฉด ์ƒ˜ํ”Œ2 ์˜ 2,3 ์ƒ˜ํ”Œ1์˜ 3,4 ๊ฐ€ 0์ด ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ๋ฐ์ดํ„ฐ์˜ ์˜๋ฏธ๋ฅผ ์ถฉ๋ถ„ํžˆ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฌธ์ œ๋Š” ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ์—๋„ ๊ทธ๋Œ€๋กœ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€๋‚˜ ์„ฑ์ ํ†ต๊ณ„(๊ตญ์–ด๋Š” ๊ตญ์–ด๋ผ๋ฆฌ, ์ˆ˜ํ•™์€ ์ˆ˜ํ•™๋ผ๋ฆฌ) ์™€ ๊ฐ™์€ ๋ฐ์ดํ„ฐ๊ฐ€ ์•„๋‹ˆ๋ผ ํ”ผ์ณํ•˜๋‚˜๊ฐ€ ๋‹ค๋ฅธ ํ”ผ์ณ๋‚˜ ๋ฐ์ดํ„ฐ์—๋„ ์˜ํ–ฅ์„ ์ฃผ๋Š”๊ฒฝ์šฐ๋Š” Batch ์‚ฌ์ด์ฆˆ์— ์˜ํ–ฅ์„ ๋ฐ›์ง€ ์•Š๊ณ  ๋ฐ์ดํ„ฐ์˜ ์˜๋ฏธ๋ฅผ ์ž˜ ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ๋Š” LN ์ด ์„ฑ๋Šฅ์ด ์ข‹๋‹ค๊ณ  ์ฃผ์žฅํ•ฉ๋‹ˆ๋‹ค.

 

BN ๊ณผ์˜ ์ฐจ์ด์ 

Batch Normalization์€ ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ๋‹จ์œ„๋กœ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ๊ณ„์‚ฐํ•˜์—ฌ ์ •๊ทœํ™”๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๋ฐ˜๋ฉด **Layer Normalization(LN)**์€ ์ด๋ฆ„ ๊ทธ๋Œ€๋กœ ๋ ˆ์ด์–ด ๋‹จ์œ„, ์ •ํ™•ํžˆ๋Š” ํ•˜๋‚˜์˜ ์ƒ˜ํ”Œ ๋‚ด๋ถ€ feature๋“ค์— ๋Œ€ํ•ด์„œ๋งŒ ์ •๊ทœํ™”๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ์ •๊ทœํ™”์˜ ๊ธฐ์ค€์ด ์™„์ „ํžˆ ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

  • Batch Normalization
    • ํ‰๊ท , ๋ถ„์‚ฐ ๊ณ„์‚ฐ ์ถ•: batch ๋ฐฉํ–ฅ
    • ๊ฐ™์€ feature๋ฅผ ๊ฐ€์ง„ ์—ฌ๋Ÿฌ ์ƒ˜ํ”Œ์„ ํ•จ๊ป˜ ์‚ฌ์šฉ
  • Layer Normalization
    • ํ‰๊ท , ๋ถ„์‚ฐ ๊ณ„์‚ฐ ์ถ•: feature ๋ฐฉํ–ฅ
    • ํ•˜๋‚˜์˜ ์ƒ˜ํ”Œ ์•ˆ์—์„œ๋งŒ ๊ณ„์‚ฐ

ํ•˜๋‚˜์˜ ์ƒ˜ํ”Œ x = [xโ‚, xโ‚‚, ..., xโ‚]์— ๋Œ€ํ•ด:

$$ \mu = \frac{1}{d} \sum_{i=1}^{d} x_i $$

$$ \sigma^2 = \frac{1}{d} \sum_{i=1}^{d} (x_i - \mu)^2 $$

$$ \hat{x}_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}} $$

๊ทธ๋ฆฌ๊ณ  Batch Normalization๊ณผ ๋™์ผํ•˜๊ฒŒ scale, shift ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค:

$$ y_i = \gamma_i \hat{x}_i + \beta_i $$

์—ฌ๊ธฐ์„œ ์ค‘์š”ํ•œ ์ ์€ γ, β๋Š” feature ์ฐจ์›์— ๋Œ€ํ•ด์„œ๋งŒ ์กด์žฌํ•˜๋ฉฐ batch ํฌ๊ธฐ์™€ ๋ฌด๊ด€ํ•˜๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์œ„์˜ ์ˆ˜์‹๋Œ€๋กœ ๊ฐ™์€ ์ƒ˜ํ”Œ์„ ๊ฐ€์ง€๊ณ  ๋ ˆ์ด์–ด๋ฅผ ํ†ต๊ณผํ•˜๋Š” ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

Layer Normalization Data Flow

์ž…๋ ฅ ๋ฐ์ดํ„ฐ (X)

$$ X = \begin{bmatrix} [1,\ 2] \\ [2,\ 4] \\ [3,\ 6] \end{bmatrix} $$

shape: (3, 2)

→ ์ƒ˜ํ”Œ 3๊ฐœ, ๊ฐ ์ƒ˜ํ”Œ์€ 2์ฐจ์› ๋ฒกํ„ฐ


Linear Layer ํ†ต๊ณผ

๊ฐ€์ค‘์น˜์™€ bias๋Š” ์ด์ „๊ณผ ๋™์ผํ•˜๊ฒŒ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.

$$ Z = X $$


Layer Normalization ์ ์šฉ

Layer Normalization์€ ๊ฐ ์ƒ˜ํ”Œ๋งˆ๋‹ค ๋…๋ฆฝ์ ์œผ๋กœ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

์ฒซ ๋ฒˆ์งธ ์ƒ˜ํ”Œ [1, 2]

$$ \mu = (1 + 2) / 2 = 1.5 $$

$$ \sigma^2 = ((1 - 1.5)^2 + (2 - 1.5)^2) / 2 = 0.25 $$

์ •๊ทœํ™” ๊ฒฐ๊ณผ:

$$ [1, 2] \rightarrow [-1, 1] $$


๋‘ ๋ฒˆ์งธ ์ƒ˜ํ”Œ [2, 4]

$$ \mu = 3,\quad \sigma^2 = 1 $$

์ •๊ทœํ™” ๊ฒฐ๊ณผ:

$$ [2, 4] \rightarrow [-1, 1] $$


์„ธ ๋ฒˆ์งธ ์ƒ˜ํ”Œ [3, 6]

$$ \mu = 4.5,\quad \sigma^2 = 2.25 $$

์ •๊ทœํ™” ๊ฒฐ๊ณผ:

$$ [3, 6] \rightarrow [-1, 1] $$


Layer Normalization ๊ฒฐ๊ณผ

X_hat =
[
 [-1,  1],
 [-1,  1],
 [-1,  1]
]

Transformer ๊ตฌ์กฐ์—์„œ Layer Normalization ์ด Batch Normalization ๋ณด๋‹ค ์ ํ•ฉํ•œ ์ด์œ 

1. ์‹œํ€€์Šค ๊ธธ์ด ๊ฐ€๋ณ€์„ฑ๊ณผ Masking ๋ฌธ์ œ

Transformer์˜ Self-Attention์€ ๊ฐ€๋ณ€ ๊ธธ์ด ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ž…๋ ฅํ˜•ํƒœ๋Š” ๊ฐ ๋ฌธ์žฅ๋งˆ๋‹ค ๊ธธ์ด๊ฐ€ ๋‹ค๋ฅด๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์งง์€ ๋ฌธ์žฅ์—๋Š” padding์„ ์ถ”๊ฐ€ํ•˜ attention mask๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

Batch Normalization์„ ์ด๋Ÿฌํ•œ ๊ตฌ์กฐ์— ์ ์šฉํ•˜๋ฉด ์‹ฌ๊ฐํ•œ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. BN์€ ๋ฐฐ์น˜์™€ ์‹œํ€€์Šค ์ฐจ์› ์ „์ฒด์— ๊ฑธ์ณ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ๊ณ„์‚ฐํ•˜๋Š”๋ฐ ์œ„์—์„œ ๋ดค๋˜ ๊ฒƒ ์ฒ˜๋Ÿผ ์˜๋ฏธ ์—†๋Š” padding ํ† ํฐ์˜ 0 ๋ฒกํ„ฐ๊ฐ€ ํ†ต๊ณ„์— ํฌํ•จ๋ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ ๋ฌธ์žฅ ๊ธธ์ด์— ๋”ฐ๋ผ ์ •๊ทœํ™” ํ†ต๊ณ„๊ฐ€ ์™œ๊ณก๋˜๊ณ , ๊ฐ™์€ ๋‚ด์šฉ์˜ ๋ฌธ์žฅ์ด๋ผ๋„ padding์˜ ์–‘์— ๋”ฐ๋ผ ๋‹ค๋ฅด๊ฒŒ ์ •๊ทœํ™”๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

๋ฐ˜๋ฉด Layer Normalization์€ ๊ฐ ํ† ํฐ์˜ feature ์ฐจ์›์— ๋Œ€ํ•ด์„œ๋งŒ ์ •๊ทœํ™”๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ํ•˜๋‚˜์˜ ํ† ํฐ ๋‚ด๋ถ€์—์„œ๋งŒ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ๊ณ„์‚ฐํ•˜๊ธฐ ๋•Œ๋ฌธ์— padding ํ† ํฐ์ด๋‚˜ ์‹œํ€€์Šค ๊ธธ์ด๊ฐ€ ์ •๊ทœํ™” ํ†ต๊ณ„์— ์ „ํ˜€ ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๊ฐ ํ† ํฐ์€ ๋…๋ฆฝ์ ์œผ๋กœ ์ •๊ทœํ™”๋˜๋ฏ€๋กœ ๋ฐ์ดํ„ฐ์˜ ์˜๋ฏธ๊ฐ€ ์ถฉ์‹คํžˆ ๋ฐ˜์˜๋˜๊ณ  ๋ฐฐ์น˜๋‚˜ ์‹œํ€€์Šค ๊ตฌ์กฐ์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ ์ผ๊ด€๋œ ์ •๊ทœํ™”๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

2. Autoregressive Decoding๊ณผ ๋ฐฐ์น˜ ํฌ๊ธฐ ๋ถˆ์ผ์น˜

Transformer Decoder๋Š” ์ถ”๋ก  ์‹œ ๋ฏธ๋ž˜์˜ ์ •๋ณด๋ฅผ ์ฐธ์กฐํ•˜์ง€ ๋ชปํ•˜๋„๋ก autoregressive ๋ฐฉ์‹์œผ๋กœ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ์ด์ „์— ์ƒ์„ฑํ•œ ํ† ํฐ์„ ๋ฐ”ํƒ•์œผ๋กœ ๋‹ค์Œ ํ† ํฐ์„ ํ•˜๋‚˜์”ฉ ์ˆœ์ฐจ์ ์œผ๋กœ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์—์„œ ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ ๋ฐฐ์น˜ ํฌ๊ธฐ๊ฐ€ 1์ด ๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” Layer Normalization ๋…ผ๋ฌธ์—์„œ ๋ณด์—ฌ์ค€๊ฒƒ์ฒ˜๋Ÿผ Batch Normalization์— ์น˜๋ช…์ ์ธ ๋ฌธ์ œ๋ฅผ ์•ผ๊ธฐํ•ฉ๋‹ˆ๋‹ค.

Layer Normalization์€ ๋ฐฐ์น˜ ํฌ๊ธฐ์™€ ๋ฌด๊ด€ํ•˜๊ฒŒ ์•ˆ์ •์ ์œผ๋กœ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค. ๋ฐฐ์น˜ ํฌ๊ธฐ๊ฐ€ 1์ด๋“  32๋“  ์ •๊ทœํ™” ๊ฒฐ๊ณผ๋Š” ์ผ๊ด€๋˜๋ฉฐ, ํ•™์Šต ์‹œ ๊ด€์ฐฐํ•œ ์„ฑ๋Šฅ์ด ์ถ”๋ก  ์‹œ์—๋„ ๊ทธ๋Œ€๋กœ ์œ ์ง€๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” Transformer Decoder์˜ ์ƒ์„ฑ ํ’ˆ์งˆ์— ๊ฒฐ์ •์ ์œผ๋กœ ์ค‘์š”ํ•œ ํŠน์„ฑ์ž…๋‹ˆ๋‹ค.

3. Residual Connection๊ณผ์˜ ๊ตฌ์กฐ์  ๋ถˆ์ผ์น˜

Transformer์˜ ๊ฐ ๋ธ”๋ก์€ residual connection์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค: y = x + Sublayer(LN(x)). ์ด ๊ตฌ์กฐ๊ฐ€ ์ค‘์š”ํ•œ ์ด์œ ๋Š” gradient์˜ ํ๋ฆ„ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ์—ญ์ „ํŒŒ ์‹œ ∂y/∂x = 1 + ∂Sublayer/∂x ๊ฐ€ ๋˜์–ด, gradient๊ฐ€ ํ•ญ์ƒ ์ง์ ‘ ํ๋ฅผ ์ˆ˜ ์žˆ๋Š” ๊ฒฝ๋กœ(identity mapping)๊ฐ€ ๋ณด์žฅ๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๊นŠ์€ ๋„คํŠธ์›Œํฌ์—์„œ gradient vanishing ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ํ•ต์‹ฌ ๋ฉ”์ปค๋‹ˆ์ฆ˜์ž…๋‹ˆ๋‹ค.

๋งŒ์•ฝ Batch Normalization์„ residual path์— ์‚ฌ์šฉํ•˜๋ฉด, BN์˜ ์ถœ๋ ฅ์ด ๋ฐฐ์น˜ ํ†ต๊ณ„์— ์˜์กดํ•˜๊ธฐ ๋•Œ๋ฌธ์— residual path์— batch-dependent noise๊ฐ€ ์ฃผ์ž…๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” gradient flow๋ฅผ ๋ถˆ์•ˆ์ •ํ•˜๊ฒŒ ๋งŒ๋“ค๊ณ , ํŠนํžˆ ๊นŠ์€ Transformer์—์„œ๋Š” gradient ํญ๋ฐœ์ด๋‚˜ ์†Œ์‹ค์„ ์ผ์œผํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ Post-LN Transformer(residual ํ›„์— LN์„ ์ ์šฉ)๋Š” ๋ ˆ์ด์–ด๊ฐ€ ๊นŠ์–ด์งˆ์ˆ˜๋ก ํ•™์Šต์ด ๋ถˆ์•ˆ์ •ํ•ด์ง€๋Š” ๊ฒƒ์œผ๋กœ ์•Œ๋ ค์ ธ ์žˆ์œผ๋ฉฐ, Pre-LN Transformer(residual ์ „์— LN์„ ์ ์šฉ)๊ฐ€ ๋” ์•ˆ์ •์ ์ธ ํ•™์Šต์„ ๋ณด์ž…๋‹ˆ๋‹ค. BN์€ ์ด๋Ÿฌํ•œ residual connection์˜ ํŠน์„ฑ๊ณผ ๊ทผ๋ณธ์ ์œผ๋กœ ์ถฉ๋Œํ•ฉ๋‹ˆ๋‹ค.

Layer Normalization์€ ๊ฐ ์ƒ˜ํ”Œ์„ ๋…๋ฆฝ์ ์œผ๋กœ ์ •๊ทœํ™”ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฐฐ์น˜์— ์˜์กดํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ residual path์˜ gradient flow๋ฅผ ๋ฐฉํ•ดํ•˜์ง€ ์•Š์œผ๋ฉฐ, ์ˆ˜์‹ญ ๊ฐœ์˜ ๋ ˆ์ด์–ด๋กœ ์ด๋ฃจ์–ด์ง„ ๊นŠ์€ Transformer์—์„œ๋„ ์•ˆ์ •์ ์ธ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ตฌ์กฐ์  ์กฐํ™”๊ฐ€ Transformer๊ฐ€ Layer Normalization์„ ์‚ฌ์šฉํ•˜๋Š” ๋˜ ๋‹ค๋ฅธ ์ค‘์š”ํ•œ ์ด์œ ์ž…๋‹ˆ๋‹ค.

728x90

+ Recent posts