728x90

๋ฐฐ๊ฒฝ

์‚ฌ๋‚ด LLM ์„œ๋น„์Šค ๊ฐœ๋ฐœ ์ค‘ vLLM ์ด ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ ๋˜์ง€ ์•Š๋Š” ํ˜„์ƒ์ด ๋ฐœ์ƒํ–ˆ์Šต๋‹ˆ๋‹ค. vLLM ๋กœ๊ทธ๋ฅผ ๋ณด๋ฉด vLLM ์„œ๋ฒ„์— ์š”์ฒญ์ด ํ•˜๋‚˜์”ฉ ์ „์†ก๋˜์–ด ์ฒ˜๋ฆฌ๋˜๊ณ  ์žˆ๋Š”๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์—ˆ๋Š”๋ฐ, ์ฒ˜์Œ์—” vLLM ๋‚ด๋ถ€์—์„œ multi GPU ์ธ์‹์„ ํ•˜์ง€ ๋ชปํ•ด vram ์„ ๊ณผ๋‹คํ•˜๊ฒŒ ์ ์œ ํ•˜์—ฌ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ๊ฐ€ ๋˜์ง€ ์•Š๋Š” ๋ฌธ์ œ๋ผ๊ณ  ์ƒ๊ฐํ–ˆ์Šต๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ, vLLM ์‹คํ–‰์‹œ multi gpu ์˜ต์…˜์„ ์คฌ๊ณ , ๋กœ๊ทธ๋ฅผ ์ฐ์–ด๋ณด์•„๋„ 2๊ฐœ์˜ gpu ๊ฐ€ ์ž˜ ์ธ์‹๋˜์–ด ์žˆ๋Š”๊ฒƒ์„ ํ™•์ธํ•˜๊ณ  ๋ฌธ์ œ๋ฅผ ์ฐพ๋‹ค FastAPI ์—์„œ vLLM ์— ์š”์ฒญ์„ ๋ณด๋‚ผ ๋•Œ openai ์˜ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ–ˆ๋˜๊ฒƒ์ด ๋ฌธ์ œ์ž„์„ ์•Œ๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. openai ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์ค‘ OpenAI ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋™๊ธฐ Request ๋กœ ์ž‘๋™ํ•˜๊ณ  AysncOpenAI ๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ๋น„๋™๊ธฐ ์ž‘๋™์„ ํ•˜๋Š” ๊ฒƒ์„ ์•Œ๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

ํ•ด๋‹น๋‚ด์šฉ์„ ์ •๋ฆฌํ•  ๊ฒธ Request ๋ฅผ ์‚ฌ์šฉํ•œ ๋ฐฉ์‹๊ฐ€ httpx ๋ฅผ ์‚ฌ์šฉํ•œ ์š”์ฒญ๋ฐฉ์‹์˜ ์ฐจ์ด์  ๊ทธ๋ฆฌ๊ณ  FastAPI ์˜ ๋™๊ธฐ/๋น„๋™๊ธฐ, ๋ณ‘๋ ฌ๊ณผ ๋น„๋™๊ธฐ์˜ ์ž‘๋™๋ฐฉ์‹์„ ์ •๋ฆฌํ•˜๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.


1. FastAPI ๋™๊ธฐ / ๋น„๋™๊ธฐ ์ฒ˜๋ฆฌ ๋ฐฉ์‹

FastAPI๋Š” ์—”๋“œํฌ์ธํŠธ ํ•จ์ˆ˜๊ฐ€ def ์ธ์ง€ async def ์ธ์ง€์— ๋”ฐ๋ผ ์™„์ „ํžˆ ๋‹ค๋ฅธ ๋ฐฉ์‹์œผ๋กœ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.

1.1 ๋™๊ธฐ ์—”๋“œํฌ์ธํŠธ (def)

from fastapi import FastAPI
import time

app = FastAPI()

@app.get("/sync")
def sync_endpoint():
    time.sleep(5)
    return {"msg": "done"}

๋™๊ธฐ ์—”๋“œํฌ์ธํŠธ์˜ ๊ฒฝ์šฐ FastAPI๋Š” ๋‚ด๋ถ€์ ์œผ๋กœ ThreadPoolExecutor๋ฅผ ์‚ฌ์šฉํ•ด ์š”์ฒญ์„ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

์ฆ‰, ์š”์ฒญ ํ•˜๋‚˜๋‹น ์Šค๋ ˆ๋“œ ํ•˜๋‚˜๋ฅผ ์ ์œ ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์˜ ๋ฌธ์ œ๋Š” ์™ธ๋ถ€ API ํ˜ธ์ถœ๊ณผ ๊ฐ™์ด I/O ๋Œ€๊ธฐ ์‹œ๊ฐ„์ด ๊ธด ์ž‘์—…์ด ์žˆ์„ ๊ฒฝ์šฐ์ž…๋‹ˆ๋‹ค. ์‘๋‹ต์ด ์˜ฌ ๋•Œ๊นŒ์ง€ ์Šค๋ ˆ๋“œ๊ฐ€ ์ ์œ ๋˜๊ธฐ ๋•Œ๋ฌธ์—, ๋™์‹œ์— ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ์š”์ฒญ ์ˆ˜๊ฐ€ ๊ธ‰๊ฒฉํžˆ ์ค„์–ด๋“ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด ๊ฒฝ์šฐ vLLM ์ž…์žฅ์—์„œ๋Š” ์š”์ฒญ์ด ํ•˜๋‚˜์”ฉ ์ˆœ์ฐจ์ ์œผ๋กœ ๋“ค์–ด์˜ค๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ด๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.


1.2 ๋น„๋™๊ธฐ ์—”๋“œํฌ์ธํŠธ (async def)

from fastapi import FastAPI
import asyncio

app = FastAPI()

@app.get("/async")
async def async_endpoint():
    await asyncio.sleep(5)
    return {"msg": "done"}

๋น„๋™๊ธฐ ์—”๋“œํฌ์ธํŠธ๋Š” ์ด๋ฒคํŠธ ๋ฃจํ”„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค. I/O ์ž‘์—…์„ ๊ธฐ๋‹ค๋ฆฌ๋Š” ๋™์•ˆ ์ œ์–ด๊ถŒ์„ ์ด๋ฒคํŠธ ๋ฃจํ”„์— ๋ฐ˜ํ™˜ํ•˜๊ณ , ๋‹ค๋ฅธ ์š”์ฒญ์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ ์—ฌ๊ธฐ์„œ ์ค‘์š”ํ•œ ์ ์€, async def๋กœ ์„ ์–ธํ–ˆ๋‹ค๊ณ  ํ•ด์„œ ์ž๋™์œผ๋กœ ๋น„๋™๊ธฐ๊ฐ€ ๋˜๋Š” ๊ฒƒ์€ ์•„๋‹™๋‹ˆ๋‹ค. ์—”๋“œํฌ์ธํŠธ ๋‚ด๋ถ€์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋“  I/O ์ž‘์—…์ด ๋น„๋™๊ธฐ์—ฌ์•ผ๋งŒ ์˜๋ฏธ ์žˆ๋Š” ๋น„๋™๊ธฐ ์ฒ˜๋ฆฌ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

์•„๋ž˜์—์„œ ์ถ”๊ฐ€์ ์œผ๋กœ ์„ค๋ช…ํ•˜๊ฒ ์ง€๋งŒ, ๋น„๋™๊ธฐ ์ž‘์—…์€ ๋ณ‘๋ ฌ๊ณผ ๋‹ค๋ฆ…๋‹ˆ๋‹ค. ๋น„๋™๊ธฐ์ž‘์—…์€ ๋™์‹œ์„ฑ ์ž‘์—…์œผ๋กœ ๋™์‹œ์— ์ฒ˜๋ฆฌ๋˜๋Š” ๊ฒƒ ์ฒ˜๋Ÿผ ๋ณด์ด๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.


2. FastAPI์—์„œ์˜ ๋™์‹œ์„ฑ๊ณผ ๋ณ‘๋ ฌ์„ฑ

 

 

Concurrency and async / await - FastAPI

FastAPI framework, high performance, easy to learn, fast to code, ready for production

fastapi.tiangolo.com

 

์ด๋ฒˆ ์ด์Šˆ๋ฅผ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋™์‹œ์„ฑ๊ณผ ๋ณ‘๋ ฌ์„ฑ์˜ ์ฐจ์ด๋ฅผ ๋ช…ํ™•ํžˆ ๊ตฌ๋ถ„ํ•  ํ•„์š”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

2.1 ๋™์‹œ์„ฑ (Concurrency)

๋™์‹œ์„ฑ์€ ์—ฌ๋Ÿฌ ์ž‘์—…์„ ๋ฒˆ๊ฐˆ์•„๊ฐ€๋ฉฐ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฐœ๋…์ž…๋‹ˆ๋‹ค.

์‹ค์ œ๋กœ ๋™์‹œ์— ์‹คํ–‰๋˜๋Š” ๊ฒƒ์€ ์•„๋‹ˆ์ง€๋งŒ, ๋™์‹œ์— ์ฒ˜๋ฆฌ๋˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ด๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.FastAPI์˜ ๋น„๋™๊ธฐ ์ฒ˜๋ฆฌ๋Š” ์—ฌ๊ธฐ์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค.


2.2 ๋ณ‘๋ ฌ์„ฑ (Parallelism)

 

 

concurrent.futures — Launching parallel tasks

Source code: Lib/concurrent/futures/thread.py, Lib/concurrent/futures/process.py, and Lib/concurrent/futures/interpreter.py The concurrent.futures module provides a high-level interface for asynchr...

docs.python.org

 

๋ณ‘๋ ฌ์„ฑ์€ ์—ฌ๋Ÿฌ ์ž‘์—…์„ ์‹ค์ œ๋กœ ๋™์‹œ์— ์‹คํ–‰ํ•˜๋Š” ๊ฐœ๋…์ž…๋‹ˆ๋‹ค.

FastAPI ๊ณต์‹๋ฌธ์„œ์— ๊ท€์—ฌ์šด burger ์˜ˆ์‹œ๊ฐ€ ์žˆ๋Š”๋ฐ์š”

1.๋™์‹œ์„ฑ

 

 

2. ๋ณ‘๋ ฌ์„ฑ 

 

์ž์„ธํ•œ ๋‚ด์šฉ์€ ์œ„ ๋งํฌ์—์„œ ํ•œ๋ฒˆ ํ™•์ธํ•ด๋ณด์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค.


3. OpenAI ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ๋ณ‘๋ชฉ์ด ๋œ ์ด์œ 

3.1 OpenAI (๋™๊ธฐ SDK) ์‚ฌ์šฉ ์‹œ

from openai import OpenAI

client = OpenAI(
    base_url="<http://vllm:8000/v1>",
    api_key="EMPTY"
)

@app.post("/chat")
def chat():
    response = client.chat.completions.create(
        model="qwen",
        messages=[{"role": "user", "content": "hello"}]
    )
    return response.choices[0].message.content

OpenAI ํด๋ž˜์Šค๋Š” ๋™๊ธฐ ๋ฐฉ์‹์œผ๋กœ ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.

์ฆ‰, ์‘๋‹ต์ด ๋Œ์•„์˜ฌ ๋•Œ๊นŒ์ง€ FastAPI ์Šค๋ ˆ๋“œ๋ฅผ ์™„์ „ํžˆ ์ ์œ ํ•ฉ๋‹ˆ๋‹ค.

์ด๋กœ ์ธํ•ด ๋ฐœ์ƒํ•œ ํ˜„์ƒ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • FastAPI ์š”์ฒญ์ด ์ง๋ ฌํ™”๋จ
  • vLLM ์„œ๋ฒ„ ๋กœ๊ทธ์— ์š”์ฒญ์ด ํ•˜๋‚˜์”ฉ ์ฐํž˜
  • GPU๊ฐ€ ์ถฉ๋ถ„ํžˆ ์žˆ์Œ์—๋„ batching์ด ๋ฐœ์ƒํ•˜์ง€ ์•Š์Œ

์ฒ˜์Œ์—๋Š” vLLM ์„ค์ • ๋ฌธ์ œ๋กœ ์˜คํ•ดํ•˜๊ธฐ ์‰ฌ์šด ๋ถ€๋ถ„์ด์—ˆ์Šต๋‹ˆ๋‹ค.


3.2 AsyncOpenAI ์‚ฌ์šฉ ์‹œ (ํ•ด๊ฒฐ)

from openai import AsyncOpenAI

client = AsyncOpenAI(
    base_url="<http://vllm:8000/v1>",
    api_key="EMPTY"
)

@app.post("/chat")
async def chat():
    response = await client.chat.completions.create(
        model="qwen",
        messages=[{"role": "user", "content": "hello"}]
    )
    return response.choices[0].message.content

AsyncOpenAI๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด์„œ ๋ฌธ์ œ๊ฐ€ ํ•ด๊ฒฐ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

  • FastAPI ์ด๋ฒคํŠธ ๋ฃจํ”„๊ฐ€ block๋˜์ง€ ์•Š์Œ
  • ์—ฌ๋Ÿฌ ์š”์ฒญ์ด ๋™์‹œ์— vLLM์œผ๋กœ ์ „๋‹ฌ๋จ
  • vLLM batching ์ •์ƒ ๋™์ž‘
  • multi GPU ์‚ฌ์šฉ ํ™•์ธ

๊ฒฐ๊ณผ์ ์œผ๋กœ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ๊ฐ€ ๋˜์ง€ ์•Š๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์˜€๋˜ ๋ฌธ์ œ์˜ ์›์ธ์€

FastAPI์™€ vLLM ์‚ฌ์ด์˜ ์š”์ฒญ ๋ฐฉ์‹์ด์—ˆ์Šต๋‹ˆ๋‹ค.


4. requests์™€ httpx ์ฐจ์ด

4.1 requests

import requests

def call_vllm():
    r = requests.post(url, json=payload)
    return r.json()

  • ๋™๊ธฐ ์ „์šฉ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ
  • async def ๋‚ด๋ถ€์—์„œ ์‚ฌ์šฉ ์‹œ ์ด๋ฒคํŠธ ๋ฃจํ”„๋ฅผ block
  • FastAPI ๋น„๋™๊ธฐ ๊ตฌ์กฐ์™€ ๋งž์ง€ ์•Š์Œ

4.2 httpx (๋น„๋™๊ธฐ ๊ถŒ์žฅ)

import httpx

async def call_vllm():
    async with httpx.AsyncClient(timeout=60) as client:
        r = await client.post(url, json=payload)
        return r.json()

  • ๋น„๋™๊ธฐ I/O ์ง€์›
  • connection pooling ์ œ๊ณต
  • FastAPI์™€ ๊ถํ•ฉ์ด ๋งค์šฐ ์ข‹์Œ

4.3 ์ž˜๋ชป๋œ ์˜ˆ์™€ ์˜ฌ๋ฐ”๋ฅธ ์˜ˆ

โŒ ์ž˜๋ชป๋œ ์˜ˆ

@app.post("/bad")
async def bad():
    r = requests.post(url, json=payload)
    return r.json()

โญ• ์˜ฌ๋ฐ”๋ฅธ ์˜ˆ

@app.post("/good")
async def good():
    async with httpx.AsyncClient() as client:
        r = await client.post(url, json=payload)
        return r.json()


728x90

+ Recent posts