-
Notifications
You must be signed in to change notification settings - Fork 536
LMDeploy Distserve #3304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
LMDeploy Distserve #3304
Changes from all commits
97d6d5d
3241c1a
1788a28
03b363f
aabb72b
3ba605f
2e6ee7a
cdf55c1
ace6ece
481052e
f9b7409
60032b6
aa43faa
97e4430
1e6c4da
290e606
b530384
efcb72c
a3d973b
31fd9f3
48d791a
2f02e05
ae959a0
11d9961
18da0fb
a478c77
c490de4
df3f9ef
61ad2a7
ad27c3a
1c3b20c
119059f
1f220d4
0a58979
83838d8
b108752
74d9256
39b2c4f
65ba59f
3af751b
6028ec2
3047e7b
649b51e
531524a
ce660ca
957bd68
f6de868
7437bfa
b0a8f1f
a7bb7c4
d488d87
b626d9e
2d6f8c1
fec61ba
2637091
3dedc69
c09a06b
160cb3c
e97a486
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,98 @@ | ||
# LMDeploy-DistServe | ||
|
||
## Key Components | ||
|
||
1. **Router Service**: Coordinates between prefill/decode engines | ||
2. **Migration Manager**: Facilitates high-performance memory sharing | ||
|
||
## Installation | ||
|
||
``` | ||
# Inference Engine | ||
pip install lmdeploy[all] >= 0.7.0 | ||
|
||
# Transfer Engine | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just curious. Is it a private Python package for transferring? I couldn't find any information about this project. Thanks. |
||
pip install dlslime==0.0.1.post3 | ||
``` | ||
|
||
## Quick Start | ||
|
||
### 1. Configure Endpoints | ||
|
||
First deploy your prefill and decode engines. | ||
|
||
```shell | ||
# Prefill Engine | ||
CUDA_VISIBLE_DEVICES=0,1 lmdeploy serve api_server internlm/internlm2_5-7b-chat --server-port 23333 --role Prefill --tp 2 | ||
# Decode Engine | ||
CUDA_VISIBLE_DEVICES=2,3 lmdeploy serve api_server internlm/internlm2_5-7b-chat --server-port 23334 --role Decode --tp 2 | ||
``` | ||
|
||
### 2. Launch Router Service | ||
|
||
```shell | ||
lmdeploy serve proxy | ||
--server-name 0.0.0.0 | ||
--server-port 8000 | ||
--routing-strategy "min_expected_latency" | ||
--serving-strategy DistServe | ||
--log-level INFO | ||
``` | ||
|
||
## API Usage | ||
|
||
```shell | ||
# API Invoke | ||
curl -X POST "http://localhost:8000/v1/completions" \ | ||
-H "Content-Type: application/json" \ | ||
-d '{"model": "internlm/internlm2_5-7b-chat", "temperature":0, "prompt": "Shanghai is a city that ", "max_tokens": 16, "stream": false}' | ||
# Output | ||
{ | ||
"id":"2", | ||
"object":"text_completion", | ||
"created":1743662400," | ||
model":"internlm/internlm2_5-7b-chat", | ||
"choices":[ | ||
{ | ||
"index":0, | ||
"text":" is very famous for its skyscrapers. It is also a city","logprobs":null,"finish_reason":"length" | ||
} | ||
], | ||
"usage": { | ||
"prompt_tokens":7,"total_tokens":23,"completion_tokens":16 | ||
} | ||
} | ||
``` | ||
|
||
## Trouble Shooting | ||
|
||
### RDMA Connection Failed: | ||
|
||
Make sure ibverbs is correctly installed: | ||
|
||
``` | ||
# on Ubuntu | ||
sudo apt install libibverbs-dev | ||
# on CentOS | ||
sudo yum install ibverbs-devel | ||
``` | ||
|
||
```bash | ||
ibstatus # Verify IB device status | ||
ibv_devinfo # Check device capabilities | ||
``` | ||
|
||
### Check GPU Direct RDMA: | ||
|
||
By now, lmdeploy-distserve use GPUDirect RDMA to perform KVTransfer. Make sure GPUDirect RDMA Driver is loaded to kernel. | ||
|
||
```bash | ||
lsmod | grep nv_peer_mem | ||
# GPUDirect RDMA info will be printed If GPUDirect RDMA is correctly loaded. | ||
``` | ||
|
||
### ConnectionPool Issue | ||
|
||
Currently, if the Proxy disconnects, the connection pool must be warmed up again. A future enhancement could involve: | ||
|
||
A dedicated connection pool management server (e.g., using Raft-based tools like ETCD, as mentioned in Mooncake) to improve connection discovery and avoid repeated warmups. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
# Copyright (c) OpenMMLab. All rights reserved. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
# Copyright (c) OpenMMLab. All rights reserved. | ||
from lmdeploy.logger import get_logger | ||
|
||
logger = get_logger('lmdeploy') | ||
|
||
try: | ||
logger.debug('Registering DLSlime Backend') | ||
from .dlslime import DLSlimeBackend | ||
except ImportError: | ||
logger.warning('Disable DLSlime Backend') | ||
|
||
try: | ||
logger.debug('Registering Mooncake Backend') | ||
from .mooncake import MooncakeBackend | ||
except ImportError: | ||
logger.warning('Disable Mooncake Backend') | ||
|
||
try: | ||
logger.debug('Registering InfiniStoreBackend Backend') | ||
from .infinistore import InfiniStoreBackend | ||
except ImportError: | ||
logger.warning('Disable InfiniStoreBackend Backend') | ||
|
||
__all__ = ['DLSlimeBackend', 'MooncakeBackend', 'InfiniStoreBackend'] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Copyright (c) OpenMMLab. All rights reserved. | ||
from mmengine.registry import Registry | ||
|
||
MIGRATION_BACKENDS = Registry('migration_backend', locations=['lmdeploy.disagg.backend.backend']) |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
# Copyright (c) OpenMMLab. All rights reserved. | ||
from abc import abstractmethod | ||
|
||
from lmdeploy.disagg.config import MigrationProtocol | ||
from lmdeploy.disagg.messages import DistServeRegisterMRMessage, MigrationAssignment | ||
from lmdeploy.disagg.request import DistServeConnectionRequest, DistServeInitRequest | ||
|
||
|
||
class MigrationBackendImpl: | ||
|
||
@abstractmethod | ||
def p2p_initialize(self, init_request: DistServeInitRequest): | ||
raise NotImplementedError | ||
|
||
@abstractmethod | ||
def register_memory_region(self, register_mr_request: DistServeRegisterMRMessage): | ||
raise NotImplementedError | ||
|
||
@abstractmethod | ||
def endpoint_info(self, remote_engine_id: int, protocol: MigrationProtocol): | ||
return NotImplementedError | ||
|
||
@abstractmethod | ||
def p2p_connect(self, conn_req: DistServeConnectionRequest): | ||
raise NotImplementedError | ||
|
||
@abstractmethod | ||
async def p2p_migrate(self, assignment: MigrationAssignment): | ||
raise NotImplementedError | ||
|
||
@abstractmethod | ||
async def store(self, assignment: MigrationAssignment): | ||
raise NotImplementedError | ||
|
||
@abstractmethod | ||
async def load(self, assignment: MigrationAssignment): | ||
raise NotImplementedError |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
# Copyright (c) OpenMMLab. All rights reserved. | ||
from typing import Dict | ||
|
||
from dlslime import RDMAEndpoint, available_nic | ||
|
||
from lmdeploy.disagg.backend.backend import MIGRATION_BACKENDS | ||
from lmdeploy.disagg.backend.base import MigrationBackendImpl | ||
from lmdeploy.disagg.config import DistServeEngineConfig, MigrationBackend, MigrationProtocol | ||
from lmdeploy.disagg.messages import DistServeRegisterMRMessage, MigrationAssignment | ||
from lmdeploy.disagg.request import DistServeConnectionRequest, DistServeInitRequest | ||
from lmdeploy.logger import get_logger | ||
|
||
logger = get_logger('lmdeploy') | ||
|
||
|
||
class DLSlimeMigrationManagement: | ||
|
||
def __init__(self, init_request: DistServeInitRequest): | ||
self.rank = init_request.rank | ||
self.local_engine_config: DistServeEngineConfig = init_request.local_engine_config | ||
self.remote_engine_config: DistServeEngineConfig = init_request.remote_engine_config | ||
self.endpoint: Dict[MigrationProtocol, RDMAEndpoint] = { | ||
MigrationProtocol.TCP: None, | ||
MigrationProtocol.RDMA: None, | ||
MigrationProtocol.NVLINK: None, | ||
} | ||
if init_request.rdma_config: | ||
nics = self.local_engine_config.available_nics or available_nic() | ||
device_name = nics[self.rank % len(nics)] | ||
logger.info(f'use device {device_name} for kv migration') | ||
self.endpoint[MigrationProtocol.RDMA] = RDMAEndpoint(device_name=device_name, | ||
ib_port=1, | ||
link_type=init_request.rdma_config.link_type.name) | ||
if init_request.nvlink_init_request: | ||
raise NotImplementedError | ||
if init_request.tcp_init_request: | ||
raise NotImplementedError | ||
|
||
def register_memory_region(self, register_mr_request: DistServeRegisterMRMessage): | ||
self.endpoint[register_mr_request.protocol].register_memory_region(register_mr_request.mr_key, | ||
register_mr_request.addr, | ||
register_mr_request.length) | ||
|
||
def connect_to(self, connect_request: DistServeConnectionRequest): | ||
self.endpoint[connect_request.protocol].connect_to(connect_request.remote_endpoint_info) | ||
|
||
async def p2p_migrate(self, assignment: MigrationAssignment): | ||
await self.endpoint[assignment.protocol].read_batch_async(assignment.mr_key, assignment.target_offset, | ||
assignment.source_offset, assignment.length) | ||
|
||
|
||
@MIGRATION_BACKENDS.register_module(MigrationBackend.DLSlime.name) | ||
class DLSlimeBackend(MigrationBackendImpl): | ||
|
||
def __init__(self): | ||
self.links: Dict[int, DLSlimeMigrationManagement] = {} | ||
|
||
def p2p_initialize(self, init_request: DistServeInitRequest): | ||
self.links[init_request.remote_engine_id] = DLSlimeMigrationManagement(init_request) | ||
|
||
def register_memory_region(self, register_mr_request: DistServeRegisterMRMessage): | ||
self.links[register_mr_request.remote_engine_id].register_memory_region(register_mr_request) | ||
|
||
def endpoint_info(self, remote_engine_id: int, protocol: MigrationProtocol): | ||
return self.links[remote_engine_id].endpoint[protocol].local_endpoint_info | ||
|
||
def p2p_connect(self, conn_req: DistServeConnectionRequest): | ||
self.links[conn_req.remote_engine_id].connect_to(conn_req) | ||
|
||
async def p2p_migrate(self, assignment: MigrationAssignment): | ||
await self.links[assignment.remote_engine_id].p2p_migrate(assignment) | ||
|
||
async def store(self, assignment: MigrationAssignment): | ||
raise NotImplementedError | ||
|
||
async def load(self, assignment: MigrationAssignment): | ||
raise NotImplementedError |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# Copyright (c) OpenMMLab. All rights reserved. | ||
from lmdeploy.disagg.backend.backend import MIGRATION_BACKENDS | ||
from lmdeploy.disagg.backend.base import MigrationBackendImpl | ||
from lmdeploy.disagg.config import MigrationBackend, MigrationProtocol | ||
from lmdeploy.disagg.messages import DistServeRegisterMRMessage, MigrationAssignment | ||
from lmdeploy.disagg.request import DistServeConnectionRequest, DistServeInitRequest | ||
|
||
|
||
@MIGRATION_BACKENDS.register_module(MigrationBackend.InfiniStore.name) | ||
class InfiniStoreBackend(MigrationBackendImpl): | ||
|
||
def p2p_initialize(self, init_request: DistServeInitRequest): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if this backend is not supported, we can remove it from choices in cli arguments |
||
raise NotImplementedError | ||
|
||
def register_memory_region(self, register_mr_request: DistServeRegisterMRMessage): | ||
raise NotImplementedError | ||
|
||
def endpoint_info(self, remote_engine_id: int, protocol: MigrationProtocol): | ||
return NotImplementedError | ||
|
||
def p2p_connect(self, conn_req: DistServeConnectionRequest): | ||
raise NotImplementedError | ||
|
||
async def p2p_migrate(self, assignment: MigrationAssignment): | ||
raise NotImplementedError | ||
|
||
async def store(self, assignment: MigrationAssignment): | ||
raise NotImplementedError | ||
|
||
async def load(self, assignment: MigrationAssignment): | ||
raise NotImplementedError |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# Copyright (c) OpenMMLab. All rights reserved. | ||
from lmdeploy.disagg.backend.backend import MIGRATION_BACKENDS | ||
from lmdeploy.disagg.backend.base import MigrationBackendImpl | ||
from lmdeploy.disagg.config import MigrationBackend, MigrationProtocol | ||
from lmdeploy.disagg.messages import DistServeRegisterMRMessage, MigrationAssignment | ||
from lmdeploy.disagg.request import DistServeConnectionRequest, DistServeInitRequest | ||
|
||
|
||
@MIGRATION_BACKENDS.register_module(MigrationBackend.Mooncake.name) | ||
class MooncakeBackend(MigrationBackendImpl): | ||
|
||
def p2p_initialize(self, init_request: DistServeInitRequest): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if this backend is not supported, we can remove it from choices in cli arguments There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @RunningLeon Let's keep it. @JimyMa will work with mooncake team to support it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. lgtm |
||
raise NotImplementedError | ||
|
||
def register_memory_region(self, register_mr_request: DistServeRegisterMRMessage): | ||
raise NotImplementedError | ||
|
||
def endpoint_info(self, remote_engine_id: int, protocol: MigrationProtocol): | ||
return NotImplementedError | ||
|
||
def p2p_connect(self, connect_request: DistServeConnectionRequest): | ||
raise NotImplementedError | ||
|
||
async def p2p_migrate(self, assignment: MigrationAssignment): | ||
raise NotImplementedError | ||
|
||
async def store(self, assignment: MigrationAssignment): | ||
raise NotImplementedError | ||
|
||
async def load(self, assignment: MigrationAssignment): | ||
raise NotImplementedError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we support all of them?