Skip to content

Commit beaba6a

Browse files
committed
fix img format
1 parent 433d7e2 commit beaba6a

File tree

1 file changed

+8
-4
lines changed
  • src/assets/publications/yao2024deft

1 file changed

+8
-4
lines changed

src/assets/publications/yao2024deft/deft.md

+8-4
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11

22
<div align="center">
3-
<img src="./deft.jpeg" alt="logo" width="200"></img>
3+
<img src="./deft.jpeg" width="200"
4+
/>
45
</div>
56

67
--------------------------------------------------------------------------------
@@ -14,6 +15,9 @@ We propose DeFT, an IO-aware attention algorithm for efficient tree-structured i
1415
Large language models (LLMs) are increasingly employed for complex tasks that process multiple generation calls in a tree structure with shared prefixes of tokens, including few-shot prompting, multi-step reasoning, speculative decoding, etc. However, existing inference systems for tree-based applications are inefficient due to improper partitioning of queries and KV cache during attention calculation.This leads to two main issues: (1) a lack of memory access (IO) reuse for KV cache of shared prefixes, and (2) poor load balancing. As a result, there is redundant KV cache IO between GPU global memory and shared memory, along with low GPU utilization. To address these challenges, we propose DeFT(Decoding with Flash Tree-Attention), a hardware-efficient attention algorithm with prefix-aware and load-balanced KV cache partitions. DeFT reduces the number of read/write operations of KV cache during attention calculation through KV-Guided Grouping, a method that avoids repeatedly loading KV cache of shared prefixes in attention computation. Additionally, we propose Flattened Tree KV Splitting, a mechanism that ensures even distribution of the KV cache across partitions with little computation redundancy, enhancing GPU utilization during attention computations. By reducing 73-99% KV cache IO and nearly 100% IO for partial results during attention calculation, DeFT achieves up to 2.23/3.59X speedup in the end-to-end/attention latency across three practical tree-based workloads compared to state-of-the-art attention algorithms.
1516

1617
## DeFT Overview
17-
<div align="center">
18-
<img src="./DeFT_overview.jpg" alt="overview" width="95%"></img>
19-
</div>
18+
19+
20+
21+
<img src="./DeFT_overview.jpg" style="zoom:50%;"
22+
/>
23+

0 commit comments

Comments
 (0)