[DAG] shouldReduceLoadWidth - hasOneUse should check just the loaded value - not the chain #128167

RKSimon · 2025-02-21T11:58:37Z

[DAG] shouldReduceLoadWidth - hasOneUse check for just be for the loaded value, not the chain etc.
The hasOneUse check was failing in any case where the load was part of a chain - we should only be checking if the loaded value has one use, and any updates to the chain should be handled by the fold calling shouldReduceLoadWidth.

I've updated the x86 implementation to match, although it has no effect here yet (I'm still looking at how to improve the x86 implementation) as the inner for loop was discarding chain uses anyway.

By using SDValue::hasOneUse instead this patch exposes a missing dependency on the LLVMSelectionDAG library in a lot of tools + unittests, which resulted in having to make SDNode::hasNUsesOfValue inline.

Noticed while fighting the x86 regressions in #122671

llvmbot · 2025-02-21T11:59:08Z

@llvm/pr-subscribers-clang
@llvm/pr-subscribers-llvm-selectiondag
@llvm/pr-subscribers-llvm-ir
@llvm/pr-subscribers-backend-arm
@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-backend-amdgpu

Author: Simon Pilgrim (RKSimon)

Changes

The hasOneUse check was failing in any case where the load was part of a chain - we should only be checking if the loaded value has one use, and any updates to the chain should be handled by the fold calling shouldReduceLoadWidth.

I've updated the x86 implementation to match, although it has no effect here yet (I'm still looking at how to improve the x86 implementation) as the inner for loop was discarding chain uses anyway.

By using hasNUsesOfValue instead this patch exposes a missing dependency on the LLVMSelectionDAG library in a lot of tools + unittests, we can either update the CMakeLists.txt dependencies or make SDNode::hasNUsesOfValue inline - no string opinions on this tbh.

Noticed while fighting the x86 regressions in #122671

Patch is 48.69 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/128167.diff

18 Files Affected:

(modified) llvm/include/llvm/CodeGen/TargetLowering.h (+1-1)
(modified) llvm/lib/Target/X86/X86ISelLowering.cpp (+2-1)
(modified) llvm/test/CodeGen/AArch64/merge-store.ll (+2-2)
(modified) llvm/test/CodeGen/AArch64/sme-framelower-use-bp.ll (+25-26)
(modified) llvm/test/CodeGen/AArch64/sve-fixed-length-extract-subvector.ll (+81-153)
(modified) llvm/test/CodeGen/AMDGPU/kernel-argument-dag-lowering.ll (+2-2)
(modified) llvm/test/CodeGen/ARM/vpadd.ll (+1-1)
(modified) llvm/test/CodeGen/RISCV/rvv/extractelt-fp.ll (+2-2)
(modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-extract-subvector.ll (+100-198)
(modified) llvm/tools/llvm-extract/CMakeLists.txt (+1)
(modified) llvm/tools/llvm-reduce/CMakeLists.txt (+1)
(modified) llvm/unittests/Analysis/CMakeLists.txt (+1)
(modified) llvm/unittests/Frontend/CMakeLists.txt (+1)
(modified) llvm/unittests/IR/CMakeLists.txt (+1)
(modified) llvm/unittests/Transforms/Coroutines/CMakeLists.txt (+1)
(modified) llvm/unittests/Transforms/Instrumentation/CMakeLists.txt (+1)
(modified) llvm/unittests/Transforms/Scalar/CMakeLists.txt (+1)
(modified) llvm/unittests/Transforms/Utils/CMakeLists.txt (+1)

diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index a4c3d042fe3a4..c77f8fb11f2b8 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -1817,7 +1817,7 @@ class TargetLoweringBase {
                                      EVT NewVT) const {
     // By default, assume that it is cheaper to extract a subvector from a wide
     // vector load rather than creating multiple narrow vector loads.
-    if (NewVT.isVector() && !Load->hasOneUse())
+    if (NewVT.isVector() && !Load->hasNUsesOfValue(1, 0))
       return false;
 
     return true;
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 683c8c3bdf96d..68ea9f6828e50 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -3260,7 +3260,8 @@ bool X86TargetLowering::shouldReduceLoadWidth(SDNode *Load,
   // those uses are extracted directly into a store, then the extract + store
   // can be store-folded. Therefore, it's probably not worth splitting the load.
   EVT VT = Load->getValueType(0);
-  if ((VT.is256BitVector() || VT.is512BitVector()) && !Load->hasOneUse()) {
+  if ((VT.is256BitVector() || VT.is512BitVector()) &&
+      !Load->hasNUsesOfValue(1, 0)) {
     for (SDUse &Use : Load->uses()) {
       // Skip uses of the chain value. Result 0 of the node is the load value.
       if (Use.getResNo() != 0)
diff --git a/llvm/test/CodeGen/AArch64/merge-store.ll b/llvm/test/CodeGen/AArch64/merge-store.ll
index 6653984562ae6..74e3a6d27d3e0 100644
--- a/llvm/test/CodeGen/AArch64/merge-store.ll
+++ b/llvm/test/CodeGen/AArch64/merge-store.ll
@@ -11,14 +11,14 @@ define void @blam() {
 ; SPLITTING-NEXT:    adrp x8, g1
 ; SPLITTING-NEXT:    add x8, x8, :lo12:g1
 ; SPLITTING-NEXT:    adrp x9, g0
-; SPLITTING-NEXT:    ldr q0, [x9, :lo12:g0]
+; SPLITTING-NEXT:    ldr d0, [x9, :lo12:g0]
 ; SPLITTING-NEXT:    str d0, [x8]
 ; SPLITTING-NEXT:    ret
 ;
 ; MISALIGNED-LABEL: blam:
 ; MISALIGNED:       // %bb.0:
 ; MISALIGNED-NEXT:    adrp x8, g0
-; MISALIGNED-NEXT:    ldr q0, [x8, :lo12:g0]
+; MISALIGNED-NEXT:    ldr d0, [x8, :lo12:g0]
 ; MISALIGNED-NEXT:    adrp x8, g1
 ; MISALIGNED-NEXT:    add x8, x8, :lo12:g1
 ; MISALIGNED-NEXT:    str d0, [x8]
diff --git a/llvm/test/CodeGen/AArch64/sme-framelower-use-bp.ll b/llvm/test/CodeGen/AArch64/sme-framelower-use-bp.ll
index 8d028c11b4a6b..15bf6a45f7541 100644
--- a/llvm/test/CodeGen/AArch64/sme-framelower-use-bp.ll
+++ b/llvm/test/CodeGen/AArch64/sme-framelower-use-bp.ll
@@ -531,18 +531,18 @@ define void @quux() #1 {
 ; CHECK-NEXT:    ldr x18, [x19, #80] // 8-byte Folded Reload
 ; CHECK-NEXT:    ldr x0, [x19, #72] // 8-byte Folded Reload
 ; CHECK-NEXT:    ldr x1, [x19, #64] // 8-byte Folded Reload
+; CHECK-NEXT:    ldr x15, [x19, #224] // 8-byte Folded Reload
 ; CHECK-NEXT:    ldr x2, [x19, #216] // 8-byte Folded Reload
 ; CHECK-NEXT:    ldr x3, [x19, #120] // 8-byte Folded Reload
 ; CHECK-NEXT:    ldr x4, [x19, #112] // 8-byte Folded Reload
 ; CHECK-NEXT:    ldr x5, [x19, #104] // 8-byte Folded Reload
 ; CHECK-NEXT:    ldr x6, [x19, #96] // 8-byte Folded Reload
-; CHECK-NEXT:    ldr x7, [x19, #224] // 8-byte Folded Reload
-; CHECK-NEXT:    ldr x20, [x19, #152] // 8-byte Folded Reload
-; CHECK-NEXT:    ldr x21, [x19, #144] // 8-byte Folded Reload
-; CHECK-NEXT:    ldr x22, [x19, #136] // 8-byte Folded Reload
-; CHECK-NEXT:    ldr x23, [x19, #128] // 8-byte Folded Reload
-; CHECK-NEXT:    ldr x16, [x19, #200] // 8-byte Folded Reload
-; CHECK-NEXT:    ldr x15, [x19, #208] // 8-byte Folded Reload
+; CHECK-NEXT:    ldr x16, [x19, #152] // 8-byte Folded Reload
+; CHECK-NEXT:    ldr x7, [x19, #144] // 8-byte Folded Reload
+; CHECK-NEXT:    ldr x20, [x19, #136] // 8-byte Folded Reload
+; CHECK-NEXT:    ldr x21, [x19, #128] // 8-byte Folded Reload
+; CHECK-NEXT:    ldr x23, [x19, #200] // 8-byte Folded Reload
+; CHECK-NEXT:    ldr x22, [x19, #208] // 8-byte Folded Reload
 ; CHECK-NEXT:    ldr x24, [x19, #192] // 8-byte Folded Reload
 ; CHECK-NEXT:    ldr x26, [x19, #176] // 8-byte Folded Reload
 ; CHECK-NEXT:    ldr x25, [x19, #184] // 8-byte Folded Reload
@@ -562,36 +562,34 @@ define void @quux() #1 {
 ; CHECK-NEXT:    add x25, x25, x27, lsl #2
 ; CHECK-NEXT:    str x25, [x26]
 ; CHECK-NEXT:    ldr p0, [x24]
-; CHECK-NEXT:    ldr x24, [x16]
+; CHECK-NEXT:    ldr x24, [x23]
 ; CHECK-NEXT:    mov p8.b, p0.b
 ; CHECK-NEXT:    ld1w { z16.s, z24.s }, pn8/z, [x24]
 ; CHECK-NEXT:    mov z0.d, z16.d
 ; CHECK-NEXT:    mov z1.d, z24.d
 ; CHECK-NEXT:    st1w { z1.s }, p2, [x13, #1, mul vl]
 ; CHECK-NEXT:    st1w { z0.s }, p2, [x13]
-; CHECK-NEXT:    ldr x24, [x15]
-; CHECK-NEXT:    ldr x15, [x16]
-; CHECK-NEXT:    add x15, x15, x24, lsl #2
-; CHECK-NEXT:    str x15, [x16]
-; CHECK-NEXT:    mov x16, x2
-; CHECK-NEXT:    incd x16
+; CHECK-NEXT:    ldr x24, [x22]
+; CHECK-NEXT:    ldr x22, [x23]
+; CHECK-NEXT:    add x22, x22, x24, lsl #2
+; CHECK-NEXT:    str x22, [x23]
 ; CHECK-NEXT:    ldr p1, [x2]
-; CHECK-NEXT:    mov x15, x7
-; CHECK-NEXT:    incd x15
-; CHECK-NEXT:    ldr p0, [x7]
+; CHECK-NEXT:    ldr p0, [x15]
 ; CHECK-NEXT:    ld1w { z1.s }, p2/z, [x14]
 ; CHECK-NEXT:    ld1w { z0.s }, p2/z, [x13]
-; CHECK-NEXT:    str p1, [x23]
-; CHECK-NEXT:    str p0, [x22]
-; CHECK-NEXT:    st1w { z1.s }, p2, [x21]
-; CHECK-NEXT:    st1w { z0.s }, p2, [x20]
-; CHECK-NEXT:    ldr p0, [x23]
-; CHECK-NEXT:    ldr p1, [x22]
-; CHECK-NEXT:    ld1w { z0.s }, p2/z, [x21]
-; CHECK-NEXT:    ld1w { z1.s }, p2/z, [x20]
+; CHECK-NEXT:    str p1, [x21]
+; CHECK-NEXT:    str p0, [x20]
+; CHECK-NEXT:    st1w { z1.s }, p2, [x7]
+; CHECK-NEXT:    st1w { z0.s }, p2, [x16]
+; CHECK-NEXT:    ldr p0, [x21]
+; CHECK-NEXT:    ldr p1, [x20]
+; CHECK-NEXT:    ld1w { z0.s }, p2/z, [x7]
+; CHECK-NEXT:    ld1w { z1.s }, p2/z, [x16]
 ; CHECK-NEXT:    fmopa za0.s, p0/m, p1/m, z0.s, z1.s
+; CHECK-NEXT:    mov x16, x2
+; CHECK-NEXT:    incd x16
 ; CHECK-NEXT:    ldr p1, [x16]
-; CHECK-NEXT:    ldr p0, [x7]
+; CHECK-NEXT:    ldr p0, [x15]
 ; CHECK-NEXT:    ld1w { z1.s }, p2/z, [x14, #1, mul vl]
 ; CHECK-NEXT:    ld1w { z0.s }, p2/z, [x13]
 ; CHECK-NEXT:    str p1, [x6]
@@ -604,6 +602,7 @@ define void @quux() #1 {
 ; CHECK-NEXT:    ld1w { z1.s }, p2/z, [x3]
 ; CHECK-NEXT:    fmopa za1.s, p0/m, p1/m, z0.s, z1.s
 ; CHECK-NEXT:    ldr p1, [x2]
+; CHECK-NEXT:    incd x15
 ; CHECK-NEXT:    ldr p0, [x15]
 ; CHECK-NEXT:    ld1w { z1.s }, p2/z, [x14]
 ; CHECK-NEXT:    ld1w { z0.s }, p2/z, [x13, #1, mul vl]
diff --git a/llvm/test/CodeGen/AArch64/sve-fixed-length-extract-subvector.ll b/llvm/test/CodeGen/AArch64/sve-fixed-length-extract-subvector.ll
index 0f8f4a6843eae..8fac0e1067684 100644
--- a/llvm/test/CodeGen/AArch64/sve-fixed-length-extract-subvector.ll
+++ b/llvm/test/CodeGen/AArch64/sve-fixed-length-extract-subvector.ll
@@ -31,9 +31,7 @@ define <8 x i8> @extract_subvector_v16i8(<16 x i8> %op) vscale_range(2,0) #0 {
 define void @extract_subvector_v32i8(ptr %a, ptr %b) vscale_range(2,0) #0 {
 ; CHECK-LABEL: extract_subvector_v32i8:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ptrue p0.b, vl32
-; CHECK-NEXT:    ld1b { z0.b }, p0/z, [x0]
-; CHECK-NEXT:    ext z0.b, z0.b, z0.b, #16
+; CHECK-NEXT:    ldr q0, [x0, #16]
 ; CHECK-NEXT:    str q0, [x1]
 ; CHECK-NEXT:    ret
   %op = load <32 x i8>, ptr %a
@@ -43,22 +41,13 @@ define void @extract_subvector_v32i8(ptr %a, ptr %b) vscale_range(2,0) #0 {
 }
 
 define void @extract_subvector_v64i8(ptr %a, ptr %b) #0 {
-; VBITS_GE_256-LABEL: extract_subvector_v64i8:
-; VBITS_GE_256:       // %bb.0:
-; VBITS_GE_256-NEXT:    ptrue p0.b, vl32
-; VBITS_GE_256-NEXT:    mov w8, #32 // =0x20
-; VBITS_GE_256-NEXT:    ld1b { z0.b }, p0/z, [x0, x8]
-; VBITS_GE_256-NEXT:    st1b { z0.b }, p0, [x1]
-; VBITS_GE_256-NEXT:    ret
-;
-; VBITS_GE_512-LABEL: extract_subvector_v64i8:
-; VBITS_GE_512:       // %bb.0:
-; VBITS_GE_512-NEXT:    ptrue p0.b, vl64
-; VBITS_GE_512-NEXT:    ld1b { z0.b }, p0/z, [x0]
-; VBITS_GE_512-NEXT:    ptrue p0.b, vl32
-; VBITS_GE_512-NEXT:    ext z0.b, z0.b, z0.b, #32
-; VBITS_GE_512-NEXT:    st1b { z0.b }, p0, [x1]
-; VBITS_GE_512-NEXT:    ret
+; CHECK-LABEL: extract_subvector_v64i8:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.b, vl32
+; CHECK-NEXT:    mov w8, #32 // =0x20
+; CHECK-NEXT:    ld1b { z0.b }, p0/z, [x0, x8]
+; CHECK-NEXT:    st1b { z0.b }, p0, [x1]
+; CHECK-NEXT:    ret
   %op = load <64 x i8>, ptr %a
   %ret = call <32 x i8> @llvm.vector.extract.v32i8.v64i8(<64 x i8> %op, i64 32)
   store <32 x i8> %ret, ptr %b
@@ -68,10 +57,9 @@ define void @extract_subvector_v64i8(ptr %a, ptr %b) #0 {
 define void @extract_subvector_v128i8(ptr %a, ptr %b) vscale_range(8,0) #0 {
 ; CHECK-LABEL: extract_subvector_v128i8:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ptrue p0.b, vl128
-; CHECK-NEXT:    ld1b { z0.b }, p0/z, [x0]
 ; CHECK-NEXT:    ptrue p0.b, vl64
-; CHECK-NEXT:    ext z0.b, z0.b, z0.b, #64
+; CHECK-NEXT:    mov w8, #64 // =0x40
+; CHECK-NEXT:    ld1b { z0.b }, p0/z, [x0, x8]
 ; CHECK-NEXT:    st1b { z0.b }, p0, [x1]
 ; CHECK-NEXT:    ret
   %op = load <128 x i8>, ptr %a
@@ -83,10 +71,9 @@ define void @extract_subvector_v128i8(ptr %a, ptr %b) vscale_range(8,0) #0 {
 define void @extract_subvector_v256i8(ptr %a, ptr %b) vscale_range(16,0) #0 {
 ; CHECK-LABEL: extract_subvector_v256i8:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ptrue p0.b, vl256
-; CHECK-NEXT:    ld1b { z0.b }, p0/z, [x0]
 ; CHECK-NEXT:    ptrue p0.b, vl128
-; CHECK-NEXT:    ext z0.b, z0.b, z0.b, #128
+; CHECK-NEXT:    mov w8, #128 // =0x80
+; CHECK-NEXT:    ld1b { z0.b }, p0/z, [x0, x8]
 ; CHECK-NEXT:    st1b { z0.b }, p0, [x1]
 ; CHECK-NEXT:    ret
   %op = load <256 x i8>, ptr %a
@@ -123,9 +110,7 @@ define <4 x i16> @extract_subvector_v8i16(<8 x i16> %op) vscale_range(2,0) #0 {
 define void @extract_subvector_v16i16(ptr %a, ptr %b) vscale_range(2,0) #0 {
 ; CHECK-LABEL: extract_subvector_v16i16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ptrue p0.h, vl16
-; CHECK-NEXT:    ld1h { z0.h }, p0/z, [x0]
-; CHECK-NEXT:    ext z0.b, z0.b, z0.b, #16
+; CHECK-NEXT:    ldr q0, [x0, #16]
 ; CHECK-NEXT:    str q0, [x1]
 ; CHECK-NEXT:    ret
   %op = load <16 x i16>, ptr %a
@@ -135,22 +120,13 @@ define void @extract_subvector_v16i16(ptr %a, ptr %b) vscale_range(2,0) #0 {
 }
 
 define void @extract_subvector_v32i16(ptr %a, ptr %b) #0 {
-; VBITS_GE_256-LABEL: extract_subvector_v32i16:
-; VBITS_GE_256:       // %bb.0:
-; VBITS_GE_256-NEXT:    ptrue p0.h, vl16
-; VBITS_GE_256-NEXT:    mov x8, #16 // =0x10
-; VBITS_GE_256-NEXT:    ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
-; VBITS_GE_256-NEXT:    st1h { z0.h }, p0, [x1]
-; VBITS_GE_256-NEXT:    ret
-;
-; VBITS_GE_512-LABEL: extract_subvector_v32i16:
-; VBITS_GE_512:       // %bb.0:
-; VBITS_GE_512-NEXT:    ptrue p0.h, vl32
-; VBITS_GE_512-NEXT:    ld1h { z0.h }, p0/z, [x0]
-; VBITS_GE_512-NEXT:    ptrue p0.h, vl16
-; VBITS_GE_512-NEXT:    ext z0.b, z0.b, z0.b, #32
-; VBITS_GE_512-NEXT:    st1h { z0.h }, p0, [x1]
-; VBITS_GE_512-NEXT:    ret
+; CHECK-LABEL: extract_subvector_v32i16:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.h, vl16
+; CHECK-NEXT:    mov x8, #16 // =0x10
+; CHECK-NEXT:    ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
+; CHECK-NEXT:    st1h { z0.h }, p0, [x1]
+; CHECK-NEXT:    ret
   %op = load <32 x i16>, ptr %a
   %ret = call <16 x i16> @llvm.vector.extract.v16i16.v32i16(<32 x i16> %op, i64 16)
   store <16 x i16> %ret, ptr %b
@@ -160,10 +136,9 @@ define void @extract_subvector_v32i16(ptr %a, ptr %b) #0 {
 define void @extract_subvector_v64i16(ptr %a, ptr %b) vscale_range(8,0) #0 {
 ; CHECK-LABEL: extract_subvector_v64i16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ptrue p0.h, vl64
-; CHECK-NEXT:    ld1h { z0.h }, p0/z, [x0]
 ; CHECK-NEXT:    ptrue p0.h, vl32
-; CHECK-NEXT:    ext z0.b, z0.b, z0.b, #64
+; CHECK-NEXT:    mov x8, #32 // =0x20
+; CHECK-NEXT:    ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
 ; CHECK-NEXT:    st1h { z0.h }, p0, [x1]
 ; CHECK-NEXT:    ret
   %op = load <64 x i16>, ptr %a
@@ -175,10 +150,9 @@ define void @extract_subvector_v64i16(ptr %a, ptr %b) vscale_range(8,0) #0 {
 define void @extract_subvector_v128i16(ptr %a, ptr %b) vscale_range(16,0) #0 {
 ; CHECK-LABEL: extract_subvector_v128i16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ptrue p0.h, vl128
-; CHECK-NEXT:    ld1h { z0.h }, p0/z, [x0]
 ; CHECK-NEXT:    ptrue p0.h, vl64
-; CHECK-NEXT:    ext z0.b, z0.b, z0.b, #128
+; CHECK-NEXT:    mov x8, #64 // =0x40
+; CHECK-NEXT:    ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
 ; CHECK-NEXT:    st1h { z0.h }, p0, [x1]
 ; CHECK-NEXT:    ret
   %op = load <128 x i16>, ptr %a
@@ -214,9 +188,7 @@ define <2 x i32> @extract_subvector_v4i32(<4 x i32> %op) vscale_range(2,0) #0 {
 define void @extract_subvector_v8i32(ptr %a, ptr %b) vscale_range(2,0) #0 {
 ; CHECK-LABEL: extract_subvector_v8i32:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ptrue p0.s, vl8
-; CHECK-NEXT:    ld1w { z0.s }, p0/z, [x0]
-; CHECK-NEXT:    ext z0.b, z0.b, z0.b, #16
+; CHECK-NEXT:    ldr q0, [x0, #16]
 ; CHECK-NEXT:    str q0, [x1]
 ; CHECK-NEXT:    ret
   %op = load <8 x i32>, ptr %a
@@ -226,22 +198,13 @@ define void @extract_subvector_v8i32(ptr %a, ptr %b) vscale_range(2,0) #0 {
 }
 
 define void @extract_subvector_v16i32(ptr %a, ptr %b) #0 {
-; VBITS_GE_256-LABEL: extract_subvector_v16i32:
-; VBITS_GE_256:       // %bb.0:
-; VBITS_GE_256-NEXT:    ptrue p0.s, vl8
-; VBITS_GE_256-NEXT:    mov x8, #8 // =0x8
-; VBITS_GE_256-NEXT:    ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
-; VBITS_GE_256-NEXT:    st1w { z0.s }, p0, [x1]
-; VBITS_GE_256-NEXT:    ret
-;
-; VBITS_GE_512-LABEL: extract_subvector_v16i32:
-; VBITS_GE_512:       // %bb.0:
-; VBITS_GE_512-NEXT:    ptrue p0.s, vl16
-; VBITS_GE_512-NEXT:    ld1w { z0.s }, p0/z, [x0]
-; VBITS_GE_512-NEXT:    ptrue p0.s, vl8
-; VBITS_GE_512-NEXT:    ext z0.b, z0.b, z0.b, #32
-; VBITS_GE_512-NEXT:    st1w { z0.s }, p0, [x1]
-; VBITS_GE_512-NEXT:    ret
+; CHECK-LABEL: extract_subvector_v16i32:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.s, vl8
+; CHECK-NEXT:    mov x8, #8 // =0x8
+; CHECK-NEXT:    ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
+; CHECK-NEXT:    st1w { z0.s }, p0, [x1]
+; CHECK-NEXT:    ret
   %op = load <16 x i32>, ptr %a
   %ret = call <8 x i32> @llvm.vector.extract.v8i32.v16i32(<16 x i32> %op, i64 8)
   store <8 x i32> %ret, ptr %b
@@ -251,10 +214,9 @@ define void @extract_subvector_v16i32(ptr %a, ptr %b) #0 {
 define void @extract_subvector_v32i32(ptr %a, ptr %b) vscale_range(8,0) #0 {
 ; CHECK-LABEL: extract_subvector_v32i32:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ptrue p0.s, vl32
-; CHECK-NEXT:    ld1w { z0.s }, p0/z, [x0]
 ; CHECK-NEXT:    ptrue p0.s, vl16
-; CHECK-NEXT:    ext z0.b, z0.b, z0.b, #64
+; CHECK-NEXT:    mov x8, #16 // =0x10
+; CHECK-NEXT:    ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
 ; CHECK-NEXT:    st1w { z0.s }, p0, [x1]
 ; CHECK-NEXT:    ret
   %op = load <32 x i32>, ptr %a
@@ -266,10 +228,9 @@ define void @extract_subvector_v32i32(ptr %a, ptr %b) vscale_range(8,0) #0 {
 define void @extract_subvector_v64i32(ptr %a, ptr %b) vscale_range(16,0) #0 {
 ; CHECK-LABEL: extract_subvector_v64i32:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ptrue p0.s, vl64
-; CHECK-NEXT:    ld1w { z0.s }, p0/z, [x0]
 ; CHECK-NEXT:    ptrue p0.s, vl32
-; CHECK-NEXT:    ext z0.b, z0.b, z0.b, #128
+; CHECK-NEXT:    mov x8, #32 // =0x20
+; CHECK-NEXT:    ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
 ; CHECK-NEXT:    st1w { z0.s }, p0, [x1]
 ; CHECK-NEXT:    ret
   %op = load <64 x i32>, ptr %a
@@ -294,9 +255,7 @@ define <1 x i64> @extract_subvector_v2i64(<2 x i64> %op) vscale_range(2,0) #0 {
 define void @extract_subvector_v4i64(ptr %a, ptr %b) vscale_range(2,0) #0 {
 ; CHECK-LABEL: extract_subvector_v4i64:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ptrue p0.d, vl4
-; CHECK-NEXT:    ld1d { z0.d }, p0/z, [x0]
-; CHECK-NEXT:    ext z0.b, z0.b, z0.b, #16
+; CHECK-NEXT:    ldr q0, [x0, #16]
 ; CHECK-NEXT:    str q0, [x1]
 ; CHECK-NEXT:    ret
   %op = load <4 x i64>, ptr %a
@@ -331,6 +290,14 @@ define void @extract_subvector_v16i64(ptr %a, ptr %b) #0 {
 ; VBITS_GE_256-NEXT:    st1d { z0.d }, p0, [x1, x8, lsl #3]
 ; VBITS_GE_256-NEXT:    st1d { z1.d }, p0, [x1]
 ; VBITS_GE_256-NEXT:    ret
+;
+; VBITS_GE_512-LABEL: extract_subvector_v16i64:
+; VBITS_GE_512:       // %bb.0:
+; VBITS_GE_512-NEXT:    ptrue p0.d, vl8
+; VBITS_GE_512-NEXT:    mov x8, #8 // =0x8
+; VBITS_GE_512-NEXT:    ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
+; VBITS_GE_512-NEXT:    st1d { z0.d }, p0, [x1]
+; VBITS_GE_512-NEXT:    ret
   %op = load <16 x i64>, ptr %a
   %ret = call <8 x i64> @llvm.vector.extract.v8i64.v16i64(<16 x i64> %op, i64 8)
   store <8 x i64> %ret, ptr %b
@@ -378,9 +345,7 @@ define <4 x half> @extract_subvector_v8f16(<8 x half> %op) vscale_range(2,0) #0
 define void @extract_subvector_v16f16(ptr %a, ptr %b) vscale_range(2,0) #0 {
 ; CHECK-LABEL: extract_subvector_v16f16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ptrue p0.h, vl16
-; CHECK-NEXT:    ld1h { z0.h }, p0/z, [x0]
-; CHECK-NEXT:    ext z0.b, z0.b, z0.b, #16
+; CHECK-NEXT:    ldr q0, [x0, #16]
 ; CHECK-NEXT:    str q0, [x1]
 ; CHECK-NEXT:    ret
   %op = load <16 x half>, ptr %a
@@ -390,22 +355,13 @@ define void @extract_subvector_v16f16(ptr %a, ptr %b) vscale_range(2,0) #0 {
 }
 
 define void @extract_subvector_v32f16(ptr %a, ptr %b) #0 {
-; VBITS_GE_256-LABEL: extract_subvector_v32f16:
-; VBITS_GE_256:       // %bb.0:
-; VBITS_GE_256-NEXT:    ptrue p0.h, vl16
-; VBITS_GE_256-NEXT:    mov x8, #16 // =0x10
-; VBITS_GE_256-NEXT:    ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
-; VBITS_GE_256-NEXT:    st1h { z0.h }, p0, [x1]
-; VBITS_GE_256-NEXT:    ret
-;
-; VBITS_GE_512-LABEL: extract_subvector_v32f16:
-; VBITS_GE_512:       // %bb.0:
-; VBITS_GE_512-NEXT:    ptrue p0.h, vl32
-; VBITS_GE_512-NEXT:    ld1h { z0.h }, p0/z, [x0]
-; VBITS_GE_512-NEXT:    ptrue p0.h, vl16
-; VBITS_GE_512-NEXT:    ext z0.b, z0.b, z0.b, #32
-; VBITS_GE_512-NEXT:    st1h { z0.h }, p0, [x1]
-; VBITS_GE_512-NEXT:    ret
+; CHECK-LABEL: extract_subvector_v32f16:
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    ptrue p0.h, vl16
+; CHECK-NEXT:    mov x8, #16 // =0x10
+; CHECK-NEXT:    ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
+; CHECK-NEXT:    st1h { z0.h }, p0, [x1]
+; CHECK-NEXT:    ret
   %op = load <32 x half>, ptr %a
   %ret = call <16 x half> @llvm.vector.extract.v16f16.v32f16(<32 x half> %op, i64 16)
   store <16 x half> %ret, ptr %b
@@ -415,10 +371,9 @@ define void @extract_subvector_v32f16(ptr %a, ptr %b) #0 {
 define void @extract_subvector_v64f16(ptr %a, ptr %b) vscale_range(8,0) #0 {
 ; CHECK-LABEL: extract_subvector_v64f16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ptrue p0.h, vl64
-; CHECK-NEXT:    ld1h { z0.h }, p0/z, [x0]
 ; CHECK-NEXT:    ptrue p0.h, vl32
-; CHECK-NEXT:    ext z0.b, z0.b, z0.b, #64
+; CHECK-NEXT:    mov x8, #32 // =0x20
+; CHECK-NEXT:    ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
 ; CHECK-NEXT:    st1h { z0.h }, p0, [x1]
 ; CHECK-NEXT:    ret
   %op = load <64 x half>, ptr %a
@@ -430,10 +385,9 @@ define void @extract_subvector_v64f16(ptr %a, ptr %b) vscale_range(8,0) #0 {
 define void @extract_subvector_v128f16(ptr %a, ptr %b) vscale_range(16,0) #0 {
 ; CHECK-LABEL: extract_subvector_v128f16:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ptrue p0.h, vl128
-; CHECK-NEXT:    ld1h { z0.h }, p0/z, [x0]
 ; CHECK-NEXT:    ptrue p0.h, vl64
-; CHECK-NEXT:    ext z0.b, z0.b, z0.b, #128
+; CHECK-NEXT:    mov x8, #64 // =0x40
+; CHECK-NEXT:    ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
 ; CHECK-NEXT:    st1h { z0.h }, p0, [x1]
 ; CHECK-NEXT:    ret
   %op = load <128 x half>, ptr %a
@@ -469,9 +423,7 @@ define <2 x float> @extract_subvector_v4f32(<4 x float> %op) vscale_range(2,0) #
 define void @extract_subvector_v8f32(ptr %a, ptr %b) vscale_range(2,0) #0 {
 ; CHECK-LABEL: extract_subvector_v8f32:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    ptrue p0.s, vl8
-; CHECK-NEXT:    ld1w { z0.s }, p0/z, [x0]
-; CHECK-NEXT:    ext z0.b, z0.b, z0.b, #16
+; CHECK-NEXT:    ldr q0, [x0, #16]
 ; CHECK-NEXT:    str q0, [x1]
 ; CHECK-NEXT:    ret
   %op = load <8 x float>, ptr %a
@@ -481,22 +433,13 @@ define void @extract_subvector_v8f32(ptr %a, ptr %b) vscale_range(2,0) #0 {
 }
 
 define void @extract_subvector_v16f32(ptr %a, ptr %b) #0 {
-; VBITS_GE_256-LABEL: extract_subvector_v16f32:
-; VBITS_GE_256:       // %bb.0:
-; VBITS_GE_256-NEXT:    ptrue p0.s, vl8
-; VBITS_GE_256-NEXT:    mov x8, #8 // =0x8
-; VBITS_GE_256-NEXT:    ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
-; VBITS_GE_256-NEXT:  ...
[truncated]

llvm/include/llvm/CodeGen/TargetLowering.h

arsenm · 2025-02-21T12:10:02Z

llvm/tools/llvm-reduce/CMakeLists.txt

@@ -14,6 +14,7 @@ set(LLVM_LINK_COMPONENTS
  MC
  MIRParser
  Passes
+  SelectionDAG


This one seems suspicious, does it just get pulled in with the rest of codegen?

I might have added an unnecessary one - I was just faced with a wave of build failures and updated them all.

Ah SelectionDAG depends on CodeGen - not the other way around - so we could drop CodeGen/CodeGenTypes and make the implicit, but I'd probably keep them tbh

RKSimon · 2025-02-21T12:23:30Z

@arsenm any preference for updating CMakeLists or making it inline?

arsenm · 2025-02-21T12:31:00Z

@arsenm any preference for updating CMakeLists or making it inline?

I don't exactly follow how only this one case needs to be inline to fix the failures. It's an all or nothing problem?

RKSimon · 2025-02-21T12:56:49Z

@arsenm any preference for updating CMakeLists or making it inline?

I don't exactly follow how only this one case needs to be inline to fix the failures. It's an all or nothing problem?

Its due to shouldReduceLoadWidth being virtual - if we inline SDNode::hasNUsesOfValue then the explicit dependency is hidden again - I'll keep with the CMakeLists approach as it should stop this being a problem again in the future.

topperc · 2025-02-22T22:03:36Z

llvm/include/llvm/CodeGen/TargetLowering.h

@@ -1817,7 +1817,7 @@ class TargetLoweringBase {
                                     EVT NewVT) const {


Why not move this definition to TargetLowering.cpp? Do we need the body inlined for some reason?

There's an implicit dependency that I'm still trying to track down.

…ded value, not the chain etc. The hasOneUse check was failing in any case where the load was part of a chain - we should only be checking if the loaded value has one use, and any updates to the chain should be handled by the fold calling shouldReduceLoadWidth. I've updated the x86 implementation to match, although it has no effect here yet (I'm still looking at how to improve the x86 implementation) as the inner for loop was discarding chain uses anyway. By using SDValue::hasOneUse instead this patch exposes a missing dependency on the LLVMSelectionDAG library in a lot of tools + unittests, which resulted in having to make SDNode::hasNUsesOfValue inline. Noticed while fighting the x86 regressions in llvm#122671

RKSimon requested review from arsenm, davemgreen and topperc February 21, 2025 11:58

llvmbot added backend:ARM backend:AArch64 backend:AMDGPU backend:X86 llvm:ir llvm:analysis llvm:transforms labels Feb 21, 2025

RKSimon changed the title ~~[DAG] shouldReduceLoadWidth - hasOneUse check for just be for the loaded value, not the chain etc.~~ [DAG] shouldReduceLoadWidth - hasOneUse should check just the loaded value - not the chain Feb 21, 2025

arsenm approved these changes Feb 21, 2025

View reviewed changes

arsenm added the llvm:SelectionDAG SelectionDAGISel as well label Feb 21, 2025

llvmbot added the clang Clang issues not falling into any other category label Feb 21, 2025

topperc reviewed Feb 22, 2025

View reviewed changes

RKSimon force-pushed the dag-reduce-load-width-oneuse branch from 4d1a58e to c9d8251 Compare February 24, 2025 10:08

RKSimon force-pushed the dag-reduce-load-width-oneuse branch from c9d8251 to 2cf9ca4 Compare February 24, 2025 10:10

arsenm approved these changes Feb 24, 2025

View reviewed changes

RKSimon merged commit 7de6492 into llvm:main Feb 24, 2025
9 of 11 checks passed

RKSimon deleted the dag-reduce-load-width-oneuse branch February 24, 2025 11:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DAG] shouldReduceLoadWidth - hasOneUse should check just the loaded value - not the chain #128167

[DAG] shouldReduceLoadWidth - hasOneUse should check just the loaded value - not the chain #128167

RKSimon commented Feb 21, 2025 •

edited

Loading

llvmbot commented Feb 21, 2025 •

edited

Loading

arsenm Feb 21, 2025

RKSimon Feb 21, 2025

RKSimon Feb 21, 2025

RKSimon commented Feb 21, 2025

arsenm commented Feb 21, 2025

RKSimon commented Feb 21, 2025

topperc Feb 22, 2025 •

edited

Loading

RKSimon Feb 22, 2025

		@@ -1817,7 +1817,7 @@ class TargetLoweringBase {
		EVT NewVT) const {

[DAG] shouldReduceLoadWidth - hasOneUse should check just the loaded value - not the chain #128167

[DAG] shouldReduceLoadWidth - hasOneUse should check just the loaded value - not the chain #128167

Conversation

RKSimon commented Feb 21, 2025 • edited Loading

llvmbot commented Feb 21, 2025 • edited Loading

arsenm Feb 21, 2025

Choose a reason for hiding this comment

RKSimon Feb 21, 2025

Choose a reason for hiding this comment

RKSimon Feb 21, 2025

Choose a reason for hiding this comment

RKSimon commented Feb 21, 2025

arsenm commented Feb 21, 2025

RKSimon commented Feb 21, 2025

topperc Feb 22, 2025 • edited Loading

Choose a reason for hiding this comment

RKSimon Feb 22, 2025

Choose a reason for hiding this comment

RKSimon commented Feb 21, 2025 •

edited

Loading

llvmbot commented Feb 21, 2025 •

edited

Loading

topperc Feb 22, 2025 •

edited

Loading