-
Notifications
You must be signed in to change notification settings - Fork 13.4k
[DAG] shouldReduceLoadWidth - hasOneUse should check just the loaded value - not the chain #128167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@llvm/pr-subscribers-clang @llvm/pr-subscribers-backend-amdgpu Author: Simon Pilgrim (RKSimon) ChangesThe hasOneUse check was failing in any case where the load was part of a chain - we should only be checking if the loaded value has one use, and any updates to the chain should be handled by the fold calling shouldReduceLoadWidth. I've updated the x86 implementation to match, although it has no effect here yet (I'm still looking at how to improve the x86 implementation) as the inner for loop was discarding chain uses anyway. By using hasNUsesOfValue instead this patch exposes a missing dependency on the LLVMSelectionDAG library in a lot of tools + unittests, we can either update the CMakeLists.txt dependencies or make SDNode::hasNUsesOfValue inline - no string opinions on this tbh. Noticed while fighting the x86 regressions in #122671 Patch is 48.69 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/128167.diff 18 Files Affected:
diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index a4c3d042fe3a4..c77f8fb11f2b8 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -1817,7 +1817,7 @@ class TargetLoweringBase {
EVT NewVT) const {
// By default, assume that it is cheaper to extract a subvector from a wide
// vector load rather than creating multiple narrow vector loads.
- if (NewVT.isVector() && !Load->hasOneUse())
+ if (NewVT.isVector() && !Load->hasNUsesOfValue(1, 0))
return false;
return true;
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 683c8c3bdf96d..68ea9f6828e50 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -3260,7 +3260,8 @@ bool X86TargetLowering::shouldReduceLoadWidth(SDNode *Load,
// those uses are extracted directly into a store, then the extract + store
// can be store-folded. Therefore, it's probably not worth splitting the load.
EVT VT = Load->getValueType(0);
- if ((VT.is256BitVector() || VT.is512BitVector()) && !Load->hasOneUse()) {
+ if ((VT.is256BitVector() || VT.is512BitVector()) &&
+ !Load->hasNUsesOfValue(1, 0)) {
for (SDUse &Use : Load->uses()) {
// Skip uses of the chain value. Result 0 of the node is the load value.
if (Use.getResNo() != 0)
diff --git a/llvm/test/CodeGen/AArch64/merge-store.ll b/llvm/test/CodeGen/AArch64/merge-store.ll
index 6653984562ae6..74e3a6d27d3e0 100644
--- a/llvm/test/CodeGen/AArch64/merge-store.ll
+++ b/llvm/test/CodeGen/AArch64/merge-store.ll
@@ -11,14 +11,14 @@ define void @blam() {
; SPLITTING-NEXT: adrp x8, g1
; SPLITTING-NEXT: add x8, x8, :lo12:g1
; SPLITTING-NEXT: adrp x9, g0
-; SPLITTING-NEXT: ldr q0, [x9, :lo12:g0]
+; SPLITTING-NEXT: ldr d0, [x9, :lo12:g0]
; SPLITTING-NEXT: str d0, [x8]
; SPLITTING-NEXT: ret
;
; MISALIGNED-LABEL: blam:
; MISALIGNED: // %bb.0:
; MISALIGNED-NEXT: adrp x8, g0
-; MISALIGNED-NEXT: ldr q0, [x8, :lo12:g0]
+; MISALIGNED-NEXT: ldr d0, [x8, :lo12:g0]
; MISALIGNED-NEXT: adrp x8, g1
; MISALIGNED-NEXT: add x8, x8, :lo12:g1
; MISALIGNED-NEXT: str d0, [x8]
diff --git a/llvm/test/CodeGen/AArch64/sme-framelower-use-bp.ll b/llvm/test/CodeGen/AArch64/sme-framelower-use-bp.ll
index 8d028c11b4a6b..15bf6a45f7541 100644
--- a/llvm/test/CodeGen/AArch64/sme-framelower-use-bp.ll
+++ b/llvm/test/CodeGen/AArch64/sme-framelower-use-bp.ll
@@ -531,18 +531,18 @@ define void @quux() #1 {
; CHECK-NEXT: ldr x18, [x19, #80] // 8-byte Folded Reload
; CHECK-NEXT: ldr x0, [x19, #72] // 8-byte Folded Reload
; CHECK-NEXT: ldr x1, [x19, #64] // 8-byte Folded Reload
+; CHECK-NEXT: ldr x15, [x19, #224] // 8-byte Folded Reload
; CHECK-NEXT: ldr x2, [x19, #216] // 8-byte Folded Reload
; CHECK-NEXT: ldr x3, [x19, #120] // 8-byte Folded Reload
; CHECK-NEXT: ldr x4, [x19, #112] // 8-byte Folded Reload
; CHECK-NEXT: ldr x5, [x19, #104] // 8-byte Folded Reload
; CHECK-NEXT: ldr x6, [x19, #96] // 8-byte Folded Reload
-; CHECK-NEXT: ldr x7, [x19, #224] // 8-byte Folded Reload
-; CHECK-NEXT: ldr x20, [x19, #152] // 8-byte Folded Reload
-; CHECK-NEXT: ldr x21, [x19, #144] // 8-byte Folded Reload
-; CHECK-NEXT: ldr x22, [x19, #136] // 8-byte Folded Reload
-; CHECK-NEXT: ldr x23, [x19, #128] // 8-byte Folded Reload
-; CHECK-NEXT: ldr x16, [x19, #200] // 8-byte Folded Reload
-; CHECK-NEXT: ldr x15, [x19, #208] // 8-byte Folded Reload
+; CHECK-NEXT: ldr x16, [x19, #152] // 8-byte Folded Reload
+; CHECK-NEXT: ldr x7, [x19, #144] // 8-byte Folded Reload
+; CHECK-NEXT: ldr x20, [x19, #136] // 8-byte Folded Reload
+; CHECK-NEXT: ldr x21, [x19, #128] // 8-byte Folded Reload
+; CHECK-NEXT: ldr x23, [x19, #200] // 8-byte Folded Reload
+; CHECK-NEXT: ldr x22, [x19, #208] // 8-byte Folded Reload
; CHECK-NEXT: ldr x24, [x19, #192] // 8-byte Folded Reload
; CHECK-NEXT: ldr x26, [x19, #176] // 8-byte Folded Reload
; CHECK-NEXT: ldr x25, [x19, #184] // 8-byte Folded Reload
@@ -562,36 +562,34 @@ define void @quux() #1 {
; CHECK-NEXT: add x25, x25, x27, lsl #2
; CHECK-NEXT: str x25, [x26]
; CHECK-NEXT: ldr p0, [x24]
-; CHECK-NEXT: ldr x24, [x16]
+; CHECK-NEXT: ldr x24, [x23]
; CHECK-NEXT: mov p8.b, p0.b
; CHECK-NEXT: ld1w { z16.s, z24.s }, pn8/z, [x24]
; CHECK-NEXT: mov z0.d, z16.d
; CHECK-NEXT: mov z1.d, z24.d
; CHECK-NEXT: st1w { z1.s }, p2, [x13, #1, mul vl]
; CHECK-NEXT: st1w { z0.s }, p2, [x13]
-; CHECK-NEXT: ldr x24, [x15]
-; CHECK-NEXT: ldr x15, [x16]
-; CHECK-NEXT: add x15, x15, x24, lsl #2
-; CHECK-NEXT: str x15, [x16]
-; CHECK-NEXT: mov x16, x2
-; CHECK-NEXT: incd x16
+; CHECK-NEXT: ldr x24, [x22]
+; CHECK-NEXT: ldr x22, [x23]
+; CHECK-NEXT: add x22, x22, x24, lsl #2
+; CHECK-NEXT: str x22, [x23]
; CHECK-NEXT: ldr p1, [x2]
-; CHECK-NEXT: mov x15, x7
-; CHECK-NEXT: incd x15
-; CHECK-NEXT: ldr p0, [x7]
+; CHECK-NEXT: ldr p0, [x15]
; CHECK-NEXT: ld1w { z1.s }, p2/z, [x14]
; CHECK-NEXT: ld1w { z0.s }, p2/z, [x13]
-; CHECK-NEXT: str p1, [x23]
-; CHECK-NEXT: str p0, [x22]
-; CHECK-NEXT: st1w { z1.s }, p2, [x21]
-; CHECK-NEXT: st1w { z0.s }, p2, [x20]
-; CHECK-NEXT: ldr p0, [x23]
-; CHECK-NEXT: ldr p1, [x22]
-; CHECK-NEXT: ld1w { z0.s }, p2/z, [x21]
-; CHECK-NEXT: ld1w { z1.s }, p2/z, [x20]
+; CHECK-NEXT: str p1, [x21]
+; CHECK-NEXT: str p0, [x20]
+; CHECK-NEXT: st1w { z1.s }, p2, [x7]
+; CHECK-NEXT: st1w { z0.s }, p2, [x16]
+; CHECK-NEXT: ldr p0, [x21]
+; CHECK-NEXT: ldr p1, [x20]
+; CHECK-NEXT: ld1w { z0.s }, p2/z, [x7]
+; CHECK-NEXT: ld1w { z1.s }, p2/z, [x16]
; CHECK-NEXT: fmopa za0.s, p0/m, p1/m, z0.s, z1.s
+; CHECK-NEXT: mov x16, x2
+; CHECK-NEXT: incd x16
; CHECK-NEXT: ldr p1, [x16]
-; CHECK-NEXT: ldr p0, [x7]
+; CHECK-NEXT: ldr p0, [x15]
; CHECK-NEXT: ld1w { z1.s }, p2/z, [x14, #1, mul vl]
; CHECK-NEXT: ld1w { z0.s }, p2/z, [x13]
; CHECK-NEXT: str p1, [x6]
@@ -604,6 +602,7 @@ define void @quux() #1 {
; CHECK-NEXT: ld1w { z1.s }, p2/z, [x3]
; CHECK-NEXT: fmopa za1.s, p0/m, p1/m, z0.s, z1.s
; CHECK-NEXT: ldr p1, [x2]
+; CHECK-NEXT: incd x15
; CHECK-NEXT: ldr p0, [x15]
; CHECK-NEXT: ld1w { z1.s }, p2/z, [x14]
; CHECK-NEXT: ld1w { z0.s }, p2/z, [x13, #1, mul vl]
diff --git a/llvm/test/CodeGen/AArch64/sve-fixed-length-extract-subvector.ll b/llvm/test/CodeGen/AArch64/sve-fixed-length-extract-subvector.ll
index 0f8f4a6843eae..8fac0e1067684 100644
--- a/llvm/test/CodeGen/AArch64/sve-fixed-length-extract-subvector.ll
+++ b/llvm/test/CodeGen/AArch64/sve-fixed-length-extract-subvector.ll
@@ -31,9 +31,7 @@ define <8 x i8> @extract_subvector_v16i8(<16 x i8> %op) vscale_range(2,0) #0 {
define void @extract_subvector_v32i8(ptr %a, ptr %b) vscale_range(2,0) #0 {
; CHECK-LABEL: extract_subvector_v32i8:
; CHECK: // %bb.0:
-; CHECK-NEXT: ptrue p0.b, vl32
-; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0]
-; CHECK-NEXT: ext z0.b, z0.b, z0.b, #16
+; CHECK-NEXT: ldr q0, [x0, #16]
; CHECK-NEXT: str q0, [x1]
; CHECK-NEXT: ret
%op = load <32 x i8>, ptr %a
@@ -43,22 +41,13 @@ define void @extract_subvector_v32i8(ptr %a, ptr %b) vscale_range(2,0) #0 {
}
define void @extract_subvector_v64i8(ptr %a, ptr %b) #0 {
-; VBITS_GE_256-LABEL: extract_subvector_v64i8:
-; VBITS_GE_256: // %bb.0:
-; VBITS_GE_256-NEXT: ptrue p0.b, vl32
-; VBITS_GE_256-NEXT: mov w8, #32 // =0x20
-; VBITS_GE_256-NEXT: ld1b { z0.b }, p0/z, [x0, x8]
-; VBITS_GE_256-NEXT: st1b { z0.b }, p0, [x1]
-; VBITS_GE_256-NEXT: ret
-;
-; VBITS_GE_512-LABEL: extract_subvector_v64i8:
-; VBITS_GE_512: // %bb.0:
-; VBITS_GE_512-NEXT: ptrue p0.b, vl64
-; VBITS_GE_512-NEXT: ld1b { z0.b }, p0/z, [x0]
-; VBITS_GE_512-NEXT: ptrue p0.b, vl32
-; VBITS_GE_512-NEXT: ext z0.b, z0.b, z0.b, #32
-; VBITS_GE_512-NEXT: st1b { z0.b }, p0, [x1]
-; VBITS_GE_512-NEXT: ret
+; CHECK-LABEL: extract_subvector_v64i8:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ptrue p0.b, vl32
+; CHECK-NEXT: mov w8, #32 // =0x20
+; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0, x8]
+; CHECK-NEXT: st1b { z0.b }, p0, [x1]
+; CHECK-NEXT: ret
%op = load <64 x i8>, ptr %a
%ret = call <32 x i8> @llvm.vector.extract.v32i8.v64i8(<64 x i8> %op, i64 32)
store <32 x i8> %ret, ptr %b
@@ -68,10 +57,9 @@ define void @extract_subvector_v64i8(ptr %a, ptr %b) #0 {
define void @extract_subvector_v128i8(ptr %a, ptr %b) vscale_range(8,0) #0 {
; CHECK-LABEL: extract_subvector_v128i8:
; CHECK: // %bb.0:
-; CHECK-NEXT: ptrue p0.b, vl128
-; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0]
; CHECK-NEXT: ptrue p0.b, vl64
-; CHECK-NEXT: ext z0.b, z0.b, z0.b, #64
+; CHECK-NEXT: mov w8, #64 // =0x40
+; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0, x8]
; CHECK-NEXT: st1b { z0.b }, p0, [x1]
; CHECK-NEXT: ret
%op = load <128 x i8>, ptr %a
@@ -83,10 +71,9 @@ define void @extract_subvector_v128i8(ptr %a, ptr %b) vscale_range(8,0) #0 {
define void @extract_subvector_v256i8(ptr %a, ptr %b) vscale_range(16,0) #0 {
; CHECK-LABEL: extract_subvector_v256i8:
; CHECK: // %bb.0:
-; CHECK-NEXT: ptrue p0.b, vl256
-; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0]
; CHECK-NEXT: ptrue p0.b, vl128
-; CHECK-NEXT: ext z0.b, z0.b, z0.b, #128
+; CHECK-NEXT: mov w8, #128 // =0x80
+; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0, x8]
; CHECK-NEXT: st1b { z0.b }, p0, [x1]
; CHECK-NEXT: ret
%op = load <256 x i8>, ptr %a
@@ -123,9 +110,7 @@ define <4 x i16> @extract_subvector_v8i16(<8 x i16> %op) vscale_range(2,0) #0 {
define void @extract_subvector_v16i16(ptr %a, ptr %b) vscale_range(2,0) #0 {
; CHECK-LABEL: extract_subvector_v16i16:
; CHECK: // %bb.0:
-; CHECK-NEXT: ptrue p0.h, vl16
-; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
-; CHECK-NEXT: ext z0.b, z0.b, z0.b, #16
+; CHECK-NEXT: ldr q0, [x0, #16]
; CHECK-NEXT: str q0, [x1]
; CHECK-NEXT: ret
%op = load <16 x i16>, ptr %a
@@ -135,22 +120,13 @@ define void @extract_subvector_v16i16(ptr %a, ptr %b) vscale_range(2,0) #0 {
}
define void @extract_subvector_v32i16(ptr %a, ptr %b) #0 {
-; VBITS_GE_256-LABEL: extract_subvector_v32i16:
-; VBITS_GE_256: // %bb.0:
-; VBITS_GE_256-NEXT: ptrue p0.h, vl16
-; VBITS_GE_256-NEXT: mov x8, #16 // =0x10
-; VBITS_GE_256-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
-; VBITS_GE_256-NEXT: st1h { z0.h }, p0, [x1]
-; VBITS_GE_256-NEXT: ret
-;
-; VBITS_GE_512-LABEL: extract_subvector_v32i16:
-; VBITS_GE_512: // %bb.0:
-; VBITS_GE_512-NEXT: ptrue p0.h, vl32
-; VBITS_GE_512-NEXT: ld1h { z0.h }, p0/z, [x0]
-; VBITS_GE_512-NEXT: ptrue p0.h, vl16
-; VBITS_GE_512-NEXT: ext z0.b, z0.b, z0.b, #32
-; VBITS_GE_512-NEXT: st1h { z0.h }, p0, [x1]
-; VBITS_GE_512-NEXT: ret
+; CHECK-LABEL: extract_subvector_v32i16:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ptrue p0.h, vl16
+; CHECK-NEXT: mov x8, #16 // =0x10
+; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
+; CHECK-NEXT: st1h { z0.h }, p0, [x1]
+; CHECK-NEXT: ret
%op = load <32 x i16>, ptr %a
%ret = call <16 x i16> @llvm.vector.extract.v16i16.v32i16(<32 x i16> %op, i64 16)
store <16 x i16> %ret, ptr %b
@@ -160,10 +136,9 @@ define void @extract_subvector_v32i16(ptr %a, ptr %b) #0 {
define void @extract_subvector_v64i16(ptr %a, ptr %b) vscale_range(8,0) #0 {
; CHECK-LABEL: extract_subvector_v64i16:
; CHECK: // %bb.0:
-; CHECK-NEXT: ptrue p0.h, vl64
-; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
; CHECK-NEXT: ptrue p0.h, vl32
-; CHECK-NEXT: ext z0.b, z0.b, z0.b, #64
+; CHECK-NEXT: mov x8, #32 // =0x20
+; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
; CHECK-NEXT: st1h { z0.h }, p0, [x1]
; CHECK-NEXT: ret
%op = load <64 x i16>, ptr %a
@@ -175,10 +150,9 @@ define void @extract_subvector_v64i16(ptr %a, ptr %b) vscale_range(8,0) #0 {
define void @extract_subvector_v128i16(ptr %a, ptr %b) vscale_range(16,0) #0 {
; CHECK-LABEL: extract_subvector_v128i16:
; CHECK: // %bb.0:
-; CHECK-NEXT: ptrue p0.h, vl128
-; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
; CHECK-NEXT: ptrue p0.h, vl64
-; CHECK-NEXT: ext z0.b, z0.b, z0.b, #128
+; CHECK-NEXT: mov x8, #64 // =0x40
+; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
; CHECK-NEXT: st1h { z0.h }, p0, [x1]
; CHECK-NEXT: ret
%op = load <128 x i16>, ptr %a
@@ -214,9 +188,7 @@ define <2 x i32> @extract_subvector_v4i32(<4 x i32> %op) vscale_range(2,0) #0 {
define void @extract_subvector_v8i32(ptr %a, ptr %b) vscale_range(2,0) #0 {
; CHECK-LABEL: extract_subvector_v8i32:
; CHECK: // %bb.0:
-; CHECK-NEXT: ptrue p0.s, vl8
-; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
-; CHECK-NEXT: ext z0.b, z0.b, z0.b, #16
+; CHECK-NEXT: ldr q0, [x0, #16]
; CHECK-NEXT: str q0, [x1]
; CHECK-NEXT: ret
%op = load <8 x i32>, ptr %a
@@ -226,22 +198,13 @@ define void @extract_subvector_v8i32(ptr %a, ptr %b) vscale_range(2,0) #0 {
}
define void @extract_subvector_v16i32(ptr %a, ptr %b) #0 {
-; VBITS_GE_256-LABEL: extract_subvector_v16i32:
-; VBITS_GE_256: // %bb.0:
-; VBITS_GE_256-NEXT: ptrue p0.s, vl8
-; VBITS_GE_256-NEXT: mov x8, #8 // =0x8
-; VBITS_GE_256-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
-; VBITS_GE_256-NEXT: st1w { z0.s }, p0, [x1]
-; VBITS_GE_256-NEXT: ret
-;
-; VBITS_GE_512-LABEL: extract_subvector_v16i32:
-; VBITS_GE_512: // %bb.0:
-; VBITS_GE_512-NEXT: ptrue p0.s, vl16
-; VBITS_GE_512-NEXT: ld1w { z0.s }, p0/z, [x0]
-; VBITS_GE_512-NEXT: ptrue p0.s, vl8
-; VBITS_GE_512-NEXT: ext z0.b, z0.b, z0.b, #32
-; VBITS_GE_512-NEXT: st1w { z0.s }, p0, [x1]
-; VBITS_GE_512-NEXT: ret
+; CHECK-LABEL: extract_subvector_v16i32:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ptrue p0.s, vl8
+; CHECK-NEXT: mov x8, #8 // =0x8
+; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
+; CHECK-NEXT: st1w { z0.s }, p0, [x1]
+; CHECK-NEXT: ret
%op = load <16 x i32>, ptr %a
%ret = call <8 x i32> @llvm.vector.extract.v8i32.v16i32(<16 x i32> %op, i64 8)
store <8 x i32> %ret, ptr %b
@@ -251,10 +214,9 @@ define void @extract_subvector_v16i32(ptr %a, ptr %b) #0 {
define void @extract_subvector_v32i32(ptr %a, ptr %b) vscale_range(8,0) #0 {
; CHECK-LABEL: extract_subvector_v32i32:
; CHECK: // %bb.0:
-; CHECK-NEXT: ptrue p0.s, vl32
-; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
; CHECK-NEXT: ptrue p0.s, vl16
-; CHECK-NEXT: ext z0.b, z0.b, z0.b, #64
+; CHECK-NEXT: mov x8, #16 // =0x10
+; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
; CHECK-NEXT: st1w { z0.s }, p0, [x1]
; CHECK-NEXT: ret
%op = load <32 x i32>, ptr %a
@@ -266,10 +228,9 @@ define void @extract_subvector_v32i32(ptr %a, ptr %b) vscale_range(8,0) #0 {
define void @extract_subvector_v64i32(ptr %a, ptr %b) vscale_range(16,0) #0 {
; CHECK-LABEL: extract_subvector_v64i32:
; CHECK: // %bb.0:
-; CHECK-NEXT: ptrue p0.s, vl64
-; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
; CHECK-NEXT: ptrue p0.s, vl32
-; CHECK-NEXT: ext z0.b, z0.b, z0.b, #128
+; CHECK-NEXT: mov x8, #32 // =0x20
+; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
; CHECK-NEXT: st1w { z0.s }, p0, [x1]
; CHECK-NEXT: ret
%op = load <64 x i32>, ptr %a
@@ -294,9 +255,7 @@ define <1 x i64> @extract_subvector_v2i64(<2 x i64> %op) vscale_range(2,0) #0 {
define void @extract_subvector_v4i64(ptr %a, ptr %b) vscale_range(2,0) #0 {
; CHECK-LABEL: extract_subvector_v4i64:
; CHECK: // %bb.0:
-; CHECK-NEXT: ptrue p0.d, vl4
-; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0]
-; CHECK-NEXT: ext z0.b, z0.b, z0.b, #16
+; CHECK-NEXT: ldr q0, [x0, #16]
; CHECK-NEXT: str q0, [x1]
; CHECK-NEXT: ret
%op = load <4 x i64>, ptr %a
@@ -331,6 +290,14 @@ define void @extract_subvector_v16i64(ptr %a, ptr %b) #0 {
; VBITS_GE_256-NEXT: st1d { z0.d }, p0, [x1, x8, lsl #3]
; VBITS_GE_256-NEXT: st1d { z1.d }, p0, [x1]
; VBITS_GE_256-NEXT: ret
+;
+; VBITS_GE_512-LABEL: extract_subvector_v16i64:
+; VBITS_GE_512: // %bb.0:
+; VBITS_GE_512-NEXT: ptrue p0.d, vl8
+; VBITS_GE_512-NEXT: mov x8, #8 // =0x8
+; VBITS_GE_512-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
+; VBITS_GE_512-NEXT: st1d { z0.d }, p0, [x1]
+; VBITS_GE_512-NEXT: ret
%op = load <16 x i64>, ptr %a
%ret = call <8 x i64> @llvm.vector.extract.v8i64.v16i64(<16 x i64> %op, i64 8)
store <8 x i64> %ret, ptr %b
@@ -378,9 +345,7 @@ define <4 x half> @extract_subvector_v8f16(<8 x half> %op) vscale_range(2,0) #0
define void @extract_subvector_v16f16(ptr %a, ptr %b) vscale_range(2,0) #0 {
; CHECK-LABEL: extract_subvector_v16f16:
; CHECK: // %bb.0:
-; CHECK-NEXT: ptrue p0.h, vl16
-; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
-; CHECK-NEXT: ext z0.b, z0.b, z0.b, #16
+; CHECK-NEXT: ldr q0, [x0, #16]
; CHECK-NEXT: str q0, [x1]
; CHECK-NEXT: ret
%op = load <16 x half>, ptr %a
@@ -390,22 +355,13 @@ define void @extract_subvector_v16f16(ptr %a, ptr %b) vscale_range(2,0) #0 {
}
define void @extract_subvector_v32f16(ptr %a, ptr %b) #0 {
-; VBITS_GE_256-LABEL: extract_subvector_v32f16:
-; VBITS_GE_256: // %bb.0:
-; VBITS_GE_256-NEXT: ptrue p0.h, vl16
-; VBITS_GE_256-NEXT: mov x8, #16 // =0x10
-; VBITS_GE_256-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
-; VBITS_GE_256-NEXT: st1h { z0.h }, p0, [x1]
-; VBITS_GE_256-NEXT: ret
-;
-; VBITS_GE_512-LABEL: extract_subvector_v32f16:
-; VBITS_GE_512: // %bb.0:
-; VBITS_GE_512-NEXT: ptrue p0.h, vl32
-; VBITS_GE_512-NEXT: ld1h { z0.h }, p0/z, [x0]
-; VBITS_GE_512-NEXT: ptrue p0.h, vl16
-; VBITS_GE_512-NEXT: ext z0.b, z0.b, z0.b, #32
-; VBITS_GE_512-NEXT: st1h { z0.h }, p0, [x1]
-; VBITS_GE_512-NEXT: ret
+; CHECK-LABEL: extract_subvector_v32f16:
+; CHECK: // %bb.0:
+; CHECK-NEXT: ptrue p0.h, vl16
+; CHECK-NEXT: mov x8, #16 // =0x10
+; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
+; CHECK-NEXT: st1h { z0.h }, p0, [x1]
+; CHECK-NEXT: ret
%op = load <32 x half>, ptr %a
%ret = call <16 x half> @llvm.vector.extract.v16f16.v32f16(<32 x half> %op, i64 16)
store <16 x half> %ret, ptr %b
@@ -415,10 +371,9 @@ define void @extract_subvector_v32f16(ptr %a, ptr %b) #0 {
define void @extract_subvector_v64f16(ptr %a, ptr %b) vscale_range(8,0) #0 {
; CHECK-LABEL: extract_subvector_v64f16:
; CHECK: // %bb.0:
-; CHECK-NEXT: ptrue p0.h, vl64
-; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
; CHECK-NEXT: ptrue p0.h, vl32
-; CHECK-NEXT: ext z0.b, z0.b, z0.b, #64
+; CHECK-NEXT: mov x8, #32 // =0x20
+; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
; CHECK-NEXT: st1h { z0.h }, p0, [x1]
; CHECK-NEXT: ret
%op = load <64 x half>, ptr %a
@@ -430,10 +385,9 @@ define void @extract_subvector_v64f16(ptr %a, ptr %b) vscale_range(8,0) #0 {
define void @extract_subvector_v128f16(ptr %a, ptr %b) vscale_range(16,0) #0 {
; CHECK-LABEL: extract_subvector_v128f16:
; CHECK: // %bb.0:
-; CHECK-NEXT: ptrue p0.h, vl128
-; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
; CHECK-NEXT: ptrue p0.h, vl64
-; CHECK-NEXT: ext z0.b, z0.b, z0.b, #128
+; CHECK-NEXT: mov x8, #64 // =0x40
+; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
; CHECK-NEXT: st1h { z0.h }, p0, [x1]
; CHECK-NEXT: ret
%op = load <128 x half>, ptr %a
@@ -469,9 +423,7 @@ define <2 x float> @extract_subvector_v4f32(<4 x float> %op) vscale_range(2,0) #
define void @extract_subvector_v8f32(ptr %a, ptr %b) vscale_range(2,0) #0 {
; CHECK-LABEL: extract_subvector_v8f32:
; CHECK: // %bb.0:
-; CHECK-NEXT: ptrue p0.s, vl8
-; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
-; CHECK-NEXT: ext z0.b, z0.b, z0.b, #16
+; CHECK-NEXT: ldr q0, [x0, #16]
; CHECK-NEXT: str q0, [x1]
; CHECK-NEXT: ret
%op = load <8 x float>, ptr %a
@@ -481,22 +433,13 @@ define void @extract_subvector_v8f32(ptr %a, ptr %b) vscale_range(2,0) #0 {
}
define void @extract_subvector_v16f32(ptr %a, ptr %b) #0 {
-; VBITS_GE_256-LABEL: extract_subvector_v16f32:
-; VBITS_GE_256: // %bb.0:
-; VBITS_GE_256-NEXT: ptrue p0.s, vl8
-; VBITS_GE_256-NEXT: mov x8, #8 // =0x8
-; VBITS_GE_256-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
-; VBITS_GE_256-NEXT: ...
[truncated]
|
@@ -14,6 +14,7 @@ set(LLVM_LINK_COMPONENTS | |||
MC | |||
MIRParser | |||
Passes | |||
SelectionDAG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one seems suspicious, does it just get pulled in with the rest of codegen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might have added an unnecessary one - I was just faced with a wave of build failures and updated them all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah SelectionDAG depends on CodeGen - not the other way around - so we could drop CodeGen/CodeGenTypes and make the implicit, but I'd probably keep them tbh
@arsenm any preference for updating CMakeLists or making it inline? |
I don't exactly follow how only this one case needs to be inline to fix the failures. It's an all or nothing problem? |
Its due to shouldReduceLoadWidth being virtual - if we inline SDNode::hasNUsesOfValue then the explicit dependency is hidden again - I'll keep with the CMakeLists approach as it should stop this being a problem again in the future. |
@@ -1817,7 +1817,7 @@ class TargetLoweringBase { | |||
EVT NewVT) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not move this definition to TargetLowering.cpp? Do we need the body inlined for some reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an implicit dependency that I'm still trying to track down.
4d1a58e
to
c9d8251
Compare
…ded value, not the chain etc. The hasOneUse check was failing in any case where the load was part of a chain - we should only be checking if the loaded value has one use, and any updates to the chain should be handled by the fold calling shouldReduceLoadWidth. I've updated the x86 implementation to match, although it has no effect here yet (I'm still looking at how to improve the x86 implementation) as the inner for loop was discarding chain uses anyway. By using SDValue::hasOneUse instead this patch exposes a missing dependency on the LLVMSelectionDAG library in a lot of tools + unittests, which resulted in having to make SDNode::hasNUsesOfValue inline. Noticed while fighting the x86 regressions in llvm#122671
c9d8251
to
2cf9ca4
Compare
[DAG] shouldReduceLoadWidth - hasOneUse check for just be for the loaded value, not the chain etc.
The hasOneUse check was failing in any case where the load was part of a chain - we should only be checking if the loaded value has one use, and any updates to the chain should be handled by the fold calling shouldReduceLoadWidth.
I've updated the x86 implementation to match, although it has no effect here yet (I'm still looking at how to improve the x86 implementation) as the inner for loop was discarding chain uses anyway.
By using SDValue::hasOneUse instead this patch exposes a missing dependency on the LLVMSelectionDAG library in a lot of tools + unittests, which resulted in having to make SDNode::hasNUsesOfValue inline.
Noticed while fighting the x86 regressions in #122671