[X86][BreakFalseDeps] Using reverse order for undef register selection #137569

phoebewang · 2025-04-28T01:15:18Z

BreakFalseDeps picks the best register for undef operands if instructions have false dependency. The problem is if the instruction is close to the beginning of the function, ReachingDefAnalysis is over optimism to the unused registers, which results in collision with registers just defined in the caller.

This patch changes the selection of undef register in an reverse order, which reduces the probability of register collisions between caller and callee. It brings improvement in some of our internal benchmarks with negligible effect on other benchmarks.

BreakFalseDeps picks the best register for undef operands if instructions have false dependency. The problem is if the instruction is close to the beginning of the function, ReachingDefAnalysis is over optimism of the unused registers, which results in collision with registers just defined in the caller. This patch changes the selection of undef register in an reverse order, which reduces the probability of register collisions between caller and callee. It brings improvement in some of our internal benchmarks with negligible effect on other benchmarks.

llvmbot · 2025-04-28T01:15:50Z

@llvm/pr-subscribers-tablegen
@llvm/pr-subscribers-llvm-regalloc

@llvm/pr-subscribers-backend-x86

Author: Phoebe Wang (phoebewang)

Changes

BreakFalseDeps picks the best register for undef operands if instructions have false dependency. The problem is if the instruction is close to the beginning of the function, ReachingDefAnalysis is over optimism to the unused registers, which results in collision with registers just defined in the caller.

This patch changes the selection of undef register in an reverse order, which reduces the probability of register collisions between caller and callee. It brings improvement in some of our internal benchmarks with negligible effect on other benchmarks.

Patch is 253.54 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/137569.diff

42 Files Affected:

(modified) llvm/include/llvm/CodeGen/RegisterClassInfo.h (+3-1)
(modified) llvm/include/llvm/CodeGen/TargetRegisterInfo.h (+4-3)
(modified) llvm/include/llvm/Target/Target.td (+1-1)
(modified) llvm/lib/CodeGen/BreakFalseDeps.cpp (+1-1)
(modified) llvm/lib/CodeGen/RegisterClassInfo.cpp (+10-3)
(modified) llvm/lib/Target/X86/X86RegisterInfo.td (+24-4)
(modified) llvm/test/CodeGen/X86/avx-cvt.ll (+8-8)
(modified) llvm/test/CodeGen/X86/avx512-cvt.ll (+110-110)
(modified) llvm/test/CodeGen/X86/avx512-regcall-NoMask.ll (+14-14)
(modified) llvm/test/CodeGen/X86/avx512fp16-cvt.ll (+18-18)
(modified) llvm/test/CodeGen/X86/break-false-dep.ll (+13-13)
(modified) llvm/test/CodeGen/X86/coalescer-commute1.ll (+1-1)
(modified) llvm/test/CodeGen/X86/fast-isel-fptrunc-fpext.ll (+2-2)
(modified) llvm/test/CodeGen/X86/fast-isel-int-float-conversion-x86-64.ll (+6-6)
(modified) llvm/test/CodeGen/X86/fast-isel-int-float-conversion.ll (+12-12)
(modified) llvm/test/CodeGen/X86/fast-isel-uint-float-conversion-x86-64.ll (+6-6)
(modified) llvm/test/CodeGen/X86/fast-isel-uint-float-conversion.ll (+12-12)
(modified) llvm/test/CodeGen/X86/fold-int-pow2-with-fmul-or-fdiv.ll (+26-26)
(modified) llvm/test/CodeGen/X86/fold-load-unops.ll (+12-12)
(modified) llvm/test/CodeGen/X86/fp-intrinsics.ll (+20-20)
(modified) llvm/test/CodeGen/X86/fp-strict-scalar-inttofp-fp16.ll (+30-30)
(modified) llvm/test/CodeGen/X86/fp-strict-scalar-inttofp.ll (+38-38)
(modified) llvm/test/CodeGen/X86/fp-strict-scalar-round-fp16.ll (+6-6)
(modified) llvm/test/CodeGen/X86/ftrunc.ll (+3-3)
(modified) llvm/test/CodeGen/X86/half.ll (+4-4)
(modified) llvm/test/CodeGen/X86/isel-int-to-fp.ll (+24-24)
(modified) llvm/test/CodeGen/X86/pr34080.ll (+2-2)
(modified) llvm/test/CodeGen/X86/pr37879.ll (+1-1)
(modified) llvm/test/CodeGen/X86/pr38803.ll (+1-1)
(modified) llvm/test/CodeGen/X86/rounding-ops.ll (+8-8)
(modified) llvm/test/CodeGen/X86/scalar-int-to-fp.ll (+15-15)
(modified) llvm/test/CodeGen/X86/select-narrow-int-to-fp.ll (+16-16)
(modified) llvm/test/CodeGen/X86/split-extend-vector-inreg.ll (+2-13)
(modified) llvm/test/CodeGen/X86/sse-cvttp2si.ll (+8-8)
(modified) llvm/test/CodeGen/X86/sse2-intrinsics-x86-upgrade.ll (+4-4)
(modified) llvm/test/CodeGen/X86/stack-folding-fp-avx1.ll (+17-17)
(modified) llvm/test/CodeGen/X86/vec-strict-inttofp-128.ll (+20-20)
(modified) llvm/test/CodeGen/X86/vec-strict-inttofp-256.ll (+72-72)
(modified) llvm/test/CodeGen/X86/vec-strict-inttofp-512.ll (+32-32)
(modified) llvm/test/CodeGen/X86/vec_int_to_fp.ll (+201-201)
(modified) llvm/test/CodeGen/X86/vector-constrained-fp-intrinsics.ll (+106-105)
(modified) llvm/utils/TableGen/RegisterInfoEmitter.cpp (+5-5)

diff --git a/llvm/include/llvm/CodeGen/RegisterClassInfo.h b/llvm/include/llvm/CodeGen/RegisterClassInfo.h
index 99beae761c40b..f65f54cbd6982 100644
--- a/llvm/include/llvm/CodeGen/RegisterClassInfo.h
+++ b/llvm/include/llvm/CodeGen/RegisterClassInfo.h
@@ -49,6 +49,8 @@ class RegisterClassInfo {
   // entry is valid when its tag matches.
   unsigned Tag = 0;
 
+  bool Reverse = false;
+
   const MachineFunction *MF = nullptr;
   const TargetRegisterInfo *TRI = nullptr;
 
@@ -87,7 +89,7 @@ class RegisterClassInfo {
 
   /// runOnFunction - Prepare to answer questions about MF. This must be called
   /// before any other methods are used.
-  void runOnMachineFunction(const MachineFunction &MF);
+  void runOnMachineFunction(const MachineFunction &MF, bool Rev = false);
 
   /// getNumAllocatableRegs - Returns the number of actually allocatable
   /// registers in RC in the current function.
diff --git a/llvm/include/llvm/CodeGen/TargetRegisterInfo.h b/llvm/include/llvm/CodeGen/TargetRegisterInfo.h
index ab3eaa92548ca..af3250e3c2466 100644
--- a/llvm/include/llvm/CodeGen/TargetRegisterInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetRegisterInfo.h
@@ -67,7 +67,7 @@ class TargetRegisterClass {
   const bool CoveredBySubRegs;
   const unsigned *SuperClasses;
   const uint16_t SuperClassesSize;
-  ArrayRef<MCPhysReg> (*OrderFunc)(const MachineFunction&);
+  ArrayRef<MCPhysReg> (*OrderFunc)(const MachineFunction &, bool Rev);
 
   /// Return the register class ID number.
   unsigned getID() const { return MC->getID(); }
@@ -198,8 +198,9 @@ class TargetRegisterClass {
   /// other criteria.
   ///
   /// By default, this method returns all registers in the class.
-  ArrayRef<MCPhysReg> getRawAllocationOrder(const MachineFunction &MF) const {
-    return OrderFunc ? OrderFunc(MF) : getRegisters();
+  ArrayRef<MCPhysReg> getRawAllocationOrder(const MachineFunction &MF,
+                                            bool Rev = false) const {
+    return OrderFunc ? OrderFunc(MF, Rev) : getRegisters();
   }
 
   /// Returns the combination of all lane masks of register in this class.
diff --git a/llvm/include/llvm/Target/Target.td b/llvm/include/llvm/Target/Target.td
index e8b460aaf803b..ce9a2b2751968 100644
--- a/llvm/include/llvm/Target/Target.td
+++ b/llvm/include/llvm/Target/Target.td
@@ -314,7 +314,7 @@ class RegisterClass<string namespace, list<ValueType> regTypes, int alignment,
   // to use in a given machine function. The code will be inserted in a
   // function like this:
   //
-  //   static inline unsigned f(const MachineFunction &MF) { ... }
+  //   static inline unsigned f(const MachineFunction &MF, bool Rev) { ... }
   //
   // The function should return 0 to select the default order defined by
   // MemberList, 1 to select the first AltOrders entry and so on.
diff --git a/llvm/lib/CodeGen/BreakFalseDeps.cpp b/llvm/lib/CodeGen/BreakFalseDeps.cpp
index 618e41894b29b..64da5d4890ee0 100644
--- a/llvm/lib/CodeGen/BreakFalseDeps.cpp
+++ b/llvm/lib/CodeGen/BreakFalseDeps.cpp
@@ -286,7 +286,7 @@ bool BreakFalseDeps::runOnMachineFunction(MachineFunction &mf) {
   TRI = MF->getSubtarget().getRegisterInfo();
   RDA = &getAnalysis<ReachingDefAnalysis>();
 
-  RegClassInfo.runOnMachineFunction(mf);
+  RegClassInfo.runOnMachineFunction(mf, /*Rev=*/true);
 
   LLVM_DEBUG(dbgs() << "********** BREAK FALSE DEPENDENCIES **********\n");
 
diff --git a/llvm/lib/CodeGen/RegisterClassInfo.cpp b/llvm/lib/CodeGen/RegisterClassInfo.cpp
index 40fc35a16335f..8ead83302c337 100644
--- a/llvm/lib/CodeGen/RegisterClassInfo.cpp
+++ b/llvm/lib/CodeGen/RegisterClassInfo.cpp
@@ -39,14 +39,16 @@ StressRA("stress-regalloc", cl::Hidden, cl::init(0), cl::value_desc("N"),
 
 RegisterClassInfo::RegisterClassInfo() = default;
 
-void RegisterClassInfo::runOnMachineFunction(const MachineFunction &mf) {
+void RegisterClassInfo::runOnMachineFunction(const MachineFunction &mf,
+                                             bool Rev) {
   bool Update = false;
   MF = &mf;
 
   auto &STI = MF->getSubtarget();
 
   // Allocate new array the first time we see a new target.
-  if (STI.getRegisterInfo() != TRI) {
+  if (STI.getRegisterInfo() != TRI || Reverse != Rev) {
+    Reverse = Rev;
     TRI = STI.getRegisterInfo();
     RegClass.reset(new RCInfo[TRI->getNumRegClasses()]);
     Update = true;
@@ -142,7 +144,12 @@ void RegisterClassInfo::compute(const TargetRegisterClass *RC) const {
 
   // FIXME: Once targets reserve registers instead of removing them from the
   // allocation order, we can simply use begin/end here.
-  ArrayRef<MCPhysReg> RawOrder = RC->getRawAllocationOrder(*MF);
+  ArrayRef<MCPhysReg> RawOrder = RC->getRawAllocationOrder(*MF, Reverse);
+  std::vector<MCPhysReg> ReverseOrder;
+  if (Reverse) {
+    llvm::append_range(ReverseOrder, reverse(RawOrder));
+    RawOrder = ArrayRef<MCPhysReg>(ReverseOrder);
+  }
   for (unsigned PhysReg : RawOrder) {
     // Remove reserved registers from the allocation order.
     if (Reserved.test(PhysReg))
diff --git a/llvm/lib/Target/X86/X86RegisterInfo.td b/llvm/lib/Target/X86/X86RegisterInfo.td
index 48459b3aca508..8e8f76ee43410 100644
--- a/llvm/lib/Target/X86/X86RegisterInfo.td
+++ b/llvm/lib/Target/X86/X86RegisterInfo.td
@@ -802,17 +802,37 @@ def VR512_0_15 : RegisterClass<"X86", [v16f32, v8f64, v64i8, v32i16, v16i32, v8i
                                512, (sequence "ZMM%u", 0, 15)>;
 
 // Scalar AVX-512 floating point registers.
-def FR32X : RegisterClass<"X86", [f32], 32, (sequence "XMM%u", 0, 31)>;
+def FR32X : RegisterClass<"X86", [f32], 32, (sequence "XMM%u", 0, 31)> {
+  let AltOrders = [(add (sequence "XMM%u", 16, 31), (sequence "XMM%u", 0, 15))];
+  let AltOrderSelect = [{
+    return Rev;
+  }];
+}
 
-def FR64X : RegisterClass<"X86", [f64], 64, (add FR32X)>;
+def FR64X : RegisterClass<"X86", [f64], 64, (add FR32X)> {
+  let AltOrders = [(add (sequence "XMM%u", 16, 31), (sequence "XMM%u", 0, 15))];
+  let AltOrderSelect = [{
+    return Rev;
+  }];
+}
 
 def FR16X : RegisterClass<"X86", [f16], 16, (add FR32X)> {let Size = 32;}
 
 // Extended VR128 and VR256 for AVX-512 instructions
 def VR128X : RegisterClass<"X86", [v4f32, v2f64, v8f16, v8bf16, v16i8, v8i16, v4i32, v2i64, f128],
-                           128, (add FR32X)>;
+                           128, (add FR32X)> {
+  let AltOrders = [(add (sequence "XMM%u", 16, 31), (sequence "XMM%u", 0, 15))];
+  let AltOrderSelect = [{
+    return Rev;
+  }];
+}
 def VR256X : RegisterClass<"X86", [v8f32, v4f64, v16f16, v16bf16, v32i8, v16i16, v8i32, v4i64],
-                           256, (sequence "YMM%u", 0, 31)>;
+                           256, (sequence "YMM%u", 0, 31)> {
+  let AltOrders = [(add (sequence "YMM%u", 16, 31), (sequence "YMM%u", 0, 15))];
+  let AltOrderSelect = [{
+    return Rev;
+  }];
+}
 
 // Mask registers
 def VK1     : RegisterClass<"X86", [v1i1],  16,  (sequence "K%u", 0, 7)> {let Size = 16;}
diff --git a/llvm/test/CodeGen/X86/avx-cvt.ll b/llvm/test/CodeGen/X86/avx-cvt.ll
index 1bd25273ecd48..fb30044512fa5 100644
--- a/llvm/test/CodeGen/X86/avx-cvt.ll
+++ b/llvm/test/CodeGen/X86/avx-cvt.ll
@@ -108,7 +108,7 @@ define <2 x double> @fpext01(<2 x double> %a0, <4 x float> %a1) nounwind {
 define double @funcA(ptr nocapture %e) nounwind uwtable readonly ssp {
 ; CHECK-LABEL: funcA:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vcvtsi2sdq (%rdi), %xmm0, %xmm0
+; CHECK-NEXT:    vcvtsi2sdq (%rdi), %xmm15, %xmm0
 ; CHECK-NEXT:    retq
   %tmp1 = load i64, ptr %e, align 8
   %conv = sitofp i64 %tmp1 to double
@@ -118,7 +118,7 @@ define double @funcA(ptr nocapture %e) nounwind uwtable readonly ssp {
 define double @funcB(ptr nocapture %e) nounwind uwtable readonly ssp {
 ; CHECK-LABEL: funcB:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vcvtsi2sdl (%rdi), %xmm0, %xmm0
+; CHECK-NEXT:    vcvtsi2sdl (%rdi), %xmm15, %xmm0
 ; CHECK-NEXT:    retq
   %tmp1 = load i32, ptr %e, align 4
   %conv = sitofp i32 %tmp1 to double
@@ -128,7 +128,7 @@ define double @funcB(ptr nocapture %e) nounwind uwtable readonly ssp {
 define float @funcC(ptr nocapture %e) nounwind uwtable readonly ssp {
 ; CHECK-LABEL: funcC:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vcvtsi2ssl (%rdi), %xmm0, %xmm0
+; CHECK-NEXT:    vcvtsi2ssl (%rdi), %xmm15, %xmm0
 ; CHECK-NEXT:    retq
   %tmp1 = load i32, ptr %e, align 4
   %conv = sitofp i32 %tmp1 to float
@@ -138,7 +138,7 @@ define float @funcC(ptr nocapture %e) nounwind uwtable readonly ssp {
 define float @funcD(ptr nocapture %e) nounwind uwtable readonly ssp {
 ; CHECK-LABEL: funcD:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vcvtsi2ssq (%rdi), %xmm0, %xmm0
+; CHECK-NEXT:    vcvtsi2ssq (%rdi), %xmm15, %xmm0
 ; CHECK-NEXT:    retq
   %tmp1 = load i64, ptr %e, align 8
   %conv = sitofp i64 %tmp1 to float
@@ -183,7 +183,7 @@ declare float @llvm.floor.f32(float %p)
 define float @floor_f32_load(ptr %aptr) optsize {
 ; CHECK-LABEL: floor_f32_load:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vroundss $9, (%rdi), %xmm0, %xmm0
+; CHECK-NEXT:    vroundss $9, (%rdi), %xmm15, %xmm0
 ; CHECK-NEXT:    retq
   %a = load float, ptr %aptr
   %res = call float @llvm.floor.f32(float %a)
@@ -193,7 +193,7 @@ define float @floor_f32_load(ptr %aptr) optsize {
 define float @floor_f32_load_pgso(ptr %aptr) !prof !14 {
 ; CHECK-LABEL: floor_f32_load_pgso:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vroundss $9, (%rdi), %xmm0, %xmm0
+; CHECK-NEXT:    vroundss $9, (%rdi), %xmm15, %xmm0
 ; CHECK-NEXT:    retq
   %a = load float, ptr %aptr
   %res = call float @llvm.floor.f32(float %a)
@@ -203,7 +203,7 @@ define float @floor_f32_load_pgso(ptr %aptr) !prof !14 {
 define double @nearbyint_f64_load(ptr %aptr) optsize {
 ; CHECK-LABEL: nearbyint_f64_load:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vroundsd $12, (%rdi), %xmm0, %xmm0
+; CHECK-NEXT:    vroundsd $12, (%rdi), %xmm15, %xmm0
 ; CHECK-NEXT:    retq
   %a = load double, ptr %aptr
   %res = call double @llvm.nearbyint.f64(double %a)
@@ -213,7 +213,7 @@ define double @nearbyint_f64_load(ptr %aptr) optsize {
 define double @nearbyint_f64_load_pgso(ptr %aptr) !prof !14 {
 ; CHECK-LABEL: nearbyint_f64_load_pgso:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vroundsd $12, (%rdi), %xmm0, %xmm0
+; CHECK-NEXT:    vroundsd $12, (%rdi), %xmm15, %xmm0
 ; CHECK-NEXT:    retq
   %a = load double, ptr %aptr
   %res = call double @llvm.nearbyint.f64(double %a)
diff --git a/llvm/test/CodeGen/X86/avx512-cvt.ll b/llvm/test/CodeGen/X86/avx512-cvt.ll
index a78d97782e6a3..3dd7b571b9215 100644
--- a/llvm/test/CodeGen/X86/avx512-cvt.ll
+++ b/llvm/test/CodeGen/X86/avx512-cvt.ll
@@ -22,27 +22,27 @@ define <8 x double> @sltof864(<8 x i64> %a) {
 ; NODQ:       # %bb.0:
 ; NODQ-NEXT:    vextracti32x4 $3, %zmm0, %xmm1
 ; NODQ-NEXT:    vpextrq $1, %xmm1, %rax
-; NODQ-NEXT:    vcvtsi2sd %rax, %xmm2, %xmm2
+; NODQ-NEXT:    vcvtsi2sd %rax, %xmm15, %xmm2
 ; NODQ-NEXT:    vmovq %xmm1, %rax
-; NODQ-NEXT:    vcvtsi2sd %rax, %xmm3, %xmm1
+; NODQ-NEXT:    vcvtsi2sd %rax, %xmm15, %xmm1
 ; NODQ-NEXT:    vunpcklpd {{.*#+}} xmm1 = xmm1[0],xmm2[0]
 ; NODQ-NEXT:    vextracti32x4 $2, %zmm0, %xmm2
 ; NODQ-NEXT:    vpextrq $1, %xmm2, %rax
-; NODQ-NEXT:    vcvtsi2sd %rax, %xmm3, %xmm3
+; NODQ-NEXT:    vcvtsi2sd %rax, %xmm15, %xmm3
 ; NODQ-NEXT:    vmovq %xmm2, %rax
-; NODQ-NEXT:    vcvtsi2sd %rax, %xmm4, %xmm2
+; NODQ-NEXT:    vcvtsi2sd %rax, %xmm15, %xmm2
 ; NODQ-NEXT:    vunpcklpd {{.*#+}} xmm2 = xmm2[0],xmm3[0]
 ; NODQ-NEXT:    vinsertf128 $1, %xmm1, %ymm2, %ymm1
 ; NODQ-NEXT:    vextracti128 $1, %ymm0, %xmm2
 ; NODQ-NEXT:    vpextrq $1, %xmm2, %rax
-; NODQ-NEXT:    vcvtsi2sd %rax, %xmm4, %xmm3
+; NODQ-NEXT:    vcvtsi2sd %rax, %xmm15, %xmm3
 ; NODQ-NEXT:    vmovq %xmm2, %rax
-; NODQ-NEXT:    vcvtsi2sd %rax, %xmm4, %xmm2
+; NODQ-NEXT:    vcvtsi2sd %rax, %xmm15, %xmm2
 ; NODQ-NEXT:    vunpcklpd {{.*#+}} xmm2 = xmm2[0],xmm3[0]
 ; NODQ-NEXT:    vpextrq $1, %xmm0, %rax
-; NODQ-NEXT:    vcvtsi2sd %rax, %xmm4, %xmm3
+; NODQ-NEXT:    vcvtsi2sd %rax, %xmm15, %xmm3
 ; NODQ-NEXT:    vmovq %xmm0, %rax
-; NODQ-NEXT:    vcvtsi2sd %rax, %xmm4, %xmm0
+; NODQ-NEXT:    vcvtsi2sd %rax, %xmm15, %xmm0
 ; NODQ-NEXT:    vunpcklpd {{.*#+}} xmm0 = xmm0[0],xmm3[0]
 ; NODQ-NEXT:    vinsertf128 $1, %xmm2, %ymm0, %ymm0
 ; NODQ-NEXT:    vinsertf64x4 $1, %ymm1, %zmm0, %zmm0
@@ -66,14 +66,14 @@ define <4 x double> @slto4f64(<4 x i64> %a) {
 ; NODQ:       # %bb.0:
 ; NODQ-NEXT:    vextracti128 $1, %ymm0, %xmm1
 ; NODQ-NEXT:    vpextrq $1, %xmm1, %rax
-; NODQ-NEXT:    vcvtsi2sd %rax, %xmm2, %xmm2
+; NODQ-NEXT:    vcvtsi2sd %rax, %xmm15, %xmm2
 ; NODQ-NEXT:    vmovq %xmm1, %rax
-; NODQ-NEXT:    vcvtsi2sd %rax, %xmm3, %xmm1
+; NODQ-NEXT:    vcvtsi2sd %rax, %xmm15, %xmm1
 ; NODQ-NEXT:    vunpcklpd {{.*#+}} xmm1 = xmm1[0],xmm2[0]
 ; NODQ-NEXT:    vpextrq $1, %xmm0, %rax
-; NODQ-NEXT:    vcvtsi2sd %rax, %xmm3, %xmm2
+; NODQ-NEXT:    vcvtsi2sd %rax, %xmm15, %xmm2
 ; NODQ-NEXT:    vmovq %xmm0, %rax
-; NODQ-NEXT:    vcvtsi2sd %rax, %xmm3, %xmm0
+; NODQ-NEXT:    vcvtsi2sd %rax, %xmm15, %xmm0
 ; NODQ-NEXT:    vunpcklpd {{.*#+}} xmm0 = xmm0[0],xmm2[0]
 ; NODQ-NEXT:    vinsertf128 $1, %xmm1, %ymm0, %ymm0
 ; NODQ-NEXT:    retq
@@ -97,9 +97,9 @@ define <2 x double> @slto2f64(<2 x i64> %a) {
 ; NODQ-LABEL: slto2f64:
 ; NODQ:       # %bb.0:
 ; NODQ-NEXT:    vpextrq $1, %xmm0, %rax
-; NODQ-NEXT:    vcvtsi2sd %rax, %xmm1, %xmm1
+; NODQ-NEXT:    vcvtsi2sd %rax, %xmm15, %xmm1
 ; NODQ-NEXT:    vmovq %xmm0, %rax
-; NODQ-NEXT:    vcvtsi2sd %rax, %xmm2, %xmm0
+; NODQ-NEXT:    vcvtsi2sd %rax, %xmm15, %xmm0
 ; NODQ-NEXT:    vunpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0]
 ; NODQ-NEXT:    retq
 ;
@@ -123,9 +123,9 @@ define <2 x float> @sltof2f32(<2 x i64> %a) {
 ; NODQ-LABEL: sltof2f32:
 ; NODQ:       # %bb.0:
 ; NODQ-NEXT:    vpextrq $1, %xmm0, %rax
-; NODQ-NEXT:    vcvtsi2ss %rax, %xmm1, %xmm1
+; NODQ-NEXT:    vcvtsi2ss %rax, %xmm15, %xmm1
 ; NODQ-NEXT:    vmovq %xmm0, %rax
-; NODQ-NEXT:    vcvtsi2ss %rax, %xmm2, %xmm0
+; NODQ-NEXT:    vcvtsi2ss %rax, %xmm15, %xmm0
 ; NODQ-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0],xmm1[0],zero,zero
 ; NODQ-NEXT:    retq
 ;
@@ -148,12 +148,12 @@ define <2 x float> @sltof2f32(<2 x i64> %a) {
 define <4 x float> @slto4f32_mem(ptr %a) {
 ; NODQ-LABEL: slto4f32_mem:
 ; NODQ:       # %bb.0:
-; NODQ-NEXT:    vcvtsi2ssq 8(%rdi), %xmm0, %xmm0
-; NODQ-NEXT:    vcvtsi2ssq (%rdi), %xmm1, %xmm1
+; NODQ-NEXT:    vcvtsi2ssq 8(%rdi), %xmm15, %xmm0
+; NODQ-NEXT:    vcvtsi2ssq (%rdi), %xmm15, %xmm1
 ; NODQ-NEXT:    vinsertps {{.*#+}} xmm0 = xmm1[0],xmm0[0],xmm1[2,3]
-; NODQ-NEXT:    vcvtsi2ssq 16(%rdi), %xmm2, %xmm1
+; NODQ-NEXT:    vcvtsi2ssq 16(%rdi), %xmm15, %xmm1
 ; NODQ-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0,1],xmm1[0],xmm0[3]
-; NODQ-NEXT:    vcvtsi2ssq 24(%rdi), %xmm2, %xmm1
+; NODQ-NEXT:    vcvtsi2ssq 24(%rdi), %xmm15, %xmm1
 ; NODQ-NEXT:    vinsertps {{.*#+}} xmm0 = xmm0[0,1,2],xmm1[0]
 ; NODQ-NEXT:    retq
 ;
@@ -246,16 +246,16 @@ define <4 x float> @slto4f32(<4 x i64> %a) {
 ; NODQ-LABEL: slto4f32:
 ; NODQ:       # %bb.0:
 ; NODQ-NEXT:    vpextrq $1, %xmm0, %rax
-; NODQ-NEXT:    vcvtsi2ss %rax, %xmm1, %xmm1
+; NODQ-NEXT:    vcvtsi2ss %rax, %xmm15, %xmm1
 ; NODQ-NEXT:    vmovq %xmm0, %rax
-; NODQ-NEXT:    vcvtsi2ss %rax, %xmm2, %xmm2
+; NODQ-NEXT:    vcvtsi2ss %rax, %xmm15, %xmm2
 ; NODQ-NEXT:    vinsertps {{.*#+}} xmm1 = xmm2[0],xmm1[0],xmm2[2,3]
 ; NODQ-NEXT:    vextracti128 $1, %ymm0, %xmm0
 ; NODQ-NEXT:    vmovq %xmm0, %rax
-; NODQ-NEXT:    vcvtsi2ss %rax, %xmm3, %xmm2
+; NODQ-NEXT:    vcvtsi2ss %rax, %xmm15, %xmm2
 ; NODQ-NEXT:    vinsertps {{.*#+}} xmm1 = xmm1[0,1],xmm2[0],xmm1[3]
 ; NODQ-NEXT:    vpextrq $1, %xmm0, %rax
-; NODQ-NEXT:    vcvtsi2ss %rax, %xmm3, %xmm0
+; NODQ-NEXT:    vcvtsi2ss %rax, %xmm15, %xmm0
 ; NODQ-NEXT:    vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm0[0]
 ; NODQ-NEXT:    vzeroupper
 ; NODQ-NEXT:    retq
@@ -281,16 +281,16 @@ define <4 x float> @ulto4f32(<4 x i64> %a) {
 ; NODQ-LABEL: ulto4f32:
 ; NODQ:       # %bb.0:
 ; NODQ-NEXT:    vpextrq $1, %xmm0, %rax
-; NODQ-NEXT:    vcvtusi2ss %rax, %xmm1, %xmm1
+; NODQ-NEXT:    vcvtusi2ss %rax, %xmm15, %xmm1
 ; NODQ-NEXT:    vmovq %xmm0, %rax
-; NODQ-NEXT:    vcvtusi2ss %rax, %xmm2, %xmm2
+; NODQ-NEXT:    vcvtusi2ss %rax, %xmm15, %xmm2
 ; NODQ-NEXT:    vinsertps {{.*#+}} xmm1 = xmm2[0],xmm1[0],xmm2[2,3]
 ; NODQ-NEXT:    vextracti128 $1, %ymm0, %xmm0
 ; NODQ-NEXT:    vmovq %xmm0, %rax
-; NODQ-NEXT:    vcvtusi2ss %rax, %xmm3, %xmm2
+; NODQ-NEXT:    vcvtusi2ss %rax, %xmm15, %xmm2
 ; NODQ-NEXT:    vinsertps {{.*#+}} xmm1 = xmm1[0,1],xmm2[0],xmm1[3]
 ; NODQ-NEXT:    vpextrq $1, %xmm0, %rax
-; NODQ-NEXT:    vcvtusi2ss %rax, %xmm3, %xmm0
+; NODQ-NEXT:    vcvtusi2ss %rax, %xmm15, %xmm0
 ; NODQ-NEXT:    vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm0[0]
 ; NODQ-NEXT:    vzeroupper
 ; NODQ-NEXT:    retq
@@ -316,16 +316,16 @@ define <4 x float> @ulto4f32_nneg(<4 x i64> %a) {
 ; NODQ-LABEL: ulto4f32_nneg:
 ; NODQ:       # %bb.0:
 ; NODQ-NEXT:    vpextrq $1, %xmm0, %rax
-; NODQ-NEXT:    vcvtsi2ss %rax, %xmm1, %xmm1
+; NODQ-NEXT:    vcvtsi2ss %rax, %xmm15, %xmm1
 ; NODQ-NEXT:    vmovq %xmm0, %rax
-; NODQ-NEXT:    vcvtsi2ss %rax, %xmm2, %xmm2
+; NODQ-NEXT:    vcvtsi2ss %rax, %xmm15, %xmm2
 ; NODQ-NEXT:    vinsertps {{.*#+}} xmm1 = xmm2[0],xmm1[0],xmm2[2,3]
 ; NODQ-NEXT:    vextracti128 $1, %ymm0, %xmm0
 ; NODQ-NEXT:    vmovq %xmm0, %rax
-; NODQ-NEXT:    vcvtsi2ss %rax, %xmm3, %xmm2
+; NODQ-NEXT:    vcvtsi2ss %rax, %xmm15, %xmm2
 ; NODQ-NEXT:    vinsertps {{.*#+}} xmm1 = xmm1[0,1],xmm2[0],xmm1[3]
 ; NODQ-NEXT:    vpextrq $1, %xmm0, %rax
-; NODQ-NEXT:    vcvtsi2ss %rax, %xmm3, %xmm0
+; NODQ-NEXT:    vcvtsi2ss %rax, %xmm15, %xmm0
 ; NODQ-NEXT:    vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm0[0]
 ; NODQ-NEXT:    vzeroupper
 ; NODQ-NEXT:    retq
@@ -864,7 +864,7 @@ define <2 x double> @f32tof64_inreg(<2 x double> %a0, <4 x float> %a1) nounwind
 define double @sltof64_load(ptr nocapture %e) {
 ; ALL-LABEL: sltof64_load:
 ; ALL:       # %bb.0: # %entry
-; ALL-NEXT:    vcvtsi2sdq (%rdi), %xmm0, %xmm0
+; ALL-NEXT:    vcvtsi2sdq (%rdi), %xmm15, %xmm0
 ; ALL-NEXT:    retq
 entry:
   %tmp1 = load i64, ptr %e, align 8
@@ -875,7 +875,7 @@ entry:
 define double @sitof64_load(ptr %e) {
 ; ALL-LABEL: sitof64_load:
 ; ALL:       # %bb.0: # %entry
-; ALL-NEXT:    vcvtsi2sdl (%rdi), %xmm0, %xmm0
+; ALL-NEXT:    vcvtsi2sdl (%rdi), %xmm15, %xmm0
 ; ALL-NEXT:    retq
 entry:
   %tmp1 = load i32, ptr %e, align 4
@@ -886,7 +886,7 @@ entry:
 define float @sitof32_load(ptr %e) {
 ; ALL-LABEL: sitof32_load:
 ; ALL:       # %bb.0: # %entry
-; ALL-NEXT:    vcvtsi2ssl (%rdi), %xmm0, %xmm0
+; ALL-NEXT:    vcvtsi2ssl (%rdi), %xmm15, %xmm0
 ; ALL-NEXT:    retq
 entry:
   %tmp1 = load i32, ptr %e, align 4
@@ -897,7 +897,7 @@ entry:
 define float @sltof32_load(ptr %e) {
 ; ALL-LABEL: sltof32_load:
 ; ALL:       # %bb.0: # %entry
-; ALL-NEXT:    vcvtsi2ssq (%rdi), %xmm0, %xmm0
+; ALL-NEXT:    vcvtsi2ssq (%rdi), %xmm15, %xmm0
 ; ALL-NEXT:    retq
 entry:
   %tmp1 = load i64, ptr %e, align 8
@@ -990,28 +990,28 @@ define <8 x float> @slto8f32(<8 x i64> %a) {
 ; NODQ:       # %bb.0:
 ; NODQ-NEXT:    vextracti32x4 $2, %zmm0, %xmm1
 ; NODQ-NEXT:    vpextrq $1, %xmm1, %rax
-; NODQ-NEXT:    vcvtsi2ss %rax, %xmm2, %xmm2
+; NODQ-NEXT:    vcvtsi2ss %rax, %xmm15, %xmm2
 ; NODQ-NEXT:    vmovq %xmm1, %rax
-; NODQ-NEXT:    vcvtsi2ss %rax, %xmm3, %xmm1
+; NODQ-NEXT:    vcvtsi2ss %rax, %xmm15, %xmm1
 ; NODQ-NEXT:    vinsertps {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[2,3]
 ; NODQ-NEXT:    vextracti32x4 $3, %zmm0, %xmm2
 ; NODQ-NEXT:    vmovq %xmm2, %rax
-; NODQ-NEXT:    vcvtsi2ss %rax, %xmm3, %xmm3
+; NODQ-NEXT:    vcvtsi2ss %rax, %xmm15, %xmm3
 ; NODQ-NEXT:    vinsertps {{.*#+}} xmm1 = xmm1[0,1],xmm3[0],xmm1[3]
 ; NODQ-NEXT:    vpextrq $1, %xmm2, %rax
-; NODQ-NEXT:    vcvtsi2ss %rax, %xmm4, %xmm2
+; NODQ-NEXT:    vcvtsi2ss %rax, %xmm15, %xmm2
 ; NODQ-NEXT:    vinsertps {{.*#+}} xmm1 = xmm1[0,1,2],xmm2[0]
 ; NODQ-NEXT:    vpextrq $1, %xmm0, %rax
-; NODQ-NEXT:    vcvtsi2ss %rax, %xmm4, %xmm2
+; NODQ-NEXT:    vcvtsi2ss %rax, %xmm15, %xmm2
 ; NODQ-NEXT:    vmovq %xmm0, %rax
-; NODQ-NEXT:    vcvtsi2ss %rax, %xmm4, %xmm3
+; NODQ-NEXT:    vcvtsi2ss %rax, %xmm15, %xmm3
 ; NODQ-NEXT:    vinsertps {{.*#+}} xmm2 = xmm3[0],xmm2[0],xmm3[2,3]
 ; NODQ-NEXT:    vextracti128 $1, %ymm0, %xmm0
 ; NODQ-NEXT:    vmovq %xmm0, %rax
-; NODQ-NEXT:    vcvtsi2ss %rax, %xmm4, %xmm3
+; NODQ-NEXT:    vcvtsi2ss %rax, %xmm15, %xmm3
 ; NODQ-NEXT:    vinsertps {{.*#+}} xmm2 = xmm2[0,1],xmm3[0],xmm2[3]
 ; NODQ-NEXT:    vpextrq $1, %xmm0, %rax
-; NODQ-NEXT:    vcvtsi2ss %rax, %xmm4, %xmm0
+; NODQ-NEXT:    v...
[truncated]

arsenm · 2025-04-28T07:44:15Z

llvm/lib/CodeGen/RegisterClassInfo.cpp

+  ArrayRef<MCPhysReg> RawOrder = RC->getRawAllocationOrder(*MF, Reverse);
+  std::vector<MCPhysReg> ReverseOrder;
+  if (Reverse) {
+    llvm::append_range(ReverseOrder, reverse(RawOrder));
+    RawOrder = ArrayRef<MCPhysReg>(ReverseOrder);
+  }


There is already a mechanism for providing alternative allocation orders defined in tablegen, you shouldn't need to do this

Yes, this is to imitate the alternative allocation order way. Currently it's only controlled by target features. We want to control it through pass agrument too.

What's wrong with it being a target faster? Could also expand the alternative allocation order controls. This is hardcoding a single alternative choice and requires a runtime sort

The problem is not some registers are fast. They are all the same.

The intention here is to alter the order for the a specific pass. It doesn't solve the problem here if we just reverse register oder for all passes.

s/faster/feature/

Then change the selection mechanism for the table generated order

I don't see how feature help here. This is not some feature works for all passes. We just want BreakFalseDeps uses reverse order.

arsenm · 2025-04-28T07:44:49Z

llvm/lib/CodeGen/RegisterClassInfo.cpp

  bool Update = false;
  MF = &mf;

  auto &STI = MF->getSubtarget();

  // Allocate new array the first time we see a new target.
-  if (STI.getRegisterInfo() != TRI) {
+  if (STI.getRegisterInfo() != TRI || Reverse != Rev) {


This TRI check looks broken, shouldn't be necessary

TRI is a constant value within the same Subtarget, but can be changed when we compile functions with different target feature. So we need to reset RegClass in these cases.

The analysis shouldn't survive in those cases?

My understanding is RegClass survives longer than analysis. We have other passes like MachineSink, RegAllocBase, MachineCombiner etc. all use it. The cached RegClass can be used among them within the same Subtarget?

RKSimon · 2025-04-28T11:19:44Z

llvm/test/CodeGen/X86/avx-cvt.ll

@@ -108,7 +108,7 @@ define <2 x double> @fpext01(<2 x double> %a0, <4 x float> %a1) nounwind {
 define double @funcA(ptr nocapture %e) nounwind uwtable readonly ssp {
 ; CHECK-LABEL: funcA:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vcvtsi2sdq (%rdi), %xmm0, %xmm0
+; CHECK-NEXT:    vcvtsi2sdq (%rdi), %xmm15, %xmm0


won't this cause codebloat by encouraging the use of the xmm8-15 registers?

Do you mean the 2B VEX prefix vs. 3B? The source operand is encoded in vvvv, so won't affect the prefix size.

And that true for all other cases as well? (Sorry I'm playing catchup and haven't gone through everything yet).

I checked all affected tests including vcvt[u]si*, fpround/fpext, vrcpss, vrounds*, vsqrts* all follow the same rule here.

RKSimon · 2025-05-02T14:09:25Z

llvm/test/CodeGen/X86/split-extend-vector-inreg.ll

@@ -3,19 +3,6 @@
 ; RUN: llc < %s -mtriple=x86_64-unknown -mattr=+avx | FileCheck %s


Update checks and regenerate:

; RUN: llc < %s -mtriple=i686-unknown -mattr=+avx | FileCheck %s --check-prefix=X86 ; RUN: llc < %s -mtriple=x86_64-unknown -mattr=+avx | FileCheck %s --check-prefix=X64

RKSimon · 2025-05-02T14:10:17Z

llvm/test/CodeGen/X86/avx-cvt.ll

@@ -108,7 +108,7 @@ define <2 x double> @fpext01(<2 x double> %a0, <4 x float> %a1) nounwind {
 define double @funcA(ptr nocapture %e) nounwind uwtable readonly ssp {
 ; CHECK-LABEL: funcA:
 ; CHECK:       # %bb.0:
-; CHECK-NEXT:    vcvtsi2sdq (%rdi), %xmm0, %xmm0
+; CHECK-NEXT:    vcvtsi2sdq (%rdi), %xmm15, %xmm0


And that true for all other cases as well? (Sorry I'm playing catchup and haven't gone through everything yet).

RKSimon · 2025-05-02T14:10:57Z

llvm/test/CodeGen/X86/vector-constrained-fp-intrinsics.ll

@@ -1,4 +1,5 @@
-; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; Markup has been autogenerated by intel_update_markup.py ; INTEL


phoebewang requested review from RKSimon, e-kud and williamweixiao April 28, 2025 01:15

llvmbot added backend:X86 tablegen llvm:regalloc labels Apr 28, 2025

arsenm reviewed Apr 28, 2025

View reviewed changes

RKSimon reviewed Apr 28, 2025

View reviewed changes

RKSimon reviewed May 2, 2025

View reviewed changes

Address review comment

4ecabc7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[X86][BreakFalseDeps] Using reverse order for undef register selection #137569

[X86][BreakFalseDeps] Using reverse order for undef register selection #137569

phoebewang commented Apr 28, 2025

llvmbot commented Apr 28, 2025 •

edited

Loading

arsenm Apr 28, 2025

phoebewang Apr 28, 2025

arsenm Apr 28, 2025

phoebewang Apr 28, 2025

arsenm May 2, 2025

phoebewang May 3, 2025

arsenm Apr 28, 2025

phoebewang Apr 28, 2025

arsenm May 2, 2025

phoebewang May 3, 2025

RKSimon Apr 28, 2025

phoebewang Apr 28, 2025

RKSimon May 2, 2025

phoebewang May 3, 2025

RKSimon May 2, 2025

RKSimon May 2, 2025

RKSimon May 2, 2025

		@@ -3,19 +3,6 @@
		; RUN: llc < %s -mtriple=x86_64-unknown -mattr=+avx \| FileCheck %s

[X86][BreakFalseDeps] Using reverse order for undef register selection #137569

Are you sure you want to change the base?

[X86][BreakFalseDeps] Using reverse order for undef register selection #137569

Conversation

phoebewang commented Apr 28, 2025

llvmbot commented Apr 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

llvmbot commented Apr 28, 2025 •

edited

Loading