-
Notifications
You must be signed in to change notification settings - Fork 13.4k
[X86][BreakFalseDeps] Using reverse order for undef register selection #137569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
BreakFalseDeps picks the best register for undef operands if instructions have false dependency. The problem is if the instruction is close to the beginning of the function, ReachingDefAnalysis is over optimism of the unused registers, which results in collision with registers just defined in the caller. This patch changes the selection of undef register in an reverse order, which reduces the probability of register collisions between caller and callee. It brings improvement in some of our internal benchmarks with negligible effect on other benchmarks.
@llvm/pr-subscribers-tablegen @llvm/pr-subscribers-backend-x86 Author: Phoebe Wang (phoebewang) ChangesBreakFalseDeps picks the best register for undef operands if instructions have false dependency. The problem is if the instruction is close to the beginning of the function, ReachingDefAnalysis is over optimism to the unused registers, which results in collision with registers just defined in the caller. This patch changes the selection of undef register in an reverse order, which reduces the probability of register collisions between caller and callee. It brings improvement in some of our internal benchmarks with negligible effect on other benchmarks. Patch is 253.54 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/137569.diff 42 Files Affected:
diff --git a/llvm/include/llvm/CodeGen/RegisterClassInfo.h b/llvm/include/llvm/CodeGen/RegisterClassInfo.h
index 99beae761c40b..f65f54cbd6982 100644
--- a/llvm/include/llvm/CodeGen/RegisterClassInfo.h
+++ b/llvm/include/llvm/CodeGen/RegisterClassInfo.h
@@ -49,6 +49,8 @@ class RegisterClassInfo {
// entry is valid when its tag matches.
unsigned Tag = 0;
+ bool Reverse = false;
+
const MachineFunction *MF = nullptr;
const TargetRegisterInfo *TRI = nullptr;
@@ -87,7 +89,7 @@ class RegisterClassInfo {
/// runOnFunction - Prepare to answer questions about MF. This must be called
/// before any other methods are used.
- void runOnMachineFunction(const MachineFunction &MF);
+ void runOnMachineFunction(const MachineFunction &MF, bool Rev = false);
/// getNumAllocatableRegs - Returns the number of actually allocatable
/// registers in RC in the current function.
diff --git a/llvm/include/llvm/CodeGen/TargetRegisterInfo.h b/llvm/include/llvm/CodeGen/TargetRegisterInfo.h
index ab3eaa92548ca..af3250e3c2466 100644
--- a/llvm/include/llvm/CodeGen/TargetRegisterInfo.h
+++ b/llvm/include/llvm/CodeGen/TargetRegisterInfo.h
@@ -67,7 +67,7 @@ class TargetRegisterClass {
const bool CoveredBySubRegs;
const unsigned *SuperClasses;
const uint16_t SuperClassesSize;
- ArrayRef<MCPhysReg> (*OrderFunc)(const MachineFunction&);
+ ArrayRef<MCPhysReg> (*OrderFunc)(const MachineFunction &, bool Rev);
/// Return the register class ID number.
unsigned getID() const { return MC->getID(); }
@@ -198,8 +198,9 @@ class TargetRegisterClass {
/// other criteria.
///
/// By default, this method returns all registers in the class.
- ArrayRef<MCPhysReg> getRawAllocationOrder(const MachineFunction &MF) const {
- return OrderFunc ? OrderFunc(MF) : getRegisters();
+ ArrayRef<MCPhysReg> getRawAllocationOrder(const MachineFunction &MF,
+ bool Rev = false) const {
+ return OrderFunc ? OrderFunc(MF, Rev) : getRegisters();
}
/// Returns the combination of all lane masks of register in this class.
diff --git a/llvm/include/llvm/Target/Target.td b/llvm/include/llvm/Target/Target.td
index e8b460aaf803b..ce9a2b2751968 100644
--- a/llvm/include/llvm/Target/Target.td
+++ b/llvm/include/llvm/Target/Target.td
@@ -314,7 +314,7 @@ class RegisterClass<string namespace, list<ValueType> regTypes, int alignment,
// to use in a given machine function. The code will be inserted in a
// function like this:
//
- // static inline unsigned f(const MachineFunction &MF) { ... }
+ // static inline unsigned f(const MachineFunction &MF, bool Rev) { ... }
//
// The function should return 0 to select the default order defined by
// MemberList, 1 to select the first AltOrders entry and so on.
diff --git a/llvm/lib/CodeGen/BreakFalseDeps.cpp b/llvm/lib/CodeGen/BreakFalseDeps.cpp
index 618e41894b29b..64da5d4890ee0 100644
--- a/llvm/lib/CodeGen/BreakFalseDeps.cpp
+++ b/llvm/lib/CodeGen/BreakFalseDeps.cpp
@@ -286,7 +286,7 @@ bool BreakFalseDeps::runOnMachineFunction(MachineFunction &mf) {
TRI = MF->getSubtarget().getRegisterInfo();
RDA = &getAnalysis<ReachingDefAnalysis>();
- RegClassInfo.runOnMachineFunction(mf);
+ RegClassInfo.runOnMachineFunction(mf, /*Rev=*/true);
LLVM_DEBUG(dbgs() << "********** BREAK FALSE DEPENDENCIES **********\n");
diff --git a/llvm/lib/CodeGen/RegisterClassInfo.cpp b/llvm/lib/CodeGen/RegisterClassInfo.cpp
index 40fc35a16335f..8ead83302c337 100644
--- a/llvm/lib/CodeGen/RegisterClassInfo.cpp
+++ b/llvm/lib/CodeGen/RegisterClassInfo.cpp
@@ -39,14 +39,16 @@ StressRA("stress-regalloc", cl::Hidden, cl::init(0), cl::value_desc("N"),
RegisterClassInfo::RegisterClassInfo() = default;
-void RegisterClassInfo::runOnMachineFunction(const MachineFunction &mf) {
+void RegisterClassInfo::runOnMachineFunction(const MachineFunction &mf,
+ bool Rev) {
bool Update = false;
MF = &mf;
auto &STI = MF->getSubtarget();
// Allocate new array the first time we see a new target.
- if (STI.getRegisterInfo() != TRI) {
+ if (STI.getRegisterInfo() != TRI || Reverse != Rev) {
+ Reverse = Rev;
TRI = STI.getRegisterInfo();
RegClass.reset(new RCInfo[TRI->getNumRegClasses()]);
Update = true;
@@ -142,7 +144,12 @@ void RegisterClassInfo::compute(const TargetRegisterClass *RC) const {
// FIXME: Once targets reserve registers instead of removing them from the
// allocation order, we can simply use begin/end here.
- ArrayRef<MCPhysReg> RawOrder = RC->getRawAllocationOrder(*MF);
+ ArrayRef<MCPhysReg> RawOrder = RC->getRawAllocationOrder(*MF, Reverse);
+ std::vector<MCPhysReg> ReverseOrder;
+ if (Reverse) {
+ llvm::append_range(ReverseOrder, reverse(RawOrder));
+ RawOrder = ArrayRef<MCPhysReg>(ReverseOrder);
+ }
for (unsigned PhysReg : RawOrder) {
// Remove reserved registers from the allocation order.
if (Reserved.test(PhysReg))
diff --git a/llvm/lib/Target/X86/X86RegisterInfo.td b/llvm/lib/Target/X86/X86RegisterInfo.td
index 48459b3aca508..8e8f76ee43410 100644
--- a/llvm/lib/Target/X86/X86RegisterInfo.td
+++ b/llvm/lib/Target/X86/X86RegisterInfo.td
@@ -802,17 +802,37 @@ def VR512_0_15 : RegisterClass<"X86", [v16f32, v8f64, v64i8, v32i16, v16i32, v8i
512, (sequence "ZMM%u", 0, 15)>;
// Scalar AVX-512 floating point registers.
-def FR32X : RegisterClass<"X86", [f32], 32, (sequence "XMM%u", 0, 31)>;
+def FR32X : RegisterClass<"X86", [f32], 32, (sequence "XMM%u", 0, 31)> {
+ let AltOrders = [(add (sequence "XMM%u", 16, 31), (sequence "XMM%u", 0, 15))];
+ let AltOrderSelect = [{
+ return Rev;
+ }];
+}
-def FR64X : RegisterClass<"X86", [f64], 64, (add FR32X)>;
+def FR64X : RegisterClass<"X86", [f64], 64, (add FR32X)> {
+ let AltOrders = [(add (sequence "XMM%u", 16, 31), (sequence "XMM%u", 0, 15))];
+ let AltOrderSelect = [{
+ return Rev;
+ }];
+}
def FR16X : RegisterClass<"X86", [f16], 16, (add FR32X)> {let Size = 32;}
// Extended VR128 and VR256 for AVX-512 instructions
def VR128X : RegisterClass<"X86", [v4f32, v2f64, v8f16, v8bf16, v16i8, v8i16, v4i32, v2i64, f128],
- 128, (add FR32X)>;
+ 128, (add FR32X)> {
+ let AltOrders = [(add (sequence "XMM%u", 16, 31), (sequence "XMM%u", 0, 15))];
+ let AltOrderSelect = [{
+ return Rev;
+ }];
+}
def VR256X : RegisterClass<"X86", [v8f32, v4f64, v16f16, v16bf16, v32i8, v16i16, v8i32, v4i64],
- 256, (sequence "YMM%u", 0, 31)>;
+ 256, (sequence "YMM%u", 0, 31)> {
+ let AltOrders = [(add (sequence "YMM%u", 16, 31), (sequence "YMM%u", 0, 15))];
+ let AltOrderSelect = [{
+ return Rev;
+ }];
+}
// Mask registers
def VK1 : RegisterClass<"X86", [v1i1], 16, (sequence "K%u", 0, 7)> {let Size = 16;}
diff --git a/llvm/test/CodeGen/X86/avx-cvt.ll b/llvm/test/CodeGen/X86/avx-cvt.ll
index 1bd25273ecd48..fb30044512fa5 100644
--- a/llvm/test/CodeGen/X86/avx-cvt.ll
+++ b/llvm/test/CodeGen/X86/avx-cvt.ll
@@ -108,7 +108,7 @@ define <2 x double> @fpext01(<2 x double> %a0, <4 x float> %a1) nounwind {
define double @funcA(ptr nocapture %e) nounwind uwtable readonly ssp {
; CHECK-LABEL: funcA:
; CHECK: # %bb.0:
-; CHECK-NEXT: vcvtsi2sdq (%rdi), %xmm0, %xmm0
+; CHECK-NEXT: vcvtsi2sdq (%rdi), %xmm15, %xmm0
; CHECK-NEXT: retq
%tmp1 = load i64, ptr %e, align 8
%conv = sitofp i64 %tmp1 to double
@@ -118,7 +118,7 @@ define double @funcA(ptr nocapture %e) nounwind uwtable readonly ssp {
define double @funcB(ptr nocapture %e) nounwind uwtable readonly ssp {
; CHECK-LABEL: funcB:
; CHECK: # %bb.0:
-; CHECK-NEXT: vcvtsi2sdl (%rdi), %xmm0, %xmm0
+; CHECK-NEXT: vcvtsi2sdl (%rdi), %xmm15, %xmm0
; CHECK-NEXT: retq
%tmp1 = load i32, ptr %e, align 4
%conv = sitofp i32 %tmp1 to double
@@ -128,7 +128,7 @@ define double @funcB(ptr nocapture %e) nounwind uwtable readonly ssp {
define float @funcC(ptr nocapture %e) nounwind uwtable readonly ssp {
; CHECK-LABEL: funcC:
; CHECK: # %bb.0:
-; CHECK-NEXT: vcvtsi2ssl (%rdi), %xmm0, %xmm0
+; CHECK-NEXT: vcvtsi2ssl (%rdi), %xmm15, %xmm0
; CHECK-NEXT: retq
%tmp1 = load i32, ptr %e, align 4
%conv = sitofp i32 %tmp1 to float
@@ -138,7 +138,7 @@ define float @funcC(ptr nocapture %e) nounwind uwtable readonly ssp {
define float @funcD(ptr nocapture %e) nounwind uwtable readonly ssp {
; CHECK-LABEL: funcD:
; CHECK: # %bb.0:
-; CHECK-NEXT: vcvtsi2ssq (%rdi), %xmm0, %xmm0
+; CHECK-NEXT: vcvtsi2ssq (%rdi), %xmm15, %xmm0
; CHECK-NEXT: retq
%tmp1 = load i64, ptr %e, align 8
%conv = sitofp i64 %tmp1 to float
@@ -183,7 +183,7 @@ declare float @llvm.floor.f32(float %p)
define float @floor_f32_load(ptr %aptr) optsize {
; CHECK-LABEL: floor_f32_load:
; CHECK: # %bb.0:
-; CHECK-NEXT: vroundss $9, (%rdi), %xmm0, %xmm0
+; CHECK-NEXT: vroundss $9, (%rdi), %xmm15, %xmm0
; CHECK-NEXT: retq
%a = load float, ptr %aptr
%res = call float @llvm.floor.f32(float %a)
@@ -193,7 +193,7 @@ define float @floor_f32_load(ptr %aptr) optsize {
define float @floor_f32_load_pgso(ptr %aptr) !prof !14 {
; CHECK-LABEL: floor_f32_load_pgso:
; CHECK: # %bb.0:
-; CHECK-NEXT: vroundss $9, (%rdi), %xmm0, %xmm0
+; CHECK-NEXT: vroundss $9, (%rdi), %xmm15, %xmm0
; CHECK-NEXT: retq
%a = load float, ptr %aptr
%res = call float @llvm.floor.f32(float %a)
@@ -203,7 +203,7 @@ define float @floor_f32_load_pgso(ptr %aptr) !prof !14 {
define double @nearbyint_f64_load(ptr %aptr) optsize {
; CHECK-LABEL: nearbyint_f64_load:
; CHECK: # %bb.0:
-; CHECK-NEXT: vroundsd $12, (%rdi), %xmm0, %xmm0
+; CHECK-NEXT: vroundsd $12, (%rdi), %xmm15, %xmm0
; CHECK-NEXT: retq
%a = load double, ptr %aptr
%res = call double @llvm.nearbyint.f64(double %a)
@@ -213,7 +213,7 @@ define double @nearbyint_f64_load(ptr %aptr) optsize {
define double @nearbyint_f64_load_pgso(ptr %aptr) !prof !14 {
; CHECK-LABEL: nearbyint_f64_load_pgso:
; CHECK: # %bb.0:
-; CHECK-NEXT: vroundsd $12, (%rdi), %xmm0, %xmm0
+; CHECK-NEXT: vroundsd $12, (%rdi), %xmm15, %xmm0
; CHECK-NEXT: retq
%a = load double, ptr %aptr
%res = call double @llvm.nearbyint.f64(double %a)
diff --git a/llvm/test/CodeGen/X86/avx512-cvt.ll b/llvm/test/CodeGen/X86/avx512-cvt.ll
index a78d97782e6a3..3dd7b571b9215 100644
--- a/llvm/test/CodeGen/X86/avx512-cvt.ll
+++ b/llvm/test/CodeGen/X86/avx512-cvt.ll
@@ -22,27 +22,27 @@ define <8 x double> @sltof864(<8 x i64> %a) {
; NODQ: # %bb.0:
; NODQ-NEXT: vextracti32x4 $3, %zmm0, %xmm1
; NODQ-NEXT: vpextrq $1, %xmm1, %rax
-; NODQ-NEXT: vcvtsi2sd %rax, %xmm2, %xmm2
+; NODQ-NEXT: vcvtsi2sd %rax, %xmm15, %xmm2
; NODQ-NEXT: vmovq %xmm1, %rax
-; NODQ-NEXT: vcvtsi2sd %rax, %xmm3, %xmm1
+; NODQ-NEXT: vcvtsi2sd %rax, %xmm15, %xmm1
; NODQ-NEXT: vunpcklpd {{.*#+}} xmm1 = xmm1[0],xmm2[0]
; NODQ-NEXT: vextracti32x4 $2, %zmm0, %xmm2
; NODQ-NEXT: vpextrq $1, %xmm2, %rax
-; NODQ-NEXT: vcvtsi2sd %rax, %xmm3, %xmm3
+; NODQ-NEXT: vcvtsi2sd %rax, %xmm15, %xmm3
; NODQ-NEXT: vmovq %xmm2, %rax
-; NODQ-NEXT: vcvtsi2sd %rax, %xmm4, %xmm2
+; NODQ-NEXT: vcvtsi2sd %rax, %xmm15, %xmm2
; NODQ-NEXT: vunpcklpd {{.*#+}} xmm2 = xmm2[0],xmm3[0]
; NODQ-NEXT: vinsertf128 $1, %xmm1, %ymm2, %ymm1
; NODQ-NEXT: vextracti128 $1, %ymm0, %xmm2
; NODQ-NEXT: vpextrq $1, %xmm2, %rax
-; NODQ-NEXT: vcvtsi2sd %rax, %xmm4, %xmm3
+; NODQ-NEXT: vcvtsi2sd %rax, %xmm15, %xmm3
; NODQ-NEXT: vmovq %xmm2, %rax
-; NODQ-NEXT: vcvtsi2sd %rax, %xmm4, %xmm2
+; NODQ-NEXT: vcvtsi2sd %rax, %xmm15, %xmm2
; NODQ-NEXT: vunpcklpd {{.*#+}} xmm2 = xmm2[0],xmm3[0]
; NODQ-NEXT: vpextrq $1, %xmm0, %rax
-; NODQ-NEXT: vcvtsi2sd %rax, %xmm4, %xmm3
+; NODQ-NEXT: vcvtsi2sd %rax, %xmm15, %xmm3
; NODQ-NEXT: vmovq %xmm0, %rax
-; NODQ-NEXT: vcvtsi2sd %rax, %xmm4, %xmm0
+; NODQ-NEXT: vcvtsi2sd %rax, %xmm15, %xmm0
; NODQ-NEXT: vunpcklpd {{.*#+}} xmm0 = xmm0[0],xmm3[0]
; NODQ-NEXT: vinsertf128 $1, %xmm2, %ymm0, %ymm0
; NODQ-NEXT: vinsertf64x4 $1, %ymm1, %zmm0, %zmm0
@@ -66,14 +66,14 @@ define <4 x double> @slto4f64(<4 x i64> %a) {
; NODQ: # %bb.0:
; NODQ-NEXT: vextracti128 $1, %ymm0, %xmm1
; NODQ-NEXT: vpextrq $1, %xmm1, %rax
-; NODQ-NEXT: vcvtsi2sd %rax, %xmm2, %xmm2
+; NODQ-NEXT: vcvtsi2sd %rax, %xmm15, %xmm2
; NODQ-NEXT: vmovq %xmm1, %rax
-; NODQ-NEXT: vcvtsi2sd %rax, %xmm3, %xmm1
+; NODQ-NEXT: vcvtsi2sd %rax, %xmm15, %xmm1
; NODQ-NEXT: vunpcklpd {{.*#+}} xmm1 = xmm1[0],xmm2[0]
; NODQ-NEXT: vpextrq $1, %xmm0, %rax
-; NODQ-NEXT: vcvtsi2sd %rax, %xmm3, %xmm2
+; NODQ-NEXT: vcvtsi2sd %rax, %xmm15, %xmm2
; NODQ-NEXT: vmovq %xmm0, %rax
-; NODQ-NEXT: vcvtsi2sd %rax, %xmm3, %xmm0
+; NODQ-NEXT: vcvtsi2sd %rax, %xmm15, %xmm0
; NODQ-NEXT: vunpcklpd {{.*#+}} xmm0 = xmm0[0],xmm2[0]
; NODQ-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
; NODQ-NEXT: retq
@@ -97,9 +97,9 @@ define <2 x double> @slto2f64(<2 x i64> %a) {
; NODQ-LABEL: slto2f64:
; NODQ: # %bb.0:
; NODQ-NEXT: vpextrq $1, %xmm0, %rax
-; NODQ-NEXT: vcvtsi2sd %rax, %xmm1, %xmm1
+; NODQ-NEXT: vcvtsi2sd %rax, %xmm15, %xmm1
; NODQ-NEXT: vmovq %xmm0, %rax
-; NODQ-NEXT: vcvtsi2sd %rax, %xmm2, %xmm0
+; NODQ-NEXT: vcvtsi2sd %rax, %xmm15, %xmm0
; NODQ-NEXT: vunpcklpd {{.*#+}} xmm0 = xmm0[0],xmm1[0]
; NODQ-NEXT: retq
;
@@ -123,9 +123,9 @@ define <2 x float> @sltof2f32(<2 x i64> %a) {
; NODQ-LABEL: sltof2f32:
; NODQ: # %bb.0:
; NODQ-NEXT: vpextrq $1, %xmm0, %rax
-; NODQ-NEXT: vcvtsi2ss %rax, %xmm1, %xmm1
+; NODQ-NEXT: vcvtsi2ss %rax, %xmm15, %xmm1
; NODQ-NEXT: vmovq %xmm0, %rax
-; NODQ-NEXT: vcvtsi2ss %rax, %xmm2, %xmm0
+; NODQ-NEXT: vcvtsi2ss %rax, %xmm15, %xmm0
; NODQ-NEXT: vinsertps {{.*#+}} xmm0 = xmm0[0],xmm1[0],zero,zero
; NODQ-NEXT: retq
;
@@ -148,12 +148,12 @@ define <2 x float> @sltof2f32(<2 x i64> %a) {
define <4 x float> @slto4f32_mem(ptr %a) {
; NODQ-LABEL: slto4f32_mem:
; NODQ: # %bb.0:
-; NODQ-NEXT: vcvtsi2ssq 8(%rdi), %xmm0, %xmm0
-; NODQ-NEXT: vcvtsi2ssq (%rdi), %xmm1, %xmm1
+; NODQ-NEXT: vcvtsi2ssq 8(%rdi), %xmm15, %xmm0
+; NODQ-NEXT: vcvtsi2ssq (%rdi), %xmm15, %xmm1
; NODQ-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0],xmm0[0],xmm1[2,3]
-; NODQ-NEXT: vcvtsi2ssq 16(%rdi), %xmm2, %xmm1
+; NODQ-NEXT: vcvtsi2ssq 16(%rdi), %xmm15, %xmm1
; NODQ-NEXT: vinsertps {{.*#+}} xmm0 = xmm0[0,1],xmm1[0],xmm0[3]
-; NODQ-NEXT: vcvtsi2ssq 24(%rdi), %xmm2, %xmm1
+; NODQ-NEXT: vcvtsi2ssq 24(%rdi), %xmm15, %xmm1
; NODQ-NEXT: vinsertps {{.*#+}} xmm0 = xmm0[0,1,2],xmm1[0]
; NODQ-NEXT: retq
;
@@ -246,16 +246,16 @@ define <4 x float> @slto4f32(<4 x i64> %a) {
; NODQ-LABEL: slto4f32:
; NODQ: # %bb.0:
; NODQ-NEXT: vpextrq $1, %xmm0, %rax
-; NODQ-NEXT: vcvtsi2ss %rax, %xmm1, %xmm1
+; NODQ-NEXT: vcvtsi2ss %rax, %xmm15, %xmm1
; NODQ-NEXT: vmovq %xmm0, %rax
-; NODQ-NEXT: vcvtsi2ss %rax, %xmm2, %xmm2
+; NODQ-NEXT: vcvtsi2ss %rax, %xmm15, %xmm2
; NODQ-NEXT: vinsertps {{.*#+}} xmm1 = xmm2[0],xmm1[0],xmm2[2,3]
; NODQ-NEXT: vextracti128 $1, %ymm0, %xmm0
; NODQ-NEXT: vmovq %xmm0, %rax
-; NODQ-NEXT: vcvtsi2ss %rax, %xmm3, %xmm2
+; NODQ-NEXT: vcvtsi2ss %rax, %xmm15, %xmm2
; NODQ-NEXT: vinsertps {{.*#+}} xmm1 = xmm1[0,1],xmm2[0],xmm1[3]
; NODQ-NEXT: vpextrq $1, %xmm0, %rax
-; NODQ-NEXT: vcvtsi2ss %rax, %xmm3, %xmm0
+; NODQ-NEXT: vcvtsi2ss %rax, %xmm15, %xmm0
; NODQ-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm0[0]
; NODQ-NEXT: vzeroupper
; NODQ-NEXT: retq
@@ -281,16 +281,16 @@ define <4 x float> @ulto4f32(<4 x i64> %a) {
; NODQ-LABEL: ulto4f32:
; NODQ: # %bb.0:
; NODQ-NEXT: vpextrq $1, %xmm0, %rax
-; NODQ-NEXT: vcvtusi2ss %rax, %xmm1, %xmm1
+; NODQ-NEXT: vcvtusi2ss %rax, %xmm15, %xmm1
; NODQ-NEXT: vmovq %xmm0, %rax
-; NODQ-NEXT: vcvtusi2ss %rax, %xmm2, %xmm2
+; NODQ-NEXT: vcvtusi2ss %rax, %xmm15, %xmm2
; NODQ-NEXT: vinsertps {{.*#+}} xmm1 = xmm2[0],xmm1[0],xmm2[2,3]
; NODQ-NEXT: vextracti128 $1, %ymm0, %xmm0
; NODQ-NEXT: vmovq %xmm0, %rax
-; NODQ-NEXT: vcvtusi2ss %rax, %xmm3, %xmm2
+; NODQ-NEXT: vcvtusi2ss %rax, %xmm15, %xmm2
; NODQ-NEXT: vinsertps {{.*#+}} xmm1 = xmm1[0,1],xmm2[0],xmm1[3]
; NODQ-NEXT: vpextrq $1, %xmm0, %rax
-; NODQ-NEXT: vcvtusi2ss %rax, %xmm3, %xmm0
+; NODQ-NEXT: vcvtusi2ss %rax, %xmm15, %xmm0
; NODQ-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm0[0]
; NODQ-NEXT: vzeroupper
; NODQ-NEXT: retq
@@ -316,16 +316,16 @@ define <4 x float> @ulto4f32_nneg(<4 x i64> %a) {
; NODQ-LABEL: ulto4f32_nneg:
; NODQ: # %bb.0:
; NODQ-NEXT: vpextrq $1, %xmm0, %rax
-; NODQ-NEXT: vcvtsi2ss %rax, %xmm1, %xmm1
+; NODQ-NEXT: vcvtsi2ss %rax, %xmm15, %xmm1
; NODQ-NEXT: vmovq %xmm0, %rax
-; NODQ-NEXT: vcvtsi2ss %rax, %xmm2, %xmm2
+; NODQ-NEXT: vcvtsi2ss %rax, %xmm15, %xmm2
; NODQ-NEXT: vinsertps {{.*#+}} xmm1 = xmm2[0],xmm1[0],xmm2[2,3]
; NODQ-NEXT: vextracti128 $1, %ymm0, %xmm0
; NODQ-NEXT: vmovq %xmm0, %rax
-; NODQ-NEXT: vcvtsi2ss %rax, %xmm3, %xmm2
+; NODQ-NEXT: vcvtsi2ss %rax, %xmm15, %xmm2
; NODQ-NEXT: vinsertps {{.*#+}} xmm1 = xmm1[0,1],xmm2[0],xmm1[3]
; NODQ-NEXT: vpextrq $1, %xmm0, %rax
-; NODQ-NEXT: vcvtsi2ss %rax, %xmm3, %xmm0
+; NODQ-NEXT: vcvtsi2ss %rax, %xmm15, %xmm0
; NODQ-NEXT: vinsertps {{.*#+}} xmm0 = xmm1[0,1,2],xmm0[0]
; NODQ-NEXT: vzeroupper
; NODQ-NEXT: retq
@@ -864,7 +864,7 @@ define <2 x double> @f32tof64_inreg(<2 x double> %a0, <4 x float> %a1) nounwind
define double @sltof64_load(ptr nocapture %e) {
; ALL-LABEL: sltof64_load:
; ALL: # %bb.0: # %entry
-; ALL-NEXT: vcvtsi2sdq (%rdi), %xmm0, %xmm0
+; ALL-NEXT: vcvtsi2sdq (%rdi), %xmm15, %xmm0
; ALL-NEXT: retq
entry:
%tmp1 = load i64, ptr %e, align 8
@@ -875,7 +875,7 @@ entry:
define double @sitof64_load(ptr %e) {
; ALL-LABEL: sitof64_load:
; ALL: # %bb.0: # %entry
-; ALL-NEXT: vcvtsi2sdl (%rdi), %xmm0, %xmm0
+; ALL-NEXT: vcvtsi2sdl (%rdi), %xmm15, %xmm0
; ALL-NEXT: retq
entry:
%tmp1 = load i32, ptr %e, align 4
@@ -886,7 +886,7 @@ entry:
define float @sitof32_load(ptr %e) {
; ALL-LABEL: sitof32_load:
; ALL: # %bb.0: # %entry
-; ALL-NEXT: vcvtsi2ssl (%rdi), %xmm0, %xmm0
+; ALL-NEXT: vcvtsi2ssl (%rdi), %xmm15, %xmm0
; ALL-NEXT: retq
entry:
%tmp1 = load i32, ptr %e, align 4
@@ -897,7 +897,7 @@ entry:
define float @sltof32_load(ptr %e) {
; ALL-LABEL: sltof32_load:
; ALL: # %bb.0: # %entry
-; ALL-NEXT: vcvtsi2ssq (%rdi), %xmm0, %xmm0
+; ALL-NEXT: vcvtsi2ssq (%rdi), %xmm15, %xmm0
; ALL-NEXT: retq
entry:
%tmp1 = load i64, ptr %e, align 8
@@ -990,28 +990,28 @@ define <8 x float> @slto8f32(<8 x i64> %a) {
; NODQ: # %bb.0:
; NODQ-NEXT: vextracti32x4 $2, %zmm0, %xmm1
; NODQ-NEXT: vpextrq $1, %xmm1, %rax
-; NODQ-NEXT: vcvtsi2ss %rax, %xmm2, %xmm2
+; NODQ-NEXT: vcvtsi2ss %rax, %xmm15, %xmm2
; NODQ-NEXT: vmovq %xmm1, %rax
-; NODQ-NEXT: vcvtsi2ss %rax, %xmm3, %xmm1
+; NODQ-NEXT: vcvtsi2ss %rax, %xmm15, %xmm1
; NODQ-NEXT: vinsertps {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[2,3]
; NODQ-NEXT: vextracti32x4 $3, %zmm0, %xmm2
; NODQ-NEXT: vmovq %xmm2, %rax
-; NODQ-NEXT: vcvtsi2ss %rax, %xmm3, %xmm3
+; NODQ-NEXT: vcvtsi2ss %rax, %xmm15, %xmm3
; NODQ-NEXT: vinsertps {{.*#+}} xmm1 = xmm1[0,1],xmm3[0],xmm1[3]
; NODQ-NEXT: vpextrq $1, %xmm2, %rax
-; NODQ-NEXT: vcvtsi2ss %rax, %xmm4, %xmm2
+; NODQ-NEXT: vcvtsi2ss %rax, %xmm15, %xmm2
; NODQ-NEXT: vinsertps {{.*#+}} xmm1 = xmm1[0,1,2],xmm2[0]
; NODQ-NEXT: vpextrq $1, %xmm0, %rax
-; NODQ-NEXT: vcvtsi2ss %rax, %xmm4, %xmm2
+; NODQ-NEXT: vcvtsi2ss %rax, %xmm15, %xmm2
; NODQ-NEXT: vmovq %xmm0, %rax
-; NODQ-NEXT: vcvtsi2ss %rax, %xmm4, %xmm3
+; NODQ-NEXT: vcvtsi2ss %rax, %xmm15, %xmm3
; NODQ-NEXT: vinsertps {{.*#+}} xmm2 = xmm3[0],xmm2[0],xmm3[2,3]
; NODQ-NEXT: vextracti128 $1, %ymm0, %xmm0
; NODQ-NEXT: vmovq %xmm0, %rax
-; NODQ-NEXT: vcvtsi2ss %rax, %xmm4, %xmm3
+; NODQ-NEXT: vcvtsi2ss %rax, %xmm15, %xmm3
; NODQ-NEXT: vinsertps {{.*#+}} xmm2 = xmm2[0,1],xmm3[0],xmm2[3]
; NODQ-NEXT: vpextrq $1, %xmm0, %rax
-; NODQ-NEXT: vcvtsi2ss %rax, %xmm4, %xmm0
+; NODQ-NEXT: v...
[truncated]
|
ArrayRef<MCPhysReg> RawOrder = RC->getRawAllocationOrder(*MF, Reverse); | ||
std::vector<MCPhysReg> ReverseOrder; | ||
if (Reverse) { | ||
llvm::append_range(ReverseOrder, reverse(RawOrder)); | ||
RawOrder = ArrayRef<MCPhysReg>(ReverseOrder); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already a mechanism for providing alternative allocation orders defined in tablegen, you shouldn't need to do this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is to imitate the alternative allocation order way. Currently it's only controlled by target features. We want to control it through pass agrument too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's wrong with it being a target faster? Could also expand the alternative allocation order controls. This is hardcoding a single alternative choice and requires a runtime sort
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is not some registers are fast. They are all the same.
The intention here is to alter the order for the a specific pass. It doesn't solve the problem here if we just reverse register oder for all passes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/faster/feature/
Then change the selection mechanism for the table generated order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how feature help here. This is not some feature works for all passes. We just want BreakFalseDeps uses reverse order.
bool Update = false; | ||
MF = &mf; | ||
|
||
auto &STI = MF->getSubtarget(); | ||
|
||
// Allocate new array the first time we see a new target. | ||
if (STI.getRegisterInfo() != TRI) { | ||
if (STI.getRegisterInfo() != TRI || Reverse != Rev) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TRI check looks broken, shouldn't be necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TRI is a constant value within the same Subtarget, but can be changed when we compile functions with different target feature. So we need to reset RegClass in these cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The analysis shouldn't survive in those cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is RegClass survives longer than analysis. We have other passes like MachineSink, RegAllocBase, MachineCombiner etc. all use it. The cached RegClass can be used among them within the same Subtarget?
@@ -108,7 +108,7 @@ define <2 x double> @fpext01(<2 x double> %a0, <4 x float> %a1) nounwind { | |||
define double @funcA(ptr nocapture %e) nounwind uwtable readonly ssp { | |||
; CHECK-LABEL: funcA: | |||
; CHECK: # %bb.0: | |||
; CHECK-NEXT: vcvtsi2sdq (%rdi), %xmm0, %xmm0 | |||
; CHECK-NEXT: vcvtsi2sdq (%rdi), %xmm15, %xmm0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't this cause codebloat by encouraging the use of the xmm8-15 registers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the 2B VEX prefix vs. 3B? The source operand is encoded in vvvv
, so won't affect the prefix size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And that true for all other cases as well? (Sorry I'm playing catchup and haven't gone through everything yet).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked all affected tests including vcvt[u]si*, fpround/fpext, vrcpss, vrounds*, vsqrts* all follow the same rule here.
@@ -3,19 +3,6 @@ | |||
; RUN: llc < %s -mtriple=x86_64-unknown -mattr=+avx | FileCheck %s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update checks and regenerate:
; RUN: llc < %s -mtriple=i686-unknown -mattr=+avx | FileCheck %s --check-prefix=X86
; RUN: llc < %s -mtriple=x86_64-unknown -mattr=+avx | FileCheck %s --check-prefix=X64
@@ -108,7 +108,7 @@ define <2 x double> @fpext01(<2 x double> %a0, <4 x float> %a1) nounwind { | |||
define double @funcA(ptr nocapture %e) nounwind uwtable readonly ssp { | |||
; CHECK-LABEL: funcA: | |||
; CHECK: # %bb.0: | |||
; CHECK-NEXT: vcvtsi2sdq (%rdi), %xmm0, %xmm0 | |||
; CHECK-NEXT: vcvtsi2sdq (%rdi), %xmm15, %xmm0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And that true for all other cases as well? (Sorry I'm playing catchup and haven't gone through everything yet).
@@ -1,4 +1,5 @@ | |||
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py | |||
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5 | |||
; Markup has been autogenerated by intel_update_markup.py ; INTEL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh?
BreakFalseDeps picks the best register for undef operands if instructions have false dependency. The problem is if the instruction is close to the beginning of the function, ReachingDefAnalysis is over optimism to the unused registers, which results in collision with registers just defined in the caller.
This patch changes the selection of undef register in an reverse order, which reduces the probability of register collisions between caller and callee. It brings improvement in some of our internal benchmarks with negligible effect on other benchmarks.