[AArch64] fix trampoline implementation: use X15 #126743

vtjnash · 2025-02-11T15:31:41Z

AAPCS64 reserves any of X9-X15 for a compiler to choose to use for this purpose, and says not to use X16 or X18 like GCC (and the previous implementation) chose to use. The X18 register may need to get used by the kernel in some circumstances, as specified by the platform ABI, so it is generally an unwise choice. Simply choosing a different register fixes the problem of this being broken on any platform that actually follows the platform ABI (which is all of them except EABI, if I am reading this linux kernel bug correctly https://lkml2.uits.iu.edu/hypermail/linux/kernel/2001.2/01502.html). As a side benefit, also generate slightly better code and avoids needing the compiler-rt to be present. I did that by following the XCore implementation instead of PPC (although in hindsight, following the RISCV might have been slightly more readable). That X18 is wrong to use for this purpose has been known for many years (e.g. https://www.mail-archive.com/gcc@gcc.gnu.org/msg76934.html) and also known that fixing this to use one of the correct registers is not an ABI break, since this only appears inside of a translation unit. Some of the other temporary registers (e.g. X9) are already reserved inside llvm for internal use as a generic temporary register in the prologue before saving registers, while X15 was already used in rare cases as a scratch register in the prologue as well, so I felt that seemed the most logical choice to choose here.

@ceseo Is there any tests you'd recommend I run to show that this works correctly?

@vchuravy @llvm/issue-subscribers-julialang

AAPCS64 reserves any of X9-X15 for this purpose, and says not to use X16 or X18 like GCC did. Simply choosing a different register fixes the problem of this being broken on any platform that actually follows the platform ABI. As a side benefit, also generate slightly better code by following the XCore implementation instead of PPC (although following the RISCV might have been slightly more readable in hindsight).

llvmbot · 2025-02-11T15:32:18Z

@llvm/pr-subscribers-llvm-ir
@llvm/pr-subscribers-flang-codegen
@llvm/pr-subscribers-flang-fir-hlfir

@llvm/pr-subscribers-backend-aarch64

Author: Jameson Nash (vtjnash)

Changes

AAPCS64 reserves any of X9-X15 for a compiler to choose to use for this purpose, and says not to use X16 or X18 like GCC (and the previous implementation) chose to use. The X18 register may need to get used by the kernel in some circumstances, as specified by the platform ABI, so it is generally an unwise choice. Simply choosing a different register fixes the problem of this being broken on any platform that actually follows the platform ABI (which is all of them except EABI, if I am reading this linux kernel bug correctly https://lkml2.uits.iu.edu/hypermail/linux/kernel/2001.2/01502.html). As a side benefit, also generate slightly better code and avoids needing the compiler-rt to be present. I did that by following the XCore implementation instead of PPC (although in hindsight, following the RISCV might have been slightly more readable). That X18 is wrong to use for this purpose has been known for many years (e.g. https://www.mail-archive.com/gcc@gcc.gnu.org/msg76934.html) and also known that fixing this to use one of the correct registers is not an ABI break, since this only appears inside of a translation unit. Some of the other temporary registers (e.g. X9) are already reserved inside llvm for internal use as a generic temporary register in the prologue before saving registers, while X15 was already used in rare cases as a scratch register in the prologue as well, so I felt that seemed the most logical choice to choose here.

@ceseo Is there any tests you'd recommend I run to show that this works correctly?

@vchuravy @llvm/issue-subscribers-julialang

Patch is 43.85 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/126743.diff

13 Files Affected:

(modified) compiler-rt/lib/builtins/README.txt (-5)
(modified) compiler-rt/lib/builtins/trampoline_setup.c (-42)
(modified) compiler-rt/test/builtins/Unit/trampoline_setup_test.c (+1-1)
(modified) flang/lib/Optimizer/CodeGen/BoxedProcedure.cpp (+2-2)
(modified) llvm/lib/Target/AArch64/AArch64CallingConvention.td (+23-13)
(modified) llvm/lib/Target/AArch64/AArch64FrameLowering.cpp (+26)
(modified) llvm/lib/Target/AArch64/AArch64ISelLowering.cpp (+46-39)
(modified) llvm/test/CodeGen/AArch64/nest-register.ll (+10-6)
(modified) llvm/test/CodeGen/AArch64/preserve_nonecc_call.ll (+64-52)
(modified) llvm/test/CodeGen/AArch64/statepoint-call-lowering.ll (+1-1)
(modified) llvm/test/CodeGen/AArch64/trampoline.ll (+245-12)
(modified) llvm/test/CodeGen/AArch64/win64cc-x18.ll (+9-18)
(modified) llvm/test/CodeGen/AArch64/zero-call-used-regs.ll (+8-8)

diff --git a/compiler-rt/lib/builtins/README.txt b/compiler-rt/lib/builtins/README.txt
index 19f26c92a0f94f..2d213d95f333af 100644
--- a/compiler-rt/lib/builtins/README.txt
+++ b/compiler-rt/lib/builtins/README.txt
@@ -272,11 +272,6 @@ switch32
 switch8
 switchu8
 
-// This function generates a custom trampoline function with the specific
-// realFunc and localsPtr values.
-void __trampoline_setup(uint32_t* trampOnStack, int trampSizeAllocated,
-                        const void* realFunc, void* localsPtr);
-
 // There is no C interface to the *_vfp_d8_d15_regs functions.  There are
 // called in the prolog and epilog of Thumb1 functions.  When the C++ ABI use
 // SJLJ for exceptions, each function with a catch clause or destructors needs
diff --git a/compiler-rt/lib/builtins/trampoline_setup.c b/compiler-rt/lib/builtins/trampoline_setup.c
index 830e25e4c0303a..844eb279441428 100644
--- a/compiler-rt/lib/builtins/trampoline_setup.c
+++ b/compiler-rt/lib/builtins/trampoline_setup.c
@@ -41,45 +41,3 @@ COMPILER_RT_ABI void __trampoline_setup(uint32_t *trampOnStack,
   __clear_cache(trampOnStack, &trampOnStack[10]);
 }
 #endif // __powerpc__ && !defined(__powerpc64__)
-
-// The AArch64 compiler generates calls to __trampoline_setup() when creating
-// trampoline functions on the stack for use with nested functions.
-// This function creates a custom 36-byte trampoline function on the stack
-// which loads x18 with a pointer to the outer function's locals
-// and then jumps to the target nested function.
-// Note: x18 is a reserved platform register on Windows and macOS.
-
-#if defined(__aarch64__) && defined(__ELF__)
-COMPILER_RT_ABI void __trampoline_setup(uint32_t *trampOnStack,
-                                        int trampSizeAllocated,
-                                        const void *realFunc, void *localsPtr) {
-  // This should never happen, but if compiler did not allocate
-  // enough space on stack for the trampoline, abort.
-  if (trampSizeAllocated < 36)
-    compilerrt_abort();
-
-  // create trampoline
-  // Load realFunc into x17. mov/movk 16 bits at a time.
-  trampOnStack[0] =
-      0xd2800000u | ((((uint64_t)realFunc >> 0) & 0xffffu) << 5) | 0x11;
-  trampOnStack[1] =
-      0xf2a00000u | ((((uint64_t)realFunc >> 16) & 0xffffu) << 5) | 0x11;
-  trampOnStack[2] =
-      0xf2c00000u | ((((uint64_t)realFunc >> 32) & 0xffffu) << 5) | 0x11;
-  trampOnStack[3] =
-      0xf2e00000u | ((((uint64_t)realFunc >> 48) & 0xffffu) << 5) | 0x11;
-  // Load localsPtr into x18
-  trampOnStack[4] =
-      0xd2800000u | ((((uint64_t)localsPtr >> 0) & 0xffffu) << 5) | 0x12;
-  trampOnStack[5] =
-      0xf2a00000u | ((((uint64_t)localsPtr >> 16) & 0xffffu) << 5) | 0x12;
-  trampOnStack[6] =
-      0xf2c00000u | ((((uint64_t)localsPtr >> 32) & 0xffffu) << 5) | 0x12;
-  trampOnStack[7] =
-      0xf2e00000u | ((((uint64_t)localsPtr >> 48) & 0xffffu) << 5) | 0x12;
-  trampOnStack[8] = 0xd61f0220; // br x17
-
-  // Clear instruction cache.
-  __clear_cache(trampOnStack, &trampOnStack[9]);
-}
-#endif // defined(__aarch64__) && !defined(__APPLE__) && !defined(_WIN64)
diff --git a/compiler-rt/test/builtins/Unit/trampoline_setup_test.c b/compiler-rt/test/builtins/Unit/trampoline_setup_test.c
index d51d35acaa02f1..da115fe7642718 100644
--- a/compiler-rt/test/builtins/Unit/trampoline_setup_test.c
+++ b/compiler-rt/test/builtins/Unit/trampoline_setup_test.c
@@ -7,7 +7,7 @@
 
 /*
  * Tests nested functions
- * The ppc and aarch64 compilers generates a call to __trampoline_setup
+ * The ppc compiler generates a call to __trampoline_setup
  * The i386 and x86_64 compilers generate a call to ___enable_execute_stack
  */
 
diff --git a/flang/lib/Optimizer/CodeGen/BoxedProcedure.cpp b/flang/lib/Optimizer/CodeGen/BoxedProcedure.cpp
index 26f4aee21d8bda..f402404121da08 100644
--- a/flang/lib/Optimizer/CodeGen/BoxedProcedure.cpp
+++ b/flang/lib/Optimizer/CodeGen/BoxedProcedure.cpp
@@ -274,10 +274,10 @@ class BoxedProcedurePass
             auto loc = embox.getLoc();
             mlir::Type i8Ty = builder.getI8Type();
             mlir::Type i8Ptr = builder.getRefType(i8Ty);
-            // For AArch64, PPC32 and PPC64, the thunk is populated by a call to
+            // For PPC32 and PPC64, the thunk is populated by a call to
             // __trampoline_setup, which is defined in
             // compiler-rt/lib/builtins/trampoline_setup.c and requires the
-            // thunk size greater than 32 bytes.  For RISCV and x86_64, the
+            // thunk size greater than 32 bytes.  For Aarch64, RISCV and x86_64, the
             // thunk setup doesn't go through __trampoline_setup and fits in 32
             // bytes.
             fir::SequenceType::Extent thunkSize = triple.getTrampolineSize();
diff --git a/llvm/lib/Target/AArch64/AArch64CallingConvention.td b/llvm/lib/Target/AArch64/AArch64CallingConvention.td
index 7cca6d9bc6b9c3..8355463dea94ea 100644
--- a/llvm/lib/Target/AArch64/AArch64CallingConvention.td
+++ b/llvm/lib/Target/AArch64/AArch64CallingConvention.td
@@ -28,6 +28,12 @@ class CCIfSubtarget<string F, CCAction A>
 //===----------------------------------------------------------------------===//
 
 defvar AArch64_Common = [
+  // The 'nest' parameter, if any, is passed in X15.
+  // The previous register used here (X18) is also defined to be unavailable
+  // for this purpose, while all of X9-X15 were defined to be free for LLVM to
+  // use for this, so use X15 (which LLVM often already clobbers anyways).
+  CCIfNest<CCAssignToReg<[X15]>>,
+
   CCIfType<[iPTR], CCBitConvertToType<i64>>,
   CCIfType<[v2f32], CCBitConvertToType<v2i32>>,
   CCIfType<[v2f64, v4f32], CCBitConvertToType<v2i64>>,
@@ -117,16 +123,12 @@ defvar AArch64_Common = [
 ];
 
 let Entry = 1 in
-def CC_AArch64_AAPCS : CallingConv<!listconcat(
-  // The 'nest' parameter, if any, is passed in X18.
-  // Darwin and Windows use X18 as the platform register and hence 'nest' isn't
-  // currently supported there.
-  [CCIfNest<CCAssignToReg<[X18]>>],
-  AArch64_Common
-)>;
+def CC_AArch64_AAPCS : CallingConv<AArch64_Common>;
 
 let Entry = 1 in
 def RetCC_AArch64_AAPCS : CallingConv<[
+  CCIfNest<CCAssignToReg<[X15]>>,
+
   CCIfType<[iPTR], CCBitConvertToType<i64>>,
   CCIfType<[v2f32], CCBitConvertToType<v2i32>>,
   CCIfType<[v2f64, v4f32], CCBitConvertToType<v2i64>>,
@@ -177,6 +179,8 @@ def CC_AArch64_Win64_VarArg : CallingConv<[
 // a stack layout compatible with the x64 calling convention.
 let Entry = 1 in
 def CC_AArch64_Arm64EC_VarArg : CallingConv<[
+  CCIfNest<CCAssignToReg<[X15]>>,
+
   // Convert small floating-point values to integer.
   CCIfType<[f16, bf16], CCBitConvertToType<i16>>,
   CCIfType<[f32], CCBitConvertToType<i32>>,
@@ -295,6 +299,8 @@ def CC_AArch64_Arm64EC_Thunk_Native : CallingConv<[
 
 let Entry = 1 in
 def RetCC_AArch64_Arm64EC_Thunk : CallingConv<[
+  CCIfNest<CCAssignToReg<[X15]>>,
+
   // The X86-Win64 calling convention always returns __m64 values in RAX.
   CCIfType<[x86mmx], CCBitConvertToType<i64>>,
 
@@ -353,6 +359,8 @@ def RetCC_AArch64_Arm64EC_CFGuard_Check : CallingConv<[
 //     + Stack slots are sized as needed rather than being at least 64-bit.
 let Entry = 1 in
 def CC_AArch64_DarwinPCS : CallingConv<[
+  CCIfNest<CCAssignToReg<[X15]>>,
+
   CCIfType<[iPTR], CCBitConvertToType<i64>>,
   CCIfType<[v2f32], CCBitConvertToType<v2i32>>,
   CCIfType<[v2f64, v4f32, f128], CCBitConvertToType<v2i64>>,
@@ -427,6 +435,8 @@ def CC_AArch64_DarwinPCS : CallingConv<[
 
 let Entry = 1 in
 def CC_AArch64_DarwinPCS_VarArg : CallingConv<[
+  CCIfNest<CCAssignToReg<[X15]>>,
+
   CCIfType<[iPTR], CCBitConvertToType<i64>>,
   CCIfType<[v2f32], CCBitConvertToType<v2i32>>,
   CCIfType<[v2f64, v4f32, f128], CCBitConvertToType<v2i64>>,
@@ -450,6 +460,8 @@ def CC_AArch64_DarwinPCS_VarArg : CallingConv<[
 // same as the normal Darwin VarArgs handling.
 let Entry = 1 in
 def CC_AArch64_DarwinPCS_ILP32_VarArg : CallingConv<[
+  CCIfNest<CCAssignToReg<[X15]>>,
+
   CCIfType<[v2f32], CCBitConvertToType<v2i32>>,
   CCIfType<[v2f64, v4f32, f128], CCBitConvertToType<v2i64>>,
 
@@ -494,6 +506,8 @@ def CC_AArch64_DarwinPCS_ILP32_VarArg : CallingConv<[
 
 let Entry = 1 in
 def CC_AArch64_GHC : CallingConv<[
+  CCIfNest<CCAssignToReg<[X15]>>,
+
   CCIfType<[iPTR], CCBitConvertToType<i64>>,
 
   // Handle all vector types as either f64 or v2f64.
@@ -523,6 +537,7 @@ def CC_AArch64_Preserve_None : CallingConv<[
   // We can pass arguments in all general registers, except:
   // - X8, used for sret
   // - X16/X17, used by the linker as IP0/IP1
+  // - X15, the nest register and used by Windows for stack allocation
   // - X18, the platform register
   // - X19, the base pointer
   // - X29, the frame pointer
@@ -533,6 +548,7 @@ def CC_AArch64_Preserve_None : CallingConv<[
   // normal functions without saving and reloading arguments.
   // X9 is assigned last as it is used in FrameLowering as the first
   // choice for a scratch register.
+  CCIfNest<CCAssignToReg<[X15]>>,
   CCIfType<[i32], CCAssignToReg<[W20, W21, W22, W23,
                                  W24, W25, W26, W27, W28,
                                  W0, W1, W2, W3, W4, W5,
@@ -544,12 +560,6 @@ def CC_AArch64_Preserve_None : CallingConv<[
                                  X6, X7, X10, X11,
                                  X12, X13, X14, X9]>>,
 
-  // Windows uses X15 for stack allocation
-  CCIf<"!State.getMachineFunction().getSubtarget<AArch64Subtarget>().isTargetWindows()",
-    CCIfType<[i32], CCAssignToReg<[W15]>>>,
-  CCIf<"!State.getMachineFunction().getSubtarget<AArch64Subtarget>().isTargetWindows()",
-    CCIfType<[i64], CCAssignToReg<[X15]>>>,
-
   CCDelegateTo<CC_AArch64_AAPCS>
 ]>;
 
diff --git a/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp b/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
index d3abd79b85a75f..ced3ff7b742ad1 100644
--- a/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
@@ -2044,6 +2044,25 @@ void AArch64FrameLowering::emitPrologue(MachineFunction &MF,
           : 0;
 
   if (windowsRequiresStackProbe(MF, NumBytes + RealignmentPadding)) {
+    // Find an available register to store value of VG to.
+    unsigned X15Scratch = AArch64::NoRegister;
+    if (LiveRegs.contains(AArch64::X15)) {
+        // if (llvm::any_of(
+        //         MBB.liveins(),
+        //         [&STI](const MachineBasicBlock::RegisterMaskPair &LiveIn) {
+        //           return STI.getRegisterInfo()->isSuperOrSubRegisterEq(
+        //               AArch64::X15, LiveIn.PhysReg);
+        //         }))
+      X15Scratch = findScratchNonCalleeSaveRegister(&MBB);
+      assert(X15Scratch != AArch64::NoRegister);
+      LiveRegs.removeReg(AArch64::X15); // ignore X15 since we restore it
+      BuildMI(MBB, MBBI, DL, TII->get(AArch64::ORRXrr), X15Scratch)
+          .addReg(AArch64::XZR)
+          .addReg(AArch64::X15, RegState::Undef)
+          .addReg(AArch64::X15, RegState::Implicit)
+          .setMIFlag(MachineInstr::FrameSetup);
+    }
+
     uint64_t NumWords = (NumBytes + RealignmentPadding) >> 4;
     if (NeedsWinCFI) {
       HasWinCFI = true;
@@ -2166,6 +2185,13 @@ void AArch64FrameLowering::emitPrologue(MachineFunction &MF,
       // we've set a frame pointer and already finished the SEH prologue.
       assert(!NeedsWinCFI);
     }
+    if (X15Scratch != AArch64::NoRegister) {
+      BuildMI(MBB, MBBI, DL, TII->get(AArch64::ORRXrr), AArch64::X15)
+          .addReg(AArch64::XZR)
+          .addReg(X15Scratch, RegState::Undef)
+          .addReg(X15Scratch, RegState::Implicit)
+          .setMIFlag(MachineInstr::FrameSetup);
+    }
   }
 
   StackOffset SVECalleeSavesSize = {}, SVELocalsSize = SVEStackSize;
diff --git a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
index 0d1608a97bfd30..1404077446420d 100644
--- a/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
+++ b/llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
@@ -7290,59 +7290,66 @@ static SDValue LowerFLDEXP(SDValue Op, SelectionDAG &DAG) {
 
 SDValue AArch64TargetLowering::LowerADJUST_TRAMPOLINE(SDValue Op,
                                                       SelectionDAG &DAG) const {
-  // Note: x18 cannot be used for the Nest parameter on Windows and macOS.
-  if (Subtarget->isTargetDarwin() || Subtarget->isTargetWindows())
-    report_fatal_error(
-        "ADJUST_TRAMPOLINE operation is only supported on Linux.");
-
   return Op.getOperand(0);
 }
 
 SDValue AArch64TargetLowering::LowerINIT_TRAMPOLINE(SDValue Op,
                                                     SelectionDAG &DAG) const {
-
-  // Note: x18 cannot be used for the Nest parameter on Windows and macOS.
-  if (Subtarget->isTargetDarwin() || Subtarget->isTargetWindows())
-    report_fatal_error("INIT_TRAMPOLINE operation is only supported on Linux.");
-
   SDValue Chain = Op.getOperand(0);
-  SDValue Trmp = Op.getOperand(1); // trampoline
+  SDValue Trmp = Op.getOperand(1); // trampoline, 36 bytes
   SDValue FPtr = Op.getOperand(2); // nested function
   SDValue Nest = Op.getOperand(3); // 'nest' parameter value
-  SDLoc dl(Op);
 
-  EVT PtrVT = getPointerTy(DAG.getDataLayout());
-  Type *IntPtrTy = DAG.getDataLayout().getIntPtrType(*DAG.getContext());
-
-  TargetLowering::ArgListTy Args;
-  TargetLowering::ArgListEntry Entry;
+  const Value *TrmpAddr = cast<SrcValueSDNode>(Op.getOperand(4))->getValue();
 
-  Entry.Ty = IntPtrTy;
-  Entry.Node = Trmp;
-  Args.push_back(Entry);
+  // ldr x15, .+16
+  // ldr x17, .+20
+  // br x17
+  // 0
+  // .nest: .qword nest
+  // .fptr: .qword fptr
+  SDValue OutChains[5];
 
-  if (auto *FI = dyn_cast<FrameIndexSDNode>(Trmp.getNode())) {
-    MachineFunction &MF = DAG.getMachineFunction();
-    MachineFrameInfo &MFI = MF.getFrameInfo();
-    Entry.Node =
-        DAG.getConstant(MFI.getObjectSize(FI->getIndex()), dl, MVT::i64);
-  } else
-    Entry.Node = DAG.getConstant(36, dl, MVT::i64);
+  const char X15 = 0x0f;
+  const char X17 = 0x11;
 
-  Args.push_back(Entry);
-  Entry.Node = FPtr;
-  Args.push_back(Entry);
-  Entry.Node = Nest;
-  Args.push_back(Entry);
+  SDValue Addr = Trmp;
 
-  // Lower to a call to __trampoline_setup(Trmp, TrampSize, FPtr, ctx_reg)
-  TargetLowering::CallLoweringInfo CLI(DAG);
-  CLI.setDebugLoc(dl).setChain(Chain).setLibCallee(
-      CallingConv::C, Type::getVoidTy(*DAG.getContext()),
-      DAG.getExternalSymbol("__trampoline_setup", PtrVT), std::move(Args));
+  SDLoc dl(Op);
+  OutChains[0] =
+      DAG.getStore(Chain, dl, DAG.getConstant(0x58000080u | X15, dl, MVT::i32), Addr,
+                   MachinePointerInfo(TrmpAddr));
 
-  std::pair<SDValue, SDValue> CallResult = LowerCallTo(CLI);
-  return CallResult.second;
+  Addr = DAG.getNode(ISD::ADD, dl, MVT::i64, Trmp,
+                     DAG.getConstant(4, dl, MVT::i64));
+  OutChains[1] =
+      DAG.getStore(Chain, dl, DAG.getConstant(0x580000b0u | X17, dl, MVT::i32), Addr,
+                   MachinePointerInfo(TrmpAddr, 4));
+
+  Addr = DAG.getNode(ISD::ADD, dl, MVT::i64, Trmp,
+                     DAG.getConstant(8, dl, MVT::i64));
+  OutChains[2] =
+      DAG.getStore(Chain, dl, DAG.getConstant(0xd61f0220u, dl, MVT::i32), Addr,
+                   MachinePointerInfo(TrmpAddr, 8));
+
+  Addr = DAG.getNode(ISD::ADD, dl, MVT::i64, Trmp,
+                     DAG.getConstant(16, dl, MVT::i64));
+  OutChains[3] =
+      DAG.getStore(Chain, dl, Nest, Addr, MachinePointerInfo(TrmpAddr, 16));
+
+  Addr = DAG.getNode(ISD::ADD, dl, MVT::i64, Trmp,
+                     DAG.getConstant(24, dl, MVT::i64));
+  OutChains[4] =
+      DAG.getStore(Chain, dl, FPtr, Addr, MachinePointerInfo(TrmpAddr, 24));
+
+  SDValue StoreToken = DAG.getNode(ISD::TokenFactor, dl, MVT::Other, OutChains);
+
+  SDValue EndOfTrmp = DAG.getNode(ISD::ADD, dl, MVT::i64, Trmp,
+                     DAG.getConstant(12, dl, MVT::i64));
+
+  // Call clear cache on the trampoline instructions.
+  return DAG.getNode(ISD::CLEAR_CACHE, dl, MVT::Other, StoreToken,
+                              Trmp, EndOfTrmp);
 }
 
 SDValue AArch64TargetLowering::LowerOperation(SDValue Op,
diff --git a/llvm/test/CodeGen/AArch64/nest-register.ll b/llvm/test/CodeGen/AArch64/nest-register.ll
index 1e1c1b044bab65..2e94dfba1fa523 100644
--- a/llvm/test/CodeGen/AArch64/nest-register.ll
+++ b/llvm/test/CodeGen/AArch64/nest-register.ll
@@ -1,3 +1,4 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
 ; RUN: llc -disable-post-ra -verify-machineinstrs < %s -mtriple=aarch64-none-linux-gnu | FileCheck %s
 
 ; Tests that the 'nest' parameter attribute causes the relevant parameter to be
@@ -5,18 +6,21 @@
 
 define ptr @nest_receiver(ptr nest %arg) nounwind {
 ; CHECK-LABEL: nest_receiver:
-; CHECK-NEXT: // %bb.0:
-; CHECK-NEXT: mov x0, x18
-; CHECK-NEXT: ret
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    mov x0, x15
+; CHECK-NEXT:    ret
 
   ret ptr %arg
 }
 
 define ptr @nest_caller(ptr %arg) nounwind {
 ; CHECK-LABEL: nest_caller:
-; CHECK: mov x18, x0
-; CHECK-NEXT: bl nest_receiver
-; CHECK: ret
+; CHECK:       // %bb.0:
+; CHECK-NEXT:    str x30, [sp, #-16]! // 8-byte Folded Spill
+; CHECK-NEXT:    mov x15, x0
+; CHECK-NEXT:    bl nest_receiver
+; CHECK-NEXT:    ldr x30, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT:    ret
 
   %result = call ptr @nest_receiver(ptr nest %arg)
   ret ptr %result
diff --git a/llvm/test/CodeGen/AArch64/preserve_nonecc_call.ll b/llvm/test/CodeGen/AArch64/preserve_nonecc_call.ll
index 9b9717c19321e7..e0d7b5abe7bea2 100644
--- a/llvm/test/CodeGen/AArch64/preserve_nonecc_call.ll
+++ b/llvm/test/CodeGen/AArch64/preserve_nonecc_call.ll
@@ -184,10 +184,11 @@ declare preserve_nonecc i64 @callee_with_many_param2(i64 %a1, i64 %a2, i64 %a3,
 define preserve_nonecc i64 @callee_with_many_param(i64 %a1, i64 %a2, i64 %a3, i64 %a4, i64 %a5, i64 %a6, i64 %a7, i64 %a8, i64 %a9, i64 %a10, i64 %a11, i64 %a12, i64 %a13, i64 %a14, i64 %a15, i64 %a16, i64 %a17, i64 %a18, i64 %a19, i64 %a20, i64 %a21, i64 %a22, i64 %a23, i64 %a24) {
 ; CHECK-LABEL: callee_with_many_param:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    str x30, [sp, #-16]! // 8-byte Folded Spill
-; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    sub sp, sp, #32
+; CHECK-NEXT:    str x30, [sp, #16] // 8-byte Folded Spill
+; CHECK-NEXT:    .cfi_def_cfa_offset 32
 ; CHECK-NEXT:    .cfi_offset w30, -16
-; CHECK-NEXT:    mov x8, x15
+; CHECK-NEXT:    ldr x8, [sp, #32]
 ; CHECK-NEXT:    mov x15, x20
 ; CHECK-NEXT:    mov x20, x21
 ; CHECK-NEXT:    mov x21, x22
@@ -212,17 +213,20 @@ define preserve_nonecc i64 @callee_with_many_param(i64 %a1, i64 %a2, i64 %a3, i6
 ; CHECK-NEXT:    mov x13, x14
 ; CHECK-NEXT:    mov x14, x9
 ; CHECK-NEXT:    mov x9, x8
+; CHECK-NEXT:    str x15, [sp]
 ; CHECK-NEXT:    bl callee_with_many_param2
-; CHECK-NEXT:    ldr x30, [sp], #16 // 8-byte Folded Reload
+; CHECK-NEXT:    ldr x30, [sp, #16] // 8-byte Folded Reload
+; CHECK-NEXT:    add sp, sp, #32
 ; CHECK-NEXT:    ret
 ;
 ; DARWIN-LABEL: callee_with_many_param:
 ; DARWIN:       ; %bb.0:
-; DARWIN-NEXT:    stp x29, x30, [sp, #-16]! ; 16-byte Folded Spill
-; DARWIN-NEXT:    .cfi_def_cfa_offset 16
+; DARWIN-NEXT:    sub sp, sp, #32
+; DARWIN-NEXT:    stp x29, x30, [sp, #16] ; 16-byte Folded Spill
+; DARWIN-NEXT:    .cfi_def_cfa_offset 32
 ; DARWIN-NEXT:    .cfi_offset w30, -8
 ; DARWIN-NEXT:    .cfi_offset w29, -16
-; DARWIN-NEXT:    mov x8, x15
+; DARWIN-NEXT:    ldr x8, [sp, #32]
 ; DARWIN-NEXT:    mov x15, x20
 ; DARWIN-NEXT:    mov x20, x21
 ; DARWIN-NEXT:    mov x21, x22
@@ -247,8 +251,10 @@ define preserve_nonecc i64 @callee_with_many_param(i64 %a1, i64 %a2, i64 %a3, i6
 ; DARWIN-NEXT:    mov x13, x14
 ; DARWIN-NEXT:    mov x14, x9
 ; DARWIN-NEXT:    mov x9, x8
+; DARWIN-NEXT:    str x15, [sp]
 ; DARWIN-NEXT:    bl _callee_with_many_param2
-; DARWIN-NEXT:    ldp x29, x30, [sp], #16 ; 16-byte Folded Reload
+; DARWIN-NEXT:    ldp x29, x30, [sp, #16] ; 16-byte Folded Reload
+; DARWIN-NEXT:    add sp, sp, #32
 ; DARWIN-NEXT:    ret
 ;
 ; WIN-LABEL: callee_with_many_param:
@@ -302,17 +308,18 @@ define preserve_nonecc i64 @callee_with_many_param(i64 %a1, i64 %a2, i64 %a3, i6
 define i64 @caller3() {
 ; CHECK-LABEL: caller3:
 ; CHECK:       // %bb.0:
-; CHECK-NEXT:    stp d15, d14, [sp, #-160]! // 16-byte Folded Spill
-; CHECK-NEXT:    stp d13, d12, [sp, #16] // 16-byte Folded Spill
-; CHECK-NEXT:    stp d11, d10, [sp, #32] // 16-byte Fold...
[truncated]

github-actions · 2025-02-11T15:35:18Z

✅ With the latest revision this PR passed the C/C++ code formatter.

vzakhari · 2025-02-12T00:40:04Z

flang/lib/Optimizer/CodeGen/BoxedProcedure.cpp

            // __trampoline_setup, which is defined in
            // compiler-rt/lib/builtins/trampoline_setup.c and requires the
-            // thunk size greater than 32 bytes.  For RISCV and x86_64, the
+            // thunk size greater than 32 bytes.  For Aarch64, RISCV and x86_64, the
            // thunk setup doesn't go through __trampoline_setup and fits in 32


Is 32 bytes comment correct for AArch64? It looks like below it says 36 bytes.

The trampoline is now smaller than it was before (28 bytes used, with 4 padding bytes), so while 36 is still acceptable, 32 is correct

The calling convention previously stated that X9 and X15 were callee-preserved, but the implementation of AArch64FrameLowering.cpp uses those as scratch registers and does not actually preserve it.

vtjnash · 2025-02-12T15:44:45Z

This ended up growing too much, so it might need to be split, since I was finding more apparent calling convention mistakes with the previous trampoline implementation and with the preservemost definitions in the td, which conflicted with the implementation requirements of AArch64FrameLowering.cpp which assumed that X9 and X15 are free to use in the prologue (required to be caller-saved except for preserve-none). It would be great if someone could confirm my reading of the code however!

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

ceseo · 2025-02-20T21:24:13Z

@vtjnash I don't have any tests. This is mostly used by some Fortran HPC applications. Maybe somebody in the original issue can point you to a real world program.

The idea was to use this implementation in compiler-rt as a workaround until we could solve the Fortran problem using procedure pointers, but I don't know if anyone is looking into this.

ceseo · 2025-02-28T14:56:54Z

@kiranchandramohan could you (or someone at Arm) please take a look at this?

efriedma-quic · 2025-02-28T20:06:56Z

llvm/docs/LangRef.rst

-target-specific knowledge. The ``func`` argument must hold a function.
+target-specific knowledge.
+
+The ``func`` argument must be a constant (potentially bitcasted) pointer to a


How are bitcasts relevant here?

Probably we should tighten the verifier check to just require a function. And if some target eventually needs trampolines where any of the operands is in a non-zero address-space, we can make the intrinsic overloaded.

Probably we also should tighten llvm::canReplaceOperandWithVariable.

And please land this separately.

It probably isn't of much relevance since opaque pointers. I think it is just also fairly silly of a restriction, since llvm doesn't care at all about the source of the value here, it just needs to know what calling convention it should have (to know which register to use). That info probably could just as easily be passed as a separate immarg value, since calling conventions are defined in the langref to be numeric.

That all makes sense.

efriedma-quic · 2025-02-28T20:10:51Z

llvm/test/CodeGen/AArch64/win64cc-x18.ll

@@ -1,35 +1,26 @@
-; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+;; Testing that nest uses x15 on all calling conventions (except Arm64EC)


What's the interaction with arm64ec?

The Arm64EC calling convention specifically says nest must be passed as X4 which is now implemented correctly (consistent with the AAarch64CallingConvention.td) by this PR

Oh, you mean the CC_AArch64_Arm64EC_Thunk convention. Probably should make the comment specifically mention you're referring to thunks.

I guess to make trampolines work on arm64ec, the trampoline itself would actually have to be x86-64 code... which can't even access x15, so yeah, x4 makes sense. The thunk would then copy from x4 to x15. I don't expect you to implement this, but maybe stick a report_fatal_error somewhere if someone tries to use nest with arm64ec.

efriedma-quic · 2025-02-28T20:31:15Z

llvm/lib/Target/AArch64/AArch64CallingConvention.td

@@ -523,6 +537,7 @@ def CC_AArch64_Preserve_None : CallingConv<[
  // We can pass arguments in all general registers, except:
  // - X8, used for sret
  // - X16/X17, used by the linker as IP0/IP1
+  // - X15, the nest register and used by Windows for stack allocation


Instead of reducing the utility preservenone/preservemost/etc., can we just forbid using "nest" arguments with them? I can't see why you'd want to use them together.

That sounds reasonable. If I read the code correctly, the comment about "X15 for stack allocation" is not quite accurate as well (the code doesn't necessarily have to clobber that, since it is implemented to use a temp register first--though such a register does need to be made available for it). Other comments here seem contradictory also, since it assigns X9 last "because it is needed as a scratch register"–but it seems like either it is needed as a scratch register (in which case it should not have been allowed to assign it), or it doesn't need it as a scratch register (in which case the comment is wrong since it doesn't get used as a scratch register)

Right, the x15 thing isn't a fundamental limitation; it's just a bit complicated to generate the correct code, and the author of the preservenonecc code didn't want to try to implement it.

For the x9 thing, see #99434 .

efriedma-quic · 2025-02-28T20:45:53Z

llvm/lib/Target/AArch64/AArch64CallingConvention.td


 let Entry = 1 in
 def RetCC_AArch64_AAPCS : CallingConv<[
+  CCIfNest<CCAssignToReg<[X15]>>,


A return value can't be "nest"?

efriedma-quic

(I wrote these comments a while back, but forgot to actually post them.)

efriedma-quic · 2025-03-03T22:33:35Z

llvm/docs/LangRef.rst

-target-specific knowledge. The ``func`` argument must hold a function.
+target-specific knowledge.
+
+The ``func`` argument must be a constant (potentially bitcasted) pointer to a


That all makes sense.

efriedma-quic · 2025-03-03T22:43:37Z

llvm/lib/Target/AArch64/AArch64CallingConvention.td

@@ -523,6 +537,7 @@ def CC_AArch64_Preserve_None : CallingConv<[
  // We can pass arguments in all general registers, except:
  // - X8, used for sret
  // - X16/X17, used by the linker as IP0/IP1
+  // - X15, the nest register and used by Windows for stack allocation


Right, the x15 thing isn't a fundamental limitation; it's just a bit complicated to generate the correct code, and the author of the preservenonecc code didn't want to try to implement it.

For the x9 thing, see #99434 .

efriedma-quic · 2025-03-03T22:56:04Z

llvm/test/CodeGen/AArch64/win64cc-x18.ll

@@ -1,35 +1,26 @@
-; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
+;; Testing that nest uses x15 on all calling conventions (except Arm64EC)


Oh, you mean the CC_AArch64_Arm64EC_Thunk convention. Probably should make the comment specifically mention you're referring to thunks.

I guess to make trampolines work on arm64ec, the trampoline itself would actually have to be x86-64 code... which can't even access x15, so yeah, x4 makes sense. The thunk would then copy from x4 to x15. I don't expect you to implement this, but maybe stick a report_fatal_error somewhere if someone tries to use nest with arm64ec.

llvmbot added compiler-rt backend:AArch64 compiler-rt:builtins flang Flang issues not falling into any other category flang:fir-hlfir flang:codegen labels Feb 11, 2025

vchuravy added the julialang label Feb 11, 2025

vzakhari reviewed Feb 12, 2025

View reviewed changes

vtjnash added 3 commits February 12, 2025 09:32

fixup! [AArch64] fix trampoline implementation: use X15

f6122fa

fix langref and Arm64EC CallingConvention

9aa0553

fix more langref mistakes with aarch64 calling convention

78745ed

The calling convention previously stated that X9 and X15 were callee-preserved, but the implementation of AArch64FrameLowering.cpp uses those as scratch registers and does not actually preserve it.

llvmbot added the llvm:ir label Feb 12, 2025

vzakhari reviewed Feb 12, 2025

View reviewed changes

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp Show resolved Hide resolved

fixup! [AArch64] fix trampoline implementation: use X15

3e53925

kiranchandramohan requested review from stuij, ostannard, davemgreen, nasherm, smithp35 and efriedma-quic February 28, 2025 15:05

efriedma-quic reviewed Feb 28, 2025

View reviewed changes

efriedma-quic reviewed Apr 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AArch64] fix trampoline implementation: use X15 #126743

[AArch64] fix trampoline implementation: use X15 #126743

vtjnash commented Feb 11, 2025

llvmbot commented Feb 11, 2025 •

edited

Loading

github-actions bot commented Feb 11, 2025 •

edited

Loading

vzakhari Feb 12, 2025

vtjnash Feb 12, 2025

vtjnash commented Feb 12, 2025

ceseo commented Feb 20, 2025

ceseo commented Feb 28, 2025

efriedma-quic Feb 28, 2025

vtjnash Feb 28, 2025

efriedma-quic Mar 3, 2025

efriedma-quic Feb 28, 2025

vtjnash Feb 28, 2025

efriedma-quic Mar 3, 2025

efriedma-quic Feb 28, 2025

vtjnash Feb 28, 2025

efriedma-quic Mar 3, 2025

efriedma-quic Feb 28, 2025

efriedma-quic left a comment

efriedma-quic Mar 3, 2025

efriedma-quic Mar 3, 2025

efriedma-quic Mar 3, 2025

		@@ -1,35 +1,26 @@
		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
		;; Testing that nest uses x15 on all calling conventions (except Arm64EC)

[AArch64] fix trampoline implementation: use X15 #126743

Are you sure you want to change the base?

[AArch64] fix trampoline implementation: use X15 #126743

Conversation

vtjnash commented Feb 11, 2025

llvmbot commented Feb 11, 2025 • edited Loading

github-actions bot commented Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vtjnash commented Feb 12, 2025

ceseo commented Feb 20, 2025

ceseo commented Feb 28, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

efriedma-quic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

llvmbot commented Feb 11, 2025 •

edited

Loading

github-actions bot commented Feb 11, 2025 •

edited

Loading