Skip to content

[IR] Add llvm.vector.[de]interleave{4,6,8} #139893

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

lukel97
Copy link
Contributor

@lukel97 lukel97 commented May 14, 2025

This adds [de]interleave intrinsics for factors of 4,6,8, so that every interleaved memory operation supported by the in-tree targets can be represented by a single intrinsic.

For context, [de]interleaves of fixed-length vectors are represented by a series of shufflevectors. The intrinsics are needed for scalable vectors, and we don't currently scalably vectorize all possible factors of interleave groups supported by RISC-V/AArch64.

The underlying reason for this is that higher factors are currently represented by interleaving multiple interleaves themselves, which made sense at the time in the discussion in #89018.

But after trying to integrate these for higher factors on RISC-V I think we should revisit this design choice:

  • Matching these in InterleavedAccessPass is non-trivial: We currently only support factors that are a power of 2, and detecting this requires a good chunk of code
  • The shufflevector masks used for [de]interleaves of fixed-length vectors are much easier to pattern match as they are strided patterns, but for the intrinsics it's much more complicated to match as the structure is a tree.
  • Unlike shufflevectors, there's no optimisation that happens on [de]interleave2 intriniscs
  • For non-power-of-2 factors e.g. 6, there are multiple possible ways a [de]interleave could be represented, see the discussion in [IA] Add support for [de]interleave{3,5,7} #139373
  • We already have intrinsics for 2,3,5 and 7, so by avoiding 4,6 and 8 we're not really saving much

By representing these higher factors are interleaved-interleaves, we can in theory support arbitrarily high interleave factors. However I'm not sure this is actually needed in practice: SVE only has instructions for factors 2,3,4, whilst RVV only supports up to factor 8.

This patch would make it much easier to support scalable interleaved accesses in the loop vectorizer for RISC-V for factors 3,5,6 and 7, as the loop vectorizer and InterleavedAccessPass wouldn't need to construct and match trees of interleaves.

For interleave factors above 8, for which there are no hardware memory operations to match in the InterleavedAccessPass, we can still keep the wide load + recursive interleaving in the loop vectorizer.

If people agree with the direction, I would post these patches to follow up:

  • Lower [de]interleave{3,4} on AArch64
  • Teach InterleavedAccessPass to recognize [de]interleave{4,6,8} (3,5,7 are handled by [IA] Add support for [de]interleave{3,5,7} #139373, this would be nearly identical)
  • Teach the loop vectorizer to emit a single [de]interleave intrinsic for factors 2->8. This would complete scalable vlsegN support on RISC-V
  • Remove the recursive [de]interleaving pattern matching from InterleavedAccessPass

@llvmbot
Copy link
Member

llvmbot commented May 14, 2025

@llvm/pr-subscribers-llvm-selectiondag

Author: Luke Lau (lukel97)

Changes

This adds [de]interleave intrinsics for factors of 4,6,8, so that every interleaved memory operation supported by the in-tree targets can be represented by a single intrinsic.

For context, [de]interleaves of fixed-length vectors are represented by a series of shufflevectors. The intrinsics are needed for scalable vectors, and we don't currently scalably vectorize all possible factors of interleave groups supported by RISC-V/AArch64.

The underlying reason for this is that higher factors are currently represented by interleaving multiple interleaves themselves, which made sense at the time in the discussion in #89018.

But after trying to integrate these for higher factors on RISC-V I think we should revisit this design choice:

  • Matching these in InterleavedAccessPass is non-trivial: We currently only support factors that are a power of 2, and detecting this requires a good chunk of code
  • The shufflevector masks used for [de]interleaves of fixed-length vectors are much easier to pattern match as they are strided patterns, but for the intrinsics it's much more complicated to match as the structure is a tree.
  • There's no optimisation that happens on [de]interleave2 intriniscs, so there's not much point to representing it in this form
  • For non-power-of-2 factors e.g. 6, there are multiple possible ways a [de]interleave could be represented, see the discussion in #139373
  • We already have intrinsics for 2,3,5 and 7, so by avoiding 4,6 and 8 we're not really saving much

By representing these higher factors are interleaved-interleaves, we can in theory support arbitrarily high interleave factors. However I'm not sure this is actually needed in practice: SVE only has instructions for factors 2,3,4, whilst RVV only supports up to factor 8.

This patch would make it much easier to support scalable interleaved accesses in the loop vectorizer for RISC-V for factors 3,5,6 and 7, as the loop vectorizer and InterleavedAccessPass wouldn't need to construct and match trees of interleaves.

If people agree with the direction, I would post these patches to follow up:

  • Lower [de]interleave{3,4} on AArch64
  • Teach InterleavedAccessPass to recognize [de]interleave{4,6,8} (3,5,7 are handled by [IA] Add support for [de]interleave{3,5,7} #139373, this would be nearly identical)
  • Remove the recursive [de]interleaving from the loop vectorizer and instead emit a single intrinsic
  • Remove the recursive [de]interleaving pattern matching from InterleavedAccessPass

If we ever do want to end up supporting interleave factors higher than what the target natively has instructions for, we can then extend this infrastructure further. But I think it's more important that we have full support for the native capabilities first.


Patch is 777.14 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/139893.diff

9 Files Affected:

  • (modified) llvm/docs/LangRef.rst (+5-5)
  • (modified) llvm/include/llvm/IR/Intrinsics.h (+20-10)
  • (modified) llvm/include/llvm/IR/Intrinsics.td (+66)
  • (modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp (+18)
  • (modified) llvm/lib/IR/Intrinsics.cpp (+26-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll (+500-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vector-deinterleave.ll (+1620-133)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vector-interleave-fixed.ll (+761-26)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vector-interleave.ll (+9979-3877)
diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 7296bb84b7d95..c0bc0a10ed537 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -20158,7 +20158,7 @@ Arguments:
 
 The argument to this intrinsic must be a vector.
 
-'``llvm.vector.deinterleave2/3/5/7``' Intrinsic
+'``llvm.vector.deinterleave2/3/4/5/6/7/8``' Intrinsic
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Syntax:
@@ -20176,8 +20176,8 @@ This is an overloaded intrinsic.
 Overview:
 """""""""
 
-The '``llvm.vector.deinterleave2/3/5/7``' intrinsics deinterleave adjacent lanes
-into 2, 3, 5, and 7 separate vectors, respectively, and return them as the
+The '``llvm.vector.deinterleave2/3/4/5/6/7/8``' intrinsics deinterleave adjacent lanes
+into 2 through to 8 separate vectors, respectively, and return them as the
 result.
 
 This intrinsic works for both fixed and scalable vectors. While this intrinsic
@@ -20199,7 +20199,7 @@ Arguments:
 The argument is a vector whose type corresponds to the logical concatenation of
 the aggregated result types.
 
-'``llvm.vector.interleave2/3/5/7``' Intrinsic
+'``llvm.vector.interleave2/3/4/5/6/7/8``' Intrinsic
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Syntax:
@@ -20217,7 +20217,7 @@ This is an overloaded intrinsic.
 Overview:
 """""""""
 
-The '``llvm.vector.interleave2/3/5/7``' intrinsic constructs a vector
+The '``llvm.vector.interleave2/3/4/5/6/7/8``' intrinsic constructs a vector
 by interleaving all the input vectors.
 
 This intrinsic works for both fixed and scalable vectors. While this intrinsic
diff --git a/llvm/include/llvm/IR/Intrinsics.h b/llvm/include/llvm/IR/Intrinsics.h
index 6fb1bf9359b9a..b64784909fc25 100644
--- a/llvm/include/llvm/IR/Intrinsics.h
+++ b/llvm/include/llvm/IR/Intrinsics.h
@@ -153,8 +153,11 @@ namespace Intrinsic {
       TruncArgument,
       HalfVecArgument,
       OneThirdVecArgument,
+      OneFourthVecArgument,
       OneFifthVecArgument,
+      OneSixthVecArgument,
       OneSeventhVecArgument,
+      OneEighthVecArgument,
       SameVecWidthArgument,
       VecOfAnyPtrsToElt,
       VecElementArgument,
@@ -167,8 +170,11 @@ namespace Intrinsic {
     } Kind;
 
     // These three have to be contiguous.
-    static_assert(OneFifthVecArgument == OneThirdVecArgument + 1 &&
-                  OneSeventhVecArgument == OneFifthVecArgument + 1);
+    static_assert(OneFourthVecArgument == OneThirdVecArgument + 1 &&
+                  OneFifthVecArgument == OneFourthVecArgument + 1 &&
+                  OneSixthVecArgument == OneFifthVecArgument + 1 &&
+                  OneSeventhVecArgument == OneSixthVecArgument + 1 &&
+                  OneEighthVecArgument == OneSeventhVecArgument + 1);
     union {
       unsigned Integer_Width;
       unsigned Float_Width;
@@ -188,19 +194,23 @@ namespace Intrinsic {
     unsigned getArgumentNumber() const {
       assert(Kind == Argument || Kind == ExtendArgument ||
              Kind == TruncArgument || Kind == HalfVecArgument ||
-             Kind == OneThirdVecArgument || Kind == OneFifthVecArgument ||
-             Kind == OneSeventhVecArgument || Kind == SameVecWidthArgument ||
-             Kind == VecElementArgument || Kind == Subdivide2Argument ||
-             Kind == Subdivide4Argument || Kind == VecOfBitcastsToInt);
+             Kind == OneThirdVecArgument || Kind == OneFourthVecArgument ||
+             Kind == OneFifthVecArgument || Kind == OneSixthVecArgument ||
+             Kind == OneSeventhVecArgument || Kind == OneEighthVecArgument ||
+             Kind == SameVecWidthArgument || Kind == VecElementArgument ||
+             Kind == Subdivide2Argument || Kind == Subdivide4Argument ||
+             Kind == VecOfBitcastsToInt);
       return Argument_Info >> 3;
     }
     ArgKind getArgumentKind() const {
       assert(Kind == Argument || Kind == ExtendArgument ||
              Kind == TruncArgument || Kind == HalfVecArgument ||
-             Kind == OneThirdVecArgument || Kind == OneFifthVecArgument ||
-             Kind == OneSeventhVecArgument || Kind == SameVecWidthArgument ||
-             Kind == VecElementArgument || Kind == Subdivide2Argument ||
-             Kind == Subdivide4Argument || Kind == VecOfBitcastsToInt);
+             Kind == OneThirdVecArgument || Kind == OneFourthVecArgument ||
+             Kind == OneFifthVecArgument || Kind == OneSixthVecArgument ||
+             Kind == OneSeventhVecArgument || Kind == OneEighthVecArgument ||
+             Kind == SameVecWidthArgument || Kind == VecElementArgument ||
+             Kind == Subdivide2Argument || Kind == Subdivide4Argument ||
+             Kind == VecOfBitcastsToInt);
       return (ArgKind)(Argument_Info & 7);
     }
 
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index 8d26961eebbf3..3994a543f9dcf 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -340,6 +340,9 @@ def IIT_ONE_FIFTH_VEC_ARG : IIT_Base<63>;
 def IIT_ONE_SEVENTH_VEC_ARG : IIT_Base<64>;
 def IIT_V2048: IIT_Vec<2048, 65>;
 def IIT_V4096: IIT_Vec<4096, 66>;
+def IIT_ONE_FOURTH_VEC_ARG : IIT_Base<67>;
+def IIT_ONE_SIXTH_VEC_ARG : IIT_Base<68>;
+def IIT_ONE_EIGHTH_VEC_ARG : IIT_Base<69>;
 }
 
 defvar IIT_all_FixedTypes = !filter(iit, IIT_all,
@@ -483,12 +486,21 @@ class LLVMHalfElementsVectorType<int num>
 class LLVMOneThirdElementsVectorType<int num>
   : LLVMMatchType<num, IIT_ONE_THIRD_VEC_ARG>;
 
+class LLVMOneFourthElementsVectorType<int num>
+  : LLVMMatchType<num, IIT_ONE_FOURTH_VEC_ARG>;
+
 class LLVMOneFifthElementsVectorType<int num>
   : LLVMMatchType<num, IIT_ONE_FIFTH_VEC_ARG>;
 
+class LLVMOneSixthElementsVectorType<int num>
+  : LLVMMatchType<num, IIT_ONE_SIXTH_VEC_ARG>;
+
 class LLVMOneSeventhElementsVectorType<int num>
   : LLVMMatchType<num, IIT_ONE_SEVENTH_VEC_ARG>;
 
+class LLVMOneEighthElementsVectorType<int num>
+  : LLVMMatchType<num, IIT_ONE_EIGHTH_VEC_ARG>;
+
 // Match the type of another intrinsic parameter that is expected to be a
 // vector type (i.e. <N x iM>) but with each element subdivided to
 // form a vector with more elements that are smaller than the original.
@@ -2776,6 +2788,20 @@ def int_vector_deinterleave3 : DefaultAttrsIntrinsic<[LLVMOneThirdElementsVector
                                                      [llvm_anyvector_ty],
                                                      [IntrNoMem]>;
 
+def int_vector_interleave4   : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+                                                     [LLVMOneFourthElementsVectorType<0>,
+                                                      LLVMOneFourthElementsVectorType<0>,
+                                                      LLVMOneFourthElementsVectorType<0>,
+                                                      LLVMOneFourthElementsVectorType<0>],
+                                                     [IntrNoMem]>;
+
+def int_vector_deinterleave4 : DefaultAttrsIntrinsic<[LLVMOneFourthElementsVectorType<0>,
+                                                      LLVMOneFourthElementsVectorType<0>,
+                                                      LLVMOneFourthElementsVectorType<0>,
+                                                      LLVMOneFourthElementsVectorType<0>],
+                                                     [llvm_anyvector_ty],
+                                                     [IntrNoMem]>;
+
 def int_vector_interleave5   : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
                                                      [LLVMOneFifthElementsVectorType<0>,
                                                       LLVMOneFifthElementsVectorType<0>,
@@ -2792,6 +2818,24 @@ def int_vector_deinterleave5 : DefaultAttrsIntrinsic<[LLVMOneFifthElementsVector
                                                      [llvm_anyvector_ty],
                                                      [IntrNoMem]>;
 
+def int_vector_interleave6   : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+                                                     [LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>],
+                                                     [IntrNoMem]>;
+
+def int_vector_deinterleave6 : DefaultAttrsIntrinsic<[LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>],
+                                                     [llvm_anyvector_ty],
+                                                     [IntrNoMem]>;
+
 def int_vector_interleave7   : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
                                                      [LLVMOneSeventhElementsVectorType<0>,
                                                       LLVMOneSeventhElementsVectorType<0>,
@@ -2812,6 +2856,28 @@ def int_vector_deinterleave7 : DefaultAttrsIntrinsic<[LLVMOneSeventhElementsVect
                                                      [llvm_anyvector_ty],
                                                      [IntrNoMem]>;
 
+def int_vector_interleave8   : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+                                                     [LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>],
+                                                     [IntrNoMem]>;
+
+def int_vector_deinterleave8 : DefaultAttrsIntrinsic<[LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>],
+                                                     [llvm_anyvector_ty],
+                                                     [IntrNoMem]>;
+
 //===-------------- Intrinsics to perform partial reduction ---------------===//
 
 def int_experimental_vector_partial_reduce_add : DefaultAttrsIntrinsic<[LLVMMatchType<0>],
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index 9d138d364bad7..10ee75a83a267 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -8181,24 +8181,42 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
   case Intrinsic::vector_interleave3:
     visitVectorInterleave(I, 3);
     return;
+  case Intrinsic::vector_interleave4:
+    visitVectorInterleave(I, 4);
+    return;
   case Intrinsic::vector_interleave5:
     visitVectorInterleave(I, 5);
     return;
+  case Intrinsic::vector_interleave6:
+    visitVectorInterleave(I, 6);
+    return;
   case Intrinsic::vector_interleave7:
     visitVectorInterleave(I, 7);
     return;
+  case Intrinsic::vector_interleave8:
+    visitVectorInterleave(I, 8);
+    return;
   case Intrinsic::vector_deinterleave2:
     visitVectorDeinterleave(I, 2);
     return;
   case Intrinsic::vector_deinterleave3:
     visitVectorDeinterleave(I, 3);
     return;
+  case Intrinsic::vector_deinterleave4:
+    visitVectorDeinterleave(I, 4);
+    return;
   case Intrinsic::vector_deinterleave5:
     visitVectorDeinterleave(I, 5);
     return;
+  case Intrinsic::vector_deinterleave6:
+    visitVectorDeinterleave(I, 6);
+    return;
   case Intrinsic::vector_deinterleave7:
     visitVectorDeinterleave(I, 7);
     return;
+  case Intrinsic::vector_deinterleave8:
+    visitVectorDeinterleave(I, 8);
+    return;
   case Intrinsic::experimental_vector_compress:
     setValue(&I, DAG.getNode(ISD::VECTOR_COMPRESS, sdl,
                              getValue(I.getArgOperand(0)).getValueType(),
diff --git a/llvm/lib/IR/Intrinsics.cpp b/llvm/lib/IR/Intrinsics.cpp
index dabb5fe006b3c..28f7523476774 100644
--- a/llvm/lib/IR/Intrinsics.cpp
+++ b/llvm/lib/IR/Intrinsics.cpp
@@ -378,18 +378,36 @@ DecodeIITType(unsigned &NextElt, ArrayRef<unsigned char> Infos,
         IITDescriptor::get(IITDescriptor::OneThirdVecArgument, ArgInfo));
     return;
   }
+  case IIT_ONE_FOURTH_VEC_ARG: {
+    unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
+    OutputTable.push_back(
+        IITDescriptor::get(IITDescriptor::OneFourthVecArgument, ArgInfo));
+    return;
+  }
   case IIT_ONE_FIFTH_VEC_ARG: {
     unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
     OutputTable.push_back(
         IITDescriptor::get(IITDescriptor::OneFifthVecArgument, ArgInfo));
     return;
   }
+  case IIT_ONE_SIXTH_VEC_ARG: {
+    unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
+    OutputTable.push_back(
+        IITDescriptor::get(IITDescriptor::OneSixthVecArgument, ArgInfo));
+    return;
+  }
   case IIT_ONE_SEVENTH_VEC_ARG: {
     unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
     OutputTable.push_back(
         IITDescriptor::get(IITDescriptor::OneSeventhVecArgument, ArgInfo));
     return;
   }
+  case IIT_ONE_EIGHTH_VEC_ARG: {
+    unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
+    OutputTable.push_back(
+        IITDescriptor::get(IITDescriptor::OneEighthVecArgument, ArgInfo));
+    return;
+  }
   case IIT_SAME_VEC_WIDTH_ARG: {
     unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
     OutputTable.push_back(
@@ -584,11 +602,14 @@ static Type *DecodeFixedType(ArrayRef<Intrinsic::IITDescriptor> &Infos,
     return VectorType::getHalfElementsVectorType(
         cast<VectorType>(Tys[D.getArgumentNumber()]));
   case IITDescriptor::OneThirdVecArgument:
+  case IITDescriptor::OneFourthVecArgument:
   case IITDescriptor::OneFifthVecArgument:
+  case IITDescriptor::OneSixthVecArgument:
   case IITDescriptor::OneSeventhVecArgument:
+  case IITDescriptor::OneEighthVecArgument:
     return VectorType::getOneNthElementsVectorType(
         cast<VectorType>(Tys[D.getArgumentNumber()]),
-        3 + (D.Kind - IITDescriptor::OneThirdVecArgument) * 2);
+        3 + (D.Kind - IITDescriptor::OneThirdVecArgument));
   case IITDescriptor::SameVecWidthArgument: {
     Type *EltTy = DecodeFixedType(Infos, Tys, Context);
     Type *Ty = Tys[D.getArgumentNumber()];
@@ -974,15 +995,18 @@ matchIntrinsicType(Type *Ty, ArrayRef<Intrinsic::IITDescriptor> &Infos,
            VectorType::getHalfElementsVectorType(
                cast<VectorType>(ArgTys[D.getArgumentNumber()])) != Ty;
   case IITDescriptor::OneThirdVecArgument:
+  case IITDescriptor::OneFourthVecArgument:
   case IITDescriptor::OneFifthVecArgument:
+  case IITDescriptor::OneSixthVecArgument:
   case IITDescriptor::OneSeventhVecArgument:
+  case IITDescriptor::OneEighthVecArgument:
     // If this is a forward reference, defer the check for later.
     if (D.getArgumentNumber() >= ArgTys.size())
       return IsDeferredCheck || DeferCheck(Ty);
     return !isa<VectorType>(ArgTys[D.getArgumentNumber()]) ||
            VectorType::getOneNthElementsVectorType(
                cast<VectorType>(ArgTys[D.getArgumentNumber()]),
-               3 + (D.Kind - IITDescriptor::OneThirdVecArgument) * 2) != Ty;
+               3 + (D.Kind - IITDescriptor::OneThirdVecArgument)) != Ty;
   case IITDescriptor::SameVecWidthArgument: {
     if (D.getArgumentNumber() >= ArgTys.size()) {
       // Defer check and subsequent check for the vector element type.
diff --git a/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll b/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll
index f6b5a35aa06d6..a3ad0b26efd4d 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll
@@ -223,6 +223,41 @@ define {<2 x i32>, <2 x i32>, <2 x i32>} @vector_deinterleave3_v2i32_v6i32(<6 x
 	   ret {<2 x i32>, <2 x i32>, <2 x i32>} %res
 }
 
+define {<2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>} @vector_deinterleave3_v2i32_v8i32(<8 x i32> %v) {
+; CHECK-LABEL: vector_deinterleave3_v2i32_v8i32:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi sp, sp, -16
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 1
+; CHECK-NEXT:    sub sp, sp, a0
+; CHECK-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x02, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 2 * vlenb
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    vsetivli zero, 2, e32, m2, ta, ma
+; CHECK-NEXT:    vslidedown.vi v10, v8, 6
+; CHECK-NEXT:    vslidedown.vi v12, v8, 4
+; CHECK-NEXT:    vsetivli zero, 2, e32, m1, ta, ma
+; CHECK-NEXT:    vslidedown.vi v9, v8, 2
+; CHECK-NEXT:    srli a0, a0, 3
+; CHECK-NEXT:    add a1, a0, a0
+; CHECK-NEXT:    vsetvli zero, a1, e32, m1, ta, ma
+; CHECK-NEXT:    vslideup.vx v12, v10, a0
+; CHECK-NEXT:    vslideup.vx v8, v9, a0
+; CHECK-NEXT:    addi a0, sp, 16
+; CHECK-NEXT:    vmv.v.v v9, v12
+; CHECK-NEXT:    vs2r.v v8, (a0)
+; CHECK-NEXT:    vsetvli a1, zero, e32, mf2, ta, ma
+; CHECK-NEXT:    vlseg4e32.v v8, (a0)
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 1
+; CHECK-NEXT:    add sp, sp, a0
+; CHECK-NEXT:    .cfi_def_cfa sp, 16
+; CHECK-NEXT:    addi sp, sp, 16
+; CHECK-NEXT:    .cfi_def_cfa_offset 0
+; CHECK-NEXT:    ret
+	   %res = call {<2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>} @llvm.vector.deinterleave4.v8i32(<8 x i32> %v)
+	   ret {<2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>} %res
+}
 
 define {<2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>} @vector_deinterleave5_v2i16_v10i16(<10 x i16> %v) {
 ; CHECK-LABEL: vector_deinterleave5_v2i16_v10i16:
@@ -265,6 +300,49 @@ define {<2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>} @vector_deinterle
 	   ret {<2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>} %res
 }
 
+define {<2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>} @vector_deinterleave6_v2i16_v12i16(<12 x i16> %v) {
+; CHECK-LABEL: vector_deinterleave6_v2i16_v12i16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi sp, sp, -16
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 1
+; CHECK-NEXT:    sub sp, sp, a0
+; CHECK-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x02, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 2 * vlenb
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    vsetivli zero, 2, e16, m1, ta, ma
+; CHECK-NEXT:    vslidedown.vi v14, v8, 6
+; CHECK-NEXT:    vslidedown.vi v15, v8, 4
+; CHECK-NEXT:    vslidedown.vi v16, v8, 2
+; CHECK-NEXT:    vsetivli zero, 2, e16, m2, ta, ma
+; CHECK-NEXT:    vslidedown.vi v10, v8, 10
+; CHECK-NEXT:    vslidedown.vi v12, v8, 8
+; CHECK-NEXT:    srli a1, a0, 3
+; CHECK-NEXT:    srli a0, a0, 2
+; CHECK-NEXT:    add a2, a1, a1
+; CHECK-NEXT:    add a3, a0, a0
+; CHECK-NEXT:    vsetvli zero, a2, ...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented May 14, 2025

@llvm/pr-subscribers-llvm-ir

Author: Luke Lau (lukel97)

Changes

This adds [de]interleave intrinsics for factors of 4,6,8, so that every interleaved memory operation supported by the in-tree targets can be represented by a single intrinsic.

For context, [de]interleaves of fixed-length vectors are represented by a series of shufflevectors. The intrinsics are needed for scalable vectors, and we don't currently scalably vectorize all possible factors of interleave groups supported by RISC-V/AArch64.

The underlying reason for this is that higher factors are currently represented by interleaving multiple interleaves themselves, which made sense at the time in the discussion in #89018.

But after trying to integrate these for higher factors on RISC-V I think we should revisit this design choice:

  • Matching these in InterleavedAccessPass is non-trivial: We currently only support factors that are a power of 2, and detecting this requires a good chunk of code
  • The shufflevector masks used for [de]interleaves of fixed-length vectors are much easier to pattern match as they are strided patterns, but for the intrinsics it's much more complicated to match as the structure is a tree.
  • There's no optimisation that happens on [de]interleave2 intriniscs, so there's not much point to representing it in this form
  • For non-power-of-2 factors e.g. 6, there are multiple possible ways a [de]interleave could be represented, see the discussion in #139373
  • We already have intrinsics for 2,3,5 and 7, so by avoiding 4,6 and 8 we're not really saving much

By representing these higher factors are interleaved-interleaves, we can in theory support arbitrarily high interleave factors. However I'm not sure this is actually needed in practice: SVE only has instructions for factors 2,3,4, whilst RVV only supports up to factor 8.

This patch would make it much easier to support scalable interleaved accesses in the loop vectorizer for RISC-V for factors 3,5,6 and 7, as the loop vectorizer and InterleavedAccessPass wouldn't need to construct and match trees of interleaves.

If people agree with the direction, I would post these patches to follow up:

  • Lower [de]interleave{3,4} on AArch64
  • Teach InterleavedAccessPass to recognize [de]interleave{4,6,8} (3,5,7 are handled by [IA] Add support for [de]interleave{3,5,7} #139373, this would be nearly identical)
  • Remove the recursive [de]interleaving from the loop vectorizer and instead emit a single intrinsic
  • Remove the recursive [de]interleaving pattern matching from InterleavedAccessPass

If we ever do want to end up supporting interleave factors higher than what the target natively has instructions for, we can then extend this infrastructure further. But I think it's more important that we have full support for the native capabilities first.


Patch is 777.14 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/139893.diff

9 Files Affected:

  • (modified) llvm/docs/LangRef.rst (+5-5)
  • (modified) llvm/include/llvm/IR/Intrinsics.h (+20-10)
  • (modified) llvm/include/llvm/IR/Intrinsics.td (+66)
  • (modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp (+18)
  • (modified) llvm/lib/IR/Intrinsics.cpp (+26-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll (+500-2)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vector-deinterleave.ll (+1620-133)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vector-interleave-fixed.ll (+761-26)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vector-interleave.ll (+9979-3877)
diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 7296bb84b7d95..c0bc0a10ed537 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -20158,7 +20158,7 @@ Arguments:
 
 The argument to this intrinsic must be a vector.
 
-'``llvm.vector.deinterleave2/3/5/7``' Intrinsic
+'``llvm.vector.deinterleave2/3/4/5/6/7/8``' Intrinsic
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Syntax:
@@ -20176,8 +20176,8 @@ This is an overloaded intrinsic.
 Overview:
 """""""""
 
-The '``llvm.vector.deinterleave2/3/5/7``' intrinsics deinterleave adjacent lanes
-into 2, 3, 5, and 7 separate vectors, respectively, and return them as the
+The '``llvm.vector.deinterleave2/3/4/5/6/7/8``' intrinsics deinterleave adjacent lanes
+into 2 through to 8 separate vectors, respectively, and return them as the
 result.
 
 This intrinsic works for both fixed and scalable vectors. While this intrinsic
@@ -20199,7 +20199,7 @@ Arguments:
 The argument is a vector whose type corresponds to the logical concatenation of
 the aggregated result types.
 
-'``llvm.vector.interleave2/3/5/7``' Intrinsic
+'``llvm.vector.interleave2/3/4/5/6/7/8``' Intrinsic
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Syntax:
@@ -20217,7 +20217,7 @@ This is an overloaded intrinsic.
 Overview:
 """""""""
 
-The '``llvm.vector.interleave2/3/5/7``' intrinsic constructs a vector
+The '``llvm.vector.interleave2/3/4/5/6/7/8``' intrinsic constructs a vector
 by interleaving all the input vectors.
 
 This intrinsic works for both fixed and scalable vectors. While this intrinsic
diff --git a/llvm/include/llvm/IR/Intrinsics.h b/llvm/include/llvm/IR/Intrinsics.h
index 6fb1bf9359b9a..b64784909fc25 100644
--- a/llvm/include/llvm/IR/Intrinsics.h
+++ b/llvm/include/llvm/IR/Intrinsics.h
@@ -153,8 +153,11 @@ namespace Intrinsic {
       TruncArgument,
       HalfVecArgument,
       OneThirdVecArgument,
+      OneFourthVecArgument,
       OneFifthVecArgument,
+      OneSixthVecArgument,
       OneSeventhVecArgument,
+      OneEighthVecArgument,
       SameVecWidthArgument,
       VecOfAnyPtrsToElt,
       VecElementArgument,
@@ -167,8 +170,11 @@ namespace Intrinsic {
     } Kind;
 
     // These three have to be contiguous.
-    static_assert(OneFifthVecArgument == OneThirdVecArgument + 1 &&
-                  OneSeventhVecArgument == OneFifthVecArgument + 1);
+    static_assert(OneFourthVecArgument == OneThirdVecArgument + 1 &&
+                  OneFifthVecArgument == OneFourthVecArgument + 1 &&
+                  OneSixthVecArgument == OneFifthVecArgument + 1 &&
+                  OneSeventhVecArgument == OneSixthVecArgument + 1 &&
+                  OneEighthVecArgument == OneSeventhVecArgument + 1);
     union {
       unsigned Integer_Width;
       unsigned Float_Width;
@@ -188,19 +194,23 @@ namespace Intrinsic {
     unsigned getArgumentNumber() const {
       assert(Kind == Argument || Kind == ExtendArgument ||
              Kind == TruncArgument || Kind == HalfVecArgument ||
-             Kind == OneThirdVecArgument || Kind == OneFifthVecArgument ||
-             Kind == OneSeventhVecArgument || Kind == SameVecWidthArgument ||
-             Kind == VecElementArgument || Kind == Subdivide2Argument ||
-             Kind == Subdivide4Argument || Kind == VecOfBitcastsToInt);
+             Kind == OneThirdVecArgument || Kind == OneFourthVecArgument ||
+             Kind == OneFifthVecArgument || Kind == OneSixthVecArgument ||
+             Kind == OneSeventhVecArgument || Kind == OneEighthVecArgument ||
+             Kind == SameVecWidthArgument || Kind == VecElementArgument ||
+             Kind == Subdivide2Argument || Kind == Subdivide4Argument ||
+             Kind == VecOfBitcastsToInt);
       return Argument_Info >> 3;
     }
     ArgKind getArgumentKind() const {
       assert(Kind == Argument || Kind == ExtendArgument ||
              Kind == TruncArgument || Kind == HalfVecArgument ||
-             Kind == OneThirdVecArgument || Kind == OneFifthVecArgument ||
-             Kind == OneSeventhVecArgument || Kind == SameVecWidthArgument ||
-             Kind == VecElementArgument || Kind == Subdivide2Argument ||
-             Kind == Subdivide4Argument || Kind == VecOfBitcastsToInt);
+             Kind == OneThirdVecArgument || Kind == OneFourthVecArgument ||
+             Kind == OneFifthVecArgument || Kind == OneSixthVecArgument ||
+             Kind == OneSeventhVecArgument || Kind == OneEighthVecArgument ||
+             Kind == SameVecWidthArgument || Kind == VecElementArgument ||
+             Kind == Subdivide2Argument || Kind == Subdivide4Argument ||
+             Kind == VecOfBitcastsToInt);
       return (ArgKind)(Argument_Info & 7);
     }
 
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index 8d26961eebbf3..3994a543f9dcf 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -340,6 +340,9 @@ def IIT_ONE_FIFTH_VEC_ARG : IIT_Base<63>;
 def IIT_ONE_SEVENTH_VEC_ARG : IIT_Base<64>;
 def IIT_V2048: IIT_Vec<2048, 65>;
 def IIT_V4096: IIT_Vec<4096, 66>;
+def IIT_ONE_FOURTH_VEC_ARG : IIT_Base<67>;
+def IIT_ONE_SIXTH_VEC_ARG : IIT_Base<68>;
+def IIT_ONE_EIGHTH_VEC_ARG : IIT_Base<69>;
 }
 
 defvar IIT_all_FixedTypes = !filter(iit, IIT_all,
@@ -483,12 +486,21 @@ class LLVMHalfElementsVectorType<int num>
 class LLVMOneThirdElementsVectorType<int num>
   : LLVMMatchType<num, IIT_ONE_THIRD_VEC_ARG>;
 
+class LLVMOneFourthElementsVectorType<int num>
+  : LLVMMatchType<num, IIT_ONE_FOURTH_VEC_ARG>;
+
 class LLVMOneFifthElementsVectorType<int num>
   : LLVMMatchType<num, IIT_ONE_FIFTH_VEC_ARG>;
 
+class LLVMOneSixthElementsVectorType<int num>
+  : LLVMMatchType<num, IIT_ONE_SIXTH_VEC_ARG>;
+
 class LLVMOneSeventhElementsVectorType<int num>
   : LLVMMatchType<num, IIT_ONE_SEVENTH_VEC_ARG>;
 
+class LLVMOneEighthElementsVectorType<int num>
+  : LLVMMatchType<num, IIT_ONE_EIGHTH_VEC_ARG>;
+
 // Match the type of another intrinsic parameter that is expected to be a
 // vector type (i.e. <N x iM>) but with each element subdivided to
 // form a vector with more elements that are smaller than the original.
@@ -2776,6 +2788,20 @@ def int_vector_deinterleave3 : DefaultAttrsIntrinsic<[LLVMOneThirdElementsVector
                                                      [llvm_anyvector_ty],
                                                      [IntrNoMem]>;
 
+def int_vector_interleave4   : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+                                                     [LLVMOneFourthElementsVectorType<0>,
+                                                      LLVMOneFourthElementsVectorType<0>,
+                                                      LLVMOneFourthElementsVectorType<0>,
+                                                      LLVMOneFourthElementsVectorType<0>],
+                                                     [IntrNoMem]>;
+
+def int_vector_deinterleave4 : DefaultAttrsIntrinsic<[LLVMOneFourthElementsVectorType<0>,
+                                                      LLVMOneFourthElementsVectorType<0>,
+                                                      LLVMOneFourthElementsVectorType<0>,
+                                                      LLVMOneFourthElementsVectorType<0>],
+                                                     [llvm_anyvector_ty],
+                                                     [IntrNoMem]>;
+
 def int_vector_interleave5   : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
                                                      [LLVMOneFifthElementsVectorType<0>,
                                                       LLVMOneFifthElementsVectorType<0>,
@@ -2792,6 +2818,24 @@ def int_vector_deinterleave5 : DefaultAttrsIntrinsic<[LLVMOneFifthElementsVector
                                                      [llvm_anyvector_ty],
                                                      [IntrNoMem]>;
 
+def int_vector_interleave6   : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+                                                     [LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>],
+                                                     [IntrNoMem]>;
+
+def int_vector_deinterleave6 : DefaultAttrsIntrinsic<[LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>],
+                                                     [llvm_anyvector_ty],
+                                                     [IntrNoMem]>;
+
 def int_vector_interleave7   : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
                                                      [LLVMOneSeventhElementsVectorType<0>,
                                                       LLVMOneSeventhElementsVectorType<0>,
@@ -2812,6 +2856,28 @@ def int_vector_deinterleave7 : DefaultAttrsIntrinsic<[LLVMOneSeventhElementsVect
                                                      [llvm_anyvector_ty],
                                                      [IntrNoMem]>;
 
+def int_vector_interleave8   : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+                                                     [LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>],
+                                                     [IntrNoMem]>;
+
+def int_vector_deinterleave8 : DefaultAttrsIntrinsic<[LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>],
+                                                     [llvm_anyvector_ty],
+                                                     [IntrNoMem]>;
+
 //===-------------- Intrinsics to perform partial reduction ---------------===//
 
 def int_experimental_vector_partial_reduce_add : DefaultAttrsIntrinsic<[LLVMMatchType<0>],
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index 9d138d364bad7..10ee75a83a267 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -8181,24 +8181,42 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
   case Intrinsic::vector_interleave3:
     visitVectorInterleave(I, 3);
     return;
+  case Intrinsic::vector_interleave4:
+    visitVectorInterleave(I, 4);
+    return;
   case Intrinsic::vector_interleave5:
     visitVectorInterleave(I, 5);
     return;
+  case Intrinsic::vector_interleave6:
+    visitVectorInterleave(I, 6);
+    return;
   case Intrinsic::vector_interleave7:
     visitVectorInterleave(I, 7);
     return;
+  case Intrinsic::vector_interleave8:
+    visitVectorInterleave(I, 8);
+    return;
   case Intrinsic::vector_deinterleave2:
     visitVectorDeinterleave(I, 2);
     return;
   case Intrinsic::vector_deinterleave3:
     visitVectorDeinterleave(I, 3);
     return;
+  case Intrinsic::vector_deinterleave4:
+    visitVectorDeinterleave(I, 4);
+    return;
   case Intrinsic::vector_deinterleave5:
     visitVectorDeinterleave(I, 5);
     return;
+  case Intrinsic::vector_deinterleave6:
+    visitVectorDeinterleave(I, 6);
+    return;
   case Intrinsic::vector_deinterleave7:
     visitVectorDeinterleave(I, 7);
     return;
+  case Intrinsic::vector_deinterleave8:
+    visitVectorDeinterleave(I, 8);
+    return;
   case Intrinsic::experimental_vector_compress:
     setValue(&I, DAG.getNode(ISD::VECTOR_COMPRESS, sdl,
                              getValue(I.getArgOperand(0)).getValueType(),
diff --git a/llvm/lib/IR/Intrinsics.cpp b/llvm/lib/IR/Intrinsics.cpp
index dabb5fe006b3c..28f7523476774 100644
--- a/llvm/lib/IR/Intrinsics.cpp
+++ b/llvm/lib/IR/Intrinsics.cpp
@@ -378,18 +378,36 @@ DecodeIITType(unsigned &NextElt, ArrayRef<unsigned char> Infos,
         IITDescriptor::get(IITDescriptor::OneThirdVecArgument, ArgInfo));
     return;
   }
+  case IIT_ONE_FOURTH_VEC_ARG: {
+    unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
+    OutputTable.push_back(
+        IITDescriptor::get(IITDescriptor::OneFourthVecArgument, ArgInfo));
+    return;
+  }
   case IIT_ONE_FIFTH_VEC_ARG: {
     unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
     OutputTable.push_back(
         IITDescriptor::get(IITDescriptor::OneFifthVecArgument, ArgInfo));
     return;
   }
+  case IIT_ONE_SIXTH_VEC_ARG: {
+    unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
+    OutputTable.push_back(
+        IITDescriptor::get(IITDescriptor::OneSixthVecArgument, ArgInfo));
+    return;
+  }
   case IIT_ONE_SEVENTH_VEC_ARG: {
     unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
     OutputTable.push_back(
         IITDescriptor::get(IITDescriptor::OneSeventhVecArgument, ArgInfo));
     return;
   }
+  case IIT_ONE_EIGHTH_VEC_ARG: {
+    unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
+    OutputTable.push_back(
+        IITDescriptor::get(IITDescriptor::OneEighthVecArgument, ArgInfo));
+    return;
+  }
   case IIT_SAME_VEC_WIDTH_ARG: {
     unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
     OutputTable.push_back(
@@ -584,11 +602,14 @@ static Type *DecodeFixedType(ArrayRef<Intrinsic::IITDescriptor> &Infos,
     return VectorType::getHalfElementsVectorType(
         cast<VectorType>(Tys[D.getArgumentNumber()]));
   case IITDescriptor::OneThirdVecArgument:
+  case IITDescriptor::OneFourthVecArgument:
   case IITDescriptor::OneFifthVecArgument:
+  case IITDescriptor::OneSixthVecArgument:
   case IITDescriptor::OneSeventhVecArgument:
+  case IITDescriptor::OneEighthVecArgument:
     return VectorType::getOneNthElementsVectorType(
         cast<VectorType>(Tys[D.getArgumentNumber()]),
-        3 + (D.Kind - IITDescriptor::OneThirdVecArgument) * 2);
+        3 + (D.Kind - IITDescriptor::OneThirdVecArgument));
   case IITDescriptor::SameVecWidthArgument: {
     Type *EltTy = DecodeFixedType(Infos, Tys, Context);
     Type *Ty = Tys[D.getArgumentNumber()];
@@ -974,15 +995,18 @@ matchIntrinsicType(Type *Ty, ArrayRef<Intrinsic::IITDescriptor> &Infos,
            VectorType::getHalfElementsVectorType(
                cast<VectorType>(ArgTys[D.getArgumentNumber()])) != Ty;
   case IITDescriptor::OneThirdVecArgument:
+  case IITDescriptor::OneFourthVecArgument:
   case IITDescriptor::OneFifthVecArgument:
+  case IITDescriptor::OneSixthVecArgument:
   case IITDescriptor::OneSeventhVecArgument:
+  case IITDescriptor::OneEighthVecArgument:
     // If this is a forward reference, defer the check for later.
     if (D.getArgumentNumber() >= ArgTys.size())
       return IsDeferredCheck || DeferCheck(Ty);
     return !isa<VectorType>(ArgTys[D.getArgumentNumber()]) ||
            VectorType::getOneNthElementsVectorType(
                cast<VectorType>(ArgTys[D.getArgumentNumber()]),
-               3 + (D.Kind - IITDescriptor::OneThirdVecArgument) * 2) != Ty;
+               3 + (D.Kind - IITDescriptor::OneThirdVecArgument)) != Ty;
   case IITDescriptor::SameVecWidthArgument: {
     if (D.getArgumentNumber() >= ArgTys.size()) {
       // Defer check and subsequent check for the vector element type.
diff --git a/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll b/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll
index f6b5a35aa06d6..a3ad0b26efd4d 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll
@@ -223,6 +223,41 @@ define {<2 x i32>, <2 x i32>, <2 x i32>} @vector_deinterleave3_v2i32_v6i32(<6 x
 	   ret {<2 x i32>, <2 x i32>, <2 x i32>} %res
 }
 
+define {<2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>} @vector_deinterleave3_v2i32_v8i32(<8 x i32> %v) {
+; CHECK-LABEL: vector_deinterleave3_v2i32_v8i32:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi sp, sp, -16
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 1
+; CHECK-NEXT:    sub sp, sp, a0
+; CHECK-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x02, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 2 * vlenb
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    vsetivli zero, 2, e32, m2, ta, ma
+; CHECK-NEXT:    vslidedown.vi v10, v8, 6
+; CHECK-NEXT:    vslidedown.vi v12, v8, 4
+; CHECK-NEXT:    vsetivli zero, 2, e32, m1, ta, ma
+; CHECK-NEXT:    vslidedown.vi v9, v8, 2
+; CHECK-NEXT:    srli a0, a0, 3
+; CHECK-NEXT:    add a1, a0, a0
+; CHECK-NEXT:    vsetvli zero, a1, e32, m1, ta, ma
+; CHECK-NEXT:    vslideup.vx v12, v10, a0
+; CHECK-NEXT:    vslideup.vx v8, v9, a0
+; CHECK-NEXT:    addi a0, sp, 16
+; CHECK-NEXT:    vmv.v.v v9, v12
+; CHECK-NEXT:    vs2r.v v8, (a0)
+; CHECK-NEXT:    vsetvli a1, zero, e32, mf2, ta, ma
+; CHECK-NEXT:    vlseg4e32.v v8, (a0)
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 1
+; CHECK-NEXT:    add sp, sp, a0
+; CHECK-NEXT:    .cfi_def_cfa sp, 16
+; CHECK-NEXT:    addi sp, sp, 16
+; CHECK-NEXT:    .cfi_def_cfa_offset 0
+; CHECK-NEXT:    ret
+	   %res = call {<2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>} @llvm.vector.deinterleave4.v8i32(<8 x i32> %v)
+	   ret {<2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>} %res
+}
 
 define {<2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>} @vector_deinterleave5_v2i16_v10i16(<10 x i16> %v) {
 ; CHECK-LABEL: vector_deinterleave5_v2i16_v10i16:
@@ -265,6 +300,49 @@ define {<2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>} @vector_deinterle
 	   ret {<2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>} %res
 }
 
+define {<2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>} @vector_deinterleave6_v2i16_v12i16(<12 x i16> %v) {
+; CHECK-LABEL: vector_deinterleave6_v2i16_v12i16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi sp, sp, -16
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 1
+; CHECK-NEXT:    sub sp, sp, a0
+; CHECK-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x02, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 2 * vlenb
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    vsetivli zero, 2, e16, m1, ta, ma
+; CHECK-NEXT:    vslidedown.vi v14, v8, 6
+; CHECK-NEXT:    vslidedown.vi v15, v8, 4
+; CHECK-NEXT:    vslidedown.vi v16, v8, 2
+; CHECK-NEXT:    vsetivli zero, 2, e16, m2, ta, ma
+; CHECK-NEXT:    vslidedown.vi v10, v8, 10
+; CHECK-NEXT:    vslidedown.vi v12, v8, 8
+; CHECK-NEXT:    srli a1, a0, 3
+; CHECK-NEXT:    srli a0, a0, 2
+; CHECK-NEXT:    add a2, a1, a1
+; CHECK-NEXT:    add a3, a0, a0
+; CHECK-NEXT:    vsetvli zero, a2, ...
[truncated]

@@ -223,6 +223,41 @@ define {<2 x i32>, <2 x i32>, <2 x i32>} @vector_deinterleave3_v2i32_v6i32(<6 x
ret {<2 x i32>, <2 x i32>, <2 x i32>} %res
}

define {<2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>} @vector_deinterleave3_v2i32_v8i32(<8 x i32> %v) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could use nounwind here and rest of the other functions.

@@ -223,6 +223,41 @@ define {<2 x i32>, <2 x i32>, <2 x i32>} @vector_deinterleave3_v2i32_v6i32(<6 x
ret {<2 x i32>, <2 x i32>, <2 x i32>} %res
}

define {<2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>} @vector_deinterleave3_v2i32_v8i32(<8 x i32> %v) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deinterleave4?

; RV32-NEXT: vs1r.v v9, (a0) # vscale x 8-byte Folded Spill
; RV32-NEXT: li a1, 3
; RV32-NEXT: mv a0, s0
; RV32-NEXT: call __mulsi3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add M extension for RV32?

; RV64-NEXT: vs1r.v v9, (a0) # vscale x 8-byte Folded Spill
; RV64-NEXT: li a1, 3
; RV64-NEXT: mv a0, s0
; RV64-NEXT: call __muldi3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto M extension

@efriedma-quic
Copy link
Collaborator

By representing these higher factors are interleaved-interleaves, we can in theory support arbitrarily high interleave factors. However I'm not sure this is actually needed in practice.

You can definitely end up with very large interleave factors in some cases; my team has internal testcases for stride 24. Granted, it's uncommon.

@@ -167,8 +170,11 @@ namespace Intrinsic {
} Kind;

// These three have to be contiguous.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These six?

Kind == OneSeventhVecArgument || Kind == SameVecWidthArgument ||
Kind == VecElementArgument || Kind == Subdivide2Argument ||
Kind == Subdivide4Argument || Kind == VecOfBitcastsToInt);
Kind == OneThirdVecArgument || Kind == OneFourthVecArgument ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use comparators </> since they are contiguous?

Kind == OneSeventhVecArgument || Kind == SameVecWidthArgument ||
Kind == VecElementArgument || Kind == Subdivide2Argument ||
Kind == Subdivide4Argument || Kind == VecOfBitcastsToInt);
Kind == OneThirdVecArgument || Kind == OneFourthVecArgument ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@lukel97
Copy link
Contributor Author

lukel97 commented May 15, 2025

By representing these higher factors are interleaved-interleaves, we can in theory support arbitrarily high interleave factors. However I'm not sure this is actually needed in practice.

You can definitely end up with very large interleave factors in some cases; my team has internal testcases for stride 24. Granted, it's uncommon.

From my understanding though the loop vectorizer upstream today doesn't emit any scalable interleave group higher than 4 on AArch64 and 8 on RISC-V. This is from a quick grep of TLI.getMaxSupportedInterleaveFactor/getInterleavedMemoryOpCost and how BasicTTIImpl::getInterleavedMemoryOpCost returns invalid for scalable vector types . Do you have anything downstream that works around this limitation?

I should mention that the fixed-length VF VPlan should still be able to handle arbitrarily high factors, and hopefully in these cases the loop vectorizer will pick it based off the cost.

@efriedma-quic
Copy link
Collaborator

It was implemented downstream in a completely separate vectorization framework. I'm only mentioning it because we don't want to block off the possibility of adding such support in the future.

@preames
Copy link
Collaborator

preames commented May 16, 2025

I support this proposal. Note that's largely a reversal of my original stance on this, but seeing all the complexity here, I think adding the explicit variants if probably the right call.

Another option we could explore is to split deinterleaveN into N calls to an intrinsic for the form "deinterleave(N, Vec)". This is a more direct mapping to what we do for the fixed vector shuffles today. This was discussed in the original threads, but I'm still (mildly) of the opinion we went the wrong direction here. I'm happy to defer to those actually working on this though.

We could also do interleave(N, concat_vector(...)) instead. This seems less clearly motivated, and I'd only bother if we were deciding to do the former.

Note that even if we want to pursue my alternative, I support this proposal as an intermediate step. Let's clean up the complexity we have, then possibly revisit.

@mshockwave
Copy link
Member

Another option we could explore is to split deinterleaveN into N calls to an intrinsic for the form "deinterleave(N, Vec)

I guess you mean instead of

%d = deinterleave3(%v)
%s0 = extractvalue %d, 0
%s1 = extractvalue %d, 1
%s2 = extractvalue %d, 2

We're going to do something like

%s0 = deinterleave(0, %v)
%s1 = deinterleave(1, %v)
%s2 = deinterleave(2, %v)

I guess a potential problem might happen when we cannot turn this into segmented load/store. For instance, how should we codegen a single, lingering %s = deinterleave(X, %v)? We might be able to mitigate it by adding another argument indicating the total number of fields, like %s0 = deinterleave(0, 3, %v) for the first field when NF = 3.

@preames
Copy link
Collaborator

preames commented May 17, 2025

Another option we could explore is to split deinterleaveN into N calls to an intrinsic for the form

I guess a potential problem might happen when we cannot turn this into segmented load/store. For instance, how should we codegen a single, lingering %s = deinterleave(X, %v)? We might be able to mitigate it by adding another argument indicating the total number of fields, like %s0 = deinterleave(0, 3, %v) for the first field when NF = 3.

Yeah, this was exactly what I had in mind. We have two constant integer operands which fully describe the shuffle being performed. (e.g., deinterleave with stride 3 and offset 2, which is analogous to a shufflevector with 2, 5, 8, 11, ... as the mask)

At least on riscv, this is actually a better mapping to the lowering (when we don't turn it into a segment load), than the current intrinsics with their tuple return. Each of the individual lanes becomes a vcompress or vrgather (or vnsrl if possible).

@lukel97
Copy link
Contributor Author

lukel97 commented May 21, 2025

It was implemented downstream in a completely separate vectorization framework. I'm only mentioning it because we don't want to block off the possibility of adding such support in the future.

Agreed it would be nice to keep the possibility. From my understanding, these higher factors only need the recursive interleaving support in the loop vectorizer, not in InterleavedAccessPass because there's no hardware instructions beyond 8 that we can currently map to. So could I suggest the following plan instead:

  • Teach the loop vectorizer to emit a single [de]interleave intrinsic for factors up to 8. Keep the recursive interleaving for powers of 2 beyond 8.
  • Remove the recursive [de]interleaving pattern matching from InterleavedAccessPass. Only match single intrinsics up to factor 8.

This way we would still be able to scalably vectorize e.g. factor 16, and can still remove the recursive interleaving code in InterleavedAccessPass.

Copy link
Member

@mshockwave mshockwave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lukel97
Copy link
Contributor Author

lukel97 commented May 23, 2025

If there's no objections I'll merge this early next week, but happy to hold on if people still want to discuss the direction cc @efriedma-quic

OneSeventhVecArgument,
OneEighthVecArgument,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instead parameterize a single IIT descriptor with the divisor?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llvm:ir llvm:SelectionDAG SelectionDAGISel as well
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants