[IR] Add llvm.vector.[de]interleave{4,6,8} #139893

lukel97 · 2025-05-14T12:47:11Z

This adds [de]interleave intrinsics for factors of 4,6,8, so that every interleaved memory operation supported by the in-tree targets can be represented by a single intrinsic.

For context, [de]interleaves of fixed-length vectors are represented by a series of shufflevectors. The intrinsics are needed for scalable vectors, and we don't currently scalably vectorize all possible factors of interleave groups supported by RISC-V/AArch64.

The underlying reason for this is that higher factors are currently represented by interleaving multiple interleaves themselves, which made sense at the time in the discussion in #89018.

But after trying to integrate these for higher factors on RISC-V I think we should revisit this design choice:

Matching these in InterleavedAccessPass is non-trivial: We currently only support factors that are a power of 2, and detecting this requires a good chunk of code
The shufflevector masks used for [de]interleaves of fixed-length vectors are much easier to pattern match as they are strided patterns, but for the intrinsics it's much more complicated to match as the structure is a tree.
Unlike shufflevectors, there's no optimisation that happens on [de]interleave2 intriniscs
For non-power-of-2 factors e.g. 6, there are multiple possible ways a [de]interleave could be represented, see the discussion in [IA] Add support for [de]interleave{3,5,7} #139373
We already have intrinsics for 2,3,5 and 7, so by avoiding 4,6 and 8 we're not really saving much

By representing these higher factors are interleaved-interleaves, we can in theory support arbitrarily high interleave factors. However I'm not sure this is actually needed in practice: SVE only has instructions for factors 2,3,4, whilst RVV only supports up to factor 8.

This patch would make it much easier to support scalable interleaved accesses in the loop vectorizer for RISC-V for factors 3,5,6 and 7, as the loop vectorizer and InterleavedAccessPass wouldn't need to construct and match trees of interleaves.

For interleave factors above 8, for which there are no hardware memory operations to match in the InterleavedAccessPass, we can still keep the wide load + recursive interleaving in the loop vectorizer.

If people agree with the direction, I would post these patches to follow up:

Lower [de]interleave{3,4} on AArch64
Teach InterleavedAccessPass to recognize [de]interleave{4,6,8} (3,5,7 are handled by [IA] Add support for [de]interleave{3,5,7} #139373, this would be nearly identical)
Teach the loop vectorizer to emit a single [de]interleave intrinsic for factors 2->8. This would complete scalable vlsegN support on RISC-V
Remove the recursive [de]interleaving pattern matching from InterleavedAccessPass

llvmbot · 2025-05-14T12:47:50Z

@llvm/pr-subscribers-llvm-selectiondag

Author: Luke Lau (lukel97)

Changes

This adds [de]interleave intrinsics for factors of 4,6,8, so that every interleaved memory operation supported by the in-tree targets can be represented by a single intrinsic.

For context, [de]interleaves of fixed-length vectors are represented by a series of shufflevectors. The intrinsics are needed for scalable vectors, and we don't currently scalably vectorize all possible factors of interleave groups supported by RISC-V/AArch64.

The underlying reason for this is that higher factors are currently represented by interleaving multiple interleaves themselves, which made sense at the time in the discussion in #89018.

But after trying to integrate these for higher factors on RISC-V I think we should revisit this design choice:

Matching these in InterleavedAccessPass is non-trivial: We currently only support factors that are a power of 2, and detecting this requires a good chunk of code
The shufflevector masks used for [de]interleaves of fixed-length vectors are much easier to pattern match as they are strided patterns, but for the intrinsics it's much more complicated to match as the structure is a tree.
There's no optimisation that happens on [de]interleave2 intriniscs, so there's not much point to representing it in this form
For non-power-of-2 factors e.g. 6, there are multiple possible ways a [de]interleave could be represented, see the discussion in #139373
We already have intrinsics for 2,3,5 and 7, so by avoiding 4,6 and 8 we're not really saving much

By representing these higher factors are interleaved-interleaves, we can in theory support arbitrarily high interleave factors. However I'm not sure this is actually needed in practice: SVE only has instructions for factors 2,3,4, whilst RVV only supports up to factor 8.

This patch would make it much easier to support scalable interleaved accesses in the loop vectorizer for RISC-V for factors 3,5,6 and 7, as the loop vectorizer and InterleavedAccessPass wouldn't need to construct and match trees of interleaves.

If people agree with the direction, I would post these patches to follow up:

Lower [de]interleave{3,4} on AArch64
Teach InterleavedAccessPass to recognize [de]interleave{4,6,8} (3,5,7 are handled by [IA] Add support for [de]interleave{3,5,7} #139373, this would be nearly identical)
Remove the recursive [de]interleaving from the loop vectorizer and instead emit a single intrinsic
Remove the recursive [de]interleaving pattern matching from InterleavedAccessPass

If we ever do want to end up supporting interleave factors higher than what the target natively has instructions for, we can then extend this infrastructure further. But I think it's more important that we have full support for the native capabilities first.

Patch is 777.14 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/139893.diff

9 Files Affected:

(modified) llvm/docs/LangRef.rst (+5-5)
(modified) llvm/include/llvm/IR/Intrinsics.h (+20-10)
(modified) llvm/include/llvm/IR/Intrinsics.td (+66)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp (+18)
(modified) llvm/lib/IR/Intrinsics.cpp (+26-2)
(modified) llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll (+500-2)
(modified) llvm/test/CodeGen/RISCV/rvv/vector-deinterleave.ll (+1620-133)
(modified) llvm/test/CodeGen/RISCV/rvv/vector-interleave-fixed.ll (+761-26)
(modified) llvm/test/CodeGen/RISCV/rvv/vector-interleave.ll (+9979-3877)

diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 7296bb84b7d95..c0bc0a10ed537 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -20158,7 +20158,7 @@ Arguments:
 
 The argument to this intrinsic must be a vector.
 
-'``llvm.vector.deinterleave2/3/5/7``' Intrinsic
+'``llvm.vector.deinterleave2/3/4/5/6/7/8``' Intrinsic
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Syntax:
@@ -20176,8 +20176,8 @@ This is an overloaded intrinsic.
 Overview:
 """""""""
 
-The '``llvm.vector.deinterleave2/3/5/7``' intrinsics deinterleave adjacent lanes
-into 2, 3, 5, and 7 separate vectors, respectively, and return them as the
+The '``llvm.vector.deinterleave2/3/4/5/6/7/8``' intrinsics deinterleave adjacent lanes
+into 2 through to 8 separate vectors, respectively, and return them as the
 result.
 
 This intrinsic works for both fixed and scalable vectors. While this intrinsic
@@ -20199,7 +20199,7 @@ Arguments:
 The argument is a vector whose type corresponds to the logical concatenation of
 the aggregated result types.
 
-'``llvm.vector.interleave2/3/5/7``' Intrinsic
+'``llvm.vector.interleave2/3/4/5/6/7/8``' Intrinsic
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Syntax:
@@ -20217,7 +20217,7 @@ This is an overloaded intrinsic.
 Overview:
 """""""""
 
-The '``llvm.vector.interleave2/3/5/7``' intrinsic constructs a vector
+The '``llvm.vector.interleave2/3/4/5/6/7/8``' intrinsic constructs a vector
 by interleaving all the input vectors.
 
 This intrinsic works for both fixed and scalable vectors. While this intrinsic
diff --git a/llvm/include/llvm/IR/Intrinsics.h b/llvm/include/llvm/IR/Intrinsics.h
index 6fb1bf9359b9a..b64784909fc25 100644
--- a/llvm/include/llvm/IR/Intrinsics.h
+++ b/llvm/include/llvm/IR/Intrinsics.h
@@ -153,8 +153,11 @@ namespace Intrinsic {
       TruncArgument,
       HalfVecArgument,
       OneThirdVecArgument,
+      OneFourthVecArgument,
       OneFifthVecArgument,
+      OneSixthVecArgument,
       OneSeventhVecArgument,
+      OneEighthVecArgument,
       SameVecWidthArgument,
       VecOfAnyPtrsToElt,
       VecElementArgument,
@@ -167,8 +170,11 @@ namespace Intrinsic {
     } Kind;
 
     // These three have to be contiguous.
-    static_assert(OneFifthVecArgument == OneThirdVecArgument + 1 &&
-                  OneSeventhVecArgument == OneFifthVecArgument + 1);
+    static_assert(OneFourthVecArgument == OneThirdVecArgument + 1 &&
+                  OneFifthVecArgument == OneFourthVecArgument + 1 &&
+                  OneSixthVecArgument == OneFifthVecArgument + 1 &&
+                  OneSeventhVecArgument == OneSixthVecArgument + 1 &&
+                  OneEighthVecArgument == OneSeventhVecArgument + 1);
     union {
       unsigned Integer_Width;
       unsigned Float_Width;
@@ -188,19 +194,23 @@ namespace Intrinsic {
     unsigned getArgumentNumber() const {
       assert(Kind == Argument || Kind == ExtendArgument ||
              Kind == TruncArgument || Kind == HalfVecArgument ||
-             Kind == OneThirdVecArgument || Kind == OneFifthVecArgument ||
-             Kind == OneSeventhVecArgument || Kind == SameVecWidthArgument ||
-             Kind == VecElementArgument || Kind == Subdivide2Argument ||
-             Kind == Subdivide4Argument || Kind == VecOfBitcastsToInt);
+             Kind == OneThirdVecArgument || Kind == OneFourthVecArgument ||
+             Kind == OneFifthVecArgument || Kind == OneSixthVecArgument ||
+             Kind == OneSeventhVecArgument || Kind == OneEighthVecArgument ||
+             Kind == SameVecWidthArgument || Kind == VecElementArgument ||
+             Kind == Subdivide2Argument || Kind == Subdivide4Argument ||
+             Kind == VecOfBitcastsToInt);
       return Argument_Info >> 3;
     }
     ArgKind getArgumentKind() const {
       assert(Kind == Argument || Kind == ExtendArgument ||
              Kind == TruncArgument || Kind == HalfVecArgument ||
-             Kind == OneThirdVecArgument || Kind == OneFifthVecArgument ||
-             Kind == OneSeventhVecArgument || Kind == SameVecWidthArgument ||
-             Kind == VecElementArgument || Kind == Subdivide2Argument ||
-             Kind == Subdivide4Argument || Kind == VecOfBitcastsToInt);
+             Kind == OneThirdVecArgument || Kind == OneFourthVecArgument ||
+             Kind == OneFifthVecArgument || Kind == OneSixthVecArgument ||
+             Kind == OneSeventhVecArgument || Kind == OneEighthVecArgument ||
+             Kind == SameVecWidthArgument || Kind == VecElementArgument ||
+             Kind == Subdivide2Argument || Kind == Subdivide4Argument ||
+             Kind == VecOfBitcastsToInt);
       return (ArgKind)(Argument_Info & 7);
     }
 
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index 8d26961eebbf3..3994a543f9dcf 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -340,6 +340,9 @@ def IIT_ONE_FIFTH_VEC_ARG : IIT_Base<63>;
 def IIT_ONE_SEVENTH_VEC_ARG : IIT_Base<64>;
 def IIT_V2048: IIT_Vec<2048, 65>;
 def IIT_V4096: IIT_Vec<4096, 66>;
+def IIT_ONE_FOURTH_VEC_ARG : IIT_Base<67>;
+def IIT_ONE_SIXTH_VEC_ARG : IIT_Base<68>;
+def IIT_ONE_EIGHTH_VEC_ARG : IIT_Base<69>;
 }
 
 defvar IIT_all_FixedTypes = !filter(iit, IIT_all,
@@ -483,12 +486,21 @@ class LLVMHalfElementsVectorType<int num>
 class LLVMOneThirdElementsVectorType<int num>
   : LLVMMatchType<num, IIT_ONE_THIRD_VEC_ARG>;
 
+class LLVMOneFourthElementsVectorType<int num>
+  : LLVMMatchType<num, IIT_ONE_FOURTH_VEC_ARG>;
+
 class LLVMOneFifthElementsVectorType<int num>
   : LLVMMatchType<num, IIT_ONE_FIFTH_VEC_ARG>;
 
+class LLVMOneSixthElementsVectorType<int num>
+  : LLVMMatchType<num, IIT_ONE_SIXTH_VEC_ARG>;
+
 class LLVMOneSeventhElementsVectorType<int num>
   : LLVMMatchType<num, IIT_ONE_SEVENTH_VEC_ARG>;
 
+class LLVMOneEighthElementsVectorType<int num>
+  : LLVMMatchType<num, IIT_ONE_EIGHTH_VEC_ARG>;
+
 // Match the type of another intrinsic parameter that is expected to be a
 // vector type (i.e. <N x iM>) but with each element subdivided to
 // form a vector with more elements that are smaller than the original.
@@ -2776,6 +2788,20 @@ def int_vector_deinterleave3 : DefaultAttrsIntrinsic<[LLVMOneThirdElementsVector
                                                      [llvm_anyvector_ty],
                                                      [IntrNoMem]>;
 
+def int_vector_interleave4   : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+                                                     [LLVMOneFourthElementsVectorType<0>,
+                                                      LLVMOneFourthElementsVectorType<0>,
+                                                      LLVMOneFourthElementsVectorType<0>,
+                                                      LLVMOneFourthElementsVectorType<0>],
+                                                     [IntrNoMem]>;
+
+def int_vector_deinterleave4 : DefaultAttrsIntrinsic<[LLVMOneFourthElementsVectorType<0>,
+                                                      LLVMOneFourthElementsVectorType<0>,
+                                                      LLVMOneFourthElementsVectorType<0>,
+                                                      LLVMOneFourthElementsVectorType<0>],
+                                                     [llvm_anyvector_ty],
+                                                     [IntrNoMem]>;
+
 def int_vector_interleave5   : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
                                                      [LLVMOneFifthElementsVectorType<0>,
                                                       LLVMOneFifthElementsVectorType<0>,
@@ -2792,6 +2818,24 @@ def int_vector_deinterleave5 : DefaultAttrsIntrinsic<[LLVMOneFifthElementsVector
                                                      [llvm_anyvector_ty],
                                                      [IntrNoMem]>;
 
+def int_vector_interleave6   : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+                                                     [LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>],
+                                                     [IntrNoMem]>;
+
+def int_vector_deinterleave6 : DefaultAttrsIntrinsic<[LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>],
+                                                     [llvm_anyvector_ty],
+                                                     [IntrNoMem]>;
+
 def int_vector_interleave7   : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
                                                      [LLVMOneSeventhElementsVectorType<0>,
                                                       LLVMOneSeventhElementsVectorType<0>,
@@ -2812,6 +2856,28 @@ def int_vector_deinterleave7 : DefaultAttrsIntrinsic<[LLVMOneSeventhElementsVect
                                                      [llvm_anyvector_ty],
                                                      [IntrNoMem]>;
 
+def int_vector_interleave8   : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+                                                     [LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>],
+                                                     [IntrNoMem]>;
+
+def int_vector_deinterleave8 : DefaultAttrsIntrinsic<[LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>],
+                                                     [llvm_anyvector_ty],
+                                                     [IntrNoMem]>;
+
 //===-------------- Intrinsics to perform partial reduction ---------------===//
 
 def int_experimental_vector_partial_reduce_add : DefaultAttrsIntrinsic<[LLVMMatchType<0>],
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index 9d138d364bad7..10ee75a83a267 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -8181,24 +8181,42 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
   case Intrinsic::vector_interleave3:
     visitVectorInterleave(I, 3);
     return;
+  case Intrinsic::vector_interleave4:
+    visitVectorInterleave(I, 4);
+    return;
   case Intrinsic::vector_interleave5:
     visitVectorInterleave(I, 5);
     return;
+  case Intrinsic::vector_interleave6:
+    visitVectorInterleave(I, 6);
+    return;
   case Intrinsic::vector_interleave7:
     visitVectorInterleave(I, 7);
     return;
+  case Intrinsic::vector_interleave8:
+    visitVectorInterleave(I, 8);
+    return;
   case Intrinsic::vector_deinterleave2:
     visitVectorDeinterleave(I, 2);
     return;
   case Intrinsic::vector_deinterleave3:
     visitVectorDeinterleave(I, 3);
     return;
+  case Intrinsic::vector_deinterleave4:
+    visitVectorDeinterleave(I, 4);
+    return;
   case Intrinsic::vector_deinterleave5:
     visitVectorDeinterleave(I, 5);
     return;
+  case Intrinsic::vector_deinterleave6:
+    visitVectorDeinterleave(I, 6);
+    return;
   case Intrinsic::vector_deinterleave7:
     visitVectorDeinterleave(I, 7);
     return;
+  case Intrinsic::vector_deinterleave8:
+    visitVectorDeinterleave(I, 8);
+    return;
   case Intrinsic::experimental_vector_compress:
     setValue(&I, DAG.getNode(ISD::VECTOR_COMPRESS, sdl,
                              getValue(I.getArgOperand(0)).getValueType(),
diff --git a/llvm/lib/IR/Intrinsics.cpp b/llvm/lib/IR/Intrinsics.cpp
index dabb5fe006b3c..28f7523476774 100644
--- a/llvm/lib/IR/Intrinsics.cpp
+++ b/llvm/lib/IR/Intrinsics.cpp
@@ -378,18 +378,36 @@ DecodeIITType(unsigned &NextElt, ArrayRef<unsigned char> Infos,
         IITDescriptor::get(IITDescriptor::OneThirdVecArgument, ArgInfo));
     return;
   }
+  case IIT_ONE_FOURTH_VEC_ARG: {
+    unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
+    OutputTable.push_back(
+        IITDescriptor::get(IITDescriptor::OneFourthVecArgument, ArgInfo));
+    return;
+  }
   case IIT_ONE_FIFTH_VEC_ARG: {
     unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
     OutputTable.push_back(
         IITDescriptor::get(IITDescriptor::OneFifthVecArgument, ArgInfo));
     return;
   }
+  case IIT_ONE_SIXTH_VEC_ARG: {
+    unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
+    OutputTable.push_back(
+        IITDescriptor::get(IITDescriptor::OneSixthVecArgument, ArgInfo));
+    return;
+  }
   case IIT_ONE_SEVENTH_VEC_ARG: {
     unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
     OutputTable.push_back(
         IITDescriptor::get(IITDescriptor::OneSeventhVecArgument, ArgInfo));
     return;
   }
+  case IIT_ONE_EIGHTH_VEC_ARG: {
+    unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
+    OutputTable.push_back(
+        IITDescriptor::get(IITDescriptor::OneEighthVecArgument, ArgInfo));
+    return;
+  }
   case IIT_SAME_VEC_WIDTH_ARG: {
     unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
     OutputTable.push_back(
@@ -584,11 +602,14 @@ static Type *DecodeFixedType(ArrayRef<Intrinsic::IITDescriptor> &Infos,
     return VectorType::getHalfElementsVectorType(
         cast<VectorType>(Tys[D.getArgumentNumber()]));
   case IITDescriptor::OneThirdVecArgument:
+  case IITDescriptor::OneFourthVecArgument:
   case IITDescriptor::OneFifthVecArgument:
+  case IITDescriptor::OneSixthVecArgument:
   case IITDescriptor::OneSeventhVecArgument:
+  case IITDescriptor::OneEighthVecArgument:
     return VectorType::getOneNthElementsVectorType(
         cast<VectorType>(Tys[D.getArgumentNumber()]),
-        3 + (D.Kind - IITDescriptor::OneThirdVecArgument) * 2);
+        3 + (D.Kind - IITDescriptor::OneThirdVecArgument));
   case IITDescriptor::SameVecWidthArgument: {
     Type *EltTy = DecodeFixedType(Infos, Tys, Context);
     Type *Ty = Tys[D.getArgumentNumber()];
@@ -974,15 +995,18 @@ matchIntrinsicType(Type *Ty, ArrayRef<Intrinsic::IITDescriptor> &Infos,
            VectorType::getHalfElementsVectorType(
                cast<VectorType>(ArgTys[D.getArgumentNumber()])) != Ty;
   case IITDescriptor::OneThirdVecArgument:
+  case IITDescriptor::OneFourthVecArgument:
   case IITDescriptor::OneFifthVecArgument:
+  case IITDescriptor::OneSixthVecArgument:
   case IITDescriptor::OneSeventhVecArgument:
+  case IITDescriptor::OneEighthVecArgument:
     // If this is a forward reference, defer the check for later.
     if (D.getArgumentNumber() >= ArgTys.size())
       return IsDeferredCheck || DeferCheck(Ty);
     return !isa<VectorType>(ArgTys[D.getArgumentNumber()]) ||
            VectorType::getOneNthElementsVectorType(
                cast<VectorType>(ArgTys[D.getArgumentNumber()]),
-               3 + (D.Kind - IITDescriptor::OneThirdVecArgument) * 2) != Ty;
+               3 + (D.Kind - IITDescriptor::OneThirdVecArgument)) != Ty;
   case IITDescriptor::SameVecWidthArgument: {
     if (D.getArgumentNumber() >= ArgTys.size()) {
       // Defer check and subsequent check for the vector element type.
diff --git a/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll b/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll
index f6b5a35aa06d6..a3ad0b26efd4d 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll
@@ -223,6 +223,41 @@ define {<2 x i32>, <2 x i32>, <2 x i32>} @vector_deinterleave3_v2i32_v6i32(<6 x
 	   ret {<2 x i32>, <2 x i32>, <2 x i32>} %res
 }
 
+define {<2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>} @vector_deinterleave3_v2i32_v8i32(<8 x i32> %v) {
+; CHECK-LABEL: vector_deinterleave3_v2i32_v8i32:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi sp, sp, -16
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 1
+; CHECK-NEXT:    sub sp, sp, a0
+; CHECK-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x02, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 2 * vlenb
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    vsetivli zero, 2, e32, m2, ta, ma
+; CHECK-NEXT:    vslidedown.vi v10, v8, 6
+; CHECK-NEXT:    vslidedown.vi v12, v8, 4
+; CHECK-NEXT:    vsetivli zero, 2, e32, m1, ta, ma
+; CHECK-NEXT:    vslidedown.vi v9, v8, 2
+; CHECK-NEXT:    srli a0, a0, 3
+; CHECK-NEXT:    add a1, a0, a0
+; CHECK-NEXT:    vsetvli zero, a1, e32, m1, ta, ma
+; CHECK-NEXT:    vslideup.vx v12, v10, a0
+; CHECK-NEXT:    vslideup.vx v8, v9, a0
+; CHECK-NEXT:    addi a0, sp, 16
+; CHECK-NEXT:    vmv.v.v v9, v12
+; CHECK-NEXT:    vs2r.v v8, (a0)
+; CHECK-NEXT:    vsetvli a1, zero, e32, mf2, ta, ma
+; CHECK-NEXT:    vlseg4e32.v v8, (a0)
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 1
+; CHECK-NEXT:    add sp, sp, a0
+; CHECK-NEXT:    .cfi_def_cfa sp, 16
+; CHECK-NEXT:    addi sp, sp, 16
+; CHECK-NEXT:    .cfi_def_cfa_offset 0
+; CHECK-NEXT:    ret
+	   %res = call {<2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>} @llvm.vector.deinterleave4.v8i32(<8 x i32> %v)
+	   ret {<2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>} %res
+}
 
 define {<2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>} @vector_deinterleave5_v2i16_v10i16(<10 x i16> %v) {
 ; CHECK-LABEL: vector_deinterleave5_v2i16_v10i16:
@@ -265,6 +300,49 @@ define {<2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>} @vector_deinterle
 	   ret {<2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>} %res
 }
 
+define {<2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>} @vector_deinterleave6_v2i16_v12i16(<12 x i16> %v) {
+; CHECK-LABEL: vector_deinterleave6_v2i16_v12i16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi sp, sp, -16
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 1
+; CHECK-NEXT:    sub sp, sp, a0
+; CHECK-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x02, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 2 * vlenb
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    vsetivli zero, 2, e16, m1, ta, ma
+; CHECK-NEXT:    vslidedown.vi v14, v8, 6
+; CHECK-NEXT:    vslidedown.vi v15, v8, 4
+; CHECK-NEXT:    vslidedown.vi v16, v8, 2
+; CHECK-NEXT:    vsetivli zero, 2, e16, m2, ta, ma
+; CHECK-NEXT:    vslidedown.vi v10, v8, 10
+; CHECK-NEXT:    vslidedown.vi v12, v8, 8
+; CHECK-NEXT:    srli a1, a0, 3
+; CHECK-NEXT:    srli a0, a0, 2
+; CHECK-NEXT:    add a2, a1, a1
+; CHECK-NEXT:    add a3, a0, a0
+; CHECK-NEXT:    vsetvli zero, a2, ...
[truncated]

llvmbot · 2025-05-14T12:47:50Z

@llvm/pr-subscribers-llvm-ir

Author: Luke Lau (lukel97)

Changes

This adds [de]interleave intrinsics for factors of 4,6,8, so that every interleaved memory operation supported by the in-tree targets can be represented by a single intrinsic.

For context, [de]interleaves of fixed-length vectors are represented by a series of shufflevectors. The intrinsics are needed for scalable vectors, and we don't currently scalably vectorize all possible factors of interleave groups supported by RISC-V/AArch64.

The underlying reason for this is that higher factors are currently represented by interleaving multiple interleaves themselves, which made sense at the time in the discussion in #89018.

But after trying to integrate these for higher factors on RISC-V I think we should revisit this design choice:

Matching these in InterleavedAccessPass is non-trivial: We currently only support factors that are a power of 2, and detecting this requires a good chunk of code
The shufflevector masks used for [de]interleaves of fixed-length vectors are much easier to pattern match as they are strided patterns, but for the intrinsics it's much more complicated to match as the structure is a tree.
There's no optimisation that happens on [de]interleave2 intriniscs, so there's not much point to representing it in this form
For non-power-of-2 factors e.g. 6, there are multiple possible ways a [de]interleave could be represented, see the discussion in #139373
We already have intrinsics for 2,3,5 and 7, so by avoiding 4,6 and 8 we're not really saving much

By representing these higher factors are interleaved-interleaves, we can in theory support arbitrarily high interleave factors. However I'm not sure this is actually needed in practice: SVE only has instructions for factors 2,3,4, whilst RVV only supports up to factor 8.

This patch would make it much easier to support scalable interleaved accesses in the loop vectorizer for RISC-V for factors 3,5,6 and 7, as the loop vectorizer and InterleavedAccessPass wouldn't need to construct and match trees of interleaves.

If people agree with the direction, I would post these patches to follow up:

Lower [de]interleave{3,4} on AArch64
Teach InterleavedAccessPass to recognize [de]interleave{4,6,8} (3,5,7 are handled by [IA] Add support for [de]interleave{3,5,7} #139373, this would be nearly identical)
Remove the recursive [de]interleaving from the loop vectorizer and instead emit a single intrinsic
Remove the recursive [de]interleaving pattern matching from InterleavedAccessPass

If we ever do want to end up supporting interleave factors higher than what the target natively has instructions for, we can then extend this infrastructure further. But I think it's more important that we have full support for the native capabilities first.

Patch is 777.14 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/139893.diff

9 Files Affected:

(modified) llvm/docs/LangRef.rst (+5-5)
(modified) llvm/include/llvm/IR/Intrinsics.h (+20-10)
(modified) llvm/include/llvm/IR/Intrinsics.td (+66)
(modified) llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp (+18)
(modified) llvm/lib/IR/Intrinsics.cpp (+26-2)
(modified) llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll (+500-2)
(modified) llvm/test/CodeGen/RISCV/rvv/vector-deinterleave.ll (+1620-133)
(modified) llvm/test/CodeGen/RISCV/rvv/vector-interleave-fixed.ll (+761-26)
(modified) llvm/test/CodeGen/RISCV/rvv/vector-interleave.ll (+9979-3877)

diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 7296bb84b7d95..c0bc0a10ed537 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -20158,7 +20158,7 @@ Arguments:
 
 The argument to this intrinsic must be a vector.
 
-'``llvm.vector.deinterleave2/3/5/7``' Intrinsic
+'``llvm.vector.deinterleave2/3/4/5/6/7/8``' Intrinsic
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Syntax:
@@ -20176,8 +20176,8 @@ This is an overloaded intrinsic.
 Overview:
 """""""""
 
-The '``llvm.vector.deinterleave2/3/5/7``' intrinsics deinterleave adjacent lanes
-into 2, 3, 5, and 7 separate vectors, respectively, and return them as the
+The '``llvm.vector.deinterleave2/3/4/5/6/7/8``' intrinsics deinterleave adjacent lanes
+into 2 through to 8 separate vectors, respectively, and return them as the
 result.
 
 This intrinsic works for both fixed and scalable vectors. While this intrinsic
@@ -20199,7 +20199,7 @@ Arguments:
 The argument is a vector whose type corresponds to the logical concatenation of
 the aggregated result types.
 
-'``llvm.vector.interleave2/3/5/7``' Intrinsic
+'``llvm.vector.interleave2/3/4/5/6/7/8``' Intrinsic
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 Syntax:
@@ -20217,7 +20217,7 @@ This is an overloaded intrinsic.
 Overview:
 """""""""
 
-The '``llvm.vector.interleave2/3/5/7``' intrinsic constructs a vector
+The '``llvm.vector.interleave2/3/4/5/6/7/8``' intrinsic constructs a vector
 by interleaving all the input vectors.
 
 This intrinsic works for both fixed and scalable vectors. While this intrinsic
diff --git a/llvm/include/llvm/IR/Intrinsics.h b/llvm/include/llvm/IR/Intrinsics.h
index 6fb1bf9359b9a..b64784909fc25 100644
--- a/llvm/include/llvm/IR/Intrinsics.h
+++ b/llvm/include/llvm/IR/Intrinsics.h
@@ -153,8 +153,11 @@ namespace Intrinsic {
       TruncArgument,
       HalfVecArgument,
       OneThirdVecArgument,
+      OneFourthVecArgument,
       OneFifthVecArgument,
+      OneSixthVecArgument,
       OneSeventhVecArgument,
+      OneEighthVecArgument,
       SameVecWidthArgument,
       VecOfAnyPtrsToElt,
       VecElementArgument,
@@ -167,8 +170,11 @@ namespace Intrinsic {
     } Kind;
 
     // These three have to be contiguous.
-    static_assert(OneFifthVecArgument == OneThirdVecArgument + 1 &&
-                  OneSeventhVecArgument == OneFifthVecArgument + 1);
+    static_assert(OneFourthVecArgument == OneThirdVecArgument + 1 &&
+                  OneFifthVecArgument == OneFourthVecArgument + 1 &&
+                  OneSixthVecArgument == OneFifthVecArgument + 1 &&
+                  OneSeventhVecArgument == OneSixthVecArgument + 1 &&
+                  OneEighthVecArgument == OneSeventhVecArgument + 1);
     union {
       unsigned Integer_Width;
       unsigned Float_Width;
@@ -188,19 +194,23 @@ namespace Intrinsic {
     unsigned getArgumentNumber() const {
       assert(Kind == Argument || Kind == ExtendArgument ||
              Kind == TruncArgument || Kind == HalfVecArgument ||
-             Kind == OneThirdVecArgument || Kind == OneFifthVecArgument ||
-             Kind == OneSeventhVecArgument || Kind == SameVecWidthArgument ||
-             Kind == VecElementArgument || Kind == Subdivide2Argument ||
-             Kind == Subdivide4Argument || Kind == VecOfBitcastsToInt);
+             Kind == OneThirdVecArgument || Kind == OneFourthVecArgument ||
+             Kind == OneFifthVecArgument || Kind == OneSixthVecArgument ||
+             Kind == OneSeventhVecArgument || Kind == OneEighthVecArgument ||
+             Kind == SameVecWidthArgument || Kind == VecElementArgument ||
+             Kind == Subdivide2Argument || Kind == Subdivide4Argument ||
+             Kind == VecOfBitcastsToInt);
       return Argument_Info >> 3;
     }
     ArgKind getArgumentKind() const {
       assert(Kind == Argument || Kind == ExtendArgument ||
              Kind == TruncArgument || Kind == HalfVecArgument ||
-             Kind == OneThirdVecArgument || Kind == OneFifthVecArgument ||
-             Kind == OneSeventhVecArgument || Kind == SameVecWidthArgument ||
-             Kind == VecElementArgument || Kind == Subdivide2Argument ||
-             Kind == Subdivide4Argument || Kind == VecOfBitcastsToInt);
+             Kind == OneThirdVecArgument || Kind == OneFourthVecArgument ||
+             Kind == OneFifthVecArgument || Kind == OneSixthVecArgument ||
+             Kind == OneSeventhVecArgument || Kind == OneEighthVecArgument ||
+             Kind == SameVecWidthArgument || Kind == VecElementArgument ||
+             Kind == Subdivide2Argument || Kind == Subdivide4Argument ||
+             Kind == VecOfBitcastsToInt);
       return (ArgKind)(Argument_Info & 7);
     }
 
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index 8d26961eebbf3..3994a543f9dcf 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -340,6 +340,9 @@ def IIT_ONE_FIFTH_VEC_ARG : IIT_Base<63>;
 def IIT_ONE_SEVENTH_VEC_ARG : IIT_Base<64>;
 def IIT_V2048: IIT_Vec<2048, 65>;
 def IIT_V4096: IIT_Vec<4096, 66>;
+def IIT_ONE_FOURTH_VEC_ARG : IIT_Base<67>;
+def IIT_ONE_SIXTH_VEC_ARG : IIT_Base<68>;
+def IIT_ONE_EIGHTH_VEC_ARG : IIT_Base<69>;
 }
 
 defvar IIT_all_FixedTypes = !filter(iit, IIT_all,
@@ -483,12 +486,21 @@ class LLVMHalfElementsVectorType<int num>
 class LLVMOneThirdElementsVectorType<int num>
   : LLVMMatchType<num, IIT_ONE_THIRD_VEC_ARG>;
 
+class LLVMOneFourthElementsVectorType<int num>
+  : LLVMMatchType<num, IIT_ONE_FOURTH_VEC_ARG>;
+
 class LLVMOneFifthElementsVectorType<int num>
   : LLVMMatchType<num, IIT_ONE_FIFTH_VEC_ARG>;
 
+class LLVMOneSixthElementsVectorType<int num>
+  : LLVMMatchType<num, IIT_ONE_SIXTH_VEC_ARG>;
+
 class LLVMOneSeventhElementsVectorType<int num>
   : LLVMMatchType<num, IIT_ONE_SEVENTH_VEC_ARG>;
 
+class LLVMOneEighthElementsVectorType<int num>
+  : LLVMMatchType<num, IIT_ONE_EIGHTH_VEC_ARG>;
+
 // Match the type of another intrinsic parameter that is expected to be a
 // vector type (i.e. <N x iM>) but with each element subdivided to
 // form a vector with more elements that are smaller than the original.
@@ -2776,6 +2788,20 @@ def int_vector_deinterleave3 : DefaultAttrsIntrinsic<[LLVMOneThirdElementsVector
                                                      [llvm_anyvector_ty],
                                                      [IntrNoMem]>;
 
+def int_vector_interleave4   : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+                                                     [LLVMOneFourthElementsVectorType<0>,
+                                                      LLVMOneFourthElementsVectorType<0>,
+                                                      LLVMOneFourthElementsVectorType<0>,
+                                                      LLVMOneFourthElementsVectorType<0>],
+                                                     [IntrNoMem]>;
+
+def int_vector_deinterleave4 : DefaultAttrsIntrinsic<[LLVMOneFourthElementsVectorType<0>,
+                                                      LLVMOneFourthElementsVectorType<0>,
+                                                      LLVMOneFourthElementsVectorType<0>,
+                                                      LLVMOneFourthElementsVectorType<0>],
+                                                     [llvm_anyvector_ty],
+                                                     [IntrNoMem]>;
+
 def int_vector_interleave5   : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
                                                      [LLVMOneFifthElementsVectorType<0>,
                                                       LLVMOneFifthElementsVectorType<0>,
@@ -2792,6 +2818,24 @@ def int_vector_deinterleave5 : DefaultAttrsIntrinsic<[LLVMOneFifthElementsVector
                                                      [llvm_anyvector_ty],
                                                      [IntrNoMem]>;
 
+def int_vector_interleave6   : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+                                                     [LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>],
+                                                     [IntrNoMem]>;
+
+def int_vector_deinterleave6 : DefaultAttrsIntrinsic<[LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>,
+                                                      LLVMOneSixthElementsVectorType<0>],
+                                                     [llvm_anyvector_ty],
+                                                     [IntrNoMem]>;
+
 def int_vector_interleave7   : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
                                                      [LLVMOneSeventhElementsVectorType<0>,
                                                       LLVMOneSeventhElementsVectorType<0>,
@@ -2812,6 +2856,28 @@ def int_vector_deinterleave7 : DefaultAttrsIntrinsic<[LLVMOneSeventhElementsVect
                                                      [llvm_anyvector_ty],
                                                      [IntrNoMem]>;
 
+def int_vector_interleave8   : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+                                                     [LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>],
+                                                     [IntrNoMem]>;
+
+def int_vector_deinterleave8 : DefaultAttrsIntrinsic<[LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>,
+                                                      LLVMOneEighthElementsVectorType<0>],
+                                                     [llvm_anyvector_ty],
+                                                     [IntrNoMem]>;
+
 //===-------------- Intrinsics to perform partial reduction ---------------===//
 
 def int_experimental_vector_partial_reduce_add : DefaultAttrsIntrinsic<[LLVMMatchType<0>],
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index 9d138d364bad7..10ee75a83a267 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -8181,24 +8181,42 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
   case Intrinsic::vector_interleave3:
     visitVectorInterleave(I, 3);
     return;
+  case Intrinsic::vector_interleave4:
+    visitVectorInterleave(I, 4);
+    return;
   case Intrinsic::vector_interleave5:
     visitVectorInterleave(I, 5);
     return;
+  case Intrinsic::vector_interleave6:
+    visitVectorInterleave(I, 6);
+    return;
   case Intrinsic::vector_interleave7:
     visitVectorInterleave(I, 7);
     return;
+  case Intrinsic::vector_interleave8:
+    visitVectorInterleave(I, 8);
+    return;
   case Intrinsic::vector_deinterleave2:
     visitVectorDeinterleave(I, 2);
     return;
   case Intrinsic::vector_deinterleave3:
     visitVectorDeinterleave(I, 3);
     return;
+  case Intrinsic::vector_deinterleave4:
+    visitVectorDeinterleave(I, 4);
+    return;
   case Intrinsic::vector_deinterleave5:
     visitVectorDeinterleave(I, 5);
     return;
+  case Intrinsic::vector_deinterleave6:
+    visitVectorDeinterleave(I, 6);
+    return;
   case Intrinsic::vector_deinterleave7:
     visitVectorDeinterleave(I, 7);
     return;
+  case Intrinsic::vector_deinterleave8:
+    visitVectorDeinterleave(I, 8);
+    return;
   case Intrinsic::experimental_vector_compress:
     setValue(&I, DAG.getNode(ISD::VECTOR_COMPRESS, sdl,
                              getValue(I.getArgOperand(0)).getValueType(),
diff --git a/llvm/lib/IR/Intrinsics.cpp b/llvm/lib/IR/Intrinsics.cpp
index dabb5fe006b3c..28f7523476774 100644
--- a/llvm/lib/IR/Intrinsics.cpp
+++ b/llvm/lib/IR/Intrinsics.cpp
@@ -378,18 +378,36 @@ DecodeIITType(unsigned &NextElt, ArrayRef<unsigned char> Infos,
         IITDescriptor::get(IITDescriptor::OneThirdVecArgument, ArgInfo));
     return;
   }
+  case IIT_ONE_FOURTH_VEC_ARG: {
+    unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
+    OutputTable.push_back(
+        IITDescriptor::get(IITDescriptor::OneFourthVecArgument, ArgInfo));
+    return;
+  }
   case IIT_ONE_FIFTH_VEC_ARG: {
     unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
     OutputTable.push_back(
         IITDescriptor::get(IITDescriptor::OneFifthVecArgument, ArgInfo));
     return;
   }
+  case IIT_ONE_SIXTH_VEC_ARG: {
+    unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
+    OutputTable.push_back(
+        IITDescriptor::get(IITDescriptor::OneSixthVecArgument, ArgInfo));
+    return;
+  }
   case IIT_ONE_SEVENTH_VEC_ARG: {
     unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
     OutputTable.push_back(
         IITDescriptor::get(IITDescriptor::OneSeventhVecArgument, ArgInfo));
     return;
   }
+  case IIT_ONE_EIGHTH_VEC_ARG: {
+    unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
+    OutputTable.push_back(
+        IITDescriptor::get(IITDescriptor::OneEighthVecArgument, ArgInfo));
+    return;
+  }
   case IIT_SAME_VEC_WIDTH_ARG: {
     unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
     OutputTable.push_back(
@@ -584,11 +602,14 @@ static Type *DecodeFixedType(ArrayRef<Intrinsic::IITDescriptor> &Infos,
     return VectorType::getHalfElementsVectorType(
         cast<VectorType>(Tys[D.getArgumentNumber()]));
   case IITDescriptor::OneThirdVecArgument:
+  case IITDescriptor::OneFourthVecArgument:
   case IITDescriptor::OneFifthVecArgument:
+  case IITDescriptor::OneSixthVecArgument:
   case IITDescriptor::OneSeventhVecArgument:
+  case IITDescriptor::OneEighthVecArgument:
     return VectorType::getOneNthElementsVectorType(
         cast<VectorType>(Tys[D.getArgumentNumber()]),
-        3 + (D.Kind - IITDescriptor::OneThirdVecArgument) * 2);
+        3 + (D.Kind - IITDescriptor::OneThirdVecArgument));
   case IITDescriptor::SameVecWidthArgument: {
     Type *EltTy = DecodeFixedType(Infos, Tys, Context);
     Type *Ty = Tys[D.getArgumentNumber()];
@@ -974,15 +995,18 @@ matchIntrinsicType(Type *Ty, ArrayRef<Intrinsic::IITDescriptor> &Infos,
            VectorType::getHalfElementsVectorType(
                cast<VectorType>(ArgTys[D.getArgumentNumber()])) != Ty;
   case IITDescriptor::OneThirdVecArgument:
+  case IITDescriptor::OneFourthVecArgument:
   case IITDescriptor::OneFifthVecArgument:
+  case IITDescriptor::OneSixthVecArgument:
   case IITDescriptor::OneSeventhVecArgument:
+  case IITDescriptor::OneEighthVecArgument:
     // If this is a forward reference, defer the check for later.
     if (D.getArgumentNumber() >= ArgTys.size())
       return IsDeferredCheck || DeferCheck(Ty);
     return !isa<VectorType>(ArgTys[D.getArgumentNumber()]) ||
            VectorType::getOneNthElementsVectorType(
                cast<VectorType>(ArgTys[D.getArgumentNumber()]),
-               3 + (D.Kind - IITDescriptor::OneThirdVecArgument) * 2) != Ty;
+               3 + (D.Kind - IITDescriptor::OneThirdVecArgument)) != Ty;
   case IITDescriptor::SameVecWidthArgument: {
     if (D.getArgumentNumber() >= ArgTys.size()) {
       // Defer check and subsequent check for the vector element type.
diff --git a/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll b/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll
index f6b5a35aa06d6..a3ad0b26efd4d 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll
@@ -223,6 +223,41 @@ define {<2 x i32>, <2 x i32>, <2 x i32>} @vector_deinterleave3_v2i32_v6i32(<6 x
 	   ret {<2 x i32>, <2 x i32>, <2 x i32>} %res
 }
 
+define {<2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>} @vector_deinterleave3_v2i32_v8i32(<8 x i32> %v) {
+; CHECK-LABEL: vector_deinterleave3_v2i32_v8i32:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi sp, sp, -16
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 1
+; CHECK-NEXT:    sub sp, sp, a0
+; CHECK-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x02, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 2 * vlenb
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    vsetivli zero, 2, e32, m2, ta, ma
+; CHECK-NEXT:    vslidedown.vi v10, v8, 6
+; CHECK-NEXT:    vslidedown.vi v12, v8, 4
+; CHECK-NEXT:    vsetivli zero, 2, e32, m1, ta, ma
+; CHECK-NEXT:    vslidedown.vi v9, v8, 2
+; CHECK-NEXT:    srli a0, a0, 3
+; CHECK-NEXT:    add a1, a0, a0
+; CHECK-NEXT:    vsetvli zero, a1, e32, m1, ta, ma
+; CHECK-NEXT:    vslideup.vx v12, v10, a0
+; CHECK-NEXT:    vslideup.vx v8, v9, a0
+; CHECK-NEXT:    addi a0, sp, 16
+; CHECK-NEXT:    vmv.v.v v9, v12
+; CHECK-NEXT:    vs2r.v v8, (a0)
+; CHECK-NEXT:    vsetvli a1, zero, e32, mf2, ta, ma
+; CHECK-NEXT:    vlseg4e32.v v8, (a0)
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 1
+; CHECK-NEXT:    add sp, sp, a0
+; CHECK-NEXT:    .cfi_def_cfa sp, 16
+; CHECK-NEXT:    addi sp, sp, 16
+; CHECK-NEXT:    .cfi_def_cfa_offset 0
+; CHECK-NEXT:    ret
+	   %res = call {<2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>} @llvm.vector.deinterleave4.v8i32(<8 x i32> %v)
+	   ret {<2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>} %res
+}
 
 define {<2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>} @vector_deinterleave5_v2i16_v10i16(<10 x i16> %v) {
 ; CHECK-LABEL: vector_deinterleave5_v2i16_v10i16:
@@ -265,6 +300,49 @@ define {<2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>} @vector_deinterle
 	   ret {<2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>} %res
 }
 
+define {<2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>} @vector_deinterleave6_v2i16_v12i16(<12 x i16> %v) {
+; CHECK-LABEL: vector_deinterleave6_v2i16_v12i16:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    addi sp, sp, -16
+; CHECK-NEXT:    .cfi_def_cfa_offset 16
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    slli a0, a0, 1
+; CHECK-NEXT:    sub sp, sp, a0
+; CHECK-NEXT:    .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x02, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 2 * vlenb
+; CHECK-NEXT:    csrr a0, vlenb
+; CHECK-NEXT:    vsetivli zero, 2, e16, m1, ta, ma
+; CHECK-NEXT:    vslidedown.vi v14, v8, 6
+; CHECK-NEXT:    vslidedown.vi v15, v8, 4
+; CHECK-NEXT:    vslidedown.vi v16, v8, 2
+; CHECK-NEXT:    vsetivli zero, 2, e16, m2, ta, ma
+; CHECK-NEXT:    vslidedown.vi v10, v8, 10
+; CHECK-NEXT:    vslidedown.vi v12, v8, 8
+; CHECK-NEXT:    srli a1, a0, 3
+; CHECK-NEXT:    srli a0, a0, 2
+; CHECK-NEXT:    add a2, a1, a1
+; CHECK-NEXT:    add a3, a0, a0
+; CHECK-NEXT:    vsetvli zero, a2, ...
[truncated]

mshockwave · 2025-05-14T20:52:12Z

llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll

@@ -223,6 +223,41 @@ define {<2 x i32>, <2 x i32>, <2 x i32>} @vector_deinterleave3_v2i32_v6i32(<6 x
 	   ret {<2 x i32>, <2 x i32>, <2 x i32>} %res
 }

+define {<2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>} @vector_deinterleave3_v2i32_v8i32(<8 x i32> %v) {


I think we could use nounwind here and rest of the other functions.

mshockwave · 2025-05-14T20:52:36Z

llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll

@@ -223,6 +223,41 @@ define {<2 x i32>, <2 x i32>, <2 x i32>} @vector_deinterleave3_v2i32_v6i32(<6 x
 	   ret {<2 x i32>, <2 x i32>, <2 x i32>} %res
 }

+define {<2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>} @vector_deinterleave3_v2i32_v8i32(<8 x i32> %v) {


deinterleave4?

mshockwave · 2025-05-14T20:54:21Z

llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll

+; RV32-NEXT:    vs1r.v v9, (a0) # vscale x 8-byte Folded Spill
+; RV32-NEXT:    li a1, 3
+; RV32-NEXT:    mv a0, s0
+; RV32-NEXT:    call __mulsi3


should we add M extension for RV32?

mshockwave · 2025-05-14T20:54:57Z

llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll

+; RV64-NEXT:    vs1r.v v9, (a0) # vscale x 8-byte Folded Spill
+; RV64-NEXT:    li a1, 3
+; RV64-NEXT:    mv a0, s0
+; RV64-NEXT:    call __muldi3


ditto M extension

efriedma-quic · 2025-05-14T23:15:18Z

By representing these higher factors are interleaved-interleaves, we can in theory support arbitrarily high interleave factors. However I'm not sure this is actually needed in practice.

You can definitely end up with very large interleave factors in some cases; my team has internal testcases for stride 24. Granted, it's uncommon.

wangpc-pp · 2025-05-15T04:09:14Z

llvm/include/llvm/IR/Intrinsics.h

@@ -167,8 +170,11 @@ namespace Intrinsic {
    } Kind;

    // These three have to be contiguous.


wangpc-pp · 2025-05-15T04:11:01Z

llvm/include/llvm/IR/Intrinsics.h

-             Kind == OneSeventhVecArgument || Kind == SameVecWidthArgument ||
-             Kind == VecElementArgument || Kind == Subdivide2Argument ||
-             Kind == Subdivide4Argument || Kind == VecOfBitcastsToInt);
+             Kind == OneThirdVecArgument || Kind == OneFourthVecArgument ||


Use comparators </> since they are contiguous?

wangpc-pp · 2025-05-15T04:11:13Z

llvm/include/llvm/IR/Intrinsics.h

-             Kind == OneSeventhVecArgument || Kind == SameVecWidthArgument ||
-             Kind == VecElementArgument || Kind == Subdivide2Argument ||
-             Kind == Subdivide4Argument || Kind == VecOfBitcastsToInt);
+             Kind == OneThirdVecArgument || Kind == OneFourthVecArgument ||


llvm/include/llvm/IR/Intrinsics.td

lukel97 · 2025-05-15T09:59:37Z

By representing these higher factors are interleaved-interleaves, we can in theory support arbitrarily high interleave factors. However I'm not sure this is actually needed in practice.

You can definitely end up with very large interleave factors in some cases; my team has internal testcases for stride 24. Granted, it's uncommon.

From my understanding though the loop vectorizer upstream today doesn't emit any scalable interleave group higher than 4 on AArch64 and 8 on RISC-V. This is from a quick grep of TLI.getMaxSupportedInterleaveFactor/getInterleavedMemoryOpCost and how BasicTTIImpl::getInterleavedMemoryOpCost returns invalid for scalable vector types . Do you have anything downstream that works around this limitation?

I should mention that the fixed-length VF VPlan should still be able to handle arbitrarily high factors, and hopefully in these cases the loop vectorizer will pick it based off the cost.

efriedma-quic · 2025-05-15T22:23:34Z

It was implemented downstream in a completely separate vectorization framework. I'm only mentioning it because we don't want to block off the possibility of adding such support in the future.

preames · 2025-05-16T17:53:00Z

I support this proposal. Note that's largely a reversal of my original stance on this, but seeing all the complexity here, I think adding the explicit variants if probably the right call.

Another option we could explore is to split deinterleaveN into N calls to an intrinsic for the form "deinterleave(N, Vec)". This is a more direct mapping to what we do for the fixed vector shuffles today. This was discussed in the original threads, but I'm still (mildly) of the opinion we went the wrong direction here. I'm happy to defer to those actually working on this though.

We could also do interleave(N, concat_vector(...)) instead. This seems less clearly motivated, and I'd only bother if we were deciding to do the former.

Note that even if we want to pursue my alternative, I support this proposal as an intermediate step. Let's clean up the complexity we have, then possibly revisit.

mshockwave · 2025-05-16T23:22:58Z

Another option we could explore is to split deinterleaveN into N calls to an intrinsic for the form "deinterleave(N, Vec)

I guess you mean instead of

%d = deinterleave3(%v)
%s0 = extractvalue %d, 0
%s1 = extractvalue %d, 1
%s2 = extractvalue %d, 2

We're going to do something like

%s0 = deinterleave(0, %v)
%s1 = deinterleave(1, %v)
%s2 = deinterleave(2, %v)

I guess a potential problem might happen when we cannot turn this into segmented load/store. For instance, how should we codegen a single, lingering %s = deinterleave(X, %v)? We might be able to mitigate it by adding another argument indicating the total number of fields, like %s0 = deinterleave(0, 3, %v) for the first field when NF = 3.

preames · 2025-05-17T17:27:21Z

Another option we could explore is to split deinterleaveN into N calls to an intrinsic for the form

I guess a potential problem might happen when we cannot turn this into segmented load/store. For instance, how should we codegen a single, lingering %s = deinterleave(X, %v)? We might be able to mitigate it by adding another argument indicating the total number of fields, like %s0 = deinterleave(0, 3, %v) for the first field when NF = 3.

Yeah, this was exactly what I had in mind. We have two constant integer operands which fully describe the shuffle being performed. (e.g., deinterleave with stride 3 and offset 2, which is analogous to a shufflevector with 2, 5, 8, 11, ... as the mask)

At least on riscv, this is actually a better mapping to the lowering (when we don't turn it into a segment load), than the current intrinsics with their tuple return. Each of the individual lanes becomes a vcompress or vrgather (or vnsrl if possible).

lukel97 · 2025-05-21T11:30:50Z

It was implemented downstream in a completely separate vectorization framework. I'm only mentioning it because we don't want to block off the possibility of adding such support in the future.

Agreed it would be nice to keep the possibility. From my understanding, these higher factors only need the recursive interleaving support in the loop vectorizer, not in InterleavedAccessPass because there's no hardware instructions beyond 8 that we can currently map to. So could I suggest the following plan instead:

Teach the loop vectorizer to emit a single [de]interleave intrinsic for factors up to 8. Keep the recursive interleaving for powers of 2 beyond 8.
Remove the recursive [de]interleaving pattern matching from InterleavedAccessPass. Only match single intrinsics up to factor 8.

This way we would still be able to scalably vectorize e.g. factor 16, and can still remove the recursive interleaving code in InterleavedAccessPass.

…ntrinsics/468

mshockwave

LGTM

lukel97 · 2025-05-23T18:48:08Z

If there's no objections I'll merge this early next week, but happy to hold on if people still want to discuss the direction cc @efriedma-quic

nikic · 2025-05-24T08:49:45Z

llvm/include/llvm/IR/Intrinsics.h

      OneSeventhVecArgument,
+      OneEighthVecArgument,


Can we instead parameterize a single IIT descriptor with the divisor?

[IR] Add llvm.vector.(de)interleave4/6/8

915c27d

lukel97 requested review from mshockwave, hassnaaHamdi, davemgreen, paulwalker-arm and efriedma-quic May 14, 2025 12:47

llvmbot added llvm:SelectionDAG SelectionDAGISel as well llvm:ir labels May 14, 2025

lukel97 mentioned this pull request May 14, 2025

[IA] Add support for [de]interleave{3,5,7} #139373

Merged

mshockwave reviewed May 14, 2025

View reviewed changes

wangpc-pp reviewed May 15, 2025

View reviewed changes

lukel97 added 5 commits May 21, 2025 12:38

Add nounwind to avoid cfi directives

777ccf8

Fix test name

d0166d9

Use +m

fdbcca4

Use >=/<= and update comment

7d93db6

Merge branch 'main' of github.com:llvm/llvm-project into interleave-i…

fa3ac23

…ntrinsics/468

mshockwave approved these changes May 21, 2025

View reviewed changes

nikic approved these changes May 24, 2025

View reviewed changes

		@@ -167,8 +170,11 @@ namespace Intrinsic {
		} Kind;

		// These three have to be contiguous.

[IR] Add llvm.vector.[de]interleave{4,6,8} #139893

Are you sure you want to change the base?

[IR] Add llvm.vector.[de]interleave{4,6,8} #139893

Conversation

lukel97 commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented May 14, 2025

Uh oh!

llvmbot commented May 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

efriedma-quic commented May 14, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lukel97 commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

efriedma-quic commented May 15, 2025

Uh oh!

preames commented May 16, 2025

Uh oh!

mshockwave commented May 16, 2025

Uh oh!

preames commented May 17, 2025

Uh oh!

lukel97 commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mshockwave left a comment

Choose a reason for hiding this comment

Uh oh!

lukel97 commented May 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lukel97 commented May 14, 2025 •

edited

Loading

lukel97 commented May 15, 2025 •

edited

Loading

lukel97 commented May 21, 2025 •

edited

Loading