-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[IR] Add llvm.vector.[de]interleave{4,6,8} #139893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-llvm-selectiondag Author: Luke Lau (lukel97) ChangesThis adds [de]interleave intrinsics for factors of 4,6,8, so that every interleaved memory operation supported by the in-tree targets can be represented by a single intrinsic. For context, [de]interleaves of fixed-length vectors are represented by a series of shufflevectors. The intrinsics are needed for scalable vectors, and we don't currently scalably vectorize all possible factors of interleave groups supported by RISC-V/AArch64. The underlying reason for this is that higher factors are currently represented by interleaving multiple interleaves themselves, which made sense at the time in the discussion in #89018. But after trying to integrate these for higher factors on RISC-V I think we should revisit this design choice:
By representing these higher factors are interleaved-interleaves, we can in theory support arbitrarily high interleave factors. However I'm not sure this is actually needed in practice: SVE only has instructions for factors 2,3,4, whilst RVV only supports up to factor 8. This patch would make it much easier to support scalable interleaved accesses in the loop vectorizer for RISC-V for factors 3,5,6 and 7, as the loop vectorizer and InterleavedAccessPass wouldn't need to construct and match trees of interleaves. If people agree with the direction, I would post these patches to follow up:
If we ever do want to end up supporting interleave factors higher than what the target natively has instructions for, we can then extend this infrastructure further. But I think it's more important that we have full support for the native capabilities first. Patch is 777.14 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/139893.diff 9 Files Affected:
diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 7296bb84b7d95..c0bc0a10ed537 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -20158,7 +20158,7 @@ Arguments:
The argument to this intrinsic must be a vector.
-'``llvm.vector.deinterleave2/3/5/7``' Intrinsic
+'``llvm.vector.deinterleave2/3/4/5/6/7/8``' Intrinsic
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Syntax:
@@ -20176,8 +20176,8 @@ This is an overloaded intrinsic.
Overview:
"""""""""
-The '``llvm.vector.deinterleave2/3/5/7``' intrinsics deinterleave adjacent lanes
-into 2, 3, 5, and 7 separate vectors, respectively, and return them as the
+The '``llvm.vector.deinterleave2/3/4/5/6/7/8``' intrinsics deinterleave adjacent lanes
+into 2 through to 8 separate vectors, respectively, and return them as the
result.
This intrinsic works for both fixed and scalable vectors. While this intrinsic
@@ -20199,7 +20199,7 @@ Arguments:
The argument is a vector whose type corresponds to the logical concatenation of
the aggregated result types.
-'``llvm.vector.interleave2/3/5/7``' Intrinsic
+'``llvm.vector.interleave2/3/4/5/6/7/8``' Intrinsic
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Syntax:
@@ -20217,7 +20217,7 @@ This is an overloaded intrinsic.
Overview:
"""""""""
-The '``llvm.vector.interleave2/3/5/7``' intrinsic constructs a vector
+The '``llvm.vector.interleave2/3/4/5/6/7/8``' intrinsic constructs a vector
by interleaving all the input vectors.
This intrinsic works for both fixed and scalable vectors. While this intrinsic
diff --git a/llvm/include/llvm/IR/Intrinsics.h b/llvm/include/llvm/IR/Intrinsics.h
index 6fb1bf9359b9a..b64784909fc25 100644
--- a/llvm/include/llvm/IR/Intrinsics.h
+++ b/llvm/include/llvm/IR/Intrinsics.h
@@ -153,8 +153,11 @@ namespace Intrinsic {
TruncArgument,
HalfVecArgument,
OneThirdVecArgument,
+ OneFourthVecArgument,
OneFifthVecArgument,
+ OneSixthVecArgument,
OneSeventhVecArgument,
+ OneEighthVecArgument,
SameVecWidthArgument,
VecOfAnyPtrsToElt,
VecElementArgument,
@@ -167,8 +170,11 @@ namespace Intrinsic {
} Kind;
// These three have to be contiguous.
- static_assert(OneFifthVecArgument == OneThirdVecArgument + 1 &&
- OneSeventhVecArgument == OneFifthVecArgument + 1);
+ static_assert(OneFourthVecArgument == OneThirdVecArgument + 1 &&
+ OneFifthVecArgument == OneFourthVecArgument + 1 &&
+ OneSixthVecArgument == OneFifthVecArgument + 1 &&
+ OneSeventhVecArgument == OneSixthVecArgument + 1 &&
+ OneEighthVecArgument == OneSeventhVecArgument + 1);
union {
unsigned Integer_Width;
unsigned Float_Width;
@@ -188,19 +194,23 @@ namespace Intrinsic {
unsigned getArgumentNumber() const {
assert(Kind == Argument || Kind == ExtendArgument ||
Kind == TruncArgument || Kind == HalfVecArgument ||
- Kind == OneThirdVecArgument || Kind == OneFifthVecArgument ||
- Kind == OneSeventhVecArgument || Kind == SameVecWidthArgument ||
- Kind == VecElementArgument || Kind == Subdivide2Argument ||
- Kind == Subdivide4Argument || Kind == VecOfBitcastsToInt);
+ Kind == OneThirdVecArgument || Kind == OneFourthVecArgument ||
+ Kind == OneFifthVecArgument || Kind == OneSixthVecArgument ||
+ Kind == OneSeventhVecArgument || Kind == OneEighthVecArgument ||
+ Kind == SameVecWidthArgument || Kind == VecElementArgument ||
+ Kind == Subdivide2Argument || Kind == Subdivide4Argument ||
+ Kind == VecOfBitcastsToInt);
return Argument_Info >> 3;
}
ArgKind getArgumentKind() const {
assert(Kind == Argument || Kind == ExtendArgument ||
Kind == TruncArgument || Kind == HalfVecArgument ||
- Kind == OneThirdVecArgument || Kind == OneFifthVecArgument ||
- Kind == OneSeventhVecArgument || Kind == SameVecWidthArgument ||
- Kind == VecElementArgument || Kind == Subdivide2Argument ||
- Kind == Subdivide4Argument || Kind == VecOfBitcastsToInt);
+ Kind == OneThirdVecArgument || Kind == OneFourthVecArgument ||
+ Kind == OneFifthVecArgument || Kind == OneSixthVecArgument ||
+ Kind == OneSeventhVecArgument || Kind == OneEighthVecArgument ||
+ Kind == SameVecWidthArgument || Kind == VecElementArgument ||
+ Kind == Subdivide2Argument || Kind == Subdivide4Argument ||
+ Kind == VecOfBitcastsToInt);
return (ArgKind)(Argument_Info & 7);
}
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index 8d26961eebbf3..3994a543f9dcf 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -340,6 +340,9 @@ def IIT_ONE_FIFTH_VEC_ARG : IIT_Base<63>;
def IIT_ONE_SEVENTH_VEC_ARG : IIT_Base<64>;
def IIT_V2048: IIT_Vec<2048, 65>;
def IIT_V4096: IIT_Vec<4096, 66>;
+def IIT_ONE_FOURTH_VEC_ARG : IIT_Base<67>;
+def IIT_ONE_SIXTH_VEC_ARG : IIT_Base<68>;
+def IIT_ONE_EIGHTH_VEC_ARG : IIT_Base<69>;
}
defvar IIT_all_FixedTypes = !filter(iit, IIT_all,
@@ -483,12 +486,21 @@ class LLVMHalfElementsVectorType<int num>
class LLVMOneThirdElementsVectorType<int num>
: LLVMMatchType<num, IIT_ONE_THIRD_VEC_ARG>;
+class LLVMOneFourthElementsVectorType<int num>
+ : LLVMMatchType<num, IIT_ONE_FOURTH_VEC_ARG>;
+
class LLVMOneFifthElementsVectorType<int num>
: LLVMMatchType<num, IIT_ONE_FIFTH_VEC_ARG>;
+class LLVMOneSixthElementsVectorType<int num>
+ : LLVMMatchType<num, IIT_ONE_SIXTH_VEC_ARG>;
+
class LLVMOneSeventhElementsVectorType<int num>
: LLVMMatchType<num, IIT_ONE_SEVENTH_VEC_ARG>;
+class LLVMOneEighthElementsVectorType<int num>
+ : LLVMMatchType<num, IIT_ONE_EIGHTH_VEC_ARG>;
+
// Match the type of another intrinsic parameter that is expected to be a
// vector type (i.e. <N x iM>) but with each element subdivided to
// form a vector with more elements that are smaller than the original.
@@ -2776,6 +2788,20 @@ def int_vector_deinterleave3 : DefaultAttrsIntrinsic<[LLVMOneThirdElementsVector
[llvm_anyvector_ty],
[IntrNoMem]>;
+def int_vector_interleave4 : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+ [LLVMOneFourthElementsVectorType<0>,
+ LLVMOneFourthElementsVectorType<0>,
+ LLVMOneFourthElementsVectorType<0>,
+ LLVMOneFourthElementsVectorType<0>],
+ [IntrNoMem]>;
+
+def int_vector_deinterleave4 : DefaultAttrsIntrinsic<[LLVMOneFourthElementsVectorType<0>,
+ LLVMOneFourthElementsVectorType<0>,
+ LLVMOneFourthElementsVectorType<0>,
+ LLVMOneFourthElementsVectorType<0>],
+ [llvm_anyvector_ty],
+ [IntrNoMem]>;
+
def int_vector_interleave5 : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
[LLVMOneFifthElementsVectorType<0>,
LLVMOneFifthElementsVectorType<0>,
@@ -2792,6 +2818,24 @@ def int_vector_deinterleave5 : DefaultAttrsIntrinsic<[LLVMOneFifthElementsVector
[llvm_anyvector_ty],
[IntrNoMem]>;
+def int_vector_interleave6 : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+ [LLVMOneSixthElementsVectorType<0>,
+ LLVMOneSixthElementsVectorType<0>,
+ LLVMOneSixthElementsVectorType<0>,
+ LLVMOneSixthElementsVectorType<0>,
+ LLVMOneSixthElementsVectorType<0>,
+ LLVMOneSixthElementsVectorType<0>],
+ [IntrNoMem]>;
+
+def int_vector_deinterleave6 : DefaultAttrsIntrinsic<[LLVMOneSixthElementsVectorType<0>,
+ LLVMOneSixthElementsVectorType<0>,
+ LLVMOneSixthElementsVectorType<0>,
+ LLVMOneSixthElementsVectorType<0>,
+ LLVMOneSixthElementsVectorType<0>,
+ LLVMOneSixthElementsVectorType<0>],
+ [llvm_anyvector_ty],
+ [IntrNoMem]>;
+
def int_vector_interleave7 : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
[LLVMOneSeventhElementsVectorType<0>,
LLVMOneSeventhElementsVectorType<0>,
@@ -2812,6 +2856,28 @@ def int_vector_deinterleave7 : DefaultAttrsIntrinsic<[LLVMOneSeventhElementsVect
[llvm_anyvector_ty],
[IntrNoMem]>;
+def int_vector_interleave8 : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+ [LLVMOneEighthElementsVectorType<0>,
+ LLVMOneEighthElementsVectorType<0>,
+ LLVMOneEighthElementsVectorType<0>,
+ LLVMOneEighthElementsVectorType<0>,
+ LLVMOneEighthElementsVectorType<0>,
+ LLVMOneEighthElementsVectorType<0>,
+ LLVMOneEighthElementsVectorType<0>,
+ LLVMOneEighthElementsVectorType<0>],
+ [IntrNoMem]>;
+
+def int_vector_deinterleave8 : DefaultAttrsIntrinsic<[LLVMOneEighthElementsVectorType<0>,
+ LLVMOneEighthElementsVectorType<0>,
+ LLVMOneEighthElementsVectorType<0>,
+ LLVMOneEighthElementsVectorType<0>,
+ LLVMOneEighthElementsVectorType<0>,
+ LLVMOneEighthElementsVectorType<0>,
+ LLVMOneEighthElementsVectorType<0>,
+ LLVMOneEighthElementsVectorType<0>],
+ [llvm_anyvector_ty],
+ [IntrNoMem]>;
+
//===-------------- Intrinsics to perform partial reduction ---------------===//
def int_experimental_vector_partial_reduce_add : DefaultAttrsIntrinsic<[LLVMMatchType<0>],
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index 9d138d364bad7..10ee75a83a267 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -8181,24 +8181,42 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
case Intrinsic::vector_interleave3:
visitVectorInterleave(I, 3);
return;
+ case Intrinsic::vector_interleave4:
+ visitVectorInterleave(I, 4);
+ return;
case Intrinsic::vector_interleave5:
visitVectorInterleave(I, 5);
return;
+ case Intrinsic::vector_interleave6:
+ visitVectorInterleave(I, 6);
+ return;
case Intrinsic::vector_interleave7:
visitVectorInterleave(I, 7);
return;
+ case Intrinsic::vector_interleave8:
+ visitVectorInterleave(I, 8);
+ return;
case Intrinsic::vector_deinterleave2:
visitVectorDeinterleave(I, 2);
return;
case Intrinsic::vector_deinterleave3:
visitVectorDeinterleave(I, 3);
return;
+ case Intrinsic::vector_deinterleave4:
+ visitVectorDeinterleave(I, 4);
+ return;
case Intrinsic::vector_deinterleave5:
visitVectorDeinterleave(I, 5);
return;
+ case Intrinsic::vector_deinterleave6:
+ visitVectorDeinterleave(I, 6);
+ return;
case Intrinsic::vector_deinterleave7:
visitVectorDeinterleave(I, 7);
return;
+ case Intrinsic::vector_deinterleave8:
+ visitVectorDeinterleave(I, 8);
+ return;
case Intrinsic::experimental_vector_compress:
setValue(&I, DAG.getNode(ISD::VECTOR_COMPRESS, sdl,
getValue(I.getArgOperand(0)).getValueType(),
diff --git a/llvm/lib/IR/Intrinsics.cpp b/llvm/lib/IR/Intrinsics.cpp
index dabb5fe006b3c..28f7523476774 100644
--- a/llvm/lib/IR/Intrinsics.cpp
+++ b/llvm/lib/IR/Intrinsics.cpp
@@ -378,18 +378,36 @@ DecodeIITType(unsigned &NextElt, ArrayRef<unsigned char> Infos,
IITDescriptor::get(IITDescriptor::OneThirdVecArgument, ArgInfo));
return;
}
+ case IIT_ONE_FOURTH_VEC_ARG: {
+ unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
+ OutputTable.push_back(
+ IITDescriptor::get(IITDescriptor::OneFourthVecArgument, ArgInfo));
+ return;
+ }
case IIT_ONE_FIFTH_VEC_ARG: {
unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
OutputTable.push_back(
IITDescriptor::get(IITDescriptor::OneFifthVecArgument, ArgInfo));
return;
}
+ case IIT_ONE_SIXTH_VEC_ARG: {
+ unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
+ OutputTable.push_back(
+ IITDescriptor::get(IITDescriptor::OneSixthVecArgument, ArgInfo));
+ return;
+ }
case IIT_ONE_SEVENTH_VEC_ARG: {
unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
OutputTable.push_back(
IITDescriptor::get(IITDescriptor::OneSeventhVecArgument, ArgInfo));
return;
}
+ case IIT_ONE_EIGHTH_VEC_ARG: {
+ unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
+ OutputTable.push_back(
+ IITDescriptor::get(IITDescriptor::OneEighthVecArgument, ArgInfo));
+ return;
+ }
case IIT_SAME_VEC_WIDTH_ARG: {
unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
OutputTable.push_back(
@@ -584,11 +602,14 @@ static Type *DecodeFixedType(ArrayRef<Intrinsic::IITDescriptor> &Infos,
return VectorType::getHalfElementsVectorType(
cast<VectorType>(Tys[D.getArgumentNumber()]));
case IITDescriptor::OneThirdVecArgument:
+ case IITDescriptor::OneFourthVecArgument:
case IITDescriptor::OneFifthVecArgument:
+ case IITDescriptor::OneSixthVecArgument:
case IITDescriptor::OneSeventhVecArgument:
+ case IITDescriptor::OneEighthVecArgument:
return VectorType::getOneNthElementsVectorType(
cast<VectorType>(Tys[D.getArgumentNumber()]),
- 3 + (D.Kind - IITDescriptor::OneThirdVecArgument) * 2);
+ 3 + (D.Kind - IITDescriptor::OneThirdVecArgument));
case IITDescriptor::SameVecWidthArgument: {
Type *EltTy = DecodeFixedType(Infos, Tys, Context);
Type *Ty = Tys[D.getArgumentNumber()];
@@ -974,15 +995,18 @@ matchIntrinsicType(Type *Ty, ArrayRef<Intrinsic::IITDescriptor> &Infos,
VectorType::getHalfElementsVectorType(
cast<VectorType>(ArgTys[D.getArgumentNumber()])) != Ty;
case IITDescriptor::OneThirdVecArgument:
+ case IITDescriptor::OneFourthVecArgument:
case IITDescriptor::OneFifthVecArgument:
+ case IITDescriptor::OneSixthVecArgument:
case IITDescriptor::OneSeventhVecArgument:
+ case IITDescriptor::OneEighthVecArgument:
// If this is a forward reference, defer the check for later.
if (D.getArgumentNumber() >= ArgTys.size())
return IsDeferredCheck || DeferCheck(Ty);
return !isa<VectorType>(ArgTys[D.getArgumentNumber()]) ||
VectorType::getOneNthElementsVectorType(
cast<VectorType>(ArgTys[D.getArgumentNumber()]),
- 3 + (D.Kind - IITDescriptor::OneThirdVecArgument) * 2) != Ty;
+ 3 + (D.Kind - IITDescriptor::OneThirdVecArgument)) != Ty;
case IITDescriptor::SameVecWidthArgument: {
if (D.getArgumentNumber() >= ArgTys.size()) {
// Defer check and subsequent check for the vector element type.
diff --git a/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll b/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll
index f6b5a35aa06d6..a3ad0b26efd4d 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll
@@ -223,6 +223,41 @@ define {<2 x i32>, <2 x i32>, <2 x i32>} @vector_deinterleave3_v2i32_v6i32(<6 x
ret {<2 x i32>, <2 x i32>, <2 x i32>} %res
}
+define {<2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>} @vector_deinterleave3_v2i32_v8i32(<8 x i32> %v) {
+; CHECK-LABEL: vector_deinterleave3_v2i32_v8i32:
+; CHECK: # %bb.0:
+; CHECK-NEXT: addi sp, sp, -16
+; CHECK-NEXT: .cfi_def_cfa_offset 16
+; CHECK-NEXT: csrr a0, vlenb
+; CHECK-NEXT: slli a0, a0, 1
+; CHECK-NEXT: sub sp, sp, a0
+; CHECK-NEXT: .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x02, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 2 * vlenb
+; CHECK-NEXT: csrr a0, vlenb
+; CHECK-NEXT: vsetivli zero, 2, e32, m2, ta, ma
+; CHECK-NEXT: vslidedown.vi v10, v8, 6
+; CHECK-NEXT: vslidedown.vi v12, v8, 4
+; CHECK-NEXT: vsetivli zero, 2, e32, m1, ta, ma
+; CHECK-NEXT: vslidedown.vi v9, v8, 2
+; CHECK-NEXT: srli a0, a0, 3
+; CHECK-NEXT: add a1, a0, a0
+; CHECK-NEXT: vsetvli zero, a1, e32, m1, ta, ma
+; CHECK-NEXT: vslideup.vx v12, v10, a0
+; CHECK-NEXT: vslideup.vx v8, v9, a0
+; CHECK-NEXT: addi a0, sp, 16
+; CHECK-NEXT: vmv.v.v v9, v12
+; CHECK-NEXT: vs2r.v v8, (a0)
+; CHECK-NEXT: vsetvli a1, zero, e32, mf2, ta, ma
+; CHECK-NEXT: vlseg4e32.v v8, (a0)
+; CHECK-NEXT: csrr a0, vlenb
+; CHECK-NEXT: slli a0, a0, 1
+; CHECK-NEXT: add sp, sp, a0
+; CHECK-NEXT: .cfi_def_cfa sp, 16
+; CHECK-NEXT: addi sp, sp, 16
+; CHECK-NEXT: .cfi_def_cfa_offset 0
+; CHECK-NEXT: ret
+ %res = call {<2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>} @llvm.vector.deinterleave4.v8i32(<8 x i32> %v)
+ ret {<2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>} %res
+}
define {<2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>} @vector_deinterleave5_v2i16_v10i16(<10 x i16> %v) {
; CHECK-LABEL: vector_deinterleave5_v2i16_v10i16:
@@ -265,6 +300,49 @@ define {<2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>} @vector_deinterle
ret {<2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>} %res
}
+define {<2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>} @vector_deinterleave6_v2i16_v12i16(<12 x i16> %v) {
+; CHECK-LABEL: vector_deinterleave6_v2i16_v12i16:
+; CHECK: # %bb.0:
+; CHECK-NEXT: addi sp, sp, -16
+; CHECK-NEXT: .cfi_def_cfa_offset 16
+; CHECK-NEXT: csrr a0, vlenb
+; CHECK-NEXT: slli a0, a0, 1
+; CHECK-NEXT: sub sp, sp, a0
+; CHECK-NEXT: .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x02, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 2 * vlenb
+; CHECK-NEXT: csrr a0, vlenb
+; CHECK-NEXT: vsetivli zero, 2, e16, m1, ta, ma
+; CHECK-NEXT: vslidedown.vi v14, v8, 6
+; CHECK-NEXT: vslidedown.vi v15, v8, 4
+; CHECK-NEXT: vslidedown.vi v16, v8, 2
+; CHECK-NEXT: vsetivli zero, 2, e16, m2, ta, ma
+; CHECK-NEXT: vslidedown.vi v10, v8, 10
+; CHECK-NEXT: vslidedown.vi v12, v8, 8
+; CHECK-NEXT: srli a1, a0, 3
+; CHECK-NEXT: srli a0, a0, 2
+; CHECK-NEXT: add a2, a1, a1
+; CHECK-NEXT: add a3, a0, a0
+; CHECK-NEXT: vsetvli zero, a2, ...
[truncated]
|
@llvm/pr-subscribers-llvm-ir Author: Luke Lau (lukel97) ChangesThis adds [de]interleave intrinsics for factors of 4,6,8, so that every interleaved memory operation supported by the in-tree targets can be represented by a single intrinsic. For context, [de]interleaves of fixed-length vectors are represented by a series of shufflevectors. The intrinsics are needed for scalable vectors, and we don't currently scalably vectorize all possible factors of interleave groups supported by RISC-V/AArch64. The underlying reason for this is that higher factors are currently represented by interleaving multiple interleaves themselves, which made sense at the time in the discussion in #89018. But after trying to integrate these for higher factors on RISC-V I think we should revisit this design choice:
By representing these higher factors are interleaved-interleaves, we can in theory support arbitrarily high interleave factors. However I'm not sure this is actually needed in practice: SVE only has instructions for factors 2,3,4, whilst RVV only supports up to factor 8. This patch would make it much easier to support scalable interleaved accesses in the loop vectorizer for RISC-V for factors 3,5,6 and 7, as the loop vectorizer and InterleavedAccessPass wouldn't need to construct and match trees of interleaves. If people agree with the direction, I would post these patches to follow up:
If we ever do want to end up supporting interleave factors higher than what the target natively has instructions for, we can then extend this infrastructure further. But I think it's more important that we have full support for the native capabilities first. Patch is 777.14 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/139893.diff 9 Files Affected:
diff --git a/llvm/docs/LangRef.rst b/llvm/docs/LangRef.rst
index 7296bb84b7d95..c0bc0a10ed537 100644
--- a/llvm/docs/LangRef.rst
+++ b/llvm/docs/LangRef.rst
@@ -20158,7 +20158,7 @@ Arguments:
The argument to this intrinsic must be a vector.
-'``llvm.vector.deinterleave2/3/5/7``' Intrinsic
+'``llvm.vector.deinterleave2/3/4/5/6/7/8``' Intrinsic
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Syntax:
@@ -20176,8 +20176,8 @@ This is an overloaded intrinsic.
Overview:
"""""""""
-The '``llvm.vector.deinterleave2/3/5/7``' intrinsics deinterleave adjacent lanes
-into 2, 3, 5, and 7 separate vectors, respectively, and return them as the
+The '``llvm.vector.deinterleave2/3/4/5/6/7/8``' intrinsics deinterleave adjacent lanes
+into 2 through to 8 separate vectors, respectively, and return them as the
result.
This intrinsic works for both fixed and scalable vectors. While this intrinsic
@@ -20199,7 +20199,7 @@ Arguments:
The argument is a vector whose type corresponds to the logical concatenation of
the aggregated result types.
-'``llvm.vector.interleave2/3/5/7``' Intrinsic
+'``llvm.vector.interleave2/3/4/5/6/7/8``' Intrinsic
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Syntax:
@@ -20217,7 +20217,7 @@ This is an overloaded intrinsic.
Overview:
"""""""""
-The '``llvm.vector.interleave2/3/5/7``' intrinsic constructs a vector
+The '``llvm.vector.interleave2/3/4/5/6/7/8``' intrinsic constructs a vector
by interleaving all the input vectors.
This intrinsic works for both fixed and scalable vectors. While this intrinsic
diff --git a/llvm/include/llvm/IR/Intrinsics.h b/llvm/include/llvm/IR/Intrinsics.h
index 6fb1bf9359b9a..b64784909fc25 100644
--- a/llvm/include/llvm/IR/Intrinsics.h
+++ b/llvm/include/llvm/IR/Intrinsics.h
@@ -153,8 +153,11 @@ namespace Intrinsic {
TruncArgument,
HalfVecArgument,
OneThirdVecArgument,
+ OneFourthVecArgument,
OneFifthVecArgument,
+ OneSixthVecArgument,
OneSeventhVecArgument,
+ OneEighthVecArgument,
SameVecWidthArgument,
VecOfAnyPtrsToElt,
VecElementArgument,
@@ -167,8 +170,11 @@ namespace Intrinsic {
} Kind;
// These three have to be contiguous.
- static_assert(OneFifthVecArgument == OneThirdVecArgument + 1 &&
- OneSeventhVecArgument == OneFifthVecArgument + 1);
+ static_assert(OneFourthVecArgument == OneThirdVecArgument + 1 &&
+ OneFifthVecArgument == OneFourthVecArgument + 1 &&
+ OneSixthVecArgument == OneFifthVecArgument + 1 &&
+ OneSeventhVecArgument == OneSixthVecArgument + 1 &&
+ OneEighthVecArgument == OneSeventhVecArgument + 1);
union {
unsigned Integer_Width;
unsigned Float_Width;
@@ -188,19 +194,23 @@ namespace Intrinsic {
unsigned getArgumentNumber() const {
assert(Kind == Argument || Kind == ExtendArgument ||
Kind == TruncArgument || Kind == HalfVecArgument ||
- Kind == OneThirdVecArgument || Kind == OneFifthVecArgument ||
- Kind == OneSeventhVecArgument || Kind == SameVecWidthArgument ||
- Kind == VecElementArgument || Kind == Subdivide2Argument ||
- Kind == Subdivide4Argument || Kind == VecOfBitcastsToInt);
+ Kind == OneThirdVecArgument || Kind == OneFourthVecArgument ||
+ Kind == OneFifthVecArgument || Kind == OneSixthVecArgument ||
+ Kind == OneSeventhVecArgument || Kind == OneEighthVecArgument ||
+ Kind == SameVecWidthArgument || Kind == VecElementArgument ||
+ Kind == Subdivide2Argument || Kind == Subdivide4Argument ||
+ Kind == VecOfBitcastsToInt);
return Argument_Info >> 3;
}
ArgKind getArgumentKind() const {
assert(Kind == Argument || Kind == ExtendArgument ||
Kind == TruncArgument || Kind == HalfVecArgument ||
- Kind == OneThirdVecArgument || Kind == OneFifthVecArgument ||
- Kind == OneSeventhVecArgument || Kind == SameVecWidthArgument ||
- Kind == VecElementArgument || Kind == Subdivide2Argument ||
- Kind == Subdivide4Argument || Kind == VecOfBitcastsToInt);
+ Kind == OneThirdVecArgument || Kind == OneFourthVecArgument ||
+ Kind == OneFifthVecArgument || Kind == OneSixthVecArgument ||
+ Kind == OneSeventhVecArgument || Kind == OneEighthVecArgument ||
+ Kind == SameVecWidthArgument || Kind == VecElementArgument ||
+ Kind == Subdivide2Argument || Kind == Subdivide4Argument ||
+ Kind == VecOfBitcastsToInt);
return (ArgKind)(Argument_Info & 7);
}
diff --git a/llvm/include/llvm/IR/Intrinsics.td b/llvm/include/llvm/IR/Intrinsics.td
index 8d26961eebbf3..3994a543f9dcf 100644
--- a/llvm/include/llvm/IR/Intrinsics.td
+++ b/llvm/include/llvm/IR/Intrinsics.td
@@ -340,6 +340,9 @@ def IIT_ONE_FIFTH_VEC_ARG : IIT_Base<63>;
def IIT_ONE_SEVENTH_VEC_ARG : IIT_Base<64>;
def IIT_V2048: IIT_Vec<2048, 65>;
def IIT_V4096: IIT_Vec<4096, 66>;
+def IIT_ONE_FOURTH_VEC_ARG : IIT_Base<67>;
+def IIT_ONE_SIXTH_VEC_ARG : IIT_Base<68>;
+def IIT_ONE_EIGHTH_VEC_ARG : IIT_Base<69>;
}
defvar IIT_all_FixedTypes = !filter(iit, IIT_all,
@@ -483,12 +486,21 @@ class LLVMHalfElementsVectorType<int num>
class LLVMOneThirdElementsVectorType<int num>
: LLVMMatchType<num, IIT_ONE_THIRD_VEC_ARG>;
+class LLVMOneFourthElementsVectorType<int num>
+ : LLVMMatchType<num, IIT_ONE_FOURTH_VEC_ARG>;
+
class LLVMOneFifthElementsVectorType<int num>
: LLVMMatchType<num, IIT_ONE_FIFTH_VEC_ARG>;
+class LLVMOneSixthElementsVectorType<int num>
+ : LLVMMatchType<num, IIT_ONE_SIXTH_VEC_ARG>;
+
class LLVMOneSeventhElementsVectorType<int num>
: LLVMMatchType<num, IIT_ONE_SEVENTH_VEC_ARG>;
+class LLVMOneEighthElementsVectorType<int num>
+ : LLVMMatchType<num, IIT_ONE_EIGHTH_VEC_ARG>;
+
// Match the type of another intrinsic parameter that is expected to be a
// vector type (i.e. <N x iM>) but with each element subdivided to
// form a vector with more elements that are smaller than the original.
@@ -2776,6 +2788,20 @@ def int_vector_deinterleave3 : DefaultAttrsIntrinsic<[LLVMOneThirdElementsVector
[llvm_anyvector_ty],
[IntrNoMem]>;
+def int_vector_interleave4 : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+ [LLVMOneFourthElementsVectorType<0>,
+ LLVMOneFourthElementsVectorType<0>,
+ LLVMOneFourthElementsVectorType<0>,
+ LLVMOneFourthElementsVectorType<0>],
+ [IntrNoMem]>;
+
+def int_vector_deinterleave4 : DefaultAttrsIntrinsic<[LLVMOneFourthElementsVectorType<0>,
+ LLVMOneFourthElementsVectorType<0>,
+ LLVMOneFourthElementsVectorType<0>,
+ LLVMOneFourthElementsVectorType<0>],
+ [llvm_anyvector_ty],
+ [IntrNoMem]>;
+
def int_vector_interleave5 : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
[LLVMOneFifthElementsVectorType<0>,
LLVMOneFifthElementsVectorType<0>,
@@ -2792,6 +2818,24 @@ def int_vector_deinterleave5 : DefaultAttrsIntrinsic<[LLVMOneFifthElementsVector
[llvm_anyvector_ty],
[IntrNoMem]>;
+def int_vector_interleave6 : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+ [LLVMOneSixthElementsVectorType<0>,
+ LLVMOneSixthElementsVectorType<0>,
+ LLVMOneSixthElementsVectorType<0>,
+ LLVMOneSixthElementsVectorType<0>,
+ LLVMOneSixthElementsVectorType<0>,
+ LLVMOneSixthElementsVectorType<0>],
+ [IntrNoMem]>;
+
+def int_vector_deinterleave6 : DefaultAttrsIntrinsic<[LLVMOneSixthElementsVectorType<0>,
+ LLVMOneSixthElementsVectorType<0>,
+ LLVMOneSixthElementsVectorType<0>,
+ LLVMOneSixthElementsVectorType<0>,
+ LLVMOneSixthElementsVectorType<0>,
+ LLVMOneSixthElementsVectorType<0>],
+ [llvm_anyvector_ty],
+ [IntrNoMem]>;
+
def int_vector_interleave7 : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
[LLVMOneSeventhElementsVectorType<0>,
LLVMOneSeventhElementsVectorType<0>,
@@ -2812,6 +2856,28 @@ def int_vector_deinterleave7 : DefaultAttrsIntrinsic<[LLVMOneSeventhElementsVect
[llvm_anyvector_ty],
[IntrNoMem]>;
+def int_vector_interleave8 : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
+ [LLVMOneEighthElementsVectorType<0>,
+ LLVMOneEighthElementsVectorType<0>,
+ LLVMOneEighthElementsVectorType<0>,
+ LLVMOneEighthElementsVectorType<0>,
+ LLVMOneEighthElementsVectorType<0>,
+ LLVMOneEighthElementsVectorType<0>,
+ LLVMOneEighthElementsVectorType<0>,
+ LLVMOneEighthElementsVectorType<0>],
+ [IntrNoMem]>;
+
+def int_vector_deinterleave8 : DefaultAttrsIntrinsic<[LLVMOneEighthElementsVectorType<0>,
+ LLVMOneEighthElementsVectorType<0>,
+ LLVMOneEighthElementsVectorType<0>,
+ LLVMOneEighthElementsVectorType<0>,
+ LLVMOneEighthElementsVectorType<0>,
+ LLVMOneEighthElementsVectorType<0>,
+ LLVMOneEighthElementsVectorType<0>,
+ LLVMOneEighthElementsVectorType<0>],
+ [llvm_anyvector_ty],
+ [IntrNoMem]>;
+
//===-------------- Intrinsics to perform partial reduction ---------------===//
def int_experimental_vector_partial_reduce_add : DefaultAttrsIntrinsic<[LLVMMatchType<0>],
diff --git a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
index 9d138d364bad7..10ee75a83a267 100644
--- a/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
@@ -8181,24 +8181,42 @@ void SelectionDAGBuilder::visitIntrinsicCall(const CallInst &I,
case Intrinsic::vector_interleave3:
visitVectorInterleave(I, 3);
return;
+ case Intrinsic::vector_interleave4:
+ visitVectorInterleave(I, 4);
+ return;
case Intrinsic::vector_interleave5:
visitVectorInterleave(I, 5);
return;
+ case Intrinsic::vector_interleave6:
+ visitVectorInterleave(I, 6);
+ return;
case Intrinsic::vector_interleave7:
visitVectorInterleave(I, 7);
return;
+ case Intrinsic::vector_interleave8:
+ visitVectorInterleave(I, 8);
+ return;
case Intrinsic::vector_deinterleave2:
visitVectorDeinterleave(I, 2);
return;
case Intrinsic::vector_deinterleave3:
visitVectorDeinterleave(I, 3);
return;
+ case Intrinsic::vector_deinterleave4:
+ visitVectorDeinterleave(I, 4);
+ return;
case Intrinsic::vector_deinterleave5:
visitVectorDeinterleave(I, 5);
return;
+ case Intrinsic::vector_deinterleave6:
+ visitVectorDeinterleave(I, 6);
+ return;
case Intrinsic::vector_deinterleave7:
visitVectorDeinterleave(I, 7);
return;
+ case Intrinsic::vector_deinterleave8:
+ visitVectorDeinterleave(I, 8);
+ return;
case Intrinsic::experimental_vector_compress:
setValue(&I, DAG.getNode(ISD::VECTOR_COMPRESS, sdl,
getValue(I.getArgOperand(0)).getValueType(),
diff --git a/llvm/lib/IR/Intrinsics.cpp b/llvm/lib/IR/Intrinsics.cpp
index dabb5fe006b3c..28f7523476774 100644
--- a/llvm/lib/IR/Intrinsics.cpp
+++ b/llvm/lib/IR/Intrinsics.cpp
@@ -378,18 +378,36 @@ DecodeIITType(unsigned &NextElt, ArrayRef<unsigned char> Infos,
IITDescriptor::get(IITDescriptor::OneThirdVecArgument, ArgInfo));
return;
}
+ case IIT_ONE_FOURTH_VEC_ARG: {
+ unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
+ OutputTable.push_back(
+ IITDescriptor::get(IITDescriptor::OneFourthVecArgument, ArgInfo));
+ return;
+ }
case IIT_ONE_FIFTH_VEC_ARG: {
unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
OutputTable.push_back(
IITDescriptor::get(IITDescriptor::OneFifthVecArgument, ArgInfo));
return;
}
+ case IIT_ONE_SIXTH_VEC_ARG: {
+ unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
+ OutputTable.push_back(
+ IITDescriptor::get(IITDescriptor::OneSixthVecArgument, ArgInfo));
+ return;
+ }
case IIT_ONE_SEVENTH_VEC_ARG: {
unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
OutputTable.push_back(
IITDescriptor::get(IITDescriptor::OneSeventhVecArgument, ArgInfo));
return;
}
+ case IIT_ONE_EIGHTH_VEC_ARG: {
+ unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
+ OutputTable.push_back(
+ IITDescriptor::get(IITDescriptor::OneEighthVecArgument, ArgInfo));
+ return;
+ }
case IIT_SAME_VEC_WIDTH_ARG: {
unsigned ArgInfo = (NextElt == Infos.size() ? 0 : Infos[NextElt++]);
OutputTable.push_back(
@@ -584,11 +602,14 @@ static Type *DecodeFixedType(ArrayRef<Intrinsic::IITDescriptor> &Infos,
return VectorType::getHalfElementsVectorType(
cast<VectorType>(Tys[D.getArgumentNumber()]));
case IITDescriptor::OneThirdVecArgument:
+ case IITDescriptor::OneFourthVecArgument:
case IITDescriptor::OneFifthVecArgument:
+ case IITDescriptor::OneSixthVecArgument:
case IITDescriptor::OneSeventhVecArgument:
+ case IITDescriptor::OneEighthVecArgument:
return VectorType::getOneNthElementsVectorType(
cast<VectorType>(Tys[D.getArgumentNumber()]),
- 3 + (D.Kind - IITDescriptor::OneThirdVecArgument) * 2);
+ 3 + (D.Kind - IITDescriptor::OneThirdVecArgument));
case IITDescriptor::SameVecWidthArgument: {
Type *EltTy = DecodeFixedType(Infos, Tys, Context);
Type *Ty = Tys[D.getArgumentNumber()];
@@ -974,15 +995,18 @@ matchIntrinsicType(Type *Ty, ArrayRef<Intrinsic::IITDescriptor> &Infos,
VectorType::getHalfElementsVectorType(
cast<VectorType>(ArgTys[D.getArgumentNumber()])) != Ty;
case IITDescriptor::OneThirdVecArgument:
+ case IITDescriptor::OneFourthVecArgument:
case IITDescriptor::OneFifthVecArgument:
+ case IITDescriptor::OneSixthVecArgument:
case IITDescriptor::OneSeventhVecArgument:
+ case IITDescriptor::OneEighthVecArgument:
// If this is a forward reference, defer the check for later.
if (D.getArgumentNumber() >= ArgTys.size())
return IsDeferredCheck || DeferCheck(Ty);
return !isa<VectorType>(ArgTys[D.getArgumentNumber()]) ||
VectorType::getOneNthElementsVectorType(
cast<VectorType>(ArgTys[D.getArgumentNumber()]),
- 3 + (D.Kind - IITDescriptor::OneThirdVecArgument) * 2) != Ty;
+ 3 + (D.Kind - IITDescriptor::OneThirdVecArgument)) != Ty;
case IITDescriptor::SameVecWidthArgument: {
if (D.getArgumentNumber() >= ArgTys.size()) {
// Defer check and subsequent check for the vector element type.
diff --git a/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll b/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll
index f6b5a35aa06d6..a3ad0b26efd4d 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-fixed.ll
@@ -223,6 +223,41 @@ define {<2 x i32>, <2 x i32>, <2 x i32>} @vector_deinterleave3_v2i32_v6i32(<6 x
ret {<2 x i32>, <2 x i32>, <2 x i32>} %res
}
+define {<2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>} @vector_deinterleave3_v2i32_v8i32(<8 x i32> %v) {
+; CHECK-LABEL: vector_deinterleave3_v2i32_v8i32:
+; CHECK: # %bb.0:
+; CHECK-NEXT: addi sp, sp, -16
+; CHECK-NEXT: .cfi_def_cfa_offset 16
+; CHECK-NEXT: csrr a0, vlenb
+; CHECK-NEXT: slli a0, a0, 1
+; CHECK-NEXT: sub sp, sp, a0
+; CHECK-NEXT: .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x02, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 2 * vlenb
+; CHECK-NEXT: csrr a0, vlenb
+; CHECK-NEXT: vsetivli zero, 2, e32, m2, ta, ma
+; CHECK-NEXT: vslidedown.vi v10, v8, 6
+; CHECK-NEXT: vslidedown.vi v12, v8, 4
+; CHECK-NEXT: vsetivli zero, 2, e32, m1, ta, ma
+; CHECK-NEXT: vslidedown.vi v9, v8, 2
+; CHECK-NEXT: srli a0, a0, 3
+; CHECK-NEXT: add a1, a0, a0
+; CHECK-NEXT: vsetvli zero, a1, e32, m1, ta, ma
+; CHECK-NEXT: vslideup.vx v12, v10, a0
+; CHECK-NEXT: vslideup.vx v8, v9, a0
+; CHECK-NEXT: addi a0, sp, 16
+; CHECK-NEXT: vmv.v.v v9, v12
+; CHECK-NEXT: vs2r.v v8, (a0)
+; CHECK-NEXT: vsetvli a1, zero, e32, mf2, ta, ma
+; CHECK-NEXT: vlseg4e32.v v8, (a0)
+; CHECK-NEXT: csrr a0, vlenb
+; CHECK-NEXT: slli a0, a0, 1
+; CHECK-NEXT: add sp, sp, a0
+; CHECK-NEXT: .cfi_def_cfa sp, 16
+; CHECK-NEXT: addi sp, sp, 16
+; CHECK-NEXT: .cfi_def_cfa_offset 0
+; CHECK-NEXT: ret
+ %res = call {<2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>} @llvm.vector.deinterleave4.v8i32(<8 x i32> %v)
+ ret {<2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>} %res
+}
define {<2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>} @vector_deinterleave5_v2i16_v10i16(<10 x i16> %v) {
; CHECK-LABEL: vector_deinterleave5_v2i16_v10i16:
@@ -265,6 +300,49 @@ define {<2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>} @vector_deinterle
ret {<2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>} %res
}
+define {<2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>, <2 x i16>} @vector_deinterleave6_v2i16_v12i16(<12 x i16> %v) {
+; CHECK-LABEL: vector_deinterleave6_v2i16_v12i16:
+; CHECK: # %bb.0:
+; CHECK-NEXT: addi sp, sp, -16
+; CHECK-NEXT: .cfi_def_cfa_offset 16
+; CHECK-NEXT: csrr a0, vlenb
+; CHECK-NEXT: slli a0, a0, 1
+; CHECK-NEXT: sub sp, sp, a0
+; CHECK-NEXT: .cfi_escape 0x0f, 0x0d, 0x72, 0x00, 0x11, 0x10, 0x22, 0x11, 0x02, 0x92, 0xa2, 0x38, 0x00, 0x1e, 0x22 # sp + 16 + 2 * vlenb
+; CHECK-NEXT: csrr a0, vlenb
+; CHECK-NEXT: vsetivli zero, 2, e16, m1, ta, ma
+; CHECK-NEXT: vslidedown.vi v14, v8, 6
+; CHECK-NEXT: vslidedown.vi v15, v8, 4
+; CHECK-NEXT: vslidedown.vi v16, v8, 2
+; CHECK-NEXT: vsetivli zero, 2, e16, m2, ta, ma
+; CHECK-NEXT: vslidedown.vi v10, v8, 10
+; CHECK-NEXT: vslidedown.vi v12, v8, 8
+; CHECK-NEXT: srli a1, a0, 3
+; CHECK-NEXT: srli a0, a0, 2
+; CHECK-NEXT: add a2, a1, a1
+; CHECK-NEXT: add a3, a0, a0
+; CHECK-NEXT: vsetvli zero, a2, ...
[truncated]
|
@@ -223,6 +223,41 @@ define {<2 x i32>, <2 x i32>, <2 x i32>} @vector_deinterleave3_v2i32_v6i32(<6 x | |||
ret {<2 x i32>, <2 x i32>, <2 x i32>} %res | |||
} | |||
|
|||
define {<2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>} @vector_deinterleave3_v2i32_v8i32(<8 x i32> %v) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could use nounwind here and rest of the other functions.
@@ -223,6 +223,41 @@ define {<2 x i32>, <2 x i32>, <2 x i32>} @vector_deinterleave3_v2i32_v6i32(<6 x | |||
ret {<2 x i32>, <2 x i32>, <2 x i32>} %res | |||
} | |||
|
|||
define {<2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>} @vector_deinterleave3_v2i32_v8i32(<8 x i32> %v) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deinterleave4?
; RV32-NEXT: vs1r.v v9, (a0) # vscale x 8-byte Folded Spill | ||
; RV32-NEXT: li a1, 3 | ||
; RV32-NEXT: mv a0, s0 | ||
; RV32-NEXT: call __mulsi3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add M extension for RV32?
; RV64-NEXT: vs1r.v v9, (a0) # vscale x 8-byte Folded Spill | ||
; RV64-NEXT: li a1, 3 | ||
; RV64-NEXT: mv a0, s0 | ||
; RV64-NEXT: call __muldi3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto M extension
You can definitely end up with very large interleave factors in some cases; my team has internal testcases for stride 24. Granted, it's uncommon. |
llvm/include/llvm/IR/Intrinsics.h
Outdated
@@ -167,8 +170,11 @@ namespace Intrinsic { | |||
} Kind; | |||
|
|||
// These three have to be contiguous. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These six?
llvm/include/llvm/IR/Intrinsics.h
Outdated
Kind == OneSeventhVecArgument || Kind == SameVecWidthArgument || | ||
Kind == VecElementArgument || Kind == Subdivide2Argument || | ||
Kind == Subdivide4Argument || Kind == VecOfBitcastsToInt); | ||
Kind == OneThirdVecArgument || Kind == OneFourthVecArgument || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use comparators <
/>
since they are contiguous?
llvm/include/llvm/IR/Intrinsics.h
Outdated
Kind == OneSeventhVecArgument || Kind == SameVecWidthArgument || | ||
Kind == VecElementArgument || Kind == Subdivide2Argument || | ||
Kind == Subdivide4Argument || Kind == VecOfBitcastsToInt); | ||
Kind == OneThirdVecArgument || Kind == OneFourthVecArgument || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
From my understanding though the loop vectorizer upstream today doesn't emit any scalable interleave group higher than 4 on AArch64 and 8 on RISC-V. This is from a quick grep of I should mention that the fixed-length VF VPlan should still be able to handle arbitrarily high factors, and hopefully in these cases the loop vectorizer will pick it based off the cost. |
It was implemented downstream in a completely separate vectorization framework. I'm only mentioning it because we don't want to block off the possibility of adding such support in the future. |
I support this proposal. Note that's largely a reversal of my original stance on this, but seeing all the complexity here, I think adding the explicit variants if probably the right call. Another option we could explore is to split deinterleaveN into N calls to an intrinsic for the form "deinterleave(N, Vec)". This is a more direct mapping to what we do for the fixed vector shuffles today. This was discussed in the original threads, but I'm still (mildly) of the opinion we went the wrong direction here. I'm happy to defer to those actually working on this though. We could also do interleave(N, concat_vector(...)) instead. This seems less clearly motivated, and I'd only bother if we were deciding to do the former. Note that even if we want to pursue my alternative, I support this proposal as an intermediate step. Let's clean up the complexity we have, then possibly revisit. |
I guess you mean instead of
We're going to do something like
I guess a potential problem might happen when we cannot turn this into segmented load/store. For instance, how should we codegen a single, lingering |
Yeah, this was exactly what I had in mind. We have two constant integer operands which fully describe the shuffle being performed. (e.g., deinterleave with stride 3 and offset 2, which is analogous to a shufflevector with 2, 5, 8, 11, ... as the mask) At least on riscv, this is actually a better mapping to the lowering (when we don't turn it into a segment load), than the current intrinsics with their tuple return. Each of the individual lanes becomes a vcompress or vrgather (or vnsrl if possible). |
Agreed it would be nice to keep the possibility. From my understanding, these higher factors only need the recursive interleaving support in the loop vectorizer, not in InterleavedAccessPass because there's no hardware instructions beyond 8 that we can currently map to. So could I suggest the following plan instead:
This way we would still be able to scalably vectorize e.g. factor 16, and can still remove the recursive interleaving code in InterleavedAccessPass. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
If there's no objections I'll merge this early next week, but happy to hold on if people still want to discuss the direction cc @efriedma-quic |
OneSeventhVecArgument, | ||
OneEighthVecArgument, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we instead parameterize a single IIT descriptor with the divisor?
This adds [de]interleave intrinsics for factors of 4,6,8, so that every interleaved memory operation supported by the in-tree targets can be represented by a single intrinsic.
For context, [de]interleaves of fixed-length vectors are represented by a series of shufflevectors. The intrinsics are needed for scalable vectors, and we don't currently scalably vectorize all possible factors of interleave groups supported by RISC-V/AArch64.
The underlying reason for this is that higher factors are currently represented by interleaving multiple interleaves themselves, which made sense at the time in the discussion in #89018.
But after trying to integrate these for higher factors on RISC-V I think we should revisit this design choice:
By representing these higher factors are interleaved-interleaves, we can in theory support arbitrarily high interleave factors. However I'm not sure this is actually needed in practice: SVE only has instructions for factors 2,3,4, whilst RVV only supports up to factor 8.
This patch would make it much easier to support scalable interleaved accesses in the loop vectorizer for RISC-V for factors 3,5,6 and 7, as the loop vectorizer and InterleavedAccessPass wouldn't need to construct and match trees of interleaves.
For interleave factors above 8, for which there are no hardware memory operations to match in the InterleavedAccessPass, we can still keep the wide load + recursive interleaving in the loop vectorizer.
If people agree with the direction, I would post these patches to follow up:
vlsegN
support on RISC-V