-
Notifications
You must be signed in to change notification settings - Fork 13.4k
[InstCombine] Fold bitcast (extelt (bitcast X), Idx) into bitcast+shufflevector. #136998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ffle. Fold sequences such as: ```llvm %bc = bitcast <8 x i32> %v to <2 x i128> %ext = extractelement <2 x i128> %bc, i64 0 %res = bitcast i128 %ext to <2 x i64> ``` Into: ```llvm %bc = bitcast <8 x i32> %v to <4 x i64> %res = shufflevector <4 x i64> %bc, <4 x i64> poison, <2 x i32> <i32 0, i32 1> ``` The motivation for this is a long standing regression affecting SIMDe on AArch64 introduced indirectly by the AlwaysInliner (1a2e77c). Some reproducers: * https://godbolt.org/z/53qx18s6M * https://godbolt.org/z/o5e43h5M7
@llvm/pr-subscribers-llvm-transforms Author: Ricardo Jesus (rj-jesus) ChangesFold sequences such as: %bc = bitcast <8 x i32> %v to <2 x i128>
%ext = extractelement <2 x i128> %bc, i64 0
%res = bitcast i128 %ext to <2 x i64> Into: %bc = bitcast <8 x i32> %v to <4 x i64>
%res = shufflevector <4 x i64> %bc, <4 x i64> poison, <2 x i32> <i32 0, i32 1> The motivation for this is a long-standing regression affecting SIMDe on This is an alternative approach to #135769 to fix the regressions above. Full diff: https://github.com/llvm/llvm-project/pull/136998.diff 2 Files Affected:
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp b/llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
index 1a95636f37ed7..d656dcc21ae1e 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
@@ -2380,6 +2380,51 @@ static Value *optimizeIntegerToVectorInsertions(BitCastInst &CI,
return Result;
}
+/// If the input is (extractelement (bitcast X), Idx) and the source and
+/// destination types are vectors, we are performing a vector extract from X. We
+/// can replace the extractelement+bitcast with a shufflevector, avoiding the
+/// final scalar->vector bitcast. This pattern is usually handled better by the
+/// backend.
+///
+/// Example:
+/// %bc = bitcast <8 x i32> %X to <2 x i128>
+/// %ext = extractelement <2 x i128> %bc1, i64 1
+/// bitcast i128 %ext to <2 x i64>
+/// --->
+/// %bc = bitcast <8 x i32> %X to <4 x i64>
+/// shufflevector <4 x i64> %bc, <4 x i64> poison, <2 x i32> <i32 2, i32 3>
+static Instruction *foldBitCastExtElt(BitCastInst &BitCast,
+ InstCombiner::BuilderTy &Builder) {
+ Value *X;
+ uint64_t Index;
+ if (!match(
+ BitCast.getOperand(0),
+ m_OneUse(m_ExtractElt(m_BitCast(m_Value(X)), m_ConstantInt(Index)))))
+ return nullptr;
+
+ auto *SrcTy = dyn_cast<FixedVectorType>(X->getType());
+ auto *DstTy = dyn_cast<FixedVectorType>(BitCast.getType());
+ if (!SrcTy || !DstTy)
+ return nullptr;
+
+ // Check if the mask indices would overflow.
+ unsigned NumElts = DstTy->getNumElements();
+ if (Index > INT32_MAX || NumElts > INT32_MAX ||
+ (Index + 1) * NumElts - 1 > INT32_MAX)
+ return nullptr;
+
+ unsigned DstEltWidth = DstTy->getScalarSizeInBits();
+ unsigned SrcVecWidth = SrcTy->getPrimitiveSizeInBits();
+ assert((SrcVecWidth % DstEltWidth == 0) && "Invalid types.");
+ auto *NewVecTy =
+ FixedVectorType::get(DstTy->getElementType(), SrcVecWidth / DstEltWidth);
+ auto *NewBC = Builder.CreateBitCast(X, NewVecTy, "bc");
+
+ unsigned StartIdx = Index * NumElts;
+ auto Mask = llvm::to_vector<16>(llvm::seq<int>(StartIdx, StartIdx + NumElts));
+ return new ShuffleVectorInst(NewBC, Mask);
+}
+
/// Canonicalize scalar bitcasts of extracted elements into a bitcast of the
/// vector followed by extract element. The backend tends to handle bitcasts of
/// vectors better than bitcasts of scalars because vector registers are
@@ -2866,6 +2911,9 @@ Instruction *InstCombinerImpl::visitBitCast(BitCastInst &CI) {
if (Instruction *I = canonicalizeBitCastExtElt(CI, *this))
return I;
+ if (Instruction *I = foldBitCastExtElt(CI, Builder))
+ return I;
+
if (Instruction *I = foldBitCastBitwiseLogic(CI, Builder))
return I;
diff --git a/llvm/test/Transforms/InstCombine/bitcast.ll b/llvm/test/Transforms/InstCombine/bitcast.ll
index 37d41de3e9991..cade44412341d 100644
--- a/llvm/test/Transforms/InstCombine/bitcast.ll
+++ b/llvm/test/Transforms/InstCombine/bitcast.ll
@@ -480,6 +480,34 @@ define double @bitcast_extelt8(<1 x i64> %A) {
ret double %bc
}
+; Extract a subvector from a vector, extracted element wider than source.
+
+define <2 x i64> @bitcast_extelt9(<8 x i32> %A) {
+; CHECK-LABEL: @bitcast_extelt9(
+; CHECK-NEXT: [[BC:%.*]] = bitcast <8 x i32> [[A:%.*]] to <4 x i64>
+; CHECK-NEXT: [[BC2:%.*]] = shufflevector <4 x i64> [[BC]], <4 x i64> poison, <2 x i32> <i32 2, i32 3>
+; CHECK-NEXT: ret <2 x i64> [[BC2]]
+;
+ %bc1 = bitcast <8 x i32> %A to <2 x i128>
+ %ext = extractelement <2 x i128> %bc1, i64 1
+ %bc2 = bitcast i128 %ext to <2 x i64>
+ ret <2 x i64> %bc2
+}
+
+; Extract a subvector from a vector, extracted element narrower than source.
+
+define <2 x i8> @bitcast_extelt10(<8 x i32> %A) {
+; CHECK-LABEL: @bitcast_extelt10(
+; CHECK-NEXT: [[BC:%.*]] = bitcast <8 x i32> [[A:%.*]] to <32 x i8>
+; CHECK-NEXT: [[BC2:%.*]] = shufflevector <32 x i8> [[BC]], <32 x i8> poison, <2 x i32> <i32 6, i32 7>
+; CHECK-NEXT: ret <2 x i8> [[BC2]]
+;
+ %bc1 = bitcast <8 x i32> %A to <16 x i16>
+ %ext = extractelement <16 x i16> %bc1, i64 3
+ %bc2 = bitcast i16 %ext to <2 x i8>
+ ret <2 x i8> %bc2
+}
+
define <2 x i32> @test4(i32 %A, i32 %B){
; CHECK-LABEL: @test4(
; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x i32> poison, i32 [[A:%.*]], i64 0
|
Should this be in VectorCombine where we can make it cost driven? |
Yeah, that came up in one of the comments in #135769. Avoiding the vector extract and scalar->vector bitcast seemed generally useful, but I'm happy to move this to VectorCombine if you think that would be better. Shall I do so? |
I think so - the extra bonus is you might be able to drop the one use checks, assuming you can account for the old/new cases in the costs. |
Sounds good, thanks for the suggestion! I'll move this then. :) |
On second thought, I don't think moving this to VectorCombine will work, at least not from the point of view of addressing the regressions I mentioned above. After InstCombine folds shufflevector+bitcast into bitcast+extractelement, it may perform other folds before we get to VectorCombine that change the IR too much and prevent the fold bitcast+extelt+bitcast -> bitcast+shufflevector from happening (see here for an example based on the second reproducer above). If the fold proposed in this PR isn't suitable for InstCombine, maybe we should revisit #135769, unless you have another suggestion? What do you think @nikic and @RKSimon? |
ping |
I'm ok with that. Could you add a test that wouldn't be covered by having this patch in VectorCombine to that PR maybe? |
Thanks, that's a good idea, I'll update the other PR shortly with a test. |
Fold sequences such as:
Into:
The motivation for this is a long-standing regression affecting SIMDe on
AArch64 introduced indirectly by the AlwaysInliner (1a2e77c). Some
reproducers:
This is an alternative approach to #135769 to fix the regressions above.