Description
What version of gRPC-Java are you using?
1.44.0
What is your environment?
A client and server communicating over localhost.
Operating system: Mac OSX
JDK: 11.0.9
What did you expect to see?
Client in this branch not failing the majority of RPCs with DEADLINE_EXCEEDED
when using Netty for transport.
What did you see instead?
The majority of call fails with DEADLINE_EXCEEDED
using Netty
whereas using OkHttp
doesn't .
Steps to reproduce the bug
This (hacky) reproducible might indicate a potential problem where the periodic flushing of the WriteQueue doesn't happen frequently enough. This causes latency leading to DEADLINE_EXCEEDED
during startup. Switching to ÒkHttp
for this particular case does not cause any RPCs to fail.
On the wire with Netty multiple RPCs are batched and sent togheter whereas with ÒkHttp
they are immediately dispatched.
Here is a screenshot from Perfmark using Netty:
is this a known issue or am I doing something fundamentally wrong in the reproducer?
Activity
[-]WriteQueue periodic flushing causes DEADLINE_EXCEEDED during startup?[/-][+]netty: WriteQueue periodic flushing causes DEADLINE_EXCEEDED during startup?[/+]ejona86 commentedon Feb 11, 2022
Thank you for the reproduction and good report. PerfMark is so helpful here.
Java is slow to start. I don't think people really understand how slow Java is when starting. It is slower than Python when starting, as it uses a less optimized interpreter.
We'd expect OkHttp to be faster when starting, at least for the simple reason of it having less code. OkHttp is essentially optimized for the cold-start case, as is common on Android. OkHttp has similar write-combining code as Netty via the AsyncSink, so I don't think the delaying flushing itself is the problem.
The WriteQueue will run indefinitely as long as there are events. Changing the code to use batch sizes of 16 and return after "flush0" didn't substantially affect the number of RPCs completing within 100 ms. If I increase the deadline to 200 ms, then I see more varying results. So I think netty here is heavily over-saturated, and we'd expect the latency to increase horribly when that happens.
I tried returning earlier in WriteQueue (along with rescheduling) because I think the problem is reads are blocked on the writes. It seems returning early from WriteQueue may not have much effect. Looking at Netty it seems it will immediately run the new Runnable. It will let other scheduled runnables execute first, but it doesn't go back to select() until SingleThreadEventExecutor.runAllTasks() completes. This means that deadlines can be processed much more rapidly than reads.
OkHttp has the advantage for this scenario that it uses separate dedicated threads for reading and writing. That means that the responses aren't delayed waiting for the writing to complete. The difference in latency is huge: I see okhttp completing RPCs in 0-10 ms. Although there's other things going on here, because I also see many Netty RPCs completing in 300-400 ms when I set the deadline is 200ms; that means there's thrashing due to to many threads or similar going on.
I think there should likely be some change based on this reproducer, but I suspect it can't be substantially fixed.
ejona86 commentedon Feb 11, 2022
I was accidentally looking in the ioRatio=100 flow. In another flow it runs tasks until X amount of time has passed, where X is based on the time taken for select(). That's good, as it means this should be addressable to some degree, but I'm still unclear on why reads appear to be delayed.
tommyulfsparre commentedon Feb 14, 2022
@ejona86 Thanks for looking into this. Feel free to correct the issue title.
For some context, which I forgot to add; we do "pre-warm" the JVM and the gRPC server/channels by issuing health check RPCs until those are served within a configurable latency threshold. This is done before marking the service as healthy/serving.
This mitigation has been partially successful in addressing the slow start but we still see cases where during deployment services are failing RPCs with
DEADLINE_EXCEEDED
. There is of course many factors outside of gRPC that can contribute to this but I wanted to see if I could reproduce it in an environment with fewer moving parts. Not saying that this is the same issue but it might be.[-]netty: WriteQueue periodic flushing causes DEADLINE_EXCEEDED during startup?[/-][+]netty: Writes can starve reads, causing DEADLINE_EXCEEDED[/+]ejona86 commentedon Mar 7, 2022
Thinking about this more, flow control should normally stall large data transfers and force a batch of reads. So this may be predominantly caused by HPACK/writing out headers.
tommyulfsparre commentedon Apr 6, 2022
@ejona86 Do you think this issue should be raised/moved to the netty project instead?
I tried to reproduce this using apple/servicetalk and got similar behaviour but with a different error, which might point to something that needs (if possible) to be addressed in netty as you have already concluded.
ejona86 commentedon Apr 6, 2022
It wouldn't be inappropriate to share with netty now. I had been wanting to understand why the Netty code isn't working as it would appear it should. Since you have the issue with servicetalk as well, then that means it is likely a netty issue, modulo there's just some config option we're not tweaking.