Skip to content

netty: Writes can starve reads, causing DEADLINE_EXCEEDED #8912

Open
@tommyulfsparre

Description

@tommyulfsparre

What version of gRPC-Java are you using?

1.44.0

What is your environment?

A client and server communicating over localhost.

Operating system: Mac OSX
JDK: 11.0.9

What did you expect to see?

Client in this branch not failing the majority of RPCs with DEADLINE_EXCEEDED when using Netty for transport.

What did you see instead?

The majority of call fails with DEADLINE_EXCEEDED using Netty whereas using OkHttp doesn't .

Steps to reproduce the bug

This (hacky) reproducible might indicate a potential problem where the periodic flushing of the WriteQueue doesn't happen frequently enough. This causes latency leading to DEADLINE_EXCEEDED during startup. Switching to ÒkHttp for this particular case does not cause any RPCs to fail.

On the wire with Netty multiple RPCs are batched and sent togheter whereas with ÒkHttp they are immediately dispatched.

Here is a screenshot from Perfmark using Netty:

permark

is this a known issue or am I doing something fundamentally wrong in the reproducer?

Activity

changed the title [-]WriteQueue periodic flushing causes DEADLINE_EXCEEDED during startup?[/-] [+]netty: WriteQueue periodic flushing causes DEADLINE_EXCEEDED during startup?[/+] on Feb 11, 2022
ejona86

ejona86 commented on Feb 11, 2022

@ejona86
Member

Thank you for the reproduction and good report. PerfMark is so helpful here.

Java is slow to start. I don't think people really understand how slow Java is when starting. It is slower than Python when starting, as it uses a less optimized interpreter.

We'd expect OkHttp to be faster when starting, at least for the simple reason of it having less code. OkHttp is essentially optimized for the cold-start case, as is common on Android. OkHttp has similar write-combining code as Netty via the AsyncSink, so I don't think the delaying flushing itself is the problem.

The WriteQueue will run indefinitely as long as there are events. Changing the code to use batch sizes of 16 and return after "flush0" didn't substantially affect the number of RPCs completing within 100 ms. If I increase the deadline to 200 ms, then I see more varying results. So I think netty here is heavily over-saturated, and we'd expect the latency to increase horribly when that happens.

I tried returning earlier in WriteQueue (along with rescheduling) because I think the problem is reads are blocked on the writes. It seems returning early from WriteQueue may not have much effect. Looking at Netty it seems it will immediately run the new Runnable. It will let other scheduled runnables execute first, but it doesn't go back to select() until SingleThreadEventExecutor.runAllTasks() completes. This means that deadlines can be processed much more rapidly than reads.

OkHttp has the advantage for this scenario that it uses separate dedicated threads for reading and writing. That means that the responses aren't delayed waiting for the writing to complete. The difference in latency is huge: I see okhttp completing RPCs in 0-10 ms. Although there's other things going on here, because I also see many Netty RPCs completing in 300-400 ms when I set the deadline is 200ms; that means there's thrashing due to to many threads or similar going on.

I think there should likely be some change based on this reproducer, but I suspect it can't be substantially fixed.

ejona86

ejona86 commented on Feb 11, 2022

@ejona86
Member

Looking at Netty it seems it will immediately run the new Runnable.

I was accidentally looking in the ioRatio=100 flow. In another flow it runs tasks until X amount of time has passed, where X is based on the time taken for select(). That's good, as it means this should be addressable to some degree, but I'm still unclear on why reads appear to be delayed.

tommyulfsparre

tommyulfsparre commented on Feb 14, 2022

@tommyulfsparre
Author

@ejona86 Thanks for looking into this. Feel free to correct the issue title.

Java is slow to start. I don't think people really understand how slow Java is when starting

For some context, which I forgot to add; we do "pre-warm" the JVM and the gRPC server/channels by issuing health check RPCs until those are served within a configurable latency threshold. This is done before marking the service as healthy/serving.

This mitigation has been partially successful in addressing the slow start but we still see cases where during deployment services are failing RPCs with DEADLINE_EXCEEDED. There is of course many factors outside of gRPC that can contribute to this but I wanted to see if I could reproduce it in an environment with fewer moving parts. Not saying that this is the same issue but it might be.

changed the title [-]netty: WriteQueue periodic flushing causes DEADLINE_EXCEEDED during startup?[/-] [+]netty: Writes can starve reads, causing DEADLINE_EXCEEDED[/+] on Mar 7, 2022
ejona86

ejona86 commented on Mar 7, 2022

@ejona86
Member

Thinking about this more, flow control should normally stall large data transfers and force a batch of reads. So this may be predominantly caused by HPACK/writing out headers.

tommyulfsparre

tommyulfsparre commented on Apr 6, 2022

@tommyulfsparre
Author

@ejona86 Do you think this issue should be raised/moved to the netty project instead?

I tried to reproduce this using apple/servicetalk and got similar behaviour but with a different error, which might point to something that needs (if possible) to be addressed in netty as you have already concluded.

ejona86

ejona86 commented on Apr 6, 2022

@ejona86
Member

It wouldn't be inappropriate to share with netty now. I had been wanting to understand why the Netty code isn't working as it would appear it should. Since you have the issue with servicetalk as well, then that means it is likely a netty issue, modulo there's just some config option we're not tweaking.

added this to the Next milestone on Sep 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @sergiitk@tommyulfsparre@ejona86

      Issue actions

        netty: Writes can starve reads, causing DEADLINE_EXCEEDED · Issue #8912 · grpc/grpc-java