Skip to content

Checkpoint-restore with WebFlux and Undertow does not work when graceful shutdown is enabled #43655

Open
@sdeleuze

Description

@sdeleuze

After updating https://github.com/spring-projects/spring-lifecycle-smoke-tests to run tests against Spring Boot 3.4.x, I have noticed that framework:webflux-undertow:checkpointRestoreAppTest is broken with Boot 3.4.x while still green with Boot 3.3.x, even if both are using the same Undertow version with the following error:

Error (criu/libnetlink.c:54): -95 reported by netlink: Operation not supported
Error (criu/net.c:3744): Unable to create a veth pair: -95

While discussing with @snicoll about what could caused that, he mentioned that Spring Boot 3.4.x enables graceful shutdown by default, so I tried server.shutdown=immediate and found that it fixes the test.

Could the Spring Boot team see if we could avoid this regression and keep WebFlux + Undertow CRaC support working out of the box? I suspect that when graceful shutdown is enabled, it is not finished when JVM checkpoint is invoked, letting the socket in a bad state, hence the error above.

Activity

wilkinsona

wilkinsona commented on Jan 3, 2025

@wilkinsona
Member

This doesn't look like a regression to me as it also fails (although perhaps differently) with Boot 3.3.x when graceful shutdown is enabled:

> Task :framework:webflux-undertow:checkpointRestoreAppTest FAILED

WebfluxApplicationTests > stringResponseBody(WebTestClient) STANDARD_OUT
    09:43:18.371 [Test worker] ERROR org.springframework.test.web.reactive.server.ExchangeResult -- Request details for assertion failure:

    > GET http://localhost:38021
    > accept-encoding: [gzip]
    > user-agent: [ReactorNetty/1.1.25]
    > host: [localhost:38021]
    > accept: [*/*]
    > WebTestClient-Request-Id: [1]

    No content

    < 503 SERVICE_UNAVAILABLE Service Unavailable
    < Connection: [keep-alive]
    < Content-Length: [0]
    < Date: [Fri, 03 Jan 2025 09:43:18 GMT]

    0 bytes of content (unknown content-type).


WebfluxApplicationTests > stringResponseBody(WebTestClient) FAILED
    java.lang.AssertionError at WebfluxApplicationTests.java:18

WebfluxApplicationTests > resourceInStatic(WebTestClient) STANDARD_OUT
    09:43:18.401 [Test worker] ERROR org.springframework.test.web.reactive.server.ExchangeResult -- Request details for assertion failure:

    > GET http://localhost:38021/foo.html
    > accept-encoding: [gzip]
    > user-agent: [ReactorNetty/1.1.25]
    > host: [localhost:38021]
    > accept: [*/*]
    > WebTestClient-Request-Id: [2]

    No content

    < 503 SERVICE_UNAVAILABLE Service Unavailable
    < Connection: [keep-alive]
    < Content-Length: [0]
    < Date: [Fri, 03 Jan 2025 09:43:18 GMT]

    0 bytes of content (unknown content-type).
changed the title [-]Regression on WebFlux + Undertow with Project CRaC[/-] [+]Checkpoint-restore with WebFlux and Undertow does not work when graceful shutdown is enabled[/+] on Jan 3, 2025
added this to the 3.3.x milestone on Jan 3, 2025
wilkinsona

wilkinsona commented on Jan 3, 2025

@wilkinsona
Member

With Boot 3.4.1, I'm seeing the same behavior as Boot 3.3.x when graceful shutdown is enabled. The checkpoint works, the app starts successfully upon restore, and then rejects requests with a 503. This happens because Undertow's GracefulShutdownHandler is only single-use. Once it has been shut down (as happens when taking the checkpoint) the shutdown bit is set in its state field. The bit isn't cleared upon restore so the handler still believes that Undertow has been shut down. There's no API to clear it so we may have to resort to reflection if this is something that we want to support. Alternatively, it might be possible to ignore the handler somehow when taking a checkpoint so that it isn't shut down.

sdeleuze

sdeleuze commented on Jan 6, 2025

@sdeleuze
ContributorAuthor

For the automatic checkpoint/restore at startup use case where -Dspring.context.checkpoint=onRefresh is set, graceful shutdown is IMO not needed (for any webserver) since no request is expected to have been received, so if you can disable it (for Undertow or all servers) for that use case specifically, that would make sense. Spring Boot can leverage DefaultLifecycleProcessor#CHECKPOINT_PROPERTY_NAME and DefaultLifecycleProcessor#ON_REFRESH_VALUE.

For the on-demand checkpoint/restore of a running application, I think graceful shutdown makes more sense, so maybe I could create a related GracefulShutdownHandler feature request on Undertow bug tracker and for now we just document in https://github.com/spring-projects/spring-lifecycle-smoke-tests that people using Undertow + CRaC + on-demand checkpoint/restore should disable graceful shutdown?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @sdeleuze@wilkinsona@spring-projects-issues

        Issue actions

          Checkpoint-restore with WebFlux and Undertow does not work when graceful shutdown is enabled · Issue #43655 · spring-projects/spring-boot