Skip to content

Implement concurrent formatting #6091

Open
@MarcusGrass

Description

@MarcusGrass

Hi,

I've been rummaging around this repo for a bit lately, getting a bit more familiar with the code by looking at some bugs.

What I really like working with is performance though and I noticed that even though files are formatting independently of each other (in some cases), it's all run single-threaded.

In most cases that's entirely reasonable since rustfmt is using rustc-types that are inherently not thread-safe, but in some cases it's possible to run each format file in a thread. So I tested that out.

The performance benefits are pretty staggering, of course, the measurements are not especially scientific, and it's only measured on my machine, but we're not talking about percentage points here but 3x - 10x for big projects on my machine.

Running rustfmt on all files in tokio takes about 400ms while running multithreaded rustfmt takes about 50ms.
Running rustfmt on all files an unpublished large codebase takes about 5000ms, while multithreaded is about 800ms.

Running cargo fmt --all on tokio is around 265ms, while running cargo fmt --all multithreaded is around 80ms.
Running cargo fmt --all on the same unpublished large codebase is around 1800ms, multithreaded it's 570ms.

On a smaller repo with just a 7 files, multithreaded was about 2x faster.

Downsides

Making rustfmt run concurrently will cause problems because of the good old C-ghost, shared globals, in this case it's stdout and stderr, any kind of interaction with those facilities will cause a mess.

Exemplified:
File A and File B is processed concurrently, in verbose.
There's a diff in File B, and an error in file A.
You could get some output lite:
"Using rustfmt config file rustfmt.cfg for filea.rs"
"Using rustfmt config file rustfmt.cfg for fileb.rs"
Diff at fileb.rs l 322
Error writing files {msg} (Error in file A, but you won't know that).

This is a non-trivial problem since stdout and stderr while they are internally synced for a given (e)println are not synced on the file-parsing level. And locking stdout and/or stderr until a file-pass finishes would nullify most of the benefits of multithreading completely.

Proposed solution

Reword formatting runs to take an output buffer and then print that chronologically based on which file is processed.

This preserves previous printing behaviour completely, but requires a pretty sweeping change where these buffers have to be passed down anywhere printing is wanted.

In practice there's only one part of the code where this becomes particularly problematic, and that's the diff-writer which uses term, which prints colored messages to the terminal.

There's another potential issue, which is preserving ordering between writes to stdout and stderr.
I.E. Buffers are passed down into the application, the code prints into those instead.
Now the code first prints with an eprintln then a println. The calling code gets the buffers back and is responsible for flushing them to stderr and stdout respectively, in which order should the calling code flush them?

This isn't necessarily a huge problem. Ideally code deep down the call-stack shouldn't be printing, and refactoring that may be the best thing to do for clarity anyway, but it's something to consider.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions