Skip to content

[DTLTO][LLVM] Integrated Distributed ThinLTO (DTLTO) #127749

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

bd1976bris
Copy link
Collaborator

This PR introduces initial support for Integrated Distributed ThinLTO (DTLTO) in LLVM.

DTLTO enables the distribution of backend ThinLTO compilations via external distribution systems, such as Incredibuild. Existing support for distributing ThinLTO compilations typically involves separate thin-link (--thinlto-index-only), backend compilation, and link steps coordinated by a modern build system, like Bazel. This "Bazel-style" distributed ThinLTO requires a modern build system as it must handle the dynamic dependencies specified in the summary index file shards. However, adopting a modern build system can be prohibitive for users with established build infrastructure. In contrast, DTLTO manages distribution within LLVM during the traditional link step. This approach means that DTLTO is usable with any build process that supports in-process ThinLTO.

This PR provides a minimal but functional implementation of DTLTO, which will be expanded in subsequent commits. It implements basic support for distributing backend ThinLTO compilation jobs via an external process that coordinates with a distribution system (such as Incredibuild). A JSON interface is used for communication with this external process. The JSON interface allows new distribution systems to be supported without modifying LLVM. Please see the documentation added in this PR: llvm/docs/dtlto.rst for more details.

Subsequent commits will add:

  • Support for the ThinLTO cache.
  • Support for more LTO configuration states e.g., basic block sections.
  • Performance improvements, for example, improving the performance of the temporary file removal.

RFC: Integrated Distributed ThinLTO RFC
For the design of the DTLTO feature, see: #126654.

@llvmbot llvmbot added LTO Link time optimization (regular/full LTO or ThinLTO) llvm:support llvm:transforms labels Feb 19, 2025
@llvmbot

This comment was marked as spam.

Copy link

github-actions bot commented Feb 19, 2025

✅ With the latest revision this PR passed the Python code formatter.

Copy link

github-actions bot commented Feb 19, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

This will be used for supplying a `-target=<triple>` option to clang for
DTLTO.

Note that If DTLTO is changed to use an optimisation tool that does not
require an explicit triple to be passed then the triple handling can be
removed entirely.
The DTLTO ThinLTO backend will override this function to perform
code generation.

The DTLTO ThinLTO backend will not do any codegen when invoked for each
task. Instead, it will generate the required information (e.g., the
summary index shard, import list, etc.) to allow for the codegen to be
performed externally. The backend's `wait` function will then invoke an
external distributor process to do backend compilations.
The DTLTO ThinLTO backend will need to customize the file paths used for
the summary index shard files. This is so that the names of these files
can include a unique ID so that multiple DTLTO links do not overwrite
each other's summary index shard files.

It also needs to be able to get the import files list for each job in
memory rather than writing it to a file. This is to allow the import
lists to be included in the JSON used to communicate with the
external distributor process.
The new setup() member is now called to allow the ThinLTO backends to
prepare for code generation.

For the DTLTO backend, this will be used to preallocate storage for the
information required to perform the backend compilation jobs.
Move some setup code and state to a base class.

This will allow the DTLTO ThinLTO backend to share this setup code and
state by also deriving from this base class.
Structural changes:
1. A new ThinLTO backend implementing DTLTO has been added.
2. Both the new backend and the InProcess backend derive from a common
   base class to share common setup code and state.
3. The target triple is now stored for the ThinLTO bitcode files.
4. A new setup() member is called, which the ThinLTO backends can use to
   prepare for code generation. For the DTLTO backend, this is used to
   pre-allocate storage for the information required to perform the
   backend compilation jobs.
5. The functions for emitting summary index shard and imports files have
   been altered to allow the caller to specify the filenames to write
   and to allow the list of imports to be stored in a container rather
   than written to a file.
Intentionally minimal for now. Additional state translation will be added
in future commits.
Replace the use of AddStream in the DTLTO ThinLTO backend to add files
to the link with AddBuffer.

Unlike the InProcess ThinLTO backend, DTLTO runs the backend compilation
jobs by invoking an external process (currently clang). This writes the
output object file to disk. Therefore, DTLTO requires a performant way
of adding an existing file to the link. Note that the AddBuffer
mechanism is also used for adding a file to the link if there is a cache
hit.
Some of the code in LTO.h did not conform to the LLVM coding standard.
However, it did match the style already used in that file.

However this was causing automated code-formatting checks from Github to
fail which was confusing on the PR. I decided to apply clang-format
everywhere to prevent this - even though the new code no longer matches
the style of the existing.
@bd1976bris
Copy link
Collaborator Author

I forced pushed an update, the code is equivalent, reasons for the force-push:

  1. To split up the change into multiple smaller steps. This should facilitate review.
  2. To apply clang-format to everything (I decided that getting the GitHub checks green was more important than maintaining the existing coding style).

@bd1976bris bd1976bris requested review from MaskRay and removed request for MaskRay February 19, 2025 22:45
1. Use "with" when opening files.
2. Add module docstrings.
@bd1976bris
Copy link
Collaborator Author

In offline review it was suggested that the distributor scripts could benefit from module docstrings and the preferred "with" statement idiom for opening files. I made those changes and also spotted some nits in the tests which I have corrected. I'm adding new commits rather than updating the original commits and force-pushing as I think that force-pushing might cause turbulence for reviewers. Please let me know if you would prefer force-pushing.

bd1976bris added a commit to bd1976bris/llvm-project that referenced this pull request Feb 25, 2025
- Remove linker version.

- Make AddBuffer a default parameter for LTO::run() to minimize changes
  at call sites.

- Format code with clang-format, even if it deviates slightly from the
  surrounding style. I decided that CI passing for the PRs was more
  important than matching the existing style.

- Order distributor command-line options correctly.

- Remove dead code in llvm-lto2.cpp.

- Add docstrings to Python distributors.

- Update the Python mock.py distributor to validate JSON.

- Organize LLVM DTLTO tests by moving them into a dedicated "dtlto"
  directory, tidying existing tests and adding new ones.

- Implement fix for more than one LTO partition (spotted by MaskRay).
- Replace `thinlto-remote-opt-tool` with `thinlto-remote-compiler`
- Make `thinlto-remote-compiler` a cl::opt
Copy link
Contributor

@teresajohnson teresajohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay! I have some comments and questions below. Thanks for splitting out the LLVM portions.

@@ -0,0 +1,3 @@
Tests for DTLTO (integrated distributed ThinLTO) functionality.

These are integration tests as DTLTO invokes `clang` for code-generation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the use of llvm-lto and opt here just to do testing before the clang and lld patches go in? If so, do we need cross project testing right now? Wondering if it would be better to just stick with the testing you have in this patch that is under llvm/test, and add the cross project testing using only external tools once those patches are in.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test checks that the DTLTO code that creates Clang command lines generates the expected options and that those options are accepted by clang. I wrote it to use llvm-lto and opt so it could be part of this commit before the clang and lld patches go in. However, I think it is a nice principle to only use external tools in cross-project-tests. I have removed the test from this PR with the intention of adding it back later. Additionally if we can use your idea (#127749 (comment)) of storing the Clang -c step options and applying them to the remote backend compilations then this test will be rewritten to test that behavior instead.

contribute. Distributors can use this to label build jobs for informational
purposes.

- **Linker's version string**.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I saw that this got removed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies I hadn't pulled in the documentation changes that were done on: #126654. Fixed now.

the same information that is emitted into import files for ThinLTO.

- The **input files** required for each job.
- The per-job set of files required for backend compilation, such as bitcode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the list of bitcode files include the bitcode files being imported from? If so, is the imports list described earlier needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JSON format has been simplified now. The imports field has been removed and there is now just an "inputs" field which contains all the inputs specific to a compilation job including the bitcode modules from which imports will be made. I have simplified the documentation to match. Thanks.

; RUN: rm -rf %t && split-file %s %t && cd %t

;; Generate bitcode files with a summary index.
; RUN: opt -thinlto-bc x86_64-unknown-linux-gnu.ll -o x86_64-unknown-linux-gnu.bc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use clang with input cc files instead of opt with input ll files for the cross project tests (related to my earlier comment about whether this should wait and just use external tools).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clang can be used here. Thanks. However, I have removed this test for now. See earlier comment: #127749 (comment)

invocation is contributing to, to aid the user in identifying them:

- **JSON Job Description File**:
- Format: `dtlto.<UID>.dist-file.json`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is UID in this context? Should this be PID?

Copy link
Collaborator Author

@bd1976bris bd1976bris Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed this to be PID now. Note that I expect the naming strategy to change in the future as we get feedback from more distribution systems. I have some concerns that mixing the PID into the filenames might prevent the caching systems for some distribution systems from working effectively.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have simplified the documentation now and this section has been removed. I don't want users relying on the naming scheme for temporary files which might be subject to change when I look at optimizing cache utilization.

RUN: not %{command} -save-temps 2>&1 \
RUN: | FileCheck %s --check-prefixes=ERR
RUN: ls | FileCheck %s --check-prefix=NOINDEXFILES
NOINDEXFILES-NOT: imports
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant with earlier line

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Removed the first check leaving only the -save-temps one.

# Check that imports files are created with -thinlto-emit-imports.
RUN: not %{command} -thinlto-emit-imports 2>&1 \
RUN: | FileCheck %s --check-prefixes=ERR
RUN: ls | FileCheck %s --check-prefix=INDEXFILES
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above suggest s/INDEX/IMPORT/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Fixed.

@@ -97,6 +97,12 @@ static cl::opt<bool>
"specified with -thinlto-emit-indexes or "
"-thinlto-distributed-indexes"));

static cl::opt<bool> DTLTO("dtlto", cl::desc("Perform DTLTO"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than requiring a separate bool flag, consider checking whether dtlto-distributor was specified?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Done.


if __name__ == "__main__":
json_arg = sys.argv[-1]
distributor_args = sys.argv[1:-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest calling this input_file instead of distributor_args to be clearer

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Done now.

// shard, import list, etc..) to allow for the codegen to be performed
// externally . This backend's `wait` function then invokes an external
// distributor process to do backend compilations.
class OutOfProcessThinBackend : public CGThinBackend {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, how much does this share with the original InProcessThinBackend? I am wondering if the approach of having them derive from the same refactored CGThinBackend is useful.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very little. Just some state and a few lines which handle CfiFunctionDefs. It's still nice to share these parts though. Should I remove CGThinBackend? What is your preference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this the backend thread handling is quite similar between this new backend and the write indexes backend (which invokes emitFiles) - i.e. the latter could probably be refactored to use the new Job structure you have introduced here. It isn't clear to me which existing backend has more naturally in common with the new one. Do you eventually plan to share the Cache from the in process backend?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about this the backend thread handling is quite similar between this new backend and the write indexes backend

In the sense of using the Cache and also adding the generated object files to the link, it is similar to the in process backend. However, in the sense of primarily using emitFiles to generate the pre-job summary index files, it is similar to the write indexes backend.

Do you eventually plan to share the Cache from the in process backend?

Eventually we will add support for the Cache that is used in the in process backend. We have found that, in some cases, this is more performant than the distribution system caching.

- Pull in documentation improvements.
Copy link
Collaborator

@cachemeifyoucan cachemeifyoucan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't read the bulk of the implementation yet. But start with some comments.

Limitations
-----------

The current implementation of DTLTO has the following limitations:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this should be a TODO list since the items on the list are all doable, thus not limitation by design.

"primary_input": ["dtlto.o"],
"summary_index": ["dtlto.1.51232.native.o.thinlto.bc"],
"primary_output": ["dtlto.1.51232.native.o"],
"imports": [],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is helpful to show example with more than one job so there is an example for how cross reference works.

"-O3", "-fprofile-sample-use=my.profdata",
"-o", ["primary_output", 0],
"-c", "-x", "ir", ["primary_input", 0],
["summary_index", "-fthinlto-index=", 0],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a standard JSON format currently used by a distributor you targeting? If you have openAPI doc, it will be even better.

I feel the format is a bit weird. It is an array containing different types, and for the sub-array that provide indexes, they are not same length and the item at the same offset are not always the same thing. Doesn't it make a decoder harder to write?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a standard JSON format currently used by a distributor you targeting? If you have openAPI doc, it will be even better.

It's my own format, although it has been shown to have enough information to drive distribution via real-world distribution systems we have tested with Incredibuild and reclient+buildfarm (both of which are available to test for free).

I'll try creating an OpenAPI doc. My feeling is that the format is too simple to need something like this, though.

I feel the format is a bit weird. It is an array containing different types, and for the sub-array that provide indexes, they are not same length and the item at the same offset are not always the same thing. Doesn't it make a decoder harder to write?

I don't think that decoding is particularly difficult (see, for example: local.py.)

I'm not attached to the format of the JSON as long as it can do what we need. Here are the goals/constraints I came up with for a JSON interface:

  • Be as minimal as possible.
  • Don't use spaces in field names.
    • This allows for programmatic access, such as with member access syntax. e.g., .summary_index.
  • Allow the codegen tool to be replaced without a schema update.
  • Don't depend on platform-or-shell-specific shlexing of strings
    • Think about handling quotes, spaces, etc..
  • Support filenames appearing inside arguments
    • e.g., -fthinlto-index=summary.index.
  • Decouple LLVM from distributors to the largest degree possible.
    • This is to try to ensure that if the command line for the codegen tool changes, the distributor(s) shouldn't need updating.
  • Explicitly provide the primary input file, summary index files, and primary output file.
    • This is to help with supporting distribution systems such as FASTBuild, distcc, and Icecream, which lack sandboxing and might require the paths to be manipulated.

In the current format, the sub-arrays are used to represent references to per-job filenames so that a command line can be stored that doesn't contain filenames. This ensures that the contents of the command line remain transparent to the distributor, allowing another code-generation tool to replace Clang without updating the distributors. It also means that the distributor has access to the filenames (as they are stored separately) and can manipulate them, if required, without needing to change the command line.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try creating an OpenAPI doc. My feeling is that the format is too simple to need something like this, though.

My comments about OpenAPI is that if it already exists because the format exists, it will be good to be linked. Otherwise, I don't think it is a high priority task for now.

For the format, I don't have too much strings attached as well but here is what I think it can be improved:

  • for the mix types, you have to parse by traversing json values and check on types. You can't really write a data type in languages like c++ that can directly map to the json object, thus why I say the parsing is a bit hard.
  • Maybe it helps with a better example, I don't see what the last 0 in the array means. Does it ever be non-zero? Looking at local.py only makes me more confused. An example with more than 1 input will definitely help here.
  • It is also unclear to me how to append different flags (for example, different targets as shown in the code review) to different jobs. Can target option be in the job section? Does distributor treat all elements in the list in the jobs to be inputs that it needs to fix the path?

If you ask me to design, I might do something like:

  • common section has argument list that is completely strings with no input files. All input files and job specific flags can be appended later, assuming order never matters and we don't rely on overwriting behavior.
  • job section will contain a specific list of input files and output files, where distributor can fix their path and handle the file movement. It will also come with a list of additionalOption objects that tell distributor how to construct the remaining of the command-line. The files references should just be indexed into the input/output list.

Also if the equal flags like -fthinlto-index= is just a special case, we can always create an alias for that, if that doesn't worth the effort to support.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments about OpenAPI is that if it already exists because the format exists, it will be good to be linked. Otherwise, I don't think it is a high priority task for now.

Thanks for clarifying!

  • for the mix types, you have to parse by traversing json values and check on types. You can't really write a data type in languages like c++ that can directly map to the json object, thus why I say the parsing is a bit hard.

I understand this point, but I feel that expanding the references in C++ code is not taxing. I also don't think that any referencing scheme would lend itself to mapping directly to C++ data types.

  • Maybe it helps with a better example, I don't see what the last 0 in the array means. Does it ever be non-zero? Looking at local.py only makes me more confused. An example with more than 1 input will definitely help here.

It's just a flexible referencing scheme to try to minimise the likelihood of ever needing to change it.

The expansion rule is specified here: https://github.com/bd1976bris/llvm-project/blob/DTLTO_llvm_only/llvm/docs/DTLTO.rst#command-line-expansion-example.

For a reference like: ["additional_inputs", "--", 2, "=", 0, ":", 1] and a corresponding per job array like:

{
    ...
    "additional_inputs": ["first", "second", "third"]
}

Expansion yields: "--third=first:second"

I don't have a real-world use-case where this is important though. For the cases I have currently we only need to index into the first entry of the per-job arrays, and to handle the special case of equal flags like -fthinlto-index=. I just came up with something that should be able to handle any conceivable option.

I will improve the documentation with a case that uses a non-zero index. That is, assuming we don't change the way references work (see the other parts of this comment).

  • It is also unclear to me how to append different flags (for example, different targets as shown in the code review) to different jobs. Can target option be in the job section? Does distributor treat all elements in the list in the jobs to be inputs that it needs to fix the path?

This isn't currently possible. If it was needed my plan to support this was just to add an "args" array to the per-job entries that would be handled in the same manner as the current "args" array in the "common" object.

If you ask me to design, I might do something like:

  • common section has argument list that is completely strings with no input files. All input files and job specific flags can be appended later, assuming order never matters and we don't rely on overwriting behavior.
  • job section will contain a specific list of input files and output files, where distributor can fix their path and handle the file movement. It will also come with a list of additionalOption objects that tell distributor how to construct the remaining of the command-line. The files references should just be indexed into the input/output list.

Also if the equal flags like -fthinlto-index= is just a special case, we can always create an alias for that, if that doesn't worth the effort to support.

Looks great 👍 I will rework along these lines. However, before drawing too much of your attention to the JSON format, can I first ask whether you (and other reviewers) are happy with a JSON interface? There are potentially some advantages to a plugin interface. I made a comment on this here: #126654 (comment).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cachemeifyoucan, I have explored several different JSON formats with different ways of referencing between filenames and the command line arguments. However, the ideas I tested made the JSON less understandable and the referencing less flexible than the initial JSON format discussed above.

I am now proposing a simpler format that eliminates references for filenames. Instead, it simply lists the input/output filenames and command-line arguments in arrays of strings. While this results in some string duplication, it only increases the JSON file size by 6% (compared to the initial scheme) in my tests on a Clang link. As well as simplicity and readability, an additional benefit is that consumers have less work to do when composing command lines.
With this format, modifying filenames programmatically would be more challenging for a consumer (distributor). We initially theorized this might be necessary for supporting certain distribution systems, but in my testing with SN-DBS, Incredibuild, and Reclient (see the description of #126654), it has not been required. I have no firm evidence that it will be a requirement in the future.

I have updated the implementation and documentation to use this simpler format. @teresajohnson and @MaskRay, I’d appreciate your input on this change. Of course, anyone is welcome to comment, but I’ve directly referenced those who have previously expressed opinions on the JSON format in this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I don't have a strong opinion and I am not intended to block the review with JSON format design.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cachemeifyoucan. Your comments on the JSON format were valuable and the new format makes this feature simpler and more approachable. Adding different options to different jobs is now possible as well.

@@ -426,6 +460,7 @@ class LTO {
// The bitcode modules to compile, if specified by the LTO Config.
std::optional<ModuleMapType> ModulesToCompile;
DenseMap<GlobalValue::GUID, StringRef> PrevailingModuleForGUID;
DenseMap<StringRef, std::string> ModuleTriples;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was really hoping we have only one triple for the build because what does it even mean when you cross importing different triple. Will be good to document that you need to have per module triple to account for the versions field.

@@ -426,6 +460,7 @@ class LTO {
// The bitcode modules to compile, if specified by the LTO Config.
std::optional<ModuleMapType> ModulesToCompile;
DenseMap<GlobalValue::GUID, StringRef> PrevailingModuleForGUID;
DenseMap<StringRef, std::string> ModuleTriples;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also we usually don't use DenseMap<StringRef> Maybe StringMap is better.

"thinlto-distributor-arg",
cl::desc("Additional arguments to pass to the ThinLTO distributor"));

cl::opt<std::string> ThinLTORemoteCompiler(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are those cl::opt debugging option or the real option that user needs to pass in? I should avoid -mllvm option in production usage.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are those cl::opt debugging option or the real option that user needs to pass in? I should avoid -mllvm option in production usage.

The use of -mllvm options avoids the need to implement a similar set of options for DTLTO in each LLD port (and also in llvm-lto2). See: #126654 (comment). I don't have a strong opinion though. Perhaps @MaskRay would like to comment on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ultimately the goal for options in production usage should be lld options. I'm not sure whether it is more churn to implement there from the get-go vs starting with cl::opt and transitioning later. At the least there should be a TODO here to convert these options

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to remove the cl::opt options and instead pass extra state when configuring the new backend. I will implement that in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that -mllvm options are considered internal and for production usage we should recommend proper driver and linker options. It's annoying though to update llvm-lto2 and every ELF port....

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using -mllvm options was a nice reduction in the number of changes required in different tools, which was helpful to reduce the size of the initial PR. However, I think there is agreement now that proper options are required. Note that we don't have to update all the LLD ports at once. We can add DTLTO support to the different LLD ports incrementally. Perhaps we can start with just the ELF LLD port (chosen as it is the LLD port I'm most familiar with)?


#--- t1.ll

target triple = "x86_64-unknown-linux-gnu"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer all the tests written with .ll suffix, and using bitcode comment style, rather than use .test suffix. It will get the correct syntax highlight so it is easier to read.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer all the tests written with .ll suffix, and using bitcode comment style, rather than use .test suffix. It will get the correct syntax highlight so it is easier to read.

Sounds good. Although, I suppose that an alternative POV is that it's not correct because these files are not .ll files. I don't mind doing this if you feel strongly about it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a .ll file because every line of code that is not RUN/CHECK/comment is a llvm IR assembly. Even if we split file, we still name the combined file with the same suffix.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'll do this.

Copy link
Collaborator

@cachemeifyoucan cachemeifyoucan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally the approach looks fine to me.

@@ -426,6 +460,7 @@ class LTO {
// The bitcode modules to compile, if specified by the LTO Config.
std::optional<ModuleMapType> ModulesToCompile;
DenseMap<GlobalValue::GUID, StringRef> PrevailingModuleForGUID;
DenseMap<StringRef, std::string> ModuleTriples;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single triple from linker sounds good to me.

// distributor program.
// TODO: If this strategy of deriving options proves insufficient, alternative
// approaches should be considered, such as:
// - A serialization/deserialization format for LTO configuration.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not a fan of storing command-line options, especially you want some bitcode compatibility (-cc1 flag has no compatibility). That means the object files and future static library support will require rev-locking, which might or might not be want you want.

I think most of the target options should be available in the bitcode as attributes, and those are not, we should add more of them into bitcode. Converting some options out from CodeGen option here is unfortunate but I can accept that. Hacking other options into LTO code generation is always just a hack and was never officially supported.

inconvertibleErrorCode());
}

for (auto &Job : Jobs) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe using the thread pool again here to read file in parallel?

Also why specifically pass AddBuffer function all the way here since AddStream pretty much does the same thing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe using the thread pool again here to read file in parallel?

My preference (but please feel free to disagree) would be to keep the code simple for the initial commit and then add optimizations later. It will be worthwhile eventually to try to load and delete these temporary files in as optimal a manner as possible. The optimizations I have included in this PR are those that require some sort of structural change, which brings us neatly onto...

Also why specifically pass AddBuffer function all the way here since AddStream pretty much does the same thing?

AddBuffer should be a more performant way to add a file from disk (versus AddStream). However, it's currently wrapped in the optional cache handling machinery and not exposed. Passing it through causes some turbulence, so I wanted to present this to reviewers to get comments. This hasn't worried any of the reviewers particularly - so I will revert to just using AddStream for the initial commit. The AddBuffer changes can be introduced later with performance analysis to justify them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Done now.

- Rename *.test LIT tests to *.ll, so that syntax highlighting works.

- Remove `dtlto` argument from llvm-lto2 and instead check whether
  `dtlto-distributor` was specified.

- Rename `distributor_args` to `input_file` in mock.py to be clearer.
The Triple is selected within LTO rather than being selected in LLD and passed
into the DTLO backend. I think this is a good choice because, due to the changes
for multiple symbol tables for ARM64X for COFF LLD..
- llvm@b068f2f.
..it's not quite as trivial for LLD to choose a triple in some cases now. Note
that it's easily doable - I just think it's simpler to have the code in LLVM's
LTO.cpp on balance.

There are also some other minor improvements in this commit:

- Improved  parameter/variable names for handling the reserved task range to
  make the logic easier to understand.

- Used correct comment style in *.ll tests.

- Use touch to create dummy files (makes it more obvious that the content is
  unimportant).
This simplifies the implementation. AddBuffer can be reintroduced
later if performance metrics justify modifying the LTO interface.
return Error::success();
}

namespace {
class InProcessThinBackend : public ThinBackendProc {
class CGThinBackend : public ThinBackendProc {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The base class needs a comment

CHECK-NEXT: 0
CHECK-NEXT: ]
CHECK: "--target=x86_64-unknown-linux-gnu"
CHECK: "-O2",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like "arguments" (https://clang.llvm.org/docs/JSONCompilationDatabase.html). Every argument on a separate line? This might be too wasteful...

The "command" form, while it has some shell-escape complexity, is probably better in both compactness and debuggability.
I sometimes extract a command from CMake-generated compile_commands.json (which uses "command") and invoke it in the shell. The "arguments" form would be less convenient to invoke.

Copy link
Collaborator Author

@bd1976bris bd1976bris Mar 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like "arguments" (https://clang.llvm.org/docs/JSONCompilationDatabase.html). Every argument on a separate line? This might be too wasteful...

This test is checking pretty-printed JSON. The JSON generated for the distributors is more compact.

The "command" form, while it has some shell-escape complexity, is probably better in both compactness and debuggability.
I sometimes extract a command from CMake-generated compile_commands.json (which uses "command") and invoke it in the shell. The "arguments" form would be less convenient to invoke.

I understand the point about the advantages of the "command" form. However, to aid consumers of the JSON I strongly want to avoid the complexities of quoting for different potential shells on various platforms. I could add a utility script to consume the JSON and print the shell quoted commands to make it easy to run the individual commands?

@romanova-ekaterina
Copy link
Contributor

As mentioned elsewhere on this thread, we have an alternative "no backend" design and implementation.

Please find it here: https://github.com/romanova-ekaterina/llvm-project/pull/new/kromanova/main/integrated-DTLTO-no-backend

Here's a summary:

  1. Out of process (DTLTO) backend is not used in this implementation.
  2. Both regular and thin archive members are supported in a way that's minimally invasive to LTO. This is in contrast to the approach whereby a new backend is introduced, since archive support for DTLTO cannot be implemented within a backend. Support for (non-thin) archives would require changes to the linker or LTO.cpp.
  3. Only COFF and ELF formats are supported in the initial commit. Support for all other formats is implemented, but it will be submitted later.
  4. Footprint on lld and llvm/lto is very small.
  5. Plugins (shared libraries) are used in this implementation rather than external processes/python scripts.
  6. Plugins allow direct calls and two-way communication between LLVM and DBS, while external processes require parameter serialization and thus allow only one-way communication with DBS. Plugins allow to pass memory buffers to DBS client and get output back from DBS in memory (which saves 2 writes and 2 reads from the file). Note: We have a prototype implemented for distcc distribution system that allows direct memory to memory interface, which we can share on request. Direct memory to memory interface could not be implemented in one-way external process interface.
  7. Throughput is one of the two main factors that contribute to distribution build system efficiency. Load balancing is the other factor. Plugin interface could be extended to integrate more closely with the DBS, allowing do more advanced things like getting information about internal state of DBS, allowing to do things like load balancing.

If there are aspects that people prefer about this approach, we can consider bringing them in to this PR. Or if the "no backend" design is a preferred starting point, we can turn it into a Pull Request and take across parts of this one.

The documentation has been updated for the new format and I have
addressed any outstanding documentation comments.
Users should not rely on the temporary file naming scheme which
may be subject to change.
@bd1976bris
Copy link
Collaborator Author

Thanks for the review comments so far. This PR has now been updated to use a simpler JSON format. I have had to rework the documentation for that simpler format, and in the process I have addressed all the documentation suggestions that were made either here or on the design PR. The documentation and indeed the implementation is much simplified, and I believe all outstanding review comments are now addressed. Please take another look.

Copy link
Member

@MaskRay MaskRay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me. The LTO part might need @teresajohnson 's approval.

@teresajohnson
Copy link
Contributor

This looks reasonable to me. The LTO part might need @teresajohnson 's approval.

I'm OOO but will wrap up the review asap probably early next week.

@kbelochapka
Copy link
Collaborator

This looks reasonable to me. The LTO part might need @teresajohnson 's approval.

@teresajohnson , @MaskRay , @cachemeifyoucan ,

Could you please look at the alternative implementation of DTLTO (with no DTLTO backend and with plugins) and compare it with the DTLTO implementation described in this PR (with out-of-process DTLTO backend and external processes/python scripts) before approving this PR?

Here is the link to this branch:
https://github.com/romanova-ekaterina/llvm-project/pull/new/kromanova/main/integrated-DTLTO-no-backend

It's a fully working DTLTO implementation, although it has been minimized to make it suitable for a code review. We spent considerable time working on this project, but we haven't received any comments/feedback about it. The link to this branch was posted as a comment in this PR, but might have been lost between hundreds of other comments.

Katya Romanova (@romanova-ekaterina) already described the main ideas behind the alternative implementation and the difference between both implementations in one of the comments in the PR, but I will re-iterate it again:

(1) Very little footprint on lld and llvm sources.
(2) Easy to use, only one command line option to control –distribute=
(3) Both regular and thin archive members are fully supported. This contrasts with the DTLTO backend solution, since archive support for DTLTO cannot be implemented within a LTO backend.
(4) Only COFF and ELF formats are supported in the initial commit. Support for all other formats is implemented, but it will be submitted later.
(5) Plugins (shared libraries) are used in this implementation rather than external processes/python scripts. Plugin that supports external scripts was implemented but was not included in the initial review. Plugins can be statically linked into a lld executable. We have several plugins implemented that support most popular distributed build systems – DistCC, IceCream, FastBuild, Icredibuild, Re-Client, HomeCC
(6) Plugins allow direct calls and two-way communication between LLVM and DBS, while external processes require parameter serialization and allow only one-way communication. For example, one if the advantages of using plugins is the ability to pass memory buffers to DBS client and get output back from DBS in memory (which saves 2 writes and 2 reads from the file). Note: We have a plugin prototype implemented for DistCC distribution system which implements direct memory to memory interface with DistCC client. (we can share it on request). Direct memory to memory interface could not be implemented in one-way external process interface.
(7) Throughput is one of the two main factors that contribute to distribution build system efficiency. Load balancing is the other factor. The Plugin interface could be integrated more closely with a DBS, allowing do more advanced compilation jobs scheduling based on current distributed build system state, which in turn would improve the load balancing.

@teresajohnson
Copy link
Contributor

This looks reasonable to me. The LTO part might need @teresajohnson 's approval.

@teresajohnson , @MaskRay , @cachemeifyoucan ,

Could you please look at the alternative implementation of DTLTO (with no DTLTO backend and with plugins) and compare it with the DTLTO implementation described in this PR (with out-of-process DTLTO backend and external processes/python scripts) before approving this PR?

Here is the link to this branch: https://github.com/romanova-ekaterina/llvm-project/pull/new/kromanova/main/integrated-DTLTO-no-backend

It's a fully working DTLTO implementation, although it has been minimized to make it suitable for a code review. We spent considerable time working on this project, but we haven't received any comments/feedback about it. The link to this branch was posted as a comment in this PR, but might have been lost between hundreds of other comments.

Katya Romanova (@romanova-ekaterina) already described the main ideas behind the alternative implementation and the difference between both implementations in one of the comments in the PR, but I will re-iterate it again:

(1) Very little footprint on lld and llvm sources. (2) Easy to use, only one command line option to control –distribute= (3) Both regular and thin archive members are fully supported. This contrasts with the DTLTO backend solution, since archive support for DTLTO cannot be implemented within a LTO backend. (4) Only COFF and ELF formats are supported in the initial commit. Support for all other formats is implemented, but it will be submitted later. (5) Plugins (shared libraries) are used in this implementation rather than external processes/python scripts. Plugin that supports external scripts was implemented but was not included in the initial review. Plugins can be statically linked into a lld executable. We have several plugins implemented that support most popular distributed build systems – DistCC, IceCream, FastBuild, Icredibuild, Re-Client, HomeCC (6) Plugins allow direct calls and two-way communication between LLVM and DBS, while external processes require parameter serialization and allow only one-way communication. For example, one if the advantages of using plugins is the ability to pass memory buffers to DBS client and get output back from DBS in memory (which saves 2 writes and 2 reads from the file). Note: We have a plugin prototype implemented for DistCC distribution system which implements direct memory to memory interface with DistCC client. (we can share it on request). Direct memory to memory interface could not be implemented in one-way external process interface. (7) Throughput is one of the two main factors that contribute to distribution build system efficiency. Load balancing is the other factor. The Plugin interface could be integrated more closely with a DBS, allowing do more advanced compilation jobs scheduling based on current distributed build system state, which in turn would improve the load balancing.

This approach does simplify the lld and LTO side, but I'd like to get thoughts from @MaskRay, as this changes the interfacing with lld. The downside is that it then also requires support for a larger new DTLTO library (it looks like the original patch only contains a Local plugin - do each of the other distributed build systems need to replicate a lot of that functionality?).

@tru
Copy link
Collaborator

tru commented Apr 9, 2025

I just wanted to add my comment here. I have been working with @bd1976bris to try out these changes with FastBuild for our internal build infrastructure and I am happy to report that I have a PoC working based on this and I was recently able to get distribution to work. I found the interface easy to work with and was able to get my PoC to work pretty quickly.

I am happy for this PR to move forward at this point. We probably need some changes in order to be more cache friendly, but I think that can come later.

@romanova-ekaterina
Copy link
Contributor

romanova-ekaterina commented Apr 10, 2025

Hi Teresa,

This approach does simplify the lld and LTO side, but I'd like to get thoughts from @MaskRay, as this changes the interfacing with lld. The downside is that it then also requires support for a larger new DTLTO library (it looks like the original patch only contains a Local plugin - do each of the other distributed build systems need to replicate a lot of that functionality?).

No, replication of the functionality will not be needed for plugins in general.
All common functionality will be in DTLTO static library 
(DTLTO.lib) which will be statically linked with lld.

Plugins do not need to be linked with DTLTO static library. 
Plugins could be implemented in any programming language that allow creating shared libraries.

@romanova-ekaterina
Copy link
Contributor

I just wanted to add my comment here. I have been working with @bd1976bris to try out these changes with FastBuild for our internal build infrastructure and I am happy to report that I have a PoC working based on this and I was recently able to get distribution to work. I found the interface easy to work with and was able to get my PoC to work pretty quickly.

One thing might worth mentioning about two different solutions.
The first solution has DTLTO backend, which generates JSON file and passes it 
to external process (e.g. python script). 
The second solution doesn't have a DTLTO backend. It's basically a tiny layer between the linker and DBS-specific plugin, which calls the plugin's exported functions.

However, we could have a hybrid (mix and match) solution: e.g. "no DTLTO backend" than invokes external process.

@bd1976bris
Copy link
Collaborator Author

I just wanted to add my comment here. I have been working with @bd1976bris to try out these changes with FastBuild for our internal build infrastructure and I am happy to report that I have a PoC working based on this and I was recently able to get distribution to work. I found the interface easy to work with and was able to get my PoC to work pretty quickly.

I am happy for this PR to move forward at this point. We probably need some changes in order to be more cache friendly, but I think that can come later.

While working with Tobias, I back-ported this PR to LLVM 19 and wired up the Clang and LLD side changes.

In case it helps anyone else, that can be found here: https://github.com/bd1976bris/llvm-project/commits/dtlto_llvm19/.

@cachemeifyoucan
Copy link
Collaborator

I don't have a strong opinion for either implementation. If I have to pick one, I like this PR better. I feel like the alternative will require keeping more implementation details of the build system in tree of LLVM. That is not a problem for itself but I don't know how stable those APIs and implementation details are, so it might be a problem for integration.

On the other side, I see this JSON interface from this PR is very simple and stable (maybe we should add a version in there just to be safe). I don't think the LTO interface is bloated after the change and it is not stable anyway. Thus slightly prefer this patch.

@romanova-ekaterina
Copy link
Contributor

I don't have a strong opinion for either implementation. If I have to pick one, I like this PR better. I feel like the alternative will require keeping more implementation details of the build system in tree of LLVM. That is not a problem for itself but I don't know how stable those APIs and implementation details are, so it might be a problem for integration.

On the other side, I see this JSON interface from this PR is very simple and stable (maybe we should add a version in there just to be safe). I don't think the LTO interface is bloated after the change and it is not stable anyway. Thus slightly prefer this patch.

Actually, neither of the solutions will keep implementation details of the distribution build systems in the tree of LLVM.

The plugins or external processes will either be in a separate GitHub repository or will be placed in a repository for particular distribution build systems (e.g. the DTLTO-related plugin for distcc will be added to the distcc repository). The location will be discussed with upstream and decided.

Also, it's worth noting that both interfaces could be supported at the same time in DTLTO: external processes interface for those who prefer simplicity or plugin interface for those who need performance. To understand the performance advantage that the plugins interface could provide, please refer to items (5) and (6) in the comment that kbelochapka provided:
#127749 (comment)

Or we could have a hybrid (mix and match) solution: e.g. "no DTLTO backend" (from alternative solution) that invokes external process (from main solution) or vice versa.

The main decision we are looking for reviewers' input on is whether they prefer having the DTLTO code in a ThinLTO backend (main solution) or not in a backend (alternative solution). To understand the main differences, please refer to items (1), (2), and (3) in the comment that kbelochapka provided:
#127749 (comment)

@bd1976bris
Copy link
Collaborator Author

I would like to summarize the state of this PR as I understand it. Firstly, some reviewers have indicated that this PR is acceptable in its current state. However, there is an alternative proposal: https://github.com/romanova-ekaterina/llvm-project/pull/new/kromanova/main/integrated-DTLTO-no-backend. We somehow need to come to a conclusion, and I think an appeal to authority (fallacy or not) is the way :)

@MaskRay – can I please ask you to reply to @teresajohnson's comment: #127749 (comment). @tru has created a distributor that works with his organization's fork of FASTBuild, using the JSON interface. If you prefer the no-backend approach, then the JSON interface from this PR could be merged into that approach.

Thanks.

@bd1976bris
Copy link
Collaborator Author

ping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
llvm:support llvm:transforms LTO Link time optimization (regular/full LTO or ThinLTO)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants