Description
Related August 29th CI event
On August 29th, around 3:59 Pacific Daylight Time, our CI started to fail due to not having enough storage available. It merged a few PRs, but then about 10 hours later it merged the final PR that it would merge that day: 0d63418
It continued to fail for hours. The spotty CI passes were probably due to GitHub initiating a rollout to their fleet that took 12 hours to reach complete global saturation. At that moment, GitHub reduced the actual levels of storage offered to runners to levels that closely reflect their service agreement. See actions/runner-images#10511 for more on that.
Eventually, I landed #129797 which seemed to get CI going again.
Do we take up too much space?
We have had our storage usage grow, arguably to concerning levels, over time. Yes, a lot compresses for transfer, but I'm talking about peak storage occupancy here. And tarballs are not a format that are conducive to accessing individual files, so in practice, the relevant data occupies hosts in its full, uncompressed glory nonetheless. We also generate quite a lot of build intermediates. Big ones. Some of this is unavoidable, but we should consider investigating ways to reduce storage occupancy of the toolchain and its build intermediates.
Besides, we are having issues keeping our storage usage under the amount available to CI, even if there are other aggravating events. Obviously, clearing CI storage space can be done as a dirty hack to get things running again, but changes that benefit the entire ecosystem are more desirable. However, note that a solution that reduces storage but significantly increases the number of filesystem accesses, especially during compiler or tool builds, is likely to make CI problems worse due to this fun little issue:
I'm opening this issue as a question, effectively: We track how much time the compiler costs, but what about space? Where are we tracking things like e.g. total doc size (possibly divided between libstd doc size and so on)? Are we aware of things like how much space is used by incremental compilation or other intermediates, and how it changes between versions? How about things like e.g. how many crater subjobs run out of space in each beta run? Where would someone find this information?