Skip to content

Check for large runners #15471

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Check for large runners #15471

wants to merge 1 commit into from

Conversation

jsoref
Copy link
Contributor

@jsoref jsoref commented Jan 30, 2024

fixes #15464

Variable Runner to use if set to 1
use_large_linux_runners_for_speed ubuntu-latest-xl
use_large_mac_runners_for_speed macos-12-xl
use_large_win_runners_for_speed windows-latest-xl

For simplicity, I'm copy-pasting the entire block to each workflow that needs it, this means that updating the rules can be done w/ a search+replace instead of worrying about things getting out of sync. The cost of checking each of them is minimal (especially vs the cost of running on a slow runner or waiting forever for a runner that will never happen).

@jsoref jsoref requested review from a team as code owners January 30, 2024 14:06
@jsoref
Copy link
Contributor Author

jsoref commented Jan 30, 2024

Technically, we could do this:

diff --git a/.github/workflows/compile-queries.yml b/.github/workflows/compile-queries.yml
index 430807c95d..ddb3aa794f 100644
--- a/.github/workflows/compile-queries.yml
+++ b/.github/workflows/compile-queries.yml
@@ -12,19 +12,11 @@ jobs:
   choose-runner:
     runs-on: ubuntu-latest
     outputs:
-      mac-12: ${{ steps.results.outputs.MAC_12 }}
-      ubuntu-latest: ${{ steps.results.outputs.UBUNTU_LATEST }}
-      windows-latest: ${{ steps.results.outputs.WINDOWS_LATEST }}
+      mac-12: ${{ vars.use-large-mac-runners-for-speed && 'macos-12-xl' || 'macos-12' }}
+      ubuntu-latest: ${{ vars.use-large-linux-runners-for-speed && 'ubuntu-latest-xl' || 'ubuntu-latest' }}
+      windows-latest: ${{ vars.use-large-win-runners-for-speed && 'windows-latest-xl' || 'windows-latest' }}
     steps:
-      - id: results
-        run: |
-          echo "UBUNTU_LATEST=$UBUNTU_LATEST" >> $GITHUB_OUTPUT
-          echo "MAC_12=$MAC_12" >> $GITHUB_OUTPUT
-          echo "WINDOWS_LATEST=$WINDOWS_LATEST" >> $GITHUB_OUTPUT
-        env:
-          UBUNTU_LATEST: ${{ vars.use-large-linux-runners-for-speed && 'ubuntu-latest-xl' || 'ubuntu-latest' }}
-          MAC_12: ${{ vars.use-large-mac-runners-for-speed && 'macos-12-xl' || 'macos-12' }}
-          WINDOWS_LATEST: ${{ vars.use-large-win-runners-for-speed && 'windows-latest-xl' || 'windows-latest' }}
+      - run: true
 
   compile-queries:
     needs: choose-runner

The downside is that then which runner is selected is absolutely opaque because outputs aren't logged anywhere and nothing in a job run explicitly shows the viewer the runs-on: value.

If a non-standard runner is selected, there actually is some information indicating it, but for people who don't use non-standard runners it might not be obvious:
https://github.com/github/codeql/actions/runs/7712614249/job/21020545208#step:1:3)91dc77d9

Runner name: 'ubuntu-latest-xl_ee83'
Runner group name: 'larger-hosted-public-runners'

(We used to use self-hosted runners a lot not many months ago and I've apparently quickly forgotten these things,, and even though I interact with this and other repositories occasionally, I clearly haven't remembered this detail.)

@jsoref jsoref force-pushed the choose-runner branch 2 times, most recently from ab54657 to a666577 Compare January 31, 2024 10:35
@jsoref
Copy link
Contributor Author

jsoref commented Jan 31, 2024

@angelapwen I don't suppose you could get someone to look over this one as well?

With this (and the other), I have two failures, but they're things I can't see a way to address:
https://github.com/check-spelling-sandbox/codeql/actions/runs/7724725867/job/21057437573#step:14:173

In file included from external/binlog/include/binlog/PrettyPrinter.cpp:1:
In file included from external/binlog/include/binlog/PrettyPrinter.hpp:6:
In file included from external/binlog/include/binlog/Time.hpp:6:
/usr/lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/chrono:2320:48: error: call to consteval function 'std::chrono::hh_mm_ss::_S_fractional_width' is not a constant expression
        static constexpr unsigned fractional_width = {_S_fractional_width()};

https://github.com/check-spelling-sandbox/codeql/actions/runs/7724725863/job/21057437161

Run QL tests 46s
Run codeql test run --threads=0 --ram 50000 --search-path "/home/runner/work/codeql/codeql/ruby/extractor-pack" --check-databases --check-undefined-labels --check-unused-labels --check-repeated-labels --check-redefined-labels --check-use-before-definition --consistency-queries ql/consistency-queries ql/test --compilation-cache "/home/runner/work/_temp/compilation-dir"
Executing 1145 tests in 121 directories.
Extracting test database in /home/runner/work/codeql/codeql/ruby/ql/test/library-tests/frameworks/mysql2.
Extracting test database in /home/runner/work/codeql/codeql/ruby/ql/test/library-tests/frameworks/json.
Extracting test database in /home/runner/work/codeql/codeql/ruby/ql/test/library-tests/frameworks/graphql.
Extracting test database in /home/runner/work/codeql/codeql/ruby/ql/test/library-tests/frameworks/http_clients.
Compiling queries in /home/runner/work/codeql/codeql/ruby/ql/test/library-tests/frameworks/json.
Compiling queries in /home/runner/work/codeql/codeql/ruby/ql/test/library-tests/frameworks/mysql2.
Compiling queries in /home/runner/work/codeql/codeql/ruby/ql/test/library-tests/frameworks/http_clients.
Compiling queries in /home/runner/work/codeql/codeql/ruby/ql/test/library-tests/frameworks/graphql.
[1/1145 eval 605ms] PASSED /home/runner/work/codeql/codeql/ruby/ql/test/library-tests/frameworks/mysql2/DB-CHECK
[2/1145 eval 647ms] PASSED /home/runner/work/codeql/codeql/ruby/ql/test/library-tests/frameworks/json/DB-CHECK
[3/1145 eval 502ms] PASSED /home/runner/work/codeql/codeql/ruby/ql/test/library-tests/frameworks/graphql/DB-CHECK
[4/1145 eval 589ms] PASSED /home/runner/work/codeql/codeql/ruby/ql/test/library-tests/frameworks/http_clients/DB-CHECK
Killed
Error: The operation was canceled.

This workflow (Ruby: Run QL Tests) seems to consistently die. I don't understand why.

https://github.com/check-spelling-sandbox/codeql/actions/runs/7724725861

qltest (1/2)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
qltest (1/2)
The operation was canceled.
qltest (2/2)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
qltest (2/2)
The operation was canceled.

My guess is that this one involves exceeding some resource allocations and being killed by a manager. But I can't tell (it's consistent, I've tried rerunning it).

The only outstanding one is: compile-queries which has been running for 4 hours (it's possible it will not fit in the 6 hour window).

If at all possible, it'd be great if this could be run using a matrix to get some parallel runners -- forks aren't likely to pay for large runners, but they can generally use a matrix of additional runners (my org does hit limits from some really rude repositories, but usually I'm only dealing w/ one or two of those repositories at a time).

@aibaars
Copy link
Contributor

aibaars commented Jan 31, 2024

This workflow (Ruby: Run QL Tests) seems to consistently die. I don't understand why.

The --ram 50000 may be to blame ;-) A small runner does not have that much memory.

@jsoref
Copy link
Contributor Author

jsoref commented Jan 31, 2024

I've seen some workflows that actively shoot a bunch of ancillary services in order to free memory :).

Copy link
Contributor

@oscarsj oscarsj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! We understand the problem and agree that the current experience might not be optimal. We appreciate your efforts to improve the developer experience for CodeQL contributors.
We aim to avoid adding complexity and maintenance to this repo as much as possible. Therefore, we suggest changing the approach so the original repo does not need to add any variables to continue working as usual. We feel that accepting this PR as-is will add a (admittedly small, but >0) maintenance burden to this repo, which might set a precedent that could create maintenance issues on our end in the long run.
We suggest to explore the possibility of approaching this in a way that the extra maintenance, if any, falls on the side of the forkers, not the original repo.

@jsoref jsoref requested a review from oscarsj February 7, 2024 17:05
@jsoref jsoref mentioned this pull request Feb 9, 2024
Copy link
Contributor

@oscarsj oscarsj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, thanks for your time reviewing this and addressing our concerns. This approach removes some the maintenance burden we want to avoid, that's great 👍🏾
Still, our internal tests are showing that actually this approach will not fix the DevExp for forkers, as using non-large runners to run those workflows result in OOM kills or timeouts most of the times. We appreciate it can have the opposite results: confusing forkers with ❌ tests that actually are not broken, just run on inadequate runners.
To move forward, we suggest to consider disabling entirely those workflows for non github owned forks, using something like if: ${{ github.repository_owner == 'github' }}. That would mitigate the original issue of leaving workflows pending in non-github forks, and as a secondary (but much sought) benefit, the code will be simpler.

@jsoref
Copy link
Contributor Author

jsoref commented Feb 9, 2024

I have a fix for the oom thing in #15564, and I haven't hit timeouts...

Variable | Runner to use if set to `1`
-|-
`use_large_linux_runners_for_speed` | `ubuntu-latest-xl`
`use_large_mac_runners_for_speed` | `macos-12-xl`
`use_large_win_runners_for_speed` | `windows-latest-xl`
@oscarsj
Copy link
Contributor

oscarsj commented Feb 12, 2024

I have a fix for the oom thing in #15564, and I haven't hit timeouts...

Right, there are not timeouts, that's incorrect on my part. Our tests show checks taking almost 3 hours to run with small runners, and some OOMs, not timeouts.
Let's see if how #15564 goes, but our appreciation at this point is that the complexity that PR and this one add to the CI system is high compared to the gains in devexp forkers will get by them, with slow tests and resource usage. We think it would be a simpler, more maintainable approach, to just disable those tests in external forks, and have them run on GitHub's infrastructure at PR time.

@smowton
Copy link
Contributor

smowton commented Feb 12, 2024

@oscarsj note the author has opened #15584 with an alternative solution

@jsoref
Copy link
Contributor Author

jsoref commented Feb 12, 2024

The resource usage is definitely a thing. The fact that personal repositories have 0 monitoring ability is also a thing (but GitHub should just fix that) -- I'm doing all the work for this PR in an org so that I can at least see things.

I personally like being able to test things in my forks before I make public facing PRs to upstream repositories -- I'm human, and I make mistakes, but I prefer to not show my mistakes too publicly if I can avoid doing so.

That said, it would be possible for someone like me to change the if: and the runs-on: fields together if they knew they wanted to test a change to one of these workflows.

(But definitely, having the auto-tune change integrated would make things much less frustrating, as that side is quite surprising and it was not particularly obvious how to fix it.)

@oscarsj
Copy link
Contributor

oscarsj commented Feb 12, 2024

The resource usage is definitely a thing. The fact that personal repositories have 0 monitoring ability is also a thing (but GitHub should just fix that) -- I'm doing all the work for this PR in an org so that I can at least see things.

I personally like being able to test things in my forks before I make public facing PRs to upstream repositories -- I'm human, and I make mistakes, but I prefer to not show my mistakes too publicly if I can avoid doing so.

That said, it would be possible for someone like me to change the if: and the runs-on: fields together if they knew they wanted to test a change to one of these workflows.

(But definitely, having the auto-tune change integrated would make things much less frustrating, as that side is quite surprising and it was not particularly obvious how to fix it.)

Agreed! I'd suggest then we close this PR in favor of #15584 - we can always revisit this decision in the future if any of the mentioned premises changes.

@jsoref jsoref closed this Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Workflows get stuck on forks
4 participants