Check for large runners #15471

jsoref · 2024-01-30T14:06:04Z

Variable	Runner to use if set to `1`
`use_large_linux_runners_for_speed`	`ubuntu-latest-xl`
`use_large_mac_runners_for_speed`	`macos-12-xl`
`use_large_win_runners_for_speed`	`windows-latest-xl`

For simplicity, I'm copy-pasting the entire block to each workflow that needs it, this means that updating the rules can be done w/ a search+replace instead of worrying about things getting out of sync. The cost of checking each of them is minimal (especially vs the cost of running on a slow runner or waiting forever for a runner that will never happen).

jsoref · 2024-01-30T14:32:00Z

Technically, we could do this:

diff --git a/.github/workflows/compile-queries.yml b/.github/workflows/compile-queries.yml
index 430807c95d..ddb3aa794f 100644
--- a/.github/workflows/compile-queries.yml
+++ b/.github/workflows/compile-queries.yml
@@ -12,19 +12,11 @@ jobs:
   choose-runner:
     runs-on: ubuntu-latest
     outputs:
-      mac-12: ${{ steps.results.outputs.MAC_12 }}
-      ubuntu-latest: ${{ steps.results.outputs.UBUNTU_LATEST }}
-      windows-latest: ${{ steps.results.outputs.WINDOWS_LATEST }}
+      mac-12: ${{ vars.use-large-mac-runners-for-speed && 'macos-12-xl' || 'macos-12' }}
+      ubuntu-latest: ${{ vars.use-large-linux-runners-for-speed && 'ubuntu-latest-xl' || 'ubuntu-latest' }}
+      windows-latest: ${{ vars.use-large-win-runners-for-speed && 'windows-latest-xl' || 'windows-latest' }}
     steps:
-      - id: results
-        run: |
-          echo "UBUNTU_LATEST=$UBUNTU_LATEST" >> $GITHUB_OUTPUT
-          echo "MAC_12=$MAC_12" >> $GITHUB_OUTPUT
-          echo "WINDOWS_LATEST=$WINDOWS_LATEST" >> $GITHUB_OUTPUT
-        env:
-          UBUNTU_LATEST: ${{ vars.use-large-linux-runners-for-speed && 'ubuntu-latest-xl' || 'ubuntu-latest' }}
-          MAC_12: ${{ vars.use-large-mac-runners-for-speed && 'macos-12-xl' || 'macos-12' }}
-          WINDOWS_LATEST: ${{ vars.use-large-win-runners-for-speed && 'windows-latest-xl' || 'windows-latest' }}
+      - run: true
 
   compile-queries:
     needs: choose-runner

The downside is that then which runner is selected is absolutely opaque because outputs aren't logged anywhere and nothing in a job run explicitly shows the viewer the runs-on: value.

If a non-standard runner is selected, there actually is some information indicating it, but for people who don't use non-standard runners it might not be obvious:
https://github.com/github/codeql/actions/runs/7712614249/job/21020545208#step:1:3)91dc77d9

Runner name: 'ubuntu-latest-xl_ee83'
Runner group name: 'larger-hosted-public-runners'

(We used to use self-hosted runners a lot not many months ago and I've apparently quickly forgotten these things,, and even though I interact with this and other repositories occasionally, I clearly haven't remembered this detail.)

jsoref · 2024-01-31T14:53:52Z

@angelapwen I don't suppose you could get someone to look over this one as well?

With this (and the other), I have two failures, but they're things I can't see a way to address:
https://github.com/check-spelling-sandbox/codeql/actions/runs/7724725867/job/21057437573#step:14:173

In file included from external/binlog/include/binlog/PrettyPrinter.cpp:1:
In file included from external/binlog/include/binlog/PrettyPrinter.hpp:6:
In file included from external/binlog/include/binlog/Time.hpp:6:
/usr/lib/gcc/x86_64-linux-gnu/13/../../../../include/c++/13/chrono:2320:48: error: call to consteval function 'std::chrono::hh_mm_ss::_S_fractional_width' is not a constant expression
        static constexpr unsigned fractional_width = {_S_fractional_width()};

https://github.com/check-spelling-sandbox/codeql/actions/runs/7724725863/job/21057437161

Run QL tests 46s
Run codeql test run --threads=0 --ram 50000 --search-path "/home/runner/work/codeql/codeql/ruby/extractor-pack" --check-databases --check-undefined-labels --check-unused-labels --check-repeated-labels --check-redefined-labels --check-use-before-definition --consistency-queries ql/consistency-queries ql/test --compilation-cache "/home/runner/work/_temp/compilation-dir"
Executing 1145 tests in 121 directories.
Extracting test database in /home/runner/work/codeql/codeql/ruby/ql/test/library-tests/frameworks/mysql2.
Extracting test database in /home/runner/work/codeql/codeql/ruby/ql/test/library-tests/frameworks/json.
Extracting test database in /home/runner/work/codeql/codeql/ruby/ql/test/library-tests/frameworks/graphql.
Extracting test database in /home/runner/work/codeql/codeql/ruby/ql/test/library-tests/frameworks/http_clients.
Compiling queries in /home/runner/work/codeql/codeql/ruby/ql/test/library-tests/frameworks/json.
Compiling queries in /home/runner/work/codeql/codeql/ruby/ql/test/library-tests/frameworks/mysql2.
Compiling queries in /home/runner/work/codeql/codeql/ruby/ql/test/library-tests/frameworks/http_clients.
Compiling queries in /home/runner/work/codeql/codeql/ruby/ql/test/library-tests/frameworks/graphql.
[1/1145 eval 605ms] PASSED /home/runner/work/codeql/codeql/ruby/ql/test/library-tests/frameworks/mysql2/DB-CHECK
[2/1145 eval 647ms] PASSED /home/runner/work/codeql/codeql/ruby/ql/test/library-tests/frameworks/json/DB-CHECK
[3/1145 eval 502ms] PASSED /home/runner/work/codeql/codeql/ruby/ql/test/library-tests/frameworks/graphql/DB-CHECK
[4/1145 eval 589ms] PASSED /home/runner/work/codeql/codeql/ruby/ql/test/library-tests/frameworks/http_clients/DB-CHECK
Killed
Error: The operation was canceled.

This workflow (Ruby: Run QL Tests) seems to consistently die. I don't understand why.

https://github.com/check-spelling-sandbox/codeql/actions/runs/7724725861

qltest (1/2)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
qltest (1/2)
The operation was canceled.
qltest (2/2)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
qltest (2/2)
The operation was canceled.

My guess is that this one involves exceeding some resource allocations and being killed by a manager. But I can't tell (it's consistent, I've tried rerunning it).

The only outstanding one is: compile-queries which has been running for 4 hours (it's possible it will not fit in the 6 hour window).

If at all possible, it'd be great if this could be run using a matrix to get some parallel runners -- forks aren't likely to pay for large runners, but they can generally use a matrix of additional runners (my org does hit limits from some really rude repositories, but usually I'm only dealing w/ one or two of those repositories at a time).

aibaars · 2024-01-31T15:01:55Z

This workflow (Ruby: Run QL Tests) seems to consistently die. I don't understand why.

The --ram 50000 may be to blame ;-) A small runner does not have that much memory.

jsoref · 2024-01-31T16:07:30Z

I've seen some workflows that actively shoot a bunch of ancillary services in order to free memory :).

oscarsj

Thanks for your contribution! We understand the problem and agree that the current experience might not be optimal. We appreciate your efforts to improve the developer experience for CodeQL contributors.
We aim to avoid adding complexity and maintenance to this repo as much as possible. Therefore, we suggest changing the approach so the original repo does not need to add any variables to continue working as usual. We feel that accepting this PR as-is will add a (admittedly small, but >0) maintenance burden to this repo, which might set a precedent that could create maintenance issues on our end in the long run.
We suggest to explore the possibility of approaching this in a way that the extra maintenance, if any, falls on the side of the forkers, not the original repo.

oscarsj

Again, thanks for your time reviewing this and addressing our concerns. This approach removes some the maintenance burden we want to avoid, that's great 👍🏾
Still, our internal tests are showing that actually this approach will not fix the DevExp for forkers, as using non-large runners to run those workflows result in OOM kills or timeouts most of the times. We appreciate it can have the opposite results: confusing forkers with ❌ tests that actually are not broken, just run on inadequate runners.
To move forward, we suggest to consider disabling entirely those workflows for non github owned forks, using something like if: ${{ github.repository_owner == 'github' }}. That would mitigate the original issue of leaving workflows pending in non-github forks, and as a secondary (but much sought) benefit, the code will be simpler.

jsoref · 2024-02-09T17:34:23Z

I have a fix for the oom thing in #15564, and I haven't hit timeouts...

oscarsj · 2024-02-12T09:24:11Z

I have a fix for the oom thing in #15564, and I haven't hit timeouts...

Right, there are not timeouts, that's incorrect on my part. Our tests show checks taking almost 3 hours to run with small runners, and some OOMs, not timeouts.
Let's see if how #15564 goes, but our appreciation at this point is that the complexity that PR and this one add to the CI system is high compared to the gains in devexp forkers will get by them, with slow tests and resource usage. We think it would be a simpler, more maintainable approach, to just disable those tests in external forks, and have them run on GitHub's infrastructure at PR time.

smowton · 2024-02-12T10:05:18Z

@oscarsj note the author has opened #15584 with an alternative solution

jsoref · 2024-02-12T10:16:18Z

The resource usage is definitely a thing. The fact that personal repositories have 0 monitoring ability is also a thing (but GitHub should just fix that) -- I'm doing all the work for this PR in an org so that I can at least see things.

I personally like being able to test things in my forks before I make public facing PRs to upstream repositories -- I'm human, and I make mistakes, but I prefer to not show my mistakes too publicly if I can avoid doing so.

That said, it would be possible for someone like me to change the if: and the runs-on: fields together if they knew they wanted to test a change to one of these workflows.

(But definitely, having the auto-tune change integrated would make things much less frustrating, as that side is quite surprising and it was not particularly obvious how to fix it.)

oscarsj · 2024-02-12T11:32:34Z

The resource usage is definitely a thing. The fact that personal repositories have 0 monitoring ability is also a thing (but GitHub should just fix that) -- I'm doing all the work for this PR in an org so that I can at least see things.

I personally like being able to test things in my forks before I make public facing PRs to upstream repositories -- I'm human, and I make mistakes, but I prefer to not show my mistakes too publicly if I can avoid doing so.

That said, it would be possible for someone like me to change the if: and the runs-on: fields together if they knew they wanted to test a change to one of these workflows.

(But definitely, having the auto-tune change integrated would make things much less frustrating, as that side is quite surprising and it was not particularly obvious how to fix it.)

Agreed! I'd suggest then we close this PR in favor of #15584 - we can always revisit this decision in the future if any of the mentioned premises changes.

jsoref requested review from a team as code owners January 30, 2024 14:06

github-actions bot added the QL-for-QL label Jan 30, 2024

jsoref force-pushed the choose-runner branch 2 times, most recently from ab54657 to a666577 Compare January 31, 2024 10:35

oscarsj requested changes Feb 7, 2024

View reviewed changes

jsoref force-pushed the choose-runner branch from a666577 to e1c1ef8 Compare February 7, 2024 17:04

jsoref requested a review from oscarsj February 7, 2024 17:05

jsoref mentioned this pull request Feb 9, 2024

Autotune memory #15564

Open

oscarsj requested changes Feb 9, 2024

View reviewed changes

jsoref force-pushed the choose-runner branch from e1c1ef8 to 8e78d17 Compare February 9, 2024 17:35

jsoref mentioned this pull request Feb 12, 2024

Limit xl runner jobs to github org #15584

Merged

jsoref closed this Feb 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check for large runners #15471

Check for large runners #15471

jsoref commented Jan 30, 2024 •

edited

Loading

jsoref commented Jan 30, 2024 •

edited

Loading

jsoref commented Jan 31, 2024

aibaars commented Jan 31, 2024

jsoref commented Jan 31, 2024

oscarsj left a comment

oscarsj left a comment

jsoref commented Feb 9, 2024

oscarsj commented Feb 12, 2024 •

edited

Loading

smowton commented Feb 12, 2024

jsoref commented Feb 12, 2024

oscarsj commented Feb 12, 2024

Check for large runners #15471

Check for large runners #15471

Conversation

jsoref commented Jan 30, 2024 • edited Loading

jsoref commented Jan 30, 2024 • edited Loading

jsoref commented Jan 31, 2024

aibaars commented Jan 31, 2024

jsoref commented Jan 31, 2024

oscarsj left a comment

Choose a reason for hiding this comment

oscarsj left a comment

Choose a reason for hiding this comment

jsoref commented Feb 9, 2024

oscarsj commented Feb 12, 2024 • edited Loading

smowton commented Feb 12, 2024

jsoref commented Feb 12, 2024

oscarsj commented Feb 12, 2024

jsoref commented Jan 30, 2024 •

edited

Loading

jsoref commented Jan 30, 2024 •

edited

Loading

oscarsj commented Feb 12, 2024 •

edited

Loading