This post is about a mistake I made that wasted a fair bit of time on my side until the folks over in Support set me straight :)
The Issue
We have been using GitHub Actions within the platform team on our Enterprise instance for some time.
We are using the following Open Source project, which enables the use of AWS Spot for ephemeral runner instances. Please go check it out and give them a star.
However, we had several organisations where a few repositories refused to complete action workflows. The workflow would kick off within Actions; a spot instance was provisioned but never picked up the job.
Troubleshooting
Firstly I was convinced this was an issue with the webhook trigger or something in the automation spinning up the runner within AWS. Checking the logs in CloudWatch showed the expected behaviour of receiving the workflow job and spinning up a spot instance.
The instance would register the runner with the GitHub instance but then sit there with no error but just no job to run. After a couple of minutes of inactivity, the instance was terminated, which again was expected.
After digging deep into logs, it was time for a support ticket. I admit I was dubious about how this odd issue would progress, but I had nowhere else to turn.
Support
The first thing support suggested was a connectivity test, which made total sense. What I presented sounded like a communication issue between the runner and the GitHub Enterprise instance.
This passed with flying colours; however, there is an issue that you can't run the command within a sudo shell. As these are spot instances, they are accessed via the SSM Sessions manager, and the clocks are ticking, so I only get a few minutes to get the output I need.
sh-4.2$ sudo su -
[root@ip-192-168-4-180 ~]# cd /opt/actions-runner/
[root@ip-192-168-4-180 actions-runner]# ls
bin config.sh _diag env.sh externals run-helper.cmd.template run-helper.sh.template run.sh runsvc.sh safe_sleep.sh svc.sh _work
[root@ip-192-168-4-180 actions-runner]# ./run.sh --check
Must not run interactively with sudo
Exiting runner...
[root@ip-192-168-4-180 actions-runner]#
Session terminated, killing shell... ...killed.
Terminated
sh-4.2$
I wasn't interested in a long-term solution here, so I hacked the /opt/actions-runner/run-helper.sh.template file to enable the command to get past this check.
[root@ip-192-168-4-196 actions-runner]# head /opt/actions-runner/run-helper.sh.template
#!/bin/bash
# Validate not sudo
#user_id=`id -u`
#if [ $user_id -eq 0 -a -z "$RUNNER_ALLOW_RUNASROOT" ]; then
# echo "Must not run interactively with sudo"
# exit 1
#fi
Next up was to enable debugging on the runner; this is done by simply creating some specific secrets in the target repo.
See here for the steps - https://docs.github.com/en/actions/monitoring-and-troubleshooting-workflows/enabling-debug-logging#enabling-runner-diagnostic-logging
Again this provided nothing for me to go on.
Runner Groups
Support then suggested investigating what permissions had been assigned to the runner groups.
In my frustration of wanting this fixed, I stupidly said I'm not using runner groups.
However, you will always use a runner group. If not an explicitly defined custom runner group, your runners will be registered in the Default Runner Group.
As the above shows - All repository, excluding public repositories
There we go, the penny drops, and the only repositories that were not working were public, and I now feel stupid. :) How did I not pick that up?
I tick the box, and everything in the matrix is fixed.
Summary
While public repositories in our Enterprise instance are not as public as on GitHub.com, this setting makes complete sense from a security perspective. I should have dug a little further into this to avoid having to raise the support ticket.
If possible, I would like to see Action workflows on repositories that are public fail immediately if public repositories are denied in the config above.
However, this was just one of those configurations that was always going to present this issue, and I missed the glaringly obvious!
Thank you to GitHub support for dealing with my stupid answers so gracefully.
However, this was just one of those configurations that was always going to present this issue, and I missed the glaringly obvious!
Thank you to GitHub support for dealing with my stupid answers so gracefully.
I hope this helps someone else.
Cheers