Re: ETA on GHA self-hosted runner security audit?

Chesnay Schepler Fri, 08 Apr 2022 23:10:04 -0700

i think we'd be alright with those caveats.

1) It depends on how far you wanna go. If the goal is to only protectthe workflow files themselves (and not any scripts that the workflow calls),then you only need to check out a single version of the code, namely thePR. The way I see it whether the workflow calls some PR-controlledscript or test largely results in the same threats (that is, runningweird stuff or making changes to the current shell).

2) That's gonna be unintuitive and will definitely end up breaking CI afew times, but for the most part changes to the CI workflow can also beverifiedfrom runs in the contributors fork in github-hosted runners. Thealternative is to "merge" the PR into a separate branch in the main repo fortesting purposes. It's inconvenient, but not a problem I think given howrarely we change things.

3) The "bitcoin miner problem" is giving me headaches. On the one handI'd love to just say "hey let's just run manual builds for for allnon-committers",on the other hand I know that will be tedious AF and result in stuffbeing merged without CI being run at all.My current plan is to add some form of self-contained pre-flight checkas another job to the pull_request_targetworkflow that fails with an error if it is a first-time contributor.That _may_ cover most cases; I think we'd be fine with re-evaluatingthis later onif necessary, accepting some risk (so long as that risk is limited toreduced CI capacity/stability).

> you should [...] make sure that your `pull_request_target" workflowis the only one that can use self-hosted runners and make sure no otherworkflows can.

If there are no 'pull_request' workflows this is covered, correct? If soI think we'd just ban those entirely.



On 07/04/2022 11:00, Jarek Potiuk wrote:

Cool. Thanks for the explanation :). Happy to brainstorm a bit - looks
like you are well prepared. But still there are a few things left which
might be difficult or maybe even impossible to handle.

I think if you do this extra hoop and indeed not use at all
pull_request_target workflow then as of two weeks you could actually "some"
security in place:
https://github.blog/2022-03-23-github-actions-secure-self-hosted-runners-specific-workflows/
- we've been discussing it with Zeppelin as well and I started to look at
it recently, but in essence, you should be able to make sure that your
`pull_request_target" workflow is the only one that can use self-hosted
runners and make sure no other workflows can. Otherwise you risk the
problem that the users will actually modify any of the existing workflows
they have access to (maybe you indeed do not need any) and make them run on
self-hosted machines of yours.

There are two small problems with using pull_request_target and one huge:

1) By default you get only the main version of both - the workflow and the
code that is checked out. If you want to run any actions on the code of the
pull request, you need to specifically check out the code of the pull
request (likely in a different directory so that you don't override the
scripts that you actually run in that workflow). We do it in Airflow
already so it's possible. But effectively you have two copies of code that
you manage - one that comes with your "pull_request_target" (from main) and
one that comes from the user from pull request.

2) The second caveat of the "pull_request_targe" is that if you want to
make any changes in the workflow itself, then you either "test in
production" or you need to push your changes as `main` branch in your fork
to test it which adds a lot of extra hoops. Those are very annoying things
but we made it work in Airflow with quite extra effort.

3) The huge problem with using pull_request_target on long running machines
is that it opens you up for cryptocurrency mining by bad actors. And there
is very little that you can do to protect yourself "with certainty" on a
living system with "static machines" - short of going the same protections
that cloud vendors do for their computing instances, no more, no less (and
that's really the difficult part)

Actually, crypto-currency mining is a much bigger problem for
"pull_request_target" - if it will at any step execute an unreviewed user's
code from the PR (and I cannot imagine any non-trivial CI workflow that
would not execute user code in some ways). The possibility of
"cryptocurrency abuse" is not "academic" problem - this had actually
happened and this is main reason why GitHub introduced "Manual Approval":
https://github.blog/2021-04-22-github-actions-update-helping-maintainers-combat-bad-actors/.
After this change, GitHub Actions will prevent the "pull_request" workflows
originating from "new users": all fine there. They protect their own
infrastructure. But - if you run the user's code in "pull_request_target" -
this protection DOES NOT WORK. Look at this image here:
https://pasteboard.co/x2t7T3T3CP2k.png. When new users submit PR, the
"pull_request" workflow is indeed protected, But "pull_request_target" is
considered safe (as it only runs from main) and it will run immediately,
without waiting for approval. So in a sense "pull_request_target" is more
sensitive for cryptocurrency mining cases.

If your "pull_request_target" workflow at any time will execute code coming
from the user - then the "new user" can very easily sneak in something that
will continue running in a background and run cryptocurrency mining without
any need for approval. If you have a long running machine - this is a
really bad scenario.
In the case of Airflow we have "ephemeral" machines that get killed after
the job is done. They are also "spot instances" so they get killed within
24 hours no matter what. Also we very, very carefully review any changes
coming to pull_request_target workflow.

Currently, the only code (I think!!) from user we run there is "building
Docker image from Dockerfile changeable by the user" and "running python
code in Docker to build packages" - both have time limit and docker
containers are killed after specified time (and user cannot influence the
time) - which makes it totally unattractive for currency mining (but if
they escape docker, the machines also will be shutdown relatively quickly).

But even with those careful reviews, we already had a few cases where even
with the thorough review, I realised just a day or so later that I opened a
door for the anonymous user to execute their own code on our
self-hosted machines directly. For example here is a fix of such
"vulnerability":
https://github.com/apache/airflow/commit/197b1390ed039486512436a6457e6961aef41a94
.
It looks pretty innocent, but those few lines fix the problem where
anonymous/new users could open a PR and execute code and (for example) get
a remote shell to our machine. If that was a "long running" one - that
would give pretty much full access to mine bitcoin on our build machines
and do all kinds of stuff.

I personally would never run a long-running machine for CI self-hosted
runners. It's far too dangerous IMHO. And really this is the main reason
why the problem is that hard - because you have to make a lot of effort to
make sure that bad actor's code will not escape from the workflow code into
the machines of yours. If you run them continuously, I think ultimately
there will always be a way to escape (either via some vulnerability or a
bug in your code like in our case). This is why I think to get really
"reasonably protected", you need to do both things:

* ephemeral machines started on demand controlled by webhooks of github
actions
* docker-contained user code in short running jobs

Other comments:
- If you do not need write permission - then it's cool and makes even
accidental slips less problematic, because the most you can do is to impact
the builds of others. If you do not use artifacts of the CI elsewhere, this
is secure.
- Running the entire workflow in docker containers makes perfect sense.
It does not give full protection but makes it considerably harder to escape

On Thu, Apr 7, 2022 at 8:52 AM Chesnay Schepler <ches...@apache.org> wrote:

Our thinking goes along these lines:

One of the main issues seems to be the possibility of contributors
changing the workflow files and having those workflows being run in the PR.
This alone would void several safety measures that one could set up in
the workflow itself.
The main thing missing here on githubs side is some form of (secure)
validation step that a workflow should even run.

AFAIK you solved this by modifying the runner to check for an allow list
of the person running the workflow.
We intend to not use the pull_request event but either
* pull_request_target
* have a bot trigger workflow_dispatch after some validation (like
checking the diff for changes to certain files/directories) with the
workflow checking out the PR.
Depends a bit on whether there are any surprises in the
pull_request_target option.

Both of these have the disadvantage of potentially exposing a write
token and secrets.
As we don't need write permissions at all we can both disable write
permissions in the UI and set a restrictive permissions map in the
workflow.
As for secrets, according to the documentation reusable workflows must
opt-in to specific secrets, so the main workflow run for PRs will not do
anything but reuse another workflow which doesn't use opt-in to any
secrets.

On the topic of secrets, branches and PRs builds are run by different
sets of machines via labels, to prevent cases of someone breaking out of
a PR build and into another branch build.

We intend to run the entire workflow in docker containers, which run on
a static set of machines.
Setting up ephemeral virtual machines would of course be nicer, but our
machines don't support that.

For 3rd party actions we, as of right now, only intend to use those
provided by Github (things like checkout or artifact management). Our
pipeline isn't really complex or fancy, just expensive.
Since these actions offer pretty basic functionality there is no need to
use the latest-and-greatest stuff, so we pin them to specific commits
and quite likely don't need to ever touch them.
Chances are that if we need a new feature, by the time that happens it
has already been out in the wild for months.

3rd party dependencies are certainly tricky, but it doesn't seem
specific to self-hosted runners.

On 06/04/2022 23:28, Jarek Potiuk wrote:

So could you explain what you've done to fulfill that?

If you want to use GitHub Actions, consider using your own self-hosted

runner, but only if you can afford to build and maintain your own
self-hosted infrastructure (this is not an easy task due to security
limitations of the official GitHub Actions runners).

What have you planned to make your infrastructure ready for the
security challenges there ?

Could you please explain your understanding of it and what you've done
to fulfill it ?

J.

On Wed, Apr 6, 2022 at 10:44 PM Chesnay Schepler <ches...@apache.org>

wrote:

I have very much read that.

On 06/04/2022 19:22, Jarek Potiuk wrote:

Since you referred Ash's link you probably have not read this:

    However this is not something to tackle lightly, as Infra *will not

manage

or secure your VM* - that is up to you.


On Wed, Apr 6, 2022 at 7:21 PM Chesnay Schepler <ches...@apache.org>

wrote:

This article also lists self-hosted runners as an option:

https://cwiki.apache.org/confluence/display/INFRA/GitHub+self-hosted+runners

On 06/04/2022 11:56, Chesnay Schepler wrote:

Did you find some documentation somewhere that we might have said

otherwise?

We knew that Airflow is using them and thus thought it would be fine.
We also had a chat with the Airflow folks and IIRC it also wasn't
mentioned.

There were several tickets where other projects requested token where
no limitation was mentioned:
* Arrow; token was provided:
https://issues.apache.org/jira/browse/INFRA-19875
* Beam: https://issues.apache.org/jira/browse/INFRA-22840
* Zeppelin: https://issues.apache.org/jira/browse/INFRA-22674
And in fact our own latest request for 2 tokens was also granted in
https://issues.apache.org/jira/browse/INFRA-23086. The alarm bells
only went off when we requested more tokens.

Then we have https://infra.apache.org/self-hosted-runners.html which
states /"//Apache permits projects to use self-hosted runners [but
does not recommend them]./
/
/
At last, we have

https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status

(admittedly not an official INFRA resource, but it is linked in some
INFRA tickets / discussions), which again lists self-hosted runners

as

an option (while listing /caveats/)./
/
/
/
TL;DR://There was plenty of information from which one would conclude
that self-hosted runners are allowed, and no information to the

contrary.

//


On 06/04/2022 11:43, Gavin McDonald wrote:

Hi.

On Wed, Apr 6, 2022 at 11:31 AM Chesnay Schepler<ches...@apache.org
wrote:

Hello,

Inhttps://issues.apache.org/jira/browse/INFRA-23086  it was

mentioned

that a security audit of self-hosted runners for github actions is
being
conducted at the moment, and that until this is complete no

significant

number of self-hosted runners can be set up.
This came as a bit of a surprise to us (the Flink project); we
wanted to
complete our migration to github actions within the next 2-3 weeks,
which is now effectively blocked.

I wanted to ask about this part, why was it a surprise?

Self Hosted Github Runners
has never been approved for general projects use at the moment. Did

you

find
some documentation somewhere that we might have said otherwise?

We are still evaluating a safe and secure way in which we can deploy
self
hosted runners
at the  ASF. Currently Airflow are the only approved project, and

we are

working with Beam
to ensure the same level of security if not better. the result of

this

experiment will determine
when we can open up self hosted runners for all projects.

2 to 3 weeks MIGHT be do-able but I'll let you know, still working

with

Beam currently.

I wanted to ask whether there is some form of ETA on when this

audit is

complete.

Regards,
Chesnay

Re: ETA on GHA self-hosted runner security audit?

Reply via email to