Re: ETA on GHA self-hosted runner security audit?

Jarek Potiuk Thu, 07 Apr 2022 02:01:10 -0700

Cool. Thanks for the explanation :). Happy to brainstorm a bit -  looks
like you are well prepared. But still there are a few things left which
might be difficult or maybe even impossible to handle.

I think if you do this extra hoop and indeed not use at all
pull_request_target workflow then as of two weeks you could actually "some"
security in place:
https://github.blog/2022-03-23-github-actions-secure-self-hosted-runners-specific-workflows/
- we've been discussing it with Zeppelin as well and I started to look at
it recently, but in essence, you should be able to make sure that your
`pull_request_target"  workflow is the only one that can use self-hosted
runners and make sure no other workflows can. Otherwise you risk the
problem that the users will actually modify any of the existing workflows
they have access to (maybe you indeed do not need any) and make them run on
self-hosted machines of yours.

There are two small problems with using pull_request_target and one huge:

1) By default you get only the main version of both - the workflow and the
code that is checked out. If you want to run any actions on the code of the
pull request, you need to specifically check out the code of the pull
request (likely in a different directory so that you don't override the
scripts that you actually run in that workflow). We do it in Airflow
already so it's possible. But effectively you have two copies of code that
you manage - one that comes with your "pull_request_target" (from main) and
one that comes from the user from pull request.

2) The second caveat of the "pull_request_targe" is that if you want to
make any changes in the workflow itself, then you either "test in
production" or you need to push your changes as `main` branch in your fork
to test it which adds a lot of extra hoops. Those are very annoying things
but we made it work in Airflow with quite extra effort.

3) The huge problem with using pull_request_target on long running machines
is that it opens you up for cryptocurrency mining by bad actors. And there
is very little that you can do to protect yourself "with certainty" on a
living system with "static machines" - short of going the same protections
that cloud vendors do for their computing instances, no more, no less (and
that's really the difficult part)

Actually, crypto-currency mining is a much bigger problem for
"pull_request_target" - if it will at any step execute an unreviewed user's
code from the PR (and I cannot imagine any non-trivial CI workflow that
would not execute user code in some ways). The possibility of
"cryptocurrency abuse" is not "academic" problem - this had actually
happened and this is main reason why GitHub introduced "Manual Approval":
https://github.blog/2021-04-22-github-actions-update-helping-maintainers-combat-bad-actors/.
After this change, GitHub Actions will prevent the "pull_request" workflows
originating from "new users":  all fine there. They protect their own
infrastructure. But - if you run the user's code in "pull_request_target" -
this protection DOES NOT WORK. Look at this image here:
https://pasteboard.co/x2t7T3T3CP2k.png. When new users submit PR, the
"pull_request" workflow is indeed protected, But "pull_request_target" is
considered safe (as it only runs from main) and it will run immediately,
without waiting for approval. So in a sense "pull_request_target" is more
sensitive for cryptocurrency mining cases.

If your "pull_request_target" workflow at any time will execute code coming
from the user - then the "new user" can very easily sneak in something that
will continue running in a background and run cryptocurrency mining without
any need for approval. If you have a long running machine - this is a
really bad scenario.
In the case of Airflow we have "ephemeral" machines that get killed after
the job is done. They are also "spot instances" so they get killed within
24 hours no matter what. Also we very, very carefully review any changes
coming to pull_request_target workflow.

Currently, the only code (I think!!)  from user we run there is "building
Docker image from Dockerfile changeable by the user" and "running python
code in Docker to build packages" - both have time limit and docker
containers are killed after specified time (and user cannot influence the
time) - which makes it totally unattractive for currency mining (but if
they escape docker, the machines also will be shutdown relatively quickly).

But even with those careful reviews, we already had a few cases where even
with the thorough review, I realised just a day or so later that I opened a
door for the anonymous user to execute their own code on our
self-hosted machines directly. For example here is a fix of such
"vulnerability":
https://github.com/apache/airflow/commit/197b1390ed039486512436a6457e6961aef41a94
.
It looks pretty innocent, but those few lines fix the problem where
anonymous/new users could open a PR and execute code and (for example) get
a remote shell to our machine. If that was a "long running" one - that
would give pretty much full access to mine bitcoin on our build machines
and do all kinds of stuff.

I personally would never run a long-running machine for CI self-hosted
runners. It's far too dangerous IMHO. And really this is the main reason
why the problem is that hard - because you have to make a lot of effort to
make sure that bad actor's code will not escape from the workflow code into
the machines of yours. If you run them continuously, I think ultimately
there will always be a way to escape (either via some vulnerability or a
bug in your code like in our case). This is why I think to get really
"reasonably protected", you need to do both things:

* ephemeral machines started on demand controlled by webhooks of github
actions
* docker-contained user code in short running jobs

Other comments:
- If you do not need write permission - then it's cool and makes even
accidental slips less problematic, because the most you can do is to impact
the builds of others. If you do not use artifacts of the CI elsewhere, this
is secure.
- Running the entire workflow in docker containers makes perfect sense.
It does not give full protection but makes it considerably harder to escape

J.

On Thu, Apr 7, 2022 at 8:52 AM Chesnay Schepler <ches...@apache.org> wrote:

> Our thinking goes along these lines:
>
> One of the main issues seems to be the possibility of contributors
> changing the workflow files and having those workflows being run in the PR.
> This alone would void several safety measures that one could set up in
> the workflow itself.
> The main thing missing here on githubs side is some form of (secure)
> validation step that a workflow should even run.
>
> AFAIK you solved this by modifying the runner to check for an allow list
> of the person running the workflow.
> We intend to not use the pull_request event but either
> * pull_request_target
> * have a bot trigger workflow_dispatch after some validation (like
> checking the diff for changes to certain files/directories) with the
> workflow checking out the PR.
> Depends a bit on whether there are any surprises in the
> pull_request_target option.
>
> Both of these have the disadvantage of potentially exposing a write
> token and secrets.
> As we don't need write permissions at all we can both disable write
> permissions in the UI and set a restrictive permissions map in the
> workflow.
> As for secrets, according to the documentation reusable workflows must
> opt-in to specific secrets, so the main workflow run for PRs will not do
> anything but reuse another workflow which doesn't use opt-in to any
> secrets.
>
> On the topic of secrets, branches and PRs builds are run by different
> sets of machines via labels, to prevent cases of someone breaking out of
> a PR build and into another branch build.
>
> We intend to run the entire workflow in docker containers, which run on
> a static set of machines.
> Setting up ephemeral virtual machines would of course be nicer, but our
> machines don't support that.
>
> For 3rd party actions we, as of right now, only intend to use those
> provided by Github (things like checkout or artifact management). Our
> pipeline isn't really complex or fancy, just expensive.
> Since these actions offer pretty basic functionality there is no need to
> use the latest-and-greatest stuff, so we pin them to specific commits
> and quite likely don't need to ever touch them.
> Chances are that if we need a new feature, by the time that happens it
> has already been out in the wild for months.
>
> 3rd party dependencies are certainly tricky, but it doesn't seem
> specific to self-hosted runners.
>
> On 06/04/2022 23:28, Jarek Potiuk wrote:
> > So could you explain what you've done to fulfill that?
> >
> >> If you want to use GitHub Actions, consider using your own self-hosted
> runner, but only if you can afford to build and maintain your own
> self-hosted infrastructure (this is not an easy task due to security
> limitations of the official GitHub Actions runners).
> > What have you planned to make your infrastructure ready for the
> > security challenges there ?
> >
> > Could you please explain your understanding of it and what you've done
> > to fulfill it ?
> >
> > J.
> >
> > On Wed, Apr 6, 2022 at 10:44 PM Chesnay Schepler <ches...@apache.org>
> wrote:
> >> I have very much read that.
> >>
> >> On 06/04/2022 19:22, Jarek Potiuk wrote:
> >>> Since you referred Ash's link you probably have not read this:
> >>>
> >>>    However this is not something to tackle lightly, as Infra *will not
> manage
> >>> or secure your VM* - that is up to you.
> >>>
> >>>
> >>> On Wed, Apr 6, 2022 at 7:21 PM Chesnay Schepler <ches...@apache.org>
> wrote:
> >>>
> >>>> This article also lists self-hosted runners as an option:
> >>>>
> >>>>
> https://cwiki.apache.org/confluence/display/INFRA/GitHub+self-hosted+runners
> >>>>
> >>>> On 06/04/2022 11:56, Chesnay Schepler wrote:
> >>>>>> Did you find some documentation somewhere that we might have said
> >>>>> otherwise?
> >>>>>
> >>>>> We knew that Airflow is using them and thus thought it would be fine.
> >>>>> We also had a chat with the Airflow folks and IIRC it also wasn't
> >>>>> mentioned.
> >>>>>
> >>>>> There were several tickets where other projects requested token where
> >>>>> no limitation was mentioned:
> >>>>> * Arrow; token was provided:
> >>>>> https://issues.apache.org/jira/browse/INFRA-19875
> >>>>> * Beam: https://issues.apache.org/jira/browse/INFRA-22840
> >>>>> * Zeppelin: https://issues.apache.org/jira/browse/INFRA-22674
> >>>>> And in fact our own latest request for 2 tokens was also granted in
> >>>>> https://issues.apache.org/jira/browse/INFRA-23086. The alarm bells
> >>>>> only went off when we requested more tokens.
> >>>>>
> >>>>> Then we have https://infra.apache.org/self-hosted-runners.html which
> >>>>> states /"//Apache permits projects to use self-hosted runners [but
> >>>>> does not recommend them]./
> >>>>> /
> >>>>> /
> >>>>> At last, we have
> >>>>>
> https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status
> >>>>> (admittedly not an official INFRA resource, but it is linked in some
> >>>>> INFRA tickets / discussions), which again lists self-hosted runners
> as
> >>>>> an option (while listing /caveats/)./
> >>>>> /
> >>>>> /
> >>>>> /
> >>>>> TL;DR://There was plenty of information from which one would conclude
> >>>>> that self-hosted runners are allowed, and no information to the
> contrary.
> >>>>> //
> >>>>>
> >>>>>
> >>>>> On 06/04/2022 11:43, Gavin McDonald wrote:
> >>>>>> Hi.
> >>>>>>
> >>>>>> On Wed, Apr 6, 2022 at 11:31 AM Chesnay Schepler<ches...@apache.org
> >
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hello,
> >>>>>>>
> >>>>>>> Inhttps://issues.apache.org/jira/browse/INFRA-23086  it was
> mentioned
> >>>>>>> that a security audit of self-hosted runners for github actions is
> >>>>>>> being
> >>>>>>> conducted at the moment, and that until this is complete no
> significant
> >>>>>>> number of self-hosted runners can be set up.
> >>>>>>> This came as a bit of a surprise to us (the Flink project); we
> >>>>>>> wanted to
> >>>>>>> complete our migration to github actions within the next 2-3 weeks,
> >>>>>>> which is now effectively blocked.
> >>>>>>>
> >>>>>> I wanted to ask about this part, why was it a surprise?
> >>>>>>
> >>>>>> Self Hosted Github Runners
> >>>>>> has never been approved for general projects use at the moment. Did
> you
> >>>>>> find
> >>>>>> some documentation somewhere that we might have said otherwise?
> >>>>>>
> >>>>>> We are still evaluating a safe and secure way in which we can deploy
> >>>>>> self
> >>>>>> hosted runners
> >>>>>> at the  ASF. Currently Airflow are the only approved project, and
> we are
> >>>>>> working with Beam
> >>>>>> to ensure the same level of security if not better. the result of
> this
> >>>>>> experiment will determine
> >>>>>> when we can open up self hosted runners for all projects.
> >>>>>>
> >>>>>> 2 to 3 weeks MIGHT be do-able but I'll let you know, still working
> with
> >>>>>> Beam currently.
> >>>>>>
> >>>>>>
> >>>>>>> I wanted to ask whether there is some form of ETA on when this
> audit is
> >>>>>>> complete.
> >>>>>>>
> >>>>>>> Regards,
> >>>>>>> Chesnay
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
>
>

Re: ETA on GHA self-hosted runner security audit?

Reply via email to