I only believe @OrielResearch Eila Arich-Landkof <e...@orielresearch.org> potentially doing applied work with custom containers (there must be others)!
For a plug for her and @BeamSummit -- I think enough related will be talked about in (with Conda specifics) --> https://2020.beamsummit.org/sessions/workshop-using-conda-on-beam/ I'm sure others will have more things to say that are actually helpful, on-list, before that occurs (~3 weeks). On Fri, Aug 7, 2020 at 6:32 PM Eugene Kirpichov <ekirpic...@gmail.com> wrote: > Hi old Beam friends, > > I left Google to work on climate change > <https://www.linkedin.com/posts/eugenekirpichov_i-am-leaving-google-heres-a-snipped-to-activity-6683408492444962816-Mw5U> > and am now doing a short engagement with Pachama <https://pachama.com/>. > Right now I'm trying to get a Beam Python pipeline to work; the pipeline > will use fancy requirements and native dependencies, and we plan to run it > on Cloud Dataflow (so custom containers are not yet an option), so I'm > going straight for the direct PortableRunner as per > https://beam.apache.org/documentation/runtime/environments/. > > Basically I can't get a minimal Beam program with a minimal > requirements.txt file to work - the .tar.gz of the dependency mysteriously > ends up being ungzipped and non-installable inside the Docker container > running the worker. Details below. > > === main.py === > import argparse > import logging > > import apache_beam as beam > from apache_beam.options.pipeline_options import PipelineOptions > from apache_beam.options.pipeline_options import SetupOptions > > def run(argv=None): > parser = argparse.ArgumentParser() > known_args, pipeline_args = parser.parse_known_args(argv) > > pipeline_options = PipelineOptions(pipeline_args) > pipeline_options.view_as(SetupOptions).save_main_session = True > > with beam.Pipeline(options=pipeline_options) as p: > (p | 'Create' >> beam.Create(['Hello']) > | 'Write' >> beam.io.WriteToText('/tmp')) > > > if __name__ == '__main__': > logging.getLogger().setLevel(logging.INFO) > run() > > === requirements.txt === > alembic > > When I run the program: > $ python3 main.py > --runner=PortableRunner --job_endpoint=embed > --requirements_file=requirements.txt > > > I get some normal output and then: > > INFO:apache_beam.runners.portability.fn_api_runner.worker_handlers:b' > File > "/usr/local/lib/python3.7/site-packages/pip/_internal/utils/unpacking.py", > line 261, in unpack_file\n untar_file(filename, location)\n File > "/usr/local/lib/python3.7/site-packages/pip/_internal/utils/unpacking.py", > line 177, in untar_file\n tar = tarfile.open(filename, mode)\n File > "/usr/local/lib/python3.7/tarfile.py", line 1591, in open\n return > func(name, filemode, fileobj, **kwargs)\n File > "/usr/local/lib/python3.7/tarfile.py", line 1648, in gzopen\n raise > ReadError("not a gzip file")\ntarfile.ReadError: not a gzip > file\n2020/08/08 01:17:07 Failed to install required packages: failed to > install requirements: exit status 2\n' > > This greatly puzzled me and, after some looking, I found something really > surprising. Here is the package in the *directory to be staged*: > > $ file > /var/folders/07/j09mnhmd2q9_kw40xrbfvcg80000gn/T/dataflow-requirements-cache/alembic-1.4.2.tar.gz > ...: gzip compressed data, was "dist/alembic-1.4.2.tar", last modified: > Thu Mar 19 21:48:31 2020, max compression, original size modulo 2^32 4730880 > $ ls -l > /var/folders/07/j09mnhmd2q9_kw40xrbfvcg80000gn/T/dataflow-requirements-cache/alembic-1.4.2.tar.gz > -rw-r--r-- 1 jkff staff 1092045 Aug 7 16:56 ... > > So far so good. But here is the same file inside the Docker container (I ssh'd > into the dead container > <https://thorsten-hans.com/how-to-run-commands-in-stopped-docker-containers> > ): > > # file /tmp/staged/alembic-1.4.2.tar.gz > /tmp/staged/alembic-1.4.2.tar.gz: POSIX tar archive (GNU) > # ls -l /tmp/staged/alembic-1.4.2.tar.gz > -rwxr-xr-x 1 root root 4730880 Aug 8 01:17 > /tmp/staged/alembic-1.4.2.tar.gz > > The file has clearly been unzipped and now of course pip can't install it! > What's going on here? Am I using the direct/portable runner combination > wrong? > > Thanks! > > -- > Eugene Kirpichov > http://www.linkedin.com/in/eugenekirpichov >