Hi, Just a quick note in agreement with Kyle.
For many corporates that are behind the times, using docker is quite a challenge and actually prevents me from running Beam Python against Flink. Since the code is really just Java (I think?), could there be an option to just build the jar and manually run the portable runner? Of course, the correct way should be docker, but it would be very handy to have this option. Cheers Fred From: Kyle Weaver [mailto:kcwea...@google.com] Sent: 13 September 2019 01:57 To: user@beam.apache.org Cc: dev <d...@beam.apache.org> Subject: EXTERNAL: Re: How do you write portable runner pipeline on separate python code ? I prefer loopback because a) it writes output files to the local filesystem, as the user expects, and b) you don't have to pull or build docker images, or even have docker installed on your system -- which is one less point of failure. Kyle Weaver | Software Engineer | github.com/ibzib<http://github.com/ibzib> | kcwea...@google.com<mailto:kcwea...@google.com> On Thu, Sep 12, 2019 at 5:48 PM Thomas Weise <t...@apache.org<mailto:t...@apache.org>> wrote: This should become much better with 2.16 when we have the Docker images prebuilt. Docker is probably still the best option for Python on a JVM based runner in a local environment that does not have a development setup. On Thu, Sep 12, 2019 at 1:09 PM Kyle Weaver <kcwea...@google.com<mailto:kcwea...@google.com>> wrote: +dev<mailto:d...@beam.apache.org> I think we should probably point new users of the portable Flink/Spark runners to use loopback or some other non-docker environment, as Docker adds some operational complexity that isn't really needed to run a word count example. For example, Yu's pipeline errored here because the expected Docker container wasn't built before running. Kyle Weaver | Software Engineer | github.com/ibzib<http://github.com/ibzib> | kcwea...@google.com<mailto:kcwea...@google.com> On Thu, Sep 12, 2019 at 11:27 AM Robert Bradshaw <rober...@google.com<mailto:rober...@google.com>> wrote: On this note, making local files easy to read is something we'd definitely like to improve, as the current behavior is quite surprising. This could be useful not just for running with docker and the portable runner locally, but more generally when running on a distributed system (e.g. a Flink/Spark cluster or Dataflow). It would be very convenient if we could automatically stage local files to be read as artifacts that could be consumed by any worker (possibly via external directory mounting in the local docker case rather than an actual copy), and conversely copy small outputs back to the local machine (with the similar optimization for local docker). At the very least, however, obvious messaging when the local filesystem is used from within docker, which is often a (non-obvious and hard to debug) mistake should be added. On Thu, Sep 12, 2019 at 10:34 AM Lukasz Cwik <lc...@google.com<mailto:lc...@google.com>> wrote: When you use a local filesystem path and a docker environment, "/tmp" is written inside the container. You can solve this issue by: * Using a "remote" filesystem such as HDFS/S3/GCS/... * Mounting an external directory into the container so that any "local" writes appear outside the container * Using a non-docker environment such as external or process. On Thu, Sep 12, 2019 at 4:51 AM Yu Watanabe <yu.w.ten...@gmail.com<mailto:yu.w.ten...@gmail.com>> wrote: Hello. I would like to ask for help with my sample code using portable runner using apache flink. I was able to work out the wordcount.py using this page. https://beam.apache.org/roadmap/portability/ I got below two files under /tmp. -rw-r--r-- 1 ywatanabe ywatanabe 185 Sep 12 19:56 py-wordcount-direct-00001-of-00002 -rw-r--r-- 1 ywatanabe ywatanabe 190 Sep 12 19:56 py-wordcount-direct-00000-of-00002 Then I wrote sample code with below steps. 1.Install apache_beam using pip3 separate from source code directory. 2. Wrote sample code as below and named it "test-protable-runner.py". Placed it separate directory from source code. ----------------------------------------------------------------------------------- (python) ywatanabe@debian-09-00:~$ ls -ltr total 16 drwxr-xr-x 18 ywatanabe ywatanabe 4096 Sep 12 19:06 beam (<- source code directory) -rw-r--r-- 1 ywatanabe ywatanabe 634 Sep 12 20:25 test-portable-runner.py ----------------------------------------------------------------------------------- 3. Executed the code with "python3 test-protable-ruuner.py" ========================================================================================== #!/usr/bin/env import apache_beam as beam from apache_beam.options.pipeline_options import PipelineOptions from apache_beam.io<http://apache_beam.io> import WriteToText def printMsg(line): print("OUTPUT: {0}".format(line)) return line options = PipelineOptions(["--runner=PortableRunner", "--job_endpoint=localhost:8099", "--shutdown_sources_on_final_watermark"]) p = beam.Pipeline(options=options) output = ( p | 'create' >> beam.Create(["a", "b", "c"]) | beam.Map(printMsg) ) output | 'write' >> WriteToText('/tmp/sample.txt') ======================================================================================= Job seemed to went all the way to "FINISHED" state. ------------------------------------------------------------------------------------------------------------------------------------------------------- [DataSource (Impulse) (1/1)] INFO org.apache.flink.runtime.taskmanager.Task - Registering task at network: DataSource (Impulse) (1/1) (9df528f4d493d8c3b62c8be23dc889a8) [DEPLOYING]. [DataSource (Impulse) (1/1)] INFO org.apache.flink.runtime.taskmanager.Task - DataSource (Impulse) (1/1) (9df528f4d493d8c3b62c8be23dc889a8) switched from DEPLOYING to RUNNING. [flink-akka.actor.default-dispatcher-3] INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - DataSource (Impulse) (1/1) (9df528f4d493d8c3b62c8be23dc889a8) switched from DEPLOYING to RUNNING. [flink-akka.actor.default-dispatcher-3] INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN MapPartition (MapPartition at [2]write/Write/WriteImpl/DoOnce/{FlatMap(<lambda at core.py:2415>), Map(decode)}) -> FlatMap (FlatMap at ExtractOutput[0]) (1/1) (0b60f1a9e620b061f85e97998aa4b181) switched from CREATED to SCHEDULED. [flink-akka.actor.default-dispatcher-3] INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - CHAIN MapPartition (MapPartition at [2]write/Write/WriteImpl/DoOnce/{FlatMap(<lambda at core.py:2415>), Map(decode)}) -> FlatMap (FlatMap at ExtractOutput[0]) (1/1) (0b60f1a9e620b061f85e97998aa4b181) switched from SCHEDULED to DEPLOYING. [flink-akka.actor.default-dispatcher-3] INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - Deploying CHAIN MapPartition (MapPartition at [2]write/Write/WriteImpl/DoOnce/{FlatMap(<lambda at core.py:2415>), Map(decode)}) -> FlatMap (FlatMap at ExtractOutput[0]) (1/1) (attempt #0) to aad5f142-dbb7-4900-a5ae-af8a6fdcc787 @ localhost (dataPort=-1) [flink-akka.actor.default-dispatcher-2] INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Received task CHAIN MapPartition (MapPartition at [2]write/Write/WriteImpl/DoOnce/{FlatMap(<lambda at core.py:2415>), Map(decode)}) -> FlatMap (FlatMap at ExtractOutput[0]) (1/1). [DataSource (Impulse) (1/1)] INFO org.apache.flink.runtime.taskmanager.Task - Loading JAR files for task DataSource (Impulse) (1/1) (d51e4171c0af881342eaa8a7ab06def8) [DEPLOYING]. [DataSource (Impulse) (1/1)] INFO org.apache.flink.runtime.taskmanager.Task - Registering task at network: DataSource (Impulse) (1/1) (d51e4171c0af881342eaa8a7ab06def8) [DEPLOYING]. [DataSource (Impulse) (1/1)] INFO org.apache.flink.runtime.taskmanager.Task - DataSource (Impulse) (1/1) (9df528f4d493d8c3b62c8be23dc889a8) switched from RUNNING to FINISHED. ------------------------------------------------------------------------------------------------------------------------------------------------------- But I ended up with docker error on client side. ------------------------------------------------------------------------------------------------------------------------------------------------------- (python) ywatanabe@debian-09-00:~$ python3 test-portable-runner.py /home/ywatanabe/python/lib/python3.5/site-packages/apache_beam/__init__.py:84: UserWarning: Some syntactic constructs of Python 3 are not yet fully supported by Apache Beam. 'Some syntactic constructs of Python 3 are not yet fully supported by ' ERROR:root:java.io.IOException: Received exit code 125 for command 'docker run -d --network=host --env=DOCKER_MAC_CONTAINER=null ywatanabe-docker-apache.bintray.io/beam/python3:latest<http://ywatanabe-docker-apache.bintray.io/beam/python3:latest> --id=13-1 --logging_endpoint=localhost:39049 --artifact_endpoint=localhost:33779 --provision_endpoint=localhost:34827 --control_endpoint=localhost:36079'. stderr: Unable to find image 'ywatanabe-docker-apache.bintray.io/beam/python3:latest<http://ywatanabe-docker-apache.bintray.io/beam/python3:latest>' locallydocker: Error response from daemon: unknown: Subject ywatanabe was not found.See 'docker run --help'. Traceback (most recent call last): File "test-portable-runner.py", line 27, in <module> result.wait_until_finish() File "/home/ywatanabe/python/lib/python3.5/site-packages/apache_beam/runners/portability/portable_runner.py", line 446, in wait_until_finish self._job_id, self._state, self._last_error_message())) RuntimeError: Pipeline BeamApp-ywatanabe-0912112516-22d84d63_d3b5e778-d771-4118-9ba7-c9dde85938e9 failed in state FAILED: java.io.IOException: Received exit code 125 for command 'docker run -d --network=host --env=DOCKER_MAC_CONTAINER=null ywatanabe-docker-apache.bintray.io/beam/python3:latest<http://ywatanabe-docker-apache.bintray.io/beam/python3:latest> --id=13-1 --logging_endpoint=localhost:39049 --artifact_endpoint=localhost:33779 --provision_endpoint=localhost:34827 --control_endpoint=localhost:36079'. stderr: Unable to find image 'ywatanabe-docker-apache.bintray.io/beam/python3:latest<http://ywatanabe-docker-apache.bintray.io/beam/python3:latest>' locallydocker: Error response from daemon: unknown: Subject ywatanabe was not found.See 'docker run --help'. ------------------------------------------------------------------------------------------------------------------------------------------------------- As a result , I got nothing under /tmp . Code works when using DirectRunner. May I ask , where should I look for in order to get the pipeline to write results to text files under /tmp ? Best Regards, Yu Watanabe -- Yu Watanabe Weekend Freelancer who loves to challenge building data platform yu.w.ten...@gmail.com<mailto:yu.w.ten...@gmail.com>