Re: Is there a way to decide what RDDs get cached in the Spark Runner?

2019-05-14 Thread Kyle Weaver
Minor correction: Slack channel is actually #beam-spark Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com | +1650203 *From: *Kyle Weaver *Date: *Tue, May 14, 2019 at 9:38 AM *To: * Hi Augusto, > > Right now the default behavior is to cache all intermediat

Re: Is there a way to decide what RDDs get cached in the Spark Runner?

2019-05-14 Thread Kyle Weaver
/SparkPipelineOptions.java#L150 Thanks Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com | +1650203 *From: *augusto@gmail.com *Date: *Tue, May 14, 2019 at 5:01 AM *To: * Hi, > > I guess the title says it all, right now it seems like BEAM caches all the > interme

Re: How to setup portable flink runner with remote flink cluster

2019-05-24 Thread Kyle Weaver
ption: Connection refused > ... 11 more > > 本邮件及其附件含有小红书公司的保密信息,仅限于发送给以上收件人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件! > > This communication may contain privileged or other confidential > information of Red. If you have received it in error, please advise the > sender by reply e-mail and immediately delete the message and any > attachments without copying or disclosing the contents. Thank you. > > -- Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com | +1650203

Re: [ANNOUNCE] Apache Beam 2.13.0 released!

2019-06-07 Thread Kyle Weaver
https://beam.apache.org/blog/2019/05/22/beam-2.13.0.html > > Thanks to everyone who contributed to this release, and we hope you enjoy > using Beam 2.13.0. > > -- Ankur Goenka, on behalf of The Apache Beam team > -- Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com | +1650203

Re: grpc cancelled without enough information

2019-06-13 Thread Kyle Weaver
I'm not sure the log excerpt you attached contains the root cause of your issue. Unfortunately, Beam's portable runners print a great deal of irrelevant error messages like these, even when the SDK harness disconnects normally. So it is possible the real issue is buried elsewhere in the

[ANNOUNCE] Spark portable runner (batch) now available for Java, Python, Go

2019-06-14 Thread Kyle Weaver
] https://beam.apache.org/documentation/runners/spark/ [4] https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch [5] https://builds.apache.org/job/beam_PostCommit_Python_VR_Spark/ [6] https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/ Kyle Weaver | Software Engineer | github.com/ib

Re: Beam python pipeline on spark

2019-08-02 Thread Kyle Weaver
am wondering if this has changed to date? > > Let me know. > > Thanks, > Mahesh > > *--* > *Mahesh Vangala* > *(Ph) 443-326-1957* > *(web) mvangala.com <http://mvangala.com>* > -- Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com

Re: python setup develop?

2019-09-11 Thread Kyle Weaver
Hi Matt, As of recently, I think you need to specify the python version in the gradle command, e.g. ./gradlew :sdks:python:container:py35:docker Please don't hesitate to let us know if you have any further questions -- this is how we know where to improve our documentation :) Kyle W

Re: How do you write portable runner pipeline on separate python code ?

2019-09-12 Thread Kyle Weaver
e expected Docker container wasn't built before running. Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com On Thu, Sep 12, 2019 at 11:27 AM Robert Bradshaw wrote: > On this note, making local files easy to read is something we'd definitely > like to improv

Re: Python errors when using batch+windows+textio

2019-09-12 Thread Kyle Weaver
Hi Pawel, could you tell us which version of the Beam Python SDK you are using? For the record, this looks like a known issue: https://issues.apache.org/jira/browse/BEAM-6860 Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com On Wed, Sep 11, 2019 at 6:33 AM Paweł Kordek

Re: How do you write portable runner pipeline on separate python code ?

2019-09-12 Thread Kyle Weaver
I prefer loopback because a) it writes output files to the local filesystem, as the user expects, and b) you don't have to pull or build docker images, or even have docker installed on your system -- which is one less point of failure. Kyle Weaver | Software Engineer | github.com/ibzib |

Re: How do you write portable runner pipeline on separate python code ?

2019-09-13 Thread Kyle Weaver
> Is it one of the best guarded secrets? ;-) Apparently so! Filed a few related jiras and assigned to myself. [1] https://issues.apache.org/jira/browse/BEAM-8214 [2] https://issues.apache.org/jira/browse/BEAM-8232 [3] https://issues.apache.org/jira/browse/BEAM-8233 Kyle Weaver | Softw

Re: Flink Runner logging FAILED_TO_UNCOMPRESS

2019-09-14 Thread Kyle Weaver
Try adding "--experiments=beam_fn_api" to your pipeline options. (This is a known issue with Beam 2.15 that will be fixed in 2.16.) Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com On Sat, Sep 14, 2019 at 12:52 AM Yu Watanabe wrote: > Hello. > > I a

Re: Running Python Wordcount issues

2019-09-16 Thread Kyle Weaver
This is a known issue with Beam 2.15. Using master or the 2.16 branch should fix the problem. Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com On Mon, Sep 16, 2019 at 4:31 PM Tom Barber wrote: > Hello folks, > > Trying to get started running the python w

Re: How to use google container registry for FlinkRunner ?

2019-09-16 Thread Kyle Weaver
> > > Perhaps is there any environment variable to specify which image to use ? > > Best Regards, > Yu Watanabe > > -- > Yu Watanabe > Weekend Freelancer who loves to challenge building data platform > yu.w.ten...@gmail.com > [image: LinkedIn icon] <https://www.linkedin.com/in/yuwatanabe1> [image: > Twitter icon] <https://twitter.com/yuwtennis> > -- Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com

Re: How to use the loopback?

2019-09-16 Thread Kyle Weaver
Could you share more of the stack trace? Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com On Mon, Sep 16, 2019 at 7:49 PM Benjamin Tan wrote: > I'm trying to use the loopback via the environment_type option: > > options = PipelineOptions(["--ru

Re: Python Portable Runner Issues

2019-09-17 Thread Kyle Weaver
R SerializingExecutor: Exception while executing runnable org.apache.beam.vendor.grpc.v1p21p0.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed@47db89b1 java.lang.IllegalStateException: call already closed I will try to get rid of them also, but for now you can just ignore them. They ar

Re: How to use the loopback?

2019-09-17 Thread Kyle Weaver
The StopWorker "errors" are harmless and there's an easy patch for them: https://github.com/apache/beam/pull/9600 (I already included this in my reply on the other thread, but putting it down here for the record) Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@goog

Re: Flink Runner logging FAILED_TO_UNCOMPRESS

2019-09-17 Thread Kyle Weaver
The flag is automatically set, but not in a smart way. Taking another look at the code, a more resilient fix would be to just check if the runner isinstance of PortableRunner. Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com On Tue, Sep 17, 2019 at 2:14 PM Ahmet Altay

Re: Flink Runner logging FAILED_TO_UNCOMPRESS

2019-09-17 Thread Kyle Weaver
Actually, the reported issues are already fixed on head. We're just trying to prevent similar issues in the future. Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com On Tue, Sep 17, 2019 at 3:38 PM Ahmet Altay wrote: > > > On Tue, Sep 17, 2019 at 2:26

Re: Python Portable Runner Issues

2019-09-17 Thread Kyle Weaver
The Flink runner is definitely more stable, as it's been around for longer and has more developers and users on it. But a lot of the code is shared, so for example some of the issues above would also happen on the Flink runner. Kyle Weaver | Software Engineer | github.com/ibzib |

Re: Word-count example

2019-09-17 Thread Kyle Weaver
--experiments=beam_fn_api doesn't apply here, as this is a Java pipeline using the non-portable version of the Flink runner. Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com On Tue, Sep 17, 2019 at 4:41 PM Benjamin Tan wrote: > Could you try adding "

Re: Does loopback mode work on Spark clusters?

2019-09-17 Thread Kyle Weaver
explanations of these options, stay tuned. Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com On Tue, Sep 17, 2019 at 4:54 PM Benjamin Tan wrote: > I'm having connection refused errors, though code on PySpark works on the > clusters so I'm pretty sure it's

Re: Submitting job to AWS EMR fails with error=2, No such file or directory

2019-09-18 Thread Kyle Weaver
the logs, or b) you can use the master or Beam 2.16 branches, which have better Docker logging (https://github.com/apache/beam/pull/9389). Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com On Wed, Sep 18, 2019 at 8:04 AM Yu Watanabe wrote: > Thank you for the reply.

Re: Word-count example

2019-09-19 Thread Kyle Weaver
"/Python instructions on https://beam.apache.org/documentation/runners/flink/ and let us know how that goes? Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com On Thu, Sep 19, 2019 at 10:52 AM Matthew Patterson wrote: > Thanks Ankur, > > > > As one who s

Re: Word-count example

2019-09-19 Thread Kyle Weaver
You should probably use 2.15, since 2.16 release artifacts have not been published yet. Just follow the instructions that say --runner=PortableRunner, not --runner=FlinkRunner, otherwise you'll hit that other deserialization bug that was mentioned.. Kyle Weaver | Software Engineer | githu

Re: Word-count example

2019-09-19 Thread Kyle Weaver
I'm guessing you need to install virtualenv: `pip install virtualenv` Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com On Thu, Sep 19, 2019 at 11:27 AM Matthew Patterson wrote: > Kyle, > > > > Excellent, will do: unfortunately switch to 2.16 was o

Re: Word-count example

2019-09-23 Thread Kyle Weaver
"WARNING:root:No unique name set for transform ..." should not affect the pipeline's ability to complete successfully. Is the pipeline failing? If so, could you share more logs? Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com On Mon, Sep 23, 2019 at 1

Re: Word-count example

2019-09-23 Thread Kyle Weaver
would start by making sure there are no unneeded Flink clusters, job servers, etc. left running on your machine, as port conflicts can cause silent failures. Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com On Mon, Sep 23, 2019 at 12:58 PM Matthew Patterson wrote: > K

Re: How to reference manifest from apache flink worker node ?

2019-09-23 Thread Kyle Weaver
The relevant configuration flag for the job server is `--artifacts-dir`. @Robert Bradshaw I added this info to the log message: https://github.com/apache/beam/pull/9646 Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com On Mon, Sep 23, 2019 at 11:36 AM Robert Bradshaw

Re: How do you call user defined ParDo class from the pipeline in Portable Runner (Apache Flink) ?

2019-09-25 Thread Kyle Weaver
You will need to set the save_main_session pipeline option to True. Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com On Wed, Sep 25, 2019 at 3:44 PM Yu Watanabe wrote: > Hello. > > I would like to ask question for ParDo . > > I am getting below error ins

Re: Word-count example

2019-09-25 Thread Kyle Weaver
rtify Flink taskmanger with docker? Docker in docker (dind)? Just install > in image? Docker-out-of-docker? Any working example folks could point me to > would be great. > > > > Thanks All, > > Matt > > > > [dind?]( > https://medium.com/hootsuite-engineering/b

Re: How to import external module inside ParDo using Apache Flink ?

2019-09-26 Thread Kyle Weaver
Did you try moving the imports from the process function to the top of main.py? Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com On Wed, Sep 25, 2019 at 11:27 PM Yu Watanabe wrote: > Hello. > > I would like to ask for help with resolving dependency issue for

Re: Running Python Beam Functions on Spark Kubernetes Cluster

2019-10-31 Thread Kyle Weaver
> Looks like for Java, a standalone Job Service is not required to run beam functions on Spark, and spark-submit handles everything in cluster mode. But this is not the case for Python runner. That's correct. > Are you aware of any example in Python that runs in a (i.e. Kubernetes) cluster? Not

Re: Beam with Flink without containers

2019-11-07 Thread Kyle Weaver
Hi Robert, I replied to your main question on SO. (This is becoming a frequently asked question, so we're going to look for ways to improve or at least document this process. Stay tuned.) > Generally, I’d like to have some sort of state diagram to describe who calls what when, if anything like

Re: Installing system dependencies in a DataFlow worker - how

2019-11-27 Thread Kyle Weaver
You can also configure your own Docker images if you like, instructions here: https://beam.apache.org/documentation/runtime/environments/ On Wed, Nov 27, 2019 at 12:38 AM Carl Thomé wrote: > Hi, > > I have a Beam pipeline written in the Python SDK that decodes audio files > into TFRecord:s. I'd

Re: Installing system dependencies in a DataFlow worker - how

2019-12-02 Thread Kyle Weaver
age": "(24f8c9b6e647d55d): The workflow could not be created. >> Causes: (24f8c9b6e647de48): Invalid worker harness container image: >> my_image. Custom images are not yet supported.", >> "status": "INVALID_ARGUMENT" >> } >> > >

Re: Beam portable runner

2019-12-02 Thread Kyle Weaver
Hi David, I recall gradlew isn't included in some of our source archives, but it should be included if you download the source from github: https://github.com/Apache/beam On Mon, Dec 2, 2019 at 1:40 PM David Edwards wrote: > Hi all, > > Looking for help trying to get the beam portable runner for

Re: What's every part's responsiblity for python sdk with flink?

2019-12-12 Thread Kyle Weaver
The order is: user python code -> job server -> *flink cluster -> SDK harness* 1. User python code defines the Beam pipeline. 2. The job server executes the Beam pipeline on the Flink cluster. To do so, it must translate Beam operations into Flink native operations. 3. The Flink cluster executes

Re: What's every part's responsiblity for python sdk with flink?

2019-12-12 Thread Kyle Weaver
tor" > ? > It sounds like we need to depoloy it on every flink cluster node? > > > -- 原始邮件 -- > *发件人:* "Kyle Weaver"; > *发送时间:* 2019年12月13日(星期五) 凌晨0:26 > *收件人:* "user"; > *抄送:* "Maximilian Michels"; >

Re: Please assist; how do i use a Sample transform ?

2019-12-17 Thread Kyle Weaver
Looks like you need to choose a subclass of sample. Probably FixedSizeGlobally in your case. For example, beam.transforms.combiners.*Sample.FixedSizeGlobally(5)* Source: https://github.com/apache/beam/blob/df376164fee1a8f54f3ad00c45190b813ffbdd34/sdks/python/apache_beam/transforms/combiners.py#L6

Re: Please assist; how do i use a Sample transform ?

2019-12-17 Thread Kyle Weaver
eFn(...)), , > etc. > > On Tue, Dec 17, 2019 at 2:10 PM Kyle Weaver wrote: > > > > Looks like you need to choose a subclass of sample. Probably > FixedSizeGlobally in your case. For example, > > > > beam.transforms.combiners.Sample.FixedSizeGlobally(5)

Re: Beam on Flink with Python SDK and using GCS as artifacts directory

2019-12-23 Thread Kyle Weaver
> It will be great if you can get the error from the failing process. Note that you will have to set the log level to DEBUG to get output from the process. On Fri, Dec 20, 2019 at 6:23 PM Ankur Goenka wrote: > Hi Matthew, > > It will be great if you can get the error from the failing process. >

Re: Protocol message had invalid UTF-8

2019-12-30 Thread Kyle Weaver
This error can happen when the job server and sdk versions are mismatched (due to protobuf incompatibilities). The sdk and job server containers should use the same beam version. On Mon, Dec 30, 2019 at 11:47 AM Yu Watanabe wrote: > Hello. > > I would like to get help with issue having in job-se

Re: Worker pool dies with error: context deadline exceeded

2020-01-02 Thread Kyle Weaver
This is the root cause: > python-sdk_1 | 2019/12/31 02:59:45 Failed to obtain provisioning > information: failed to dial server at localhost:45759 The Flink task manager and Beam SDK harness use connections over `localhost` to communicate. You will have to put `taskmanager` and `python-sdk` on

Re: runShadow: prebuild and build in read-only directory

2020-01-09 Thread Kyle Weaver
You can build the job server jar using: ./gradlew runners:flink:1.8:job-server:shadowJar The output jar will be located in: runners/flink/1.8/job-server/build/libs/ You can run the jar using `java -jar`. Hope that helps. On Thu, Jan 9, 2020 at 10:47 AM Robert Lugg wrote: > I am able to run

Re: Unable to reliably have multiple cores working on a dataset with DirectRunner

2020-01-29 Thread Kyle Weaver
> I also tried briefly SparkRunner with version 2.16 but was no able to achieve any throughput. What do you mean by this? On Wed, Jan 29, 2020 at 1:20 PM Julien Lafaye wrote: > I confirm the situation gets better after the commit: 4 cores used for 18 > seconds rather than one core used for 50 s

Re: Beam 2.19.0 / Flink 1.9.1 - Session cluster error when submitting job "Multiple environments cannot be created in detached mode"

2020-02-20 Thread Kyle Weaver
Hi Tobi, This seems like a bug with Beam 2.19. I filed https://issues.apache.org/jira/browse/BEAM-9345 to track the issue. > What puzzles me is that the session cluster should be allowed to have multiple environments in detached mode - or am I wrong? It looks like that check is removed in Flink

Re: RDD Caching in SparkRunner

2020-02-26 Thread Kyle Weaver
> Persisting is usually the right thing to do. +1, cacheDisabled should only be used if you're certain that *in aggregate* recomputation is faster than writing to and reading from the cache. Keep in mind that cacheDisabled applies to the whole pipeline, meaning you're out of luck if you want to re

Re: GCS numShards doubt

2020-03-02 Thread Kyle Weaver
As Luke and Robert indicated, unsetting num shards _may_ cause the runner to optimize it automatically. For example, the Flink [1] and Dataflow [2] runners override num shards. However, in the Spark runner, I don't see any such override. So I have two questions: 1. Does the Spark runner override

Re: SparkRunner on k8s

2020-04-13 Thread Kyle Weaver
Hi Buvana, Running Beam Python on Spark on Kubernetes is more complicated, because Beam has its own solution for running Python code [1]. Unfortunately there's no guide that I know of for Spark yet, however we do have instructions for Flink [2]. Beam's Flink and Spark runners, and I assume GCP's (

Re: SparkRunner on k8s

2020-04-13 Thread Kyle Weaver
type) > > File > "/usr/local/lib/python3.6/site-packages/apache_beam/io/localfilesystem.py", > line 137, in _path_open > > raw_file = open(path, mode) > > FileNotFoundError: [Errno 2] No such file or directory: > '/tmp/beam-temp-result.txt-43eab494

Re: SparkRunner on k8s

2020-04-15 Thread Kyle Weaver
e > "/usr/local/lib/python3.6/site-packages/apache_beam/io/filebasedsink.py", > line 134, in open > > return FileSystems.create(temp_path, self.mime_type, > self.compression_type) > > File > "/usr/local/lib/python3.6/site-packages/apache_beam/io/files

Re: Sending email from Apache Beam

2020-04-15 Thread Kyle Weaver
The easiest way is to do something like this (assuming you are using Python): CombineGlobally(lambda elements: ','.join(elements)) For more info on Combines, check out the programming guide: https://beam.apache.org/documentation/programming-guide/#core-beam-transforms On Wed, Apr 15, 2020 at 4:4

Re: SparkRunner on k8s

2020-04-16 Thread Kyle Weaver
id a git checkout of release-2.19.0 branch and > executed the portable runner and I still encounter this error. ☹ > > > > -Buvana > > > > *From: *Kyle Weaver > *Reply-To: *"user@beam.apache.org" > *Date: *Wednesday, April 15, 2020 at 2:48 PM > *To: *&qu

Re: Kafka IO: value of expansion_service

2020-04-22 Thread Kyle Weaver
You can build the Java SDK image from source by running the following command: ./gradlew :sdks:java:container:docker On Wed, Apr 22, 2020 at 4:43 PM Piotr Filipiuk wrote: > Thanks for quick response. > > Since Beam 2.21.0 is not yet available via pip >

Re: Kafka IO: value of expansion_service

2020-04-22 Thread Kyle Weaver
Apr 22, 2020 at 1:47 PM Kyle Weaver wrote: > >> You can build the Java SDK image from source by running the following >> command: ./gradlew :sdks:java:container:docker >> >> On Wed, Apr 22, 2020 at 4:43 PM Piotr Filipiuk >> wrote: >> >>> Thanks for q

Re: SparkRunner on k8s

2020-04-24 Thread Kyle Weaver
ns to the job runner that would eventually > translate to ' --volume /storage1:/storage1 ' while the docker container is > being run by Flink? Even if it means code changes and building from source, > its fine. Please point me in the right direction. > > Thanks, > Buvana > --

Re: Kafka IO: value of expansion_service

2020-04-27 Thread Kyle Weaver
>> at >> org.apache.beam.fn.harness.FnApiDoFnRunner$ProcessBundleContext.outputWithTimestamp(FnApiDoFnRunner.java:1411) >> at >> org.apache.beam.fn.harness.FnApiDoFnRunner$ProcessBundleContext.output(FnApiDoFnRunner.java:1406) >> at >> org.apache.beam.sdk.transforms.D

Re: Beam + Flink + Docker - Write to host system

2020-04-29 Thread Kyle Weaver
> This seems to have worked, as the output file is created on the host system. However the pipeline silently fails, and the output file remains empty. Have you checked the SDK container logs? They are most likely to contain relevant failure information. > I don't know if this is a result of me re

Re: Set parallelism for each operator

2020-04-29 Thread Kyle Weaver
Which runner are you using? On Wed, Apr 29, 2020 at 1:32 PM Eleanore Jin wrote: > Hi all, > > I just wonder can Beam allow to set parallelism for each operator > (PTransform) separately? Flink provides such feature. > > The usecase I have is the source is kafka topics, which has less > partition

Re: Beam + Flink + Docker - Write to host system

2020-04-30 Thread Kyle Weaver
gt; > ML6 Gent > <https://www.google.be/maps/place/ML6/@51.037408,3.7044893,17z/data=!3m1!4b1!4m5!3m4!1s0x47c37161feeca14b:0xb8f72585fdd21c90!8m2!3d51.037408!4d3.706678?hl=nl> > > M: +32 474 71 31 08 <+32%20474%2071%2031%2008> > > > On Wed, 29 Apr 2020 at 19:28, Kyle

Re: Beam + Flink + Docker - Write to host system

2020-05-01 Thread Kyle Weaver
- 10 > > [image: https://ml6.eu] <https://ml6.eu> > > Robbe Sneyders > > ML6 Gent > <https://www.google.be/maps/place/ML6/@51.037408,3.7044893,17z/data=!3m1!4b1!4m5!3m4!1s0x47c37161feeca14b:0xb8f72585fdd21c90!8m2!3d51.037408!4d3.706678?hl=nl> > > M: +32 474 71

Re: Warsaw Beam Meetup - May 14th!

2020-05-04 Thread Kyle Weaver
Looks like some cool talks! Thanks for sharing Brittany. On Mon, May 4, 2020 at 7:52 PM Brittany Hermann wrote: > Happy Monday, folks! > > I just wanted to share that I am partnering with Polidea to host the > second Digital Warsaw Apache Beam Meetup on May 14th. Save the date and > check out th

Re: Upgrading from 2.15 to 2.19 makes compilation fail on trigger

2020-05-05 Thread Kyle Weaver
> Maybe we should add a statement like "did you mean to wrap it in Repeatedly.forever?" to the error message +1. IMO the less indirection between the user and the fix, the better. On Tue, May 5, 2020 at 12:08 PM Luke Cwik wrote: > Pointing users to the website with additional details in the err

Re: beam python on spark-runner

2020-05-14 Thread Kyle Weaver
Keep in mind that those instructions about spark-submit are meant only to apply to the Java-only runner. For Python, running spark-submit in this manner is not going to work. See https://issues.apache.org/jira/browse/BEAM-8970 On Thu, May 14, 2020 at 2:55 PM Heejong Lee wrote: > How did you sta

Re: Portable Runner performance optimisation

2020-05-15 Thread Kyle Weaver
> 2. Is it possible to pre-run SDK Harness containers and reuse them for every Portable Runner pipeline? I could win quite a lot of time on this for more complicated pipelines. Yes, you can start docker containers before hand using the worker_pool option: docker run -p=5:5 apachebeam/pyth

Re: Portable Runner performance optimisation

2020-05-15 Thread Kyle Weaver
> Yes, you can start docker containers before hand using the worker_pool option: However, it only works for Python. Java doesn't have it yet: https://issues.apache.org/jira/browse/BEAM-8137 On Fri, May 15, 2020 at 12:00 PM Kyle Weaver wrote: > > 2. Is it possible to pre-r

Re: Portable Runner performance optimisation

2020-05-15 Thread Kyle Weaver
Note also that a worker pool should only retrieve artifacts once: https://github.com/apache/beam/pull/9398 On Fri, May 15, 2020 at 12:15 PM Luke Cwik wrote: > > > On Fri, May 15, 2020 at 9:01 AM Kyle Weaver wrote: > >> > Yes, you can start docker containers before hand

[ANNOUNCE] Beam 2.21.0 Released

2020-05-28 Thread Kyle Weaver
here: https://beam.apache.org/get-started/downloads/ This release includes bug fixes, features, and improvements detailed on the Beam blog: https://beam.apache.org/blog/beam-2.21.0/ Thanks to everyone who contributed to this release, and we hope you enjoy using Beam 2.21.0. -- Kyle Weaver, on

Re: Issue while submitting python beam pipeline on flink - local

2020-05-28 Thread Kyle Weaver
Hi Ashish, can you check to make sure apachebeam/flink1.9_job_server is also on version 2.21.0? On Thu, May 28, 2020 at 7:13 AM Ashish Raghav wrote: > Hello Guys , > > > > I am trying to run a python beam pipeline on flink. I am trying to run > apache_beam.examples.wordcount_minimal but with Pip

Re: Issue while submitting python beam pipeline on flink - local

2020-05-28 Thread Kyle Weaver
Latest Version available on docker hub is 2.20.0 > > > > > > *From:* Kyle Weaver > *Sent:* 28 May 2020 16:51 > *To:* user@beam.apache.org > *Subject:* Re: Issue while submitting python beam pipeline on flink - > local > > > > *EXTERNAL EMAIL* > > Do not clic

Re: Issue while submitting python beam pipeline on flink - local

2020-05-28 Thread Kyle Weaver
. On Thu, May 28, 2020 at 7:31 AM Kyle Weaver wrote: > 2.21.0 should be available now: > https://hub.docker.com/layers/apache/beam_flink1.9_job_server/2.21.0/images/sha256-eeac6dd4571794a8f985e9967fa0c1522aa56a28b5b0a0a34490a600065f096d?context=explore > > On Thu, May 28, 2020 at 7:

Re: Issue while submitting python beam pipeline on flink - local

2020-05-28 Thread Kyle Weaver
> You are using the LOOPBACK environment which requires that the Flink > cluster can connect back to your local machine. Since the loopback > environment by defaults binds to localhost that should not be possible. On the Flink runner page, we recommend using --net=host to avoid the kinds of networ

Re: Cross-Language pipeline fails with PROCESS SDK Harness

2020-05-28 Thread Kyle Weaver
What source are you using? On Thu, May 28, 2020 at 1:24 PM Alexey Romanenko wrote: > Hello, > > I’m trying to run a Cross-Language pipeline (Beam 2.21, Java pipeline with > an external Python transform) with a PROCESS SDK Harness and Spark Portable > Runner but it fails. > To do that I have a ru

Re: Pipeline Processing Time

2020-05-28 Thread Kyle Weaver
Which runner are you using? On Thu, May 28, 2020 at 1:43 PM Talat Uyarer wrote: > Hi, > > I have a pipeline which has 5 steps. What is the best way to measure > processing time for my pipeline? > > Thnaks >

Re: Cross-Language pipeline fails with PROCESS SDK Harness

2020-05-28 Thread Kyle Weaver
Can you try removing the cross-language component(s) from the pipeline and see if it still has the same error? On Thu, May 28, 2020 at 4:15 PM Alexey Romanenko wrote: > For testing purposes, it’s just “Create.of(“Name1”, “Name2”, ...)" > > On 28 May 2020, at 19:29, Kyle Weaver w

Re: HDFS I/O with Beam on Spark and Flink runners - consistent Error messages

2020-05-28 Thread Kyle Weaver
Hi Buvana, I suspect this is a bug. If you can try running your pipeline again with these changes: 1. Remove `--spark-master-url spark://:7077` from your Docker run command. 2. Add `--environment_type=LOOPBACK` to your pipeline options. It will help us confirm the cause of the issue. On

Re: Cross-Language pipeline fails with PROCESS SDK Harness

2020-05-29 Thread Kyle Weaver
beam:transform:read:v1 primitives, but > transform Create.Values/Read(CreateSource) executes in environment > Optional[urn: "beam:env:docker:v1" > payload: "\n\033apache/beam_java_sdk:2.20.0" > > Do you think it’s a bug or I miss something in configuration? > >

Re: Cross-Language pipeline fails with PROCESS SDK Harness

2020-05-29 Thread Kyle Weaver
il.concurrent.UncheckedExecutionException: > java.lang.IllegalStateException: Process died with exit code 1 > > If it’s unknown issue, I’ll create a Jira for that. > > On 29 May 2020, at 16:46, Kyle Weaver wrote: > > Alexey, can you try adding --experiments=beam_fn_api to you

Re: Worker pool question

2020-05-29 Thread Kyle Weaver
> Does this mean, as an end user of Beam, I can start a worker pool and have my Pipeline executed by this pool of workers (or, are these options strictly internal)? What should be the runner value in PiupelineOptions in that case? Not exactly. The runner and Beam workers work together to execute y

Re: Cross-Language pipeline fails with PROCESS SDK Harness

2020-06-01 Thread Kyle Weaver
etting *virtualenv* for a worker console > where it should be running. > > It would be useful to print out such errors with Error level log, I think. > > On 29 May 2020, at 18:55, Kyle Weaver wrote: > > That's probably a problem with your worker. You'll need to get

Re: Issue while submitting python beam pipeline on flink - local

2020-06-05 Thread Kyle Weaver
don't think this is access > > issue as the instance on which the flink cluster is running has full > > access to gcs. > > > > > > > > I tried following this > > https://stackoverflow.com/questions/59429897/beam-running-on-flink-wit > > h-python-sdk-and-us

Re: Streaming Beam jobs keep restarting on Spark/Kubernetes?

2020-06-08 Thread Kyle Weaver
> There is no error Are you sure? That sounds like a crash loop to me. It might take some digging through various Kubernetes logs to find the cause. Can you provide more information about how you're running the job? On Mon, Jun 8, 2020 at 1:50 PM Joseph Zack wrote: > Anybody out there running

Re: Building Dataflow Worker

2020-06-15 Thread Kyle Weaver
> Looks like there is an issue with Gradle Errorprone's plugin. I saw the same error earlier today: https://jira.apache.org/jira/browse/BEAM-10263 I haven't done a full investigation yet, but it seems this broke the build for several (?) past Beam source releases. I'm not sure what Beam's policy

Re: Not able to create a checkpoint path

2020-07-10 Thread Kyle Weaver
For properties not exposed through Beam's FlinkPipelineOptions, Flink's usual configuration management (usually conf/flink-conf.yaml) applies. You should be able to set the checkpoint directory via state.checkpoints.dir per https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoint

Re: Testing Apache Beam pipelines / python SDK

2020-07-17 Thread Kyle Weaver
7;Map to String' >> beam.Map(add_year) assert_that(lines, equal_to(["expected_value1", "expected_value2", ...])) On Fri, Jul 17, 2020 at 3:02 PM Kyle Weaver wrote: > > I had a look at the util_test.py, and i see that in those tests > pipelines are being cr

Re: Testing Apache Beam pipelines / python SDK

2020-07-17 Thread Kyle Weaver
> I had a look at the util_test.py, and i see that in those tests pipelines are being created as part of tests., and in these tests what are being tested are beam functions - eg beam.Map etc. assert_that checks the results of an entire pipeline, not individual transforms. You should be able to a

Re: [ANNOUNCE] Beam 2.23.0 Released

2020-07-30 Thread Kyle Weaver
Hi Eleanore, there have been no changes to Beam's supported Flink versions since Beam 2.21.0. Beam supports Flink 1.8, 1.9, and 1.10. If you are looking for Flink 1.11 support, I didn't find an existing issue, so I filed https://issues.apache.org/jira/browse/BEAM-10612. On Thu, Jul 30, 2020 at 2:

Re: Need Support for ElasticSearch 7.x for beam

2020-08-24 Thread Kyle Weaver
This ticket indicates Elasticsearch 7.x has been supported since Beam 2.19: https://issues.apache.org/jira/browse/BEAM-5192 Are there any specific features you need that aren't supported? On Mon, Aug 24, 2020 at 11:33 AM Mohil Khare wrote: > Hello, > > Firstly I am on java sdk 2.23.0 and we hea

Re: Getting Beam(Python)-on-Flink-on-k8s to work

2020-08-26 Thread Kyle Weaver
> - With the Flink operator, I was able to submit a Beam job, but hit the issue that I need Docker installed on my Flink nodes. I haven't yet tried changing the operator's yaml files to add Docker inside them. Running Beam workers via Docker on the Flink nodes is not recommended (and probably not

Re: Getting Beam(Python)-on-Flink-on-k8s to work

2020-08-28 Thread Kyle Weaver
>>>> - name: taskmanger >>>> image: myregistry:5000/docker-flink:1.10 >>>> env: >>>> - name: DOCKER_HOST >>>> value: tcp://localhost:2375 >>>> ... >>>> >>>> I quickly threw all these pieces

Re: OOM issue on Dataflow Worker by doing string manipulation

2020-09-02 Thread Kyle Weaver
You can try scoping the string builder instance to processElement, instead of making it a member of your DoFn. The same DoFn instance can be used for a bundle of many elements, or possibly even across multiple bundles. https://beam.apache.org/releases/javadoc/2.23.0/org/apache/beam/sdk/transforms/

Re: OOM issue on Dataflow Worker by doing string manipulation

2020-09-02 Thread Kyle Weaver
> It looks like `writer.setLength(0)` may actually allocate a new buffer, and then the buffer may also need to be resized as the String grows, so you could be creating a lot of orphaned buffers very quickly. I'm not that familiar with StringBuilder, is there a way to reset it and re-use the existin

Re: Flink JobService on k8s

2020-09-22 Thread Kyle Weaver
> The issue is that the jobserver does not provide the proper endpoints to the SDK harness when it submits the job to flink. I would not be surprised if there was something weird going on with Docker in Docker. The defaults mostly work fine when an external SDK harness is used [1]. Can you provid

Re: Flink JobService on k8s

2020-09-22 Thread Kyle Weaver
-6mhtq > :/tmp/beam-artifact-staging/3024e5d862fef831e830945b2d3e4e9511e0423bfb9c48de75aa2b3b67decce4 > > Do the jobserver and the taskmanager need to share the artifact staging > volume? > > On Tue, Sep 22, 2020 at 4:04 PM Kyle Weaver wrote: > >> > rpc error: c

Re: Beam Flink Kafka example: issues with docker

2020-09-28 Thread Kyle Weaver
> This looks to me like an issue with artifact staging. It looks like the worker is trying to start the apache/beam_java_sdk:2.24.0 environment, but can't find the jar that we staged that contains the code for the Java KafkaIO. Yeah. This kind of error most often happens when the job server and Be

Re: Beam Flink Kafka example: issues with docker

2020-09-28 Thread Kyle Weaver
gt; is resolved). On Mon, Sep 28, 2020 at 11:32 AM Kyle Weaver wrote: > > This looks to me like an issue with artifact staging. It looks like the > worker is trying to start the apache/beam_java_sdk:2.24.0 environment, but > can't find the jar that we staged that contains the code

Re: Beam Flink Kafka example: issues with docker

2020-09-29 Thread Kyle Weaver
worked with Dataflow yet). Will any of the issues with > xlang kafka also be an issue when writing an MQTT transform? > > > On Mon, Sep 28, 2020 at 8:34 PM Kyle Weaver wrote: > >> Sorry, didn't read closely.. LOOPBACK won't work if you're doing >> c

Re: Java/Flink - Flink's shaded Netty and Beam's Netty clash

2020-10-01 Thread Kyle Weaver
Can you provide your beam and flink versions as well? On Thu, Oct 1, 2020 at 5:59 AM Tomo Suzuki wrote: > To fix the problem we need to identify which JAR file contains > io.grpc.netty.shaded.io.netty.util.collection.IntObjectHashMap. Can you > check which version of which artifact (I suspect i

  1   2   >