Re: Apache Beam Python word count example is failing for Flink Runner

2019-05-28 Thread Ankur Goenka
Hi Anjana, When using local file system, Docker containers started during the pipeline execution can only access container's local filesystem. Also, multiple containers are started during pipeline execution which do not have access to other container's file system. So, in this case, files created

Re: Apache Beam Python word count example is failing for Flink Runner

2019-05-29 Thread Ankur Goenka
ache.bintray.io > > 2. As different log levels can be used to control the verbosity of > logging, When using the logging level as 'INFO' and attempt to log any > messages using 'logging.info' , it does not show the messages on the > console. > > Regards, &g

Re: Apache Beam Python word count example is failing for Flink Runner

2019-05-29 Thread Ankur Goenka
not $ > userId-docker-apache.bintray.io > ' > 2. Logging is working as expected now. I need to set up log levels > properly at jobserver and the script side. > > Regards, > Anjana > -- > *From:* Ankur Goenka [goe...@google.com] > *Sen

Re: How to build a beam python pipeline which does GET/POST request to API's

2019-06-01 Thread Ankur Goenka
Hi Anjana, You can write your API logic in a ParDo and subsequently pass the elements to other ParDos to transform and eventually make an API call to to another endpoint. However, this might not be a good fit for Beam as the input is not well defined and hence scaling and "once processing" of ele

Re: How to build a beam python pipeline which does GET/POST request to API's

2019-06-03 Thread Ankur Goenka
ications. > > Thanks, > Anjana > ------ > *From:* Ankur Goenka [goe...@google.com] > *Sent:* Saturday, June 01, 2019 6:47 PM > *To:* user@beam.apache.org > *Subject:* Re: How to build a beam python pipeline which does GET/POST > request to API

Re: How to build a beam python pipeline which does GET/POST request to API's

2019-06-03 Thread Ankur Goenka
gt; Thanks, > Anjana > ---------- > *From:* Ankur Goenka [goe...@google.com] > *Sent:* Monday, June 03, 2019 11:01 AM > *To:* user@beam.apache.org > *Subject:* Re: How to build a beam python pipeline which does GET/POST > request to API's > > Thank

[ANNOUNCE] Apache Beam 2.13.0 released!

2019-06-07 Thread Ankur Goenka
. -- Ankur Goenka, on behalf of The Apache Beam team

Re: Parallel computation of windows in Flink

2019-06-10 Thread Ankur Goenka
Hi Mike, This can be because of the partitioning logic of the data. If possible, can you share your pipeline code at a high level. On Mon, Jun 10, 2019 at 12:58 PM Mike Kaplinskiy wrote: > > Ladder . The smart, modern way to insure your life. > > > On Mon, Jun 10, 2019 at

Re: grpc cancelled without enough information

2019-06-13 Thread Ankur Goenka
The error seems to be originating from https://github.com/apache/beam/blob/master/sdks/java/fn-execution/src/main/java/org/apache/beam/sdk/fn/data/BeamFnDataGrpcMultiplexer.java#L166 Can you check what is the endpoint for the Data Channel and why was it not set to begin with? Few more questions *

Re: [ANNOUNCE] Spark portable runner (batch) now available for Java, Python, Go

2019-06-17 Thread Ankur Goenka
Thanks Kyle! This is a great addition towards supporting portability on Beam. On Mon, Jun 17, 2019 at 9:21 AM Ahmet Altay wrote: > Thank you Kyle! This is great news :) > > On Mon, Jun 17, 2019 at 6:40 AM Andres Angel < > ingenieroandresan...@gmail.com> wrote: > >> Really great achievement!!! co

Re: Dropping support for Flink 1.5/1.6

2019-08-13 Thread Ankur Goenka
Thanks for the headsup. On Tue, Aug 13, 2019 at 3:08 AM Maximilian Michels wrote: > Dear users of the Flink Runner, > > Just a heads-up that we may remove Flink 1.5/1.6 support for future > version of Beam (>=2.16.0): https://jira.apache.org/jira/browse/BEAM-7962 > > Why? We support 1.5 - 1.8 a

Re: Submitting job to AWS EMR fails with error=2, No such file or directory

2019-09-18 Thread Ankur Goenka
Adding to the previous suggestions. You can also add "--retain_docker_container" to your pipeline option and later login to the machine to check the docker container log. Also, in my experience running on yarn, the yarn user some time do not have access to use docker. I would suggest checking if t

Re: Word-count example

2019-09-18 Thread Ankur Goenka
Hi Matthew, Beam 2.16.0 is not yet released hence you are getting the error. Can you try using 2.15.0 version? Thanks, Ankur On Wed, Sep 18, 2019 at 6:59 AM Matthew Patterson wrote: > Tried > > " > mvn archetype:generate \ > -DarchetypeGroupId=org.apache.beam \ > -DarchetypeArtifac

Re: Word-count example

2019-09-19 Thread Ankur Goenka
AM Matthew Patterson wrote: > Hi Ankur, > > > > Yes, I was using 2.15, but was getting failure to deserialize. > > > > Thanks, > > Matt > > > > *From: *Ankur Goenka > *Reply-To: *"user@beam.apache.org" > *Date: *Wednesday, September 18

Re: Streaming + Window not working when ran using direct runner

2019-09-20 Thread Ankur Goenka
On the surface, pipeline looks fine. As this is a java pipeline running locally, I would recommend checking the thread stack using jstack and see where is the process stuck. On Fri, Sep 20, 2019 at 10:39 AM Daniel Amadei wrote: > Hi all, > > I'm trying to run the following code using direct ru

Re: Streaming + Window not working when ran using direct runner

2019-09-20 Thread Ankur Goenka
a.lang.Thread.run(java.base@11.0.2/Thread.java:834) On Fri, Sep 20, 2019 at 11:44 AM Daniel Amadei wrote: > Hi Ankur, > > I tried to do that but nothing spot to my eyes as the culprit. Stack trace > attached in case you want to give a look. > > Thanks > Daniel > > Em sex,

Re: How to reference manifest from apache flink worker node ?

2019-09-23 Thread Ankur Goenka
Flink job server does not have artifacts-dir option yet. We have a PR to add it https://github.com/apache/beam/pull/9648 However, for now you can do a few manual steps to achieve this. Start Job server. 1. Download https://repo.maven.apache.org/maven2/org/apache/beam/beam-runners-flink-1.8-job-s

Re: How do you call user defined ParDo class from the pipeline in Portable Runner (Apache Flink) ?

2019-09-25 Thread Ankur Goenka
super has some issues wile pickling in python3. Please refer https://github.com/uqfoundation/dill/issues/300 for more details. Removing reference to super in your dofn should help. On Wed, Sep 25, 2019 at 5:13 PM Yu Watanabe wrote: > Thank you for the reply. > > " save_main_session" did not wor

Re: Beam with Flink without containers

2019-11-07 Thread Ankur Goenka
Hi Robert, Different environment types are used to execute user code in a portable fashion. These environments are started and managed by beam layer on flink and are started on workers. You need to create a script which honors beam portability api. The config contains the script name. You can use

Re: Beam on Flink with Python SDK and using GCS as artifacts directory

2019-12-20 Thread Ankur Goenka
Hi Matthew, It will be great if you can get the error from the failing process. My suspicions are: - Flink task manager container access to gcs location. - As you are running Flink task manager in a container "/beam_src_code/sdks/python/container/build/target/launcher/linux_amd64/boot\" path is no

Re: Help with setup.py

2020-01-27 Thread Ankur Goenka
You can possibly have a single setup.py which call the appropriate setup.py based on the a flag (task name). I think stack overflow question explains it https://stackoverflow.com/questions/677577/distutils-how-to-pass-a-user-defined-parameter-to-setup-py On Mon, Jan 27, 2020 at 10:02 AM André Roch

Re: [ANNOUNCE] Beam 2.18.0 Released

2020-01-28 Thread Ankur Goenka
Thanks Udi! On Tue, Jan 28, 2020 at 11:30 AM Yichi Zhang wrote: > Thanks Udi! > > On Tue, Jan 28, 2020 at 11:28 AM Hannah Jiang > wrote: > >> Thanks Udi! >> >> >> On Tue, Jan 28, 2020 at 11:09 AM Pablo Estrada >> wrote: >> >>> Thanks Udi! >>> >>> On Tue, Jan 28, 2020 at 11:08 AM Rui Wang wrot

Re: Running Beam on Flink

2020-02-07 Thread Ankur Goenka
Seems that pipeline submission from sdk is not able to reach the job server which was started in docker. Can you try running "telnet localhost 8099" to make sure that pipeline submission can reach the job server. On Thu, Feb 6, 2020 at 8:16 PM Xander Song wrote: > I am having difficulty followi

Re: Running Beam on Flink

2020-02-07 Thread Ankur Goenka
net localhost 8099, I receive > > Trying ::1... > > telnet: connect to address ::1: Connection refused > > Trying 127.0.0.1... > > telnet: connect to address 127.0.0.1: Connection refused > > telnet: Unable to connect to remote host > > > On Fri, Feb 7, 2

Re: Beam Katas YouTube

2020-03-27 Thread Ankur Goenka
Thanks for sharing. The info is very informative and entertaining. Nicely done 👏 On Thu, Mar 26, 2020 at 8:32 PM Henry Suryawirawan wrote: > Hello, > > Just would like to share that recently the Apache Beam Katas is featured > in the Google Cloud Level Up YouTube video ( > https://www.youtube.co

Re: Best Walkthrough/Tutorial Today (Data Scientist writing their own pipelines)

2020-04-22 Thread Ankur Goenka
To start, beam website has a list of learning material https://beam.apache.org/documentation/resources/learning-resources/#getting-started I personally found beam katas really useful https://beam.apache.org/blog/2019/05/30/beam-kata-release.html Further beam documentation https://beam.apache.org/

Re: Set parallelism for each operator

2020-04-29 Thread Ankur Goenka
Beam does support parallelism for the job which applies to all the transforms in the job when executing on Flink using the "--parallelism" flag. >From the usecase you mentioned, Kafka read operations will be over parallelised but it should be ok as they will only have a small amount of memory impa

Re: Difference between Flink runner and portable runner on Python

2021-01-14 Thread Ankur Goenka
DOCKER is the recommended way but for testing and development you can also use LOOPBACK. More details here https://github.com/apache/beam/blob/17fbe8be70b71bc1236e553a5fe7902a2df5dda4/sdks/python/apache_beam/options/pipeline_options.py#L1115 On Thu, Jan 14, 2021 at 11:40 AM Nir Gazit wrote: > Th

Re: preserve order of records while writing in a file

2018-08-21 Thread Ankur Goenka
In case of multiple files, you can use Dataflow to parallelize processing to individual files. However, as mentioned earlier, records within in a single file is not worth parallelizing in this case. Your pipeline can start with a fixed set of file names followed by GroupBy (to shuffle the file nam

Re: Advice for piping many CSVs with different columns names to one bigQuery table

2018-09-25 Thread Ankur Goenka
Hi Eila, If I understand correctly, the objective is to read a large number of CSV files, each of which contains a single row with multiple columns. Deduplicate the columns in the file and write them to BQ. You are using pandas DF to deduplicate the columns for a small set of files which might not

Re: Advice for piping many CSVs with different columns names to one bigQuery table

2018-09-26 Thread Ankur Goenka
t your thoughts are > Many thanks, > Eila > > On Wed, Sep 26, 2018 at 12:04 PM OrielResearch Eila Arich-Landkof < > e...@orielresearch.org> wrote: > >> Hi Ankur, >> >> Thank you. Trying this approach now. Will let you know if I have any >> issue impleme

Re: Modular IO presentation at Apachecon

2018-09-26 Thread Ankur Goenka
Thanks for sharing. Great slides and looking for the recorded session. Do we have a central location where we link all the beam presentations for discoverability? On Wed, Sep 26, 2018 at 9:35 PM Thomas Weise wrote: > Thanks for sharing. I'm looking forward to see the recording of the talk > (ho

Re: CoGroupByKey only on window end.

2018-10-01 Thread Ankur Goenka
CoGBK and GBK need consistent windowing in PCollection. In your case, a custom solution is needed. Here is another way which only need pipeline orchestration and might be simpler. Lets say you have pcollection A with 15 min window and pcollection B with 1 min window Step 1: GBK pcollection A for 1

Re: Python SDK support for Flink runner

2018-10-15 Thread Ankur Goenka
Hi Wei, We have made substantial progress towards supporting Python on Flink runner though its not production ready yet. We will soon be updating the beam website to better reflect Flink support but for now here is how you can run python wordcount on Flink https://beam.apache.org/contribute/portab

Re: Python SDK support for Flink runner

2018-10-15 Thread Ankur Goenka
Flink runner. Let me know if you are interested in contributing to expedite spark runner :) On Mon, Oct 15, 2018 at 3:34 PM Wei Yan wrote: > Great. Thanks, Ankur. I'll give a try. > BTW, do you happen to know the progress on spark runner support side? > > -Wei > > On Mon, O

Re: How to kick off a Beam pipeline with PortableRunner in Java?

2018-12-03 Thread Ankur Goenka
Thanks Ruoyun! For Flink, we use a different job server which you can start using "./gradlew beam-runners-flink_2.11-job-server:runShadow " The host:port for this jobserver is localhost:8099 On Mon, Dec 3, 2018 at 2:24 PM Ruoyun Huang wrote: > Maybe this helps: > https://cwiki.apache.org/conflu

Re: Beam Python streaming pipeline on Flink Runner

2019-01-31 Thread Ankur Goenka
Hi Matthias, Unfortunately, unbounded reads including pubsub are not yet supported for portable runners. Thanks, Ankur On Thu, Jan 31, 2019 at 2:44 PM Matthias Baetens wrote: > Hi everyone, > > Last few days I have been trying to run a streaming pipeline (code on > Github

Re: Visual Beam - First demonstration - London

2019-02-11 Thread Ankur Goenka
Thanks for sharing the video. On Sun, Feb 10, 2019 at 12:49 PM Dan wrote: > Here's the video. Enjoy! > > https://skillsmatter.com/skillscasts/13405-how-to-run-kettle-on-apache-beam > > Sent from my phone > > On Wed, 6 Feb 2019, 5:03 pm Maximilian Michels >> Hi Dan, >> >> Thanks for the info. Wo

[ANNOUNCE] Beam 2.32.0 Released

2021-08-30 Thread Ankur Goenka
The Apache Beam team is pleased to announce the release of version 2.32.0. Apache Beam is an open source unified programming model to define and execute data processing pipelines, including ETL, batch and stream (continuous) processing. See https://beam.apache.org You can download the release her

Re: Local Test Beam Portable Flink Runner in mac

2022-11-14 Thread Ankur Goenka
Hi Lydia, The LOOPBACK environment is nothing but an EXTERNAL environment with automated setup of SDK Harness process manager in the pipeline submission process. The LOOPBACK environment setup a server on a random port to start the SDK Harness process on a random port. This is what you are probably

Re: Dqchecks on beam

2023-01-03 Thread Ankur Goenka
Hi Marco, It is not very clear as to which checks are you interested in. Beam does not have any standard business-specific data quality checks. However, you can add your checks in various stages of the pipeline. The checks will broadly fall into 2 categories. 1. Check a single element: There are ea

Re: Deduplicate usage

2023-03-02 Thread Ankur Goenka
Hi Binh, The Deduplicate transform uses state api to do the de-duplication which should do the needful operations to work across multiple concurrent workers. Thanks, Ankur On Thu, 2 Mar 2023 at 10:18, Binh Nguyen Van wrote: > Hi, > > I am writing a pipeline and want to apply deduplication. I lo