Re: pubsub -> IO

2019-07-17 Thread Eugene Kirpichov
I think full-blown SDF is not needed for this - someone just needs to implement a MongoDbIO.readAll() variant, using a composite transform. The regular pattern for this sort of thing will do (ParDo split, reshuffle, ParDo read). Whether it's worth replacing MongoDbIO.read() with a redirect to readA

Re: Design patterns while using Beam

2019-09-23 Thread Eugene Kirpichov
There's also https://beam.apache.org/contribute/ptransform-style-guide/ (disclaimer: I'm the main author) which is currently under "Technical docs" but probably should be moved under "Patterns" instead. On Mon, Sep 23, 2019 at 8:27 AM Chad Dombrova wrote: > There are also these two helpful artic

Re: What's the difference between Cloud Dataflow and Spark in terms of the execution?

2019-09-30 Thread Eugene Kirpichov
Hi, Cloud Dataflow also has worker nodes and a master - though the master is part of the Cloud Dataflow service and runs on Google's internal servers. I believe all data distributed data processing tools use a similar architecture. Some major differences I can point out quickly: - As you said, Sp

Re: Joining PCollections to aggregates of themselves

2019-10-10 Thread Eugene Kirpichov
" input elements can pass through the Joiner DoFn before the sideInput corresponding to that element is present" I don't think this is correct. Runners will evaluate a DoFn with side inputs on elements in a given window only after all side inputs are ready (have triggered at least once) in this wi

Re: Sharding in apache beam

2019-11-07 Thread Eugene Kirpichov
I think this might help you: (in FileIO.write()) public Write withSharding( PTransform, PCollectionView> sharding) { So as long as you can write such a transform, you can control the sharding "dynamically". On Thu, Nov 7, 2019 at 11:19 AM Luke Cwik wrote: > By not specifying an exp

Re:

2019-11-19 Thread Eugene Kirpichov
On Tue, Nov 19, 2019 at 1:56 AM 👌👌 <1150693...@qq.com> wrote: > Hello! I am a beam user. > I want to ask you two questions. > First. > I use beam of my project and my data is jsonobject.I find in my flow,the > data will be Serializable and deSerializable many times,but i am not know > where use S

Re: JDBCIO reader + BigQuery writer extremly slow due to bundle size = 1

2020-01-02 Thread Eugene Kirpichov
On Thu, Jan 2, 2020 at 11:52 AM Chamikara Jayalath wrote: > > > On Thu, Jan 2, 2020 at 10:10 AM Konstantinos P. > wrote: > >> Hi! >> >> I have setup a beam pipeline to read from a postGreSQL server and write >> to BigQuery table and it takes for ever even for size of 20k records. While >> invest

Re: snowflake source/sink

2020-01-29 Thread Eugene Kirpichov
Note that https://nl.devoteam.com/en/blog-post/querying-jdbc-database-parallel-google-dataflow-apache-beam/ , while focused on the JdbcIO connector, still explains how to read from a database in parallel - that solution would equally apply to your code, because it is not specific to JDBC. You'd nee

Re: Apply Wait.on() pattern after AvroIO, KinesisIO writes

2020-06-27 Thread Eugene Kirpichov
If a transform doesn't return something waitable, there is no way to wait on it. However: * AvroIO.write is waitable - if not through AvroIO.write() (I don't remember off the top of my head), then at least through FileIO.write().via(AvroIO.sink()). * KinesisIO.write is *very* easy to change to be w

Staged PIP package mysteriously ungzipped, non-installable inside the worker

2020-08-07 Thread Eugene Kirpichov
om/how-to-run-commands-in-stopped-docker-containers> ): # file /tmp/staged/alembic-1.4.2.tar.gz /tmp/staged/alembic-1.4.2.tar.gz: POSIX tar archive (GNU) # ls -l /tmp/staged/alembic-1.4.2.tar.gz -rwxr-xr-x 1 root root 4730880 Aug 8 01:17 /tmp/staged/alembic-1.4.2.tar.gz The file has clearl

Re: Staged PIP package mysteriously ungzipped, non-installable inside the worker

2020-08-07 Thread Eugene Kirpichov
ics) --> > https://2020.beamsummit.org/sessions/workshop-using-conda-on-beam/ > > I'm sure others will have more things to say that are actually helpful, > on-list, before that occurs (~3 weeks). > > > > On Fri, Aug 7, 2020 at 6:32 PM Eugene Kirpichov > wrote: > >>

Re: Staged PIP package mysteriously ungzipped, non-installable inside the worker

2020-08-10 Thread Eugene Kirpichov
x27;t find any mention of compression/decompression in the container boot code. My next step will be to add a bunch of debugging, rebuild the containers, and see what the artifact services think they're serving. On Fri, Aug 7, 2020 at 6:47 PM Eugene Kirpichov wrote: > Thanks Austin! Good s

Re: Staged PIP package mysteriously ungzipped, non-installable inside the worker

2020-08-10 Thread Eugene Kirpichov
s some debugging. > > cc: +Ankur Goenka +Kyle Weaver who > recently worked on Portable Runner and may be interested. > > By the way, you should be able to use custom containers with Dataflow, if > you set --experiments=use_runner_v2. > > On Mon, Aug 10, 2020 at 9:06 AM Eugen

Re: Staged PIP package mysteriously ungzipped, non-installable inside the worker

2020-08-11 Thread Eugene Kirpichov
e pipeline with the > FlinkRunner? Use: --runner=FlinkRunner and do not specify an endpoint. > It will run the Python code embedded (loopback environment) without > additional containers. > > Cheers, > Max > > On 10.08.20 21:59, Eugene Kirpichov wrote: > > Thanks Vale

Re: Staged PIP package mysteriously ungzipped, non-installable inside the worker

2020-08-11 Thread Eugene Kirpichov
On Tue, Aug 11, 2020 at 1:43 PM Eugene Kirpichov wrote: > Hi Maximilian, > > Thank you - it works fine with the embedded Flink runner (per below, seems > like it's not using Docker for running Python code? What is it using then?). > > However, the original bug appears t

Re: Staged PIP package mysteriously ungzipped, non-installable inside the worker

2020-08-11 Thread Eugene Kirpichov
(FYI Sam +sbrot...@gmail.com ) On Tue, Aug 11, 2020 at 5:00 PM Eugene Kirpichov wrote: > Ok I found the bug, and now I don't understand how it could have possibly > ever worked. And if this was never tested, then I don't understand why it > works after fixing this one bug :

Re: Staged PIP package mysteriously ungzipped, non-installable inside the worker

2020-08-13 Thread Eugene Kirpichov
FWIW I sent a PR to fix this https://github.com/apache/beam/pull/12571 However, I'm not up to date on the portable test infrastructure and would appreciate guidance on what tests I can add for this. On Tue, Aug 11, 2020 at 5:28 PM Eugene Kirpichov wrote: > (FYI Sam +sbrot...@gmail.com

Re: Staged PIP package mysteriously ungzipped, non-installable inside the worker

2020-08-13 Thread Eugene Kirpichov
Kirpichov wrote: > FWIW I sent a PR to fix this https://github.com/apache/beam/pull/12571 > > However, I'm not up to date on the portable test infrastructure and would > appreciate guidance on what tests I can add for this. > > On Tue, Aug 11, 2020 at 5:28 PM Eugene Kirpichov

Re: Staged PIP package mysteriously ungzipped, non-installable inside the worker

2020-08-13 Thread Eugene Kirpichov
+Daniel as in charge of 2.24 per dev@ thread. On Thu, Aug 13, 2020 at 6:24 PM Eugene Kirpichov wrote: > The PR is merged. > > Do folks think this warrants being cherrypicked into v2.24? My hunch is > yes, cause basically one of the runners (local portable python runner) is >

Re: Staged PIP package mysteriously ungzipped, non-installable inside the worker

2020-08-18 Thread Eugene Kirpichov
t; > On Mon, Aug 17, 2020 at 11:27 AM Valentyn Tymofieiev < > valen...@google.com> wrote: > > >> > > >> Will defer to the release manager; one reason to cherry-pick is that > 2.24.0 will be the last release with Python 2 support, so Py2 users of > Portable Pyt

Getting Beam(Python)-on-Flink-on-k8s to work

2020-08-26 Thread Eugene Kirpichov
n starts. I looked at a few conference talks: - https://www.cncf.io/wp-content/uploads/2020/02/CNCF-Webinar_-Apache-Flink-on-Kubernetes-Operator-1.pdf - seems to imply that I need to add a Beam worker "sidecar" to the Flink workers; and that I need to submit my job using "kubectl appl

Re: Getting Beam(Python)-on-Flink-on-k8s to work

2020-08-27 Thread Eugene Kirpichov
ob server? > > I'm guessing it goes through k8s for monitoring purposes. I see no reason > it shouldn't be possible to submit to the job server directly through > Python, network permitting, though I haven't tried this. > > > > On Wed, Aug 26, 2020 at 4:10 PM Eugene

Re: Getting Beam(Python)-on-Flink-on-k8s to work

2020-08-28 Thread Eugene Kirpichov
ful. > > The documents you linked are very informative. It would be great to > aggregate all this into digestible documentation. Let me know if you have > any further questions! > > Cheers, > Sam > > On Thu, Aug 27, 2020 at 10:25 AM Eugene Kirpichov > wrote: > >

Re: Getting Beam(Python)-on-Flink-on-k8s to work

2020-08-28 Thread Eugene Kirpichov
he unreleased Beam SDK, and should be fixed if running on clean Beam 2.24. Thank you again! On Fri, Aug 28, 2020 at 10:21 AM Eugene Kirpichov wrote: > Holy shit, thanks Sam, this is more help than I could have asked for!! > I'll give this a shot later today and report back. &g

Re: Getting Beam(Python)-on-Flink-on-k8s to work

2020-08-28 Thread Eugene Kirpichov
at 4:54 PM Eugene Kirpichov wrote: > Hi Sam, > > You're a wizard - this got me *way* farther than my previous attempts. > Here's a PR https://github.com/sambvfx/beam-flink-k8s/pull/1 with a > couple of changes I had to make. > > I had to make some additional changes

Re: Getting Beam(Python)-on-Flink-on-k8s to work

2020-08-28 Thread Eugene Kirpichov
wn issue: https://issues.apache.org/jira/browse/BEAM-10762 > > On Fri, Aug 28, 2020 at 4:57 PM Eugene Kirpichov > wrote: > >> P.S. Ironic how back in 2018 I was TL-ing the portable runners effort for >> a few months on Google side, and now I need community help to get it to >

Re: Getting Beam(Python)-on-Flink-on-k8s to work

2020-08-29 Thread Eugene Kirpichov
eam_flink1.10_job_server:2.23.0" with a custom-built one with the Beam 2.24 snapshot we're using. > Cheers, > Sam > > On Fri, Aug 28, 2020 at 5:50 PM Eugene Kirpichov > wrote: > >> Woohoo thanks Kyle, adding --save_main_session made it work!!! >> >

Re: Dataflow isn't parallelizing

2020-09-11 Thread Eugene Kirpichov
allelize this step :( > > Any advice on what else to try to make it do this? > > Thanks so much! > -- Eugene Kirpichov http://www.linkedin.com/in/eugenekirpichov

Re: Unsubscribe

2017-01-23 Thread Eugene Kirpichov
To manage your subscriptions to these mailing lists, please see instructions at https://beam.apache.org/get-started/support/ . Please do not post anything more via "reply all" to this thread. On Mon, Jan 23, 2017 at 2:47 PM amir bahmanyari wrote: > Unsubscribe > > > > >

A batch of minor incompatible style changes coming

2017-03-01 Thread Eugene Kirpichov
Hello, Recently we introduced a style guide for developing PTransforms https://beam.apache.org/contribute/ptransform-style-guide/ and a natural consequence of that was making Beam itself comply with its own API best practices. That work is tracked in JIRA https://issues.apache.org/jira/browse/BEA

Re: grpc IO?

2017-03-10 Thread Eugene Kirpichov
Can you clarify the requirements? E.g. show a short code snippet with the hypothetical grpc IO and explain what it should do, so we can compare it to how the same task would be done without it. On Fri, Mar 10, 2017 at 10:18 AM Borisa Zivkovic wrote: > ok thanks JB, > > I guess I can create JIRA a

Re: grpc IO?

2017-03-10 Thread Eugene Kirpichov
Your first option would be to simply do RPCs from your ParDo's. Have you tried that / is it not working for some reason / is the code turning out unnecessarily complex or not performant enough etc.? On Fri, Mar 10, 2017 at 12:36 PM Borisa Zivkovic wrote: > well what we are looking into is using g

Re: grpc IO?

2017-03-10 Thread Eugene Kirpichov
em, > we can always use custom solution here... > > On Fri, 10 Mar 2017 at 13:01 Eugene Kirpichov > wrote: > > Your first option would be to simply do RPCs from your ParDo's. > Have you tried that / is it not working for some reason / is the code > turning out unnecessa

Re: grpc IO?

2017-03-13 Thread Eugene Kirpichov
ew weeks and we keep JIRA open. After that I will > either close JIRA or propose how API would look like. > > makes sense? > > On Fri, 10 Mar 2017 at 14:20 Eugene Kirpichov > wrote: > > Hi, > > I want to clarify: I am not pushing back on your proposal, merely trying &g

Re: Approach to writing to Redis in Streaming Pipeline

2017-03-16 Thread Eugene Kirpichov
Yes, please use a ParDo. The Custom Sink API is not intended for use with unbounded collections (i.e. in pretty much any streaming pipelines) and it's generally due for a redesign. ParDo is currently almost always a better choice when you want to implement a connector writing data to a third-party

Re: IO's for image processing

2017-04-12 Thread Eugene Kirpichov
Caveat: you need to break fusion between list_of_images and processed_images if you're using a runner that supports fusion, like Dataflow. See https://cloud.google.com/dataflow/service/dataflow-service-desc#preventing-fusion On Wed, Apr 12, 2017 at 1:23 PM Sourabh Bajaj wrote: > Hi, > > One idea

Re: Using watermarks with bounded sources

2017-04-22 Thread Eugene Kirpichov
Hi! This is an excellent question; don't have time to reply in much detail right now, but please take a look at http://s.apache.org/splittable-do-fn - it unifies the concepts of bounded and unbounded sources, and the use case you mentioned is one of the motivating examples. Also, see recent discus

Re: HadoopInputFormatIO - HBase - Flink runner

2017-05-10 Thread Eugene Kirpichov
Hi Seshadri, There's 2 things you need to configure 1) Running on Flink runner. For this you only need to set the runner on your PipelineOptions: options.setRunner(FlinkRunner.class) (plus whatever other options the Flink runner needs - see https://beam.apache.org/documentation/runners/flink/) 2)

Re: Enriching stream messages based on external data

2017-06-01 Thread Eugene Kirpichov
It's probably because of the BigtableSession variable - mark it transient. On Thu, Jun 1, 2017 at 3:33 PM Csaba Kassai wrote: > Hi Gwilym, > > try to extract the DoFn into a separate static inner class or into a > separate file as a top level class, instead of declaring as an > anonymous inner c

Re: Scaling dataflow python SDK 2.0.0

2017-06-05 Thread Eugene Kirpichov
Do you have a Dataflow job ID to look at? It might be due to fusion https://cloud.google.com/dataflow/service/dataflow-service-desc#preventing-fusion On Mon, Jun 5, 2017 at 12:13 PM Prabeesh K. wrote: > Please try using *--worker_machine_type* n1-standard-4 or n1-standard-8 > > On 5 June 2017 at

Re: Scaling dataflow python SDK 2.0.0

2017-06-05 Thread Eugene Kirpichov
(200) -> combine (20M) -> clean (20M) -> filter (20M) -> insert >>>> >>>> Will become: >>>> Read (200) -> combine (20M) -> GroupByKey (20M) -> ungroup (20M) -> >>>> clean (20M) -> filter (20M) -> insert >>>> >&g

Re: Scaling dataflow python SDK 2.0.0

2017-06-10 Thread Eugene Kirpichov
better. >>>> >>>> Regards, >>>> >>>> *Sébastien MORAND* >>>> Team Lead Solution Architect >>>> Technology & Operations / Digital Factory >>>> Veolia - Group Information Systems & Technology (IS&T)

Re: Action in the pipeline after Write

2017-06-10 Thread Eugene Kirpichov
Hi! It sounds like you want to write data to BigQuery and then load the same data back from BigQuery? Why? I'm particularly confused by your comment "nothing left in the PCollection" - writing a collection to BigQuery doesn't remove data from the collection, a PCollection is just a logical descript

Re: Action in the pipeline after Write

2017-06-11 Thread Eugene Kirpichov
com> > <https://www.facebook.com/veoliaenvironment/> > <https://www.youtube.com/user/veoliaenvironnement> > <https://www.linkedin.com/company/veolia-environnement> > <https://twitter.com/veolia> > > On 11 June 2017 at 04:14, Eugene Kirpichov wrote: > >>

Re: Scaling dataflow python SDK 2.0.0

2017-06-11 Thread Eugene Kirpichov
57 71 08 > <+33%201%2085%2057%2071%2008> > Bureau 0144C (Ouest) > 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France > *www.veolia.com <http://www.veolia.com>* > <http://www.veolia.com> > <https://www.facebook.com/veoliaenvironment/> > <https://www.y

Re: Scaling dataflow python SDK 2.0.0

2017-06-11 Thread Eugene Kirpichov
Oh sorry, I think I misunderstood you. Looks like the tar.gz file contains the actual data, not just filenames. Then my suggestion doesn't apply. On Sun, Jun 11, 2017, 9:54 AM Eugene Kirpichov wrote: > Seems like there's a lot of parallelism you could exploit here and make > t

Re: Action in the pipeline after Write

2017-06-11 Thread Eugene Kirpichov
(IS&T) >>>> Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08 >>>> <+33%201%2085%2057%2071%2008> >>>> Bureau 0144C (Ouest) >>>> 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France >>>> *www.veolia.com <http://www.veolia.com&

Re: Redshift source for Python

2017-06-12 Thread Eugene Kirpichov
1: Using BoundedSource is not an antipattern per se. It is *recommended* in case you are able to use the capabilities that it provides over a ParDo - otherwise, it's recommended to use ParDo: see https://beam.apache.org/documentation/io/authoring-overview/#when-to-implement-using-the-source-api .

Re: Creating side input map with global window

2017-06-14 Thread Eugene Kirpichov
Seems related to https://issues.apache.org/jira/browse/BEAM-1197? On Wed, Jun 14, 2017 at 11:39 AM Kevin Peterson wrote: > Hi all, > > I am working on a (streaming) pipeline which reads elements from Pubsub, > and schemas for those elements from a separate pubsub topic. I'd like to be > able to

Re: Using watermarks with bounded sources

2017-06-19 Thread Eugene Kirpichov
t; > Local Time: April 23, 2017 10:18 AM > UTC Time: April 23, 2017 2:18 PM > From: p...@protonmail.com > To: Eugene Kirpichov > user@beam.apache.org > > Ah, I didn't know about that. This is *really* great -- from a quick look, > the API looks both very natural and v

Re: Using watermarks with bounded sources

2017-06-20 Thread Eugene Kirpichov
Hi! The PR just got submitted. You can play with SDF in Dataflow streaming runner now :) Hope it doesn't get rolled back (fingers crossed)... On Mon, Jun 19, 2017 at 6:06 PM Eugene Kirpichov wrote: > Hi, > The PR is ready and I'm just struggling with setup of tests - Dataflow &

Re: Using watermarks with bounded sources

2017-06-21 Thread Eugene Kirpichov
> UTC Time: June 20, 2017 11:52 PM > From: kirpic...@google.com > > To: peay , k...@google.com > user@beam.apache.org > > Hi! > > The PR just got submitted. You can play with SDF in Dataflow streaming > runner now :) Hope it doesn't get rolled back (fingers c

Re: Looking for a good "write-here-if-fails" pattern

2017-06-22 Thread Eugene Kirpichov
I don't think it's possible to come up with something that will catch failing elements at the level of whole PTransform's, because: - You can't return elements failing some internal transform of the composite PTransform, because 1) they are a private implementation detail of that transform and shou

Re: Using watermarks with bounded sources

2017-06-22 Thread Eugene Kirpichov
Shading issues have been fixed. I believe it should be all good to use now. On Wed, Jun 21, 2017 at 2:24 PM Eugene Kirpichov wrote: > Yes, there's a pending PR https://github.com/apache/beam/pull/3360 > > Note that there are some shading issues with the Dataflow streaming runner

Re: Windowing a batch (python SDK 2.0.0)

2017-06-26 Thread Eugene Kirpichov
bastien, >> >> >> >> Streaming is not supported in the Python SDK yet. There is a steady >> >> progress on that and soon we will be able to offer it. It is already >> >> supported in Java. >> >> >> >> Thank you, >> >

Re: Windowing a batch (python SDK 2.0.0)

2017-06-27 Thread Eugene Kirpichov
w.veolia.com <http://www.veolia.com>* > <http://www.veolia.com> > <https://www.facebook.com/veoliaenvironment/> > <https://www.youtube.com/user/veoliaenvironnement> > <https://www.linkedin.com/company/veolia-environnement> > <https://twitter.com/veolia> > &

Re: Missing getOptions on Pipeline class

2017-07-26 Thread Eugene Kirpichov
Hi Csaba, getOptions() was removed, and capturing PipelineOptions in the transform constructor is discouraged (or perhaps forbidden - not sure) because of the addition of templates (ValueProvider's) - the pipeline may be constructed, saved in a template, and then the template can be run with diffe

Re: Two example pipelines built by Yahoo intern

2017-08-08 Thread Eugene Kirpichov
Hi Claire, Thank you - happy to see a paper with such a detailed description of your experience with both usability of Beam per se and the execution on the Flink runner! The paper looks well-written, and, from a quick look at the code, it seems to be using the Beam API properly without obvious opp

Re: Two example pipelines built by Yahoo intern

2017-08-08 Thread Eugene Kirpichov
+Aljoscha Krettek for comments on Flink runner +Thomas Weise likewise for Apex runner On Tue, Aug 8, 2017 at 4:52 PM Eugene Kirpichov wrote: > Hi Claire, > > Thank you - happy to see a paper with such a detailed description of your > experience with both usability of Beam pe

New blog post: Splittable DoFn

2017-08-16 Thread Eugene Kirpichov
Hi all, The blog post Powerful and modular IO connectors with Splittable DoFn in Apache Beam just went live - take a look! *One of the most important parts of the Apache Beam ecosystem is its quickly growing set of connectors that al

Re: Testing transforms with generics using PAssert.containsAnyOrder

2017-09-20 Thread Eugene Kirpichov
The error suggests that you .setCoder() on your resulting collection, have you tried that? The error seems unrelated to PAssert. On Wed, Sep 20, 2017, 7:46 AM Daniel Harper wrote: > I’ve been able to use generics + PAssert testing before, I have a class > that looks like this > > public class Ta

Re: [Call for items] Beam October Newsletter

2017-10-02 Thread Eugene Kirpichov
Hi Griselda, Thanks for organizing this! A few questions: - What time period shall this newsletter cover? It will be more clear in future newsletters, but currently I'm not sure how far back to go in my memory :) - Should this focus only on things that are visible to users or will be at some point

Re: Run a ParDo after a Partition

2017-10-03 Thread Eugene Kirpichov
Hmm, partition + flatten is semantically a no-op (it, at most, may or may not cause the intermediate dataset to be materialized, there are no guarantees) - why do you need the Partition in the first place? On Tue, Oct 3, 2017 at 6:18 AM Tim Robertson wrote: > > Answering my own question. > > Aft

Re: [Call for items] Beam October Newsletter

2017-10-04 Thread Eugene Kirpichov
Thanks Griselda, I added a bunch of items that I participated in, and encourage others to do the same! On Wed, Oct 4, 2017 at 3:48 PM Griselda Cuevas wrote: > Hi everyone - just a reminder to add updates to our Newsletter, it needs > some love <3 > > If you have questions on what to add or just

Re: What is the recommended approach to use a static variable in a DoFn?

2017-10-07 Thread Eugene Kirpichov
Hi, I'm not sure what you mean by this: "But they are non-serializable so I can't just create a static constructor and create it while starting the pipeline." You can definitely use static variables in DoFn's, same way as you can use them in any other Java code. I'm not sure how serializability is

Re: Is there a way to access PipelineOptions in DoFn.Setup?

2017-10-10 Thread Eugene Kirpichov
I don't think there is a way to do this, and I agree that this is an oversight. Feel free to file a JIRA. Depending on what you're doing, using a ValueProvider may be sufficient - their run-time values will be accessible in Setup. On Tue, Oct 10, 2017 at 1:37 PM Derek Hao Hu wrote: > Hi, > > I'm

Re: DoFn setup/teardown sequence

2017-10-16 Thread Eugene Kirpichov
A worker can execute several instances of the same DoFn at the same time. They will be clones of the original DoFn specified in the pipeline and an individual instance will be accessed only serially but not concurrently. On Mon, Oct 16, 2017, 8:38 AM Jacob Marble wrote: > Perfect, thanks. > > Ja

Re: DoFn setup/teardown sequence

2017-10-16 Thread Eugene Kirpichov
- can you explain a bit more about > "an individual instance will be accessed only serially but not > concurrently"? > > Thanks, > > Derek​ > > On Mon, Oct 16, 2017 at 8:50 AM, Eugene Kirpichov > wrote: > >> A worker can execute several instances of the s

Re: [VOTE] [DISCUSSION] Remove support for Java 7

2017-10-17 Thread Eugene Kirpichov
+1 to removing Java 7 support. In terms of release 3.0, we can handle this two ways: - Wait until enough other potentially incompatible changes accumulate, do all of them, and call it a "3.0" release, so that 3.0 will truly differ in a lot of incompatible and hopefully nice ways from 2.x. This mig

Re: KafkaIO and Avro

2017-10-18 Thread Eugene Kirpichov
It seems that KafkaAvroDeserializer implements Deserializer, though I suppose with proper configuration that Object will at run-time be your desired type. Have you tried adding some Java type casts to make it compile? On Wed, Oct 18, 2017 at 7:26 AM Tim Robertson wrote: > I just tried quickly an

Re: [VOTE] [DISCUSSION] Remove support for Java 7

2017-10-18 Thread Eugene Kirpichov
l into the camp that this isn't >>>>> sufficiently incompatible to warrant a major version increase. >>>>> Semantic versioning is all about messaging, and upgrading the major >>>>> version so soon after GA for such a minor change would IMHO cause m

Re: KafkaIO and Avro

2017-10-19 Thread Eugene Kirpichov
gt;>>> props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, >>>> "kafka:9092"); >>>> >>>> props.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, " >>>> http://registry:8081";); >>>> >>>&

Re: KafkaIO and Avro

2017-10-19 Thread Eugene Kirpichov
>> similarly with >> (Class>) KafkaAvroDeserializer.class >> >> >> >> On Thu, Oct 19, 2017 at 9:00 PM, Eugene Kirpichov >> wrote: >> >>> I don't think extending the class is necessary. Not sure I understand >>> why a

Re: KafkaIO and Avro

2017-10-20 Thread Eugene Kirpichov
> > > On Fri, Oct 20, 2017, at 06:57 AM, Tim Robertson wrote: > > Thanks Eugene > > On Thu, Oct 19, 2017 at 9:36 PM, Raghu Angadi wrote: > > Ah, nice. It works. > > On Thu, Oct 19, 2017 at 1:44 PM, Eugene Kirpichov > wrote: > > The following compiles fine:

Re: PipelineTest with TestStreams: unable to serialize

2017-11-03 Thread Eugene Kirpichov
When debugging serialization exceptions, I always find it very helpful to use -Dsun.io.serialization.extendedDebugInfo=true . On Fri, Nov 3, 2017 at 9:21 AM Aleksandr wrote: > Hello, > Probably error is in your tuple tag classes, which are anonymous classes. > It means that your test is trying

Re: IBM Streams now supports Apache Beam Java applications

2017-11-07 Thread Eugene Kirpichov
Wow, this is very exciting, thank you for the announcement! This obviously provokes curiosity, so a few questions: - What are some features unique to this runner, i.e. features that might make somebody who isn't using IBM Streams to consider using it just because of how good is the Beam experience

Re: PubsubIO unable to set topic and subscription

2017-11-13 Thread Eugene Kirpichov
The error says "Can't set both the topic and the subscription": PubSub subscribers read from a subscription, and messages sent to a topic are sent to all subscriptions bound to this topic. That's why, when reading from PubSub in Beam you can specify either a topic (then a new subscription to this t

Re: @DoFn.Setup not called

2017-11-16 Thread Eugene Kirpichov
Could you give more details, e.g. a code snippet that reproduces the issue, and describe how you determine that @Setup hasn't been called? On Thu, Nov 16, 2017 at 6:58 PM Derek Hao Hu wrote: > ​I've been using DoFn.Setup method in Dataflow and it seems to be working > fine.​ > > On Thu, Nov 16,

Re: unique DoFn id

2017-11-19 Thread Eugene Kirpichov
You could create a private variable with a UUID, filled in in @Setup or (if you're hitting that bug where @Setup wasn't being called) in readObject()? On Sun, Nov 19, 2017 at 8:17 AM Jacob Marble wrote: > Is there a recommended way to get a unique id for each instance of a DoFn? > > - DataflowWo

Re: unique DoFn id

2017-11-19 Thread Eugene Kirpichov
, then that > constructed object is copied as many times as needed, each instance getting > it's own thread and @Setup,@StartBundle,@etc loop. Is that correct? > > Thanks for the help. > > Jacob > > On Sun, Nov 19, 2017 at 10:24 AM, Eugene Kirpichov > wrote: &g

Re: Handling errors in I/O transformation

2017-11-20 Thread Eugene Kirpichov
Note that this works only for streaming inserts, because with failed batch loads it is not possible to isolate which individual writes failed. On Mon, Nov 20, 2017 at 10:01 AM Lukasz Cwik wrote: > BigQueryIO has been written in such a way to support emitting failed > records to a "dead letter qu

Re: BigQueryIO withFailedInsertRetryPolicy is endlessly retrying "invalid" rows

2017-11-27 Thread Eugene Kirpichov
Hmm, the code doesn't even get to the retry policy because it throws a GoogleJsonResponseException rather than returning a TableDataInsertAllResponse with InsertErrors in it. I think BigQuery client library didn't use to throw GoogleJsonResponseException in this case before... I'm not sure if this

Re: BigQueryIO withFailedInsertRetryPolicy is endlessly retrying "invalid" rows

2017-11-27 Thread Eugene Kirpichov
this. On Mon, Nov 27, 2017 at 10:48 AM Eugene Kirpichov wrote: > Hmm, the code doesn't even get to the retry policy because it throws a > GoogleJsonResponseException rather than returning a > TableDataInsertAllResponse with InsertErrors in it. I think BigQuery client > library di

Re: Using JDBC IO read transform, running out of memory on DataflowRunner.

2017-11-29 Thread Eugene Kirpichov
Hi, I think you're hitting something that can be fixed by configuring Redshift driver: http://docs.aws.amazon.com/redshift/latest/dg/queries-troubleshooting.html#set-the-JDBC-fetch-size-parameter *By default, the JDBC driver collects all the results for a query at one time. As a result, when you at

Re: Pubsub -> Bream -> many files

2017-11-30 Thread Eugene Kirpichov
TextIO.write().to(DynamicDestinations), available in Beam 2.2, does exactly this. On Thu, Nov 30, 2017, 9:35 AM Andrew Jones wrote: > Hi, > > I'm new to Beam. I have a use case where I want to read from a Pubsub > stream, transform the data in Beam, and write to many outputs. > > As a simple exa

Re: Apache Beam, version 2.2.0

2017-12-04 Thread Eugene Kirpichov
Thanks JB for sending the detailed notes about new stuff in 2.2.0! A lot of exciting things indeed. Regarding Java 8: I thought our consensus was to have the release notes say that we're *considering* going Java8-only, and use that to get more opinions from the user community - but I can't find th

Re: Callbacks/other functions run after a PDone/output transform

2017-12-04 Thread Eugene Kirpichov
I agree that the proper API for enabling the use case "do something after the data has been written" is to return a PCollection of objects where each object represents the result of writing some identifiable subset of the data. Then one can apply a ParDo to this PCollection, in order to "do somethi

Re: Callbacks/other functions run after a PDone/output transform

2017-12-04 Thread Eugene Kirpichov
of files that were written for this pane of this destination. On Mon, Dec 4, 2017 at 3:58 PM Eugene Kirpichov wrote: > I agree that the proper API for enabling the use case "do something after > the data has been written" is to return a PCollection of objects where each > obje

Re: Bounded KafkaIO read to offset

2017-12-06 Thread Eugene Kirpichov
I think you'd have to manually implement a pipeline that uses the Kafka API to capture the current offsets of each partition (say, as a PCollection>) and passes them to a ParDo that uses the Kafka API to read each respective partition until the respective offset. This could be something useful to

Re: Data loss in BeamFlinkRunner

2017-12-07 Thread Eugene Kirpichov
Most likely this is late data https://beam.apache.org/documentation/programming-guide/#watermarks-and-late-data . Try configuring a trigger with a late data behavior that is more appropriate for your particular use case. On Thu, Dec 7, 2017 at 3:03 PM Nishu wrote: > Hi, > > I am running a Stream

[POLL] Dropping support for Java 7 in Beam 2.3

2017-12-07 Thread Eugene Kirpichov
This is a follow-up on a previous similar thread https://lists.apache.org/thread.html/2e1890c62d9f022f09b20e9f12f130fe9f1042e391979087f725d2e0@%3Cuser.beam.apache.org%3E in which the community consistently expressed support for transitioning Beam Java to Java8-only. Now that the release of Beam 2.

Re: [POLL] Dropping support for Java 7 in Beam 2.3

2017-12-07 Thread Eugene Kirpichov
option 3. On Thu, Dec 7, 2017 at 3:48 PM Eugene Kirpichov wrote: > This is a follow-up on a previous similar thread > https://lists.apache.org/thread.html/2e1890c62d9f022f09b20e9f12f130fe9f1042e391979087f725d2e0@%3Cuser.beam.apache.org%3E > in > which the community consistently expre

Re: [POLL] Dropping support for Java 7 in Beam 2.3

2017-12-07 Thread Eugene Kirpichov
A Twitter poll has been sent out too https://twitter.com/ApacheBeam/status/938926195910905857 However, due to limitations of Twitter it can only be open for 7 days. I encourage people who are late to the poll to comment on this thread instead. On Thu, Dec 7, 2017 at 3:50 PM Eugene Kirpichov

Re: Data loss in BeamFlinkRunner

2017-12-08 Thread Eugene Kirpichov
te data issue. >> >> I found a JIRA issue(https://issues.apache.org/jira/browse/BEAM-3225 ) >> for the same issue in Beam. >> >> Today I am going to try to write similar implementation in Flink. >> >> Thanks, >> Nishu >> >> >> On Fri

Re: Data loss in BeamFlinkRunner

2017-12-08 Thread Eugene Kirpichov
case of GlobalWindows? My assumption is > that in case of Global windows, it will capture entire data due to > accumulation and watermark won't affect. > Please correct me. > > Thanks & regards, > Nishu > > > On Fri, Dec 8, 2017 at 5:43 PM, Eugene Kirpichov > wrot

Re: [POLL] Dropping support for Java 7 in Beam 2.3

2017-12-11 Thread Eugene Kirpichov
ts* at Java 8 language level, which will give the bulk of the benefit of encouraging Java8-friendly APIs. Thoughts on the above? On Thu, Dec 7, 2017 at 6:22 PM Eugene Kirpichov wrote: > A Twitter poll has been sent out too > https://twitter.com/ApacheBeam/status/938926195910905857 > Howe

Re: [POLL] Dropping support for Java 7 in Beam 2.3

2017-12-12 Thread Eugene Kirpichov
ptiste Onofré wrote: > Hi Eugene, > > thanks for the report. > > I would go to Java 8 language level, starting building tests. > > Regards > JB > > On 12/11/2017 06:00 PM, Eugene Kirpichov wrote: > > Intermediate update: right now it's 219 votes on Twitter

Re: Callbacks/other functions run after a PDone/output transform

2017-12-15 Thread Eugene Kirpichov
t; > > Interesting. Maybe tough to do since sinks often don't have that knowledge. > > >> >> I think those concepts map to the more detailed description Eugene >> provided, but I find it helpful to focus on what information comes out of >> the sink and how i

Re: Callbacks/other functions run after a PDone/output transform

2017-12-18 Thread Eugene Kirpichov
nes). There are multiple ways you can do this: one > would be to CGBK the two PCollections together, and trigger the new > transform only on the final pane. Another would be to add a combiner that > returns a Void, triggering only on the final pane, and then make this > singleton Void a si

Re: Callbacks/other functions run after a PDone/output transform

2017-12-19 Thread Eugene Kirpichov
I figured out the Never.ever() approach and it seems to work. Will finish this up and send a PR at some point. Woohoo, thanks Kenn! Seems like this will be quite a useful transform. On Mon, Dec 18, 2017 at 1:23 PM Eugene Kirpichov wrote: > I'm a bit confused by all of these suggestio

  1   2   >