Re: Is it safe to cache the value of a singleton view (with a global window) in a DoFn?

2019-05-07 Thread Lukasz Cwik
Keep your code simple and rely on the runner caching the value locally so it should be very cheap to access. If you have a performance issue due to a runner lacking caching, it would be good to hear about it so we could file a JIRA about it. On Mon, May 6, 2019 at 4:24 PM Kenneth Knowles wrote:

Re: How to configure TLS connections for Dataflow job that use Kafka and Schema Registry

2019-05-08 Thread Lukasz Cwik
I replied to the SO question with more details. The issue is that your trying to load a truststore (file) on the VM which doesn't exist. You need to make that file accessible in some way. The other SO question ( https://stackoverflow.com/questions/42726011/truststore-and-google-cloud-dataflow?nore

Re: Problem with gzip

2019-05-10 Thread Lukasz Cwik
+user@beam.apache.org Reshuffle on Google Cloud Dataflow for a bounded pipeline waits till all the data has been read before the next transforms can run. After the reshuffle, the data should have been processed in parallel across the workers. Did you see this? Are you able to change the input of

Re: Problem with gzip

2019-05-10 Thread Lukasz Cwik
to run, according to > one test (24 gzip files, 17 million lines in total) I did. > > The file format for our users are mostly gzip format, since uncompressed > files would be too costly to store (It could be in hundreds of GB). > > Thanks, > > Allie > > > *From: *Lukasz Cw

Re: Problem with gzip

2019-05-10 Thread Lukasz Cwik
2019 at 1:25 PM Allie Chen wrote: > Yes, that is correct. > > *From: *Allie Chen > *Date: *Fri, May 10, 2019 at 4:21 PM > *To: * > *Cc: * > > Yes. >> >> *From: *Lukasz Cwik >> *Date: *Fri, May 10, 2019 at 4:19 PM >> *To: *dev >> *Cc: * &

Re: Problem with gzip

2019-05-10 Thread Lukasz Cwik
t;> >>> The file format for our users are mostly gzip format, since uncompressed >>> files would be too costly to store (It could be in hundreds of GB). >>> >>> Thanks, >>> >>> Allie >>> >>> >>> *From: *Lukasz Cwik

Re: How to configure TLS connections for Dataflow job that use Kafka and Schema Registry

2019-05-13 Thread Lukasz Cwik
s Apache Beam use Confluent Schema Registry client internally ? >>>> >>>> Yohei Onishi >>>> >>>> >>>> On Thu, May 9, 2019 at 1:12 PM Vishwas Bm wrote: >>>> >>>>> Hi Yohei, >>>>> >>>>> I had trie

Re: Problem with gzip

2019-05-14 Thread Lukasz Cwik
yKey won't wait until all data has been read? > > Thanks! > Allie > > *From: *Lukasz Cwik > *Date: *Fri, May 10, 2019 at 5:36 PM > *To: *dev > *Cc: *user > > There is no such flag to turn of fusion. >> >> Writing 100s of GiBs of uncompressed data to

Re: Problem with gzip

2019-05-14 Thread Lukasz Cwik
oud Dataflow > specific features with Python streaming execution.* > >- > >*Streaming autoscaling* > > I doubt whether this approach can solve my issue. > > > Thanks so much! > > Allie > > *From: *Lukasz Cwik > *Date: *Tue, May 14, 2019 at 11:16 AM > *To: *dev &

Re: Question about --environment_type argument

2019-05-28 Thread Lukasz Cwik
Are you losing the META-INF/ ServiceLoader entries related to binding the FileSystem via the FileSystemRegistrar when building the uber jar[1]? It does look like the Flink JobServer driver is registering the file systems[2]. 1: https://github.com/apache/beam/blob/95297dd82bd2fd3986900093cc1797c806

Re: Why Apache beam can't infer the default coder when using KV?

2019-06-05 Thread Lukasz Cwik
A large cause of coder inference issues is due to type erasure in Java[1]. For your example, I would suspect that it should have worked since your ConcatWordsCombineFn doesn't have any type variables declared. Can you add the message and stacktrace for the exception that is failing here[2] to your

Re: Why is my RabbitMq message never acknowledged ?

2019-06-14 Thread Lukasz Cwik
In the future, you will be able to check and give a hard error if checkpointing is disabled yet finalization is requested for portable pipelines: https://github.com/apache/beam/blob/2be7457a4c0b311c3bd784b3f00b425596adeb06/model/pipeline/src/main/proto/beam_runner_api.proto#L382 On Fri, Jun 14, 20

Re: Can we do pardo inside a pardo?

2019-06-17 Thread Lukasz Cwik
Typically you would apply your first ParDo getting back a PCollection and then apply your second ParDo to the return PCollection. You can get a lot more details in the programming guide[1]. For example: PCollection input = ... input.apply("ParDo1", ParDo.of(myDoFn1)).apply("ParDo2", ParDo.of(myDoF

Re: Beam vs Serverless (Dataflow vs Functions)

2019-06-20 Thread Lukasz Cwik
Its likely just a difference in cost and ease of implementation but Cloud Functions seems like it may fit your use case fine. On Thu, Jun 20, 2019 at 5:48 AM Joshua Fox wrote: > I need to take data that is sent over PubSub and simply store it into > MongoDB. > > Is there an advantage to use Dat

Re: [ANNOUNCE] Spark portable runner (batch) now available for Java, Python, Go

2019-06-21 Thread Lukasz Cwik
This is great news. Can't wait to see more. On Fri, Jun 21, 2019 at 8:56 AM Alexey Romanenko wrote: > Amazing job! Thank you, Kyle! > > On 19 Jun 2019, at 18:10, David Morávek wrote: > > Great job Kyle, thanks for pushing this forward! > > Sent from my iPhone > > On 18 Jun 2019, at 12:58, Ismaë

Re: gRPC method to get a pipeline definition?

2019-06-28 Thread Lukasz Cwik
The InMemoryJobService is meant to be a simple implementation. Adding a job expiration N minutes after the job completes might make sense. In reality, a more complex job service is needed that is backed by some kind of persistent storage or stateful service. On Thu, Jun 27, 2019 at 10:45 PM Chad

Re: gRPC method to get a pipeline definition?

2019-06-28 Thread Lukasz Cwik
I think the simplest solution would be to have some kind of override/hook that allows Flink/Spark/... to provide storage. They already have a concept of a job and know how to store them so can we piggyback the Beam pipeline there. On Fri, Jun 28, 2019 at 7:49 AM Chad Dombrova wrote: > > In reali

Re: gRPC method to get a pipeline definition?

2019-06-28 Thread Lukasz Cwik
+dev On Fri, Jun 28, 2019 at 8:20 AM Chad Dombrova wrote: > > I think the simplest solution would be to have some kind of override/hook >> that allows Flink/Spark/... to provide storage. They already have a concept >> of a job and know how to store them so can we piggyback the Beam pipeline >>

Re: Apache Beam issue | Reading Avro files and pushing to Bigquery

2019-07-09 Thread Lukasz Cwik
+user (please use user@ for questions about using the product and restrict to dev@ for questions related to developing the product). Can you provide the rest of the failing reason (and any stacktraces from the workers related to the failures)? On Tue, Jul 9, 2019 at 11:04 AM Dhiraj Sardana wrot

Re: [Python SDK] Avro read/write & Indexing

2019-07-09 Thread Lukasz Cwik
Typically this would be done by reading in the contents of the entire file into a map side input and then consuming that side input within a DoFn. Unfortunately, only Dataflow supports really large side inputs with an efficient access pattern and only when using Beam Java for bounded pipelines. Su

Re: [Opinion] [Question] Python SDK & Java SDK

2019-07-10 Thread Lukasz Cwik
Age is the largest consideration since the Python SDK was started a few years after the Java one was started. Another consideration was that the Python SDK only worked on Dataflow and until recently due to the work with portability, a few other runners have been able to execute Python pipelines. An

Re: [Java] Using a complex datastructure as Key for KV

2019-07-12 Thread Lukasz Cwik
TreeMapCoder.of(StringUtf8Coder.of(), ListCoder.of(VarIntCoder.of())); On Fri, Jul 12, 2019 at 10:22 AM Shannon Duncan wrote: > So I have my custom coder created for TreeMap and I'm ready to set it... > > So my Type is "TreeMap>" > > What do I put for ".setCoder(TreeMapCoder.of(???, ???))" > > O

Re: [Java] Using a complex datastructure as Key for KV

2019-07-12 Thread Lukasz Cwik
at 9:31 AM Shannon Duncan > wrote: > >> Aha, makes sense. Thanks! >> >> On Fri, Jul 12, 2019 at 9:26 AM Lukasz Cwik wrote: >> >>> TreeMapCoder.of(StringUtf8Coder.of(), ListCoder.of(VarIntCoder.of())); >>> >>> On Fri, Jul 12, 2019 at 10:22 AM

Re: Beam python pipeline on spark

2019-08-05 Thread Lukasz Cwik
I added a new stackoverflow answer pointing to the link about getting started. Please upvote the answer to increase visibility. On Sun, Aug 4, 2019 at 12:46 AM Chad Dombrova wrote: > It's in the doc that Kyle sent, but it's worth mentioning here that > streaming is not yet supported. > > -chad >

Re: How to deal with failed Checkpoint? What is current behavior for subsequent checkpoints?

2019-08-05 Thread Lukasz Cwik
https://s.apache.org/beam-finalizing-bundles should give you a bunch more details but I replied inline to your questions as well. On Fri, Jul 19, 2019 at 10:40 AM Ken Barr wrote: > Reading the below two statements I conclude that CheckpointMark.finalize > Checkpoint() will be called in order, un

Re: Save state on tear down

2019-08-05 Thread Lukasz Cwik
This is not possible today. There have been discussions about pipeline drain, snapshot and update [1, 2] which may provide additional details of what is planned and could use your feedback. 1: https://docs.google.com/document/d/1NExwHlj-2q2WUGhSO4jTu8XGhDPmm3cllSN8IMmWci8 2: https://docs.google.c

Re: Latency of Google Dataflow with Pubsub

2019-08-05 Thread Lukasz Cwik
+dev On Mon, Aug 5, 2019 at 12:49 PM Dmitry Minaev wrote: > Hi there, > > I'm building streaming pipelines in Beam (using Google Dataflow runner) > and using Google Pubsub as a message broker. I've made a couple of > experiments with a very simple pipeline: consume events from Pubsub > subscrip

Re: WordCount example breaks for FlinkRunner (local) for 2.14.0

2019-08-07 Thread Lukasz Cwik
The uber jar your using looks like it is incorrectly merging dependencies which is causing the issue you reported. Please try using mvn to run the example directly: mvn compile exec:java -Dexec.mainClass=org.apache.beam.examples.WordCount \ -Dexec.args="--runner=FlinkRunner --inputFile=pom.xm

Re: Caused by: java.lang.IllegalArgumentException: URI is not hierarchical

2019-08-08 Thread Lukasz Cwik
+user Can you supply the full stacktrace for the exception? On Tue, Aug 6, 2019 at 3:22 PM Jayanth Kolli wrote: > Getting following error while trying to run DataFloeRunner example in Self > executing JAR mode. > I have example word count spring boot application running fine with > spring-boot

Re: BigQueryIO - insert retry policy in Apache Beam

2019-08-08 Thread Lukasz Cwik
On Wed, Aug 7, 2019 at 10:55 PM Yohei Onishi wrote: > Hi, > > If you are familiar with BiqQuery insert retry policies in Apache Beam API > (BigQueryIO), please help me understand the following behavior. I am using > Dataflow runner. > >- How Dataflow job behave if I specify retryTransientErro

Re: Late data handling in Python SDK

2019-08-09 Thread Lukasz Cwik
+dev Related JIRA's I found are BEAM-3759 and BEAM-7825. This has been a priority thing as the community has been trying to get streaming Python execution working on multiple Beam runners. On Wed, Aug 7, 2019 at 2:31 AM Sam Stephens wrote: > Hi all, > > I’ve been reading into, and experimentin

Re: Multiple file systems configuration

2019-08-20 Thread Lukasz Cwik
ually, in other IOs, we can do this easily by having specific methods, > like “withConfiguration()”, “withCredentialsProvider()”, etc. for Read and > Write, but FileSystems based IO could be configured only > with PipelineOptions afaik. There was a thread about that a while ago [1] &

Re: Design question regarding streaming and sorting

2019-08-24 Thread Lukasz Cwik
Side inputs don't change for the lifetime of a bundle. Only on new bundles would you get a possibly updated new view of the side input so you may not see the changes to priority as quickly as you may expect. How quickly this happens is all dependent on the runner's internal implementation details.

Re: Design question regarding streaming and sorting

2019-08-26 Thread Lukasz Cwik
Once you choose to start processing a row, can a different row preempt that work or when you start processing you'll finish it? So far, I'm agreeing with what Jan has said. I believe there is a way to get what you want working but with a lot of unneeded complexity since not much in your problem de

Re: NoSuchElementException after Adding SplittableDoFn

2019-08-27 Thread Lukasz Cwik
This is a known issue and was fixed with https://github.com/apache/beam/commit/5d9bb4595c763025a369a959e18c6dd288e72314#diff-f149847d2c06f56ea591cab8d862c960 It is meant to be released as part of 2.16.0 On Tue, Aug 27, 2019 at 11:41 AM Zhiheng Huang wrote: > Hi all, > > Looking for help to unde

Re: NoSuchElementException after Adding SplittableDoFn

2019-08-27 Thread Lukasz Cwik
ks! Glad that this will be fixed in future release. > > Is there anyway that I can avoid hitting this problem before 2.16 is > released? > > On Tue, Aug 27, 2019 at 12:57 PM Lukasz Cwik wrote: > >> This is a known issue and was fixed with >

Re: Limit log files count with Dataflow Runner Logging

2019-08-29 Thread Lukasz Cwik
The only logging options that Dataflow exposes today limit what gets logged and not anything about how many rotated logs there are or how big they are. All Dataflow logging options are available here: https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/src/main/java/org/a

Re: Limit log files count with Dataflow Runner Logging

2019-08-30 Thread Lukasz Cwik
/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/logging/DataflowWorkerLoggingHandler.java > > Thanks > > On Thu, Aug 29, 2019 at 9:44 PM Lukasz Cwik wrote: > >> The only logging options that Dataflow exposes today li

Re: Setting environment and system properties on Dataflow workers

2019-08-30 Thread Lukasz Cwik
There is a way to run arbitrary code on JVM startup via a JVM initializer[1] in the Dataflow worker and in the portable Java worker as well. You should be able to mutate system properties at that point in time since Java allows for system properties to be mutated. The standard Java runtime doesn't

Re: Save state on tear down

2019-09-03 Thread Lukasz Cwik
execution graph causing windows to close, timers to fire, state to be emit and then garbage collected and so forth. > > thanks, > -chad > > > On Fri, Aug 16, 2019 at 2:47 PM Jose Delgado > wrote: > >> I see, thank you Lukasz. >> >> >> >&g

Re: Setting environment and system properties on Dataflow workers

2019-09-03 Thread Lukasz Cwik
nment in Google's docs, so perhaps that would bTe the > best place for this. > > On Fri, Aug 30, 2019 at 1:53 PM Lukasz Cwik wrote: > >> There is a way to run arbitrary code on JVM startup via a JVM >> initializer[1] in the Dataflow worker and in the portable Java worker

Re: [Java] Compressed SequenceFile

2019-09-05 Thread Lukasz Cwik
Sorry for the poor experience and thanks for sharing a solution with others. On Thu, Sep 5, 2019 at 6:34 AM Shannon Duncan wrote: > FYI this was due to hadoop version. 3.2.0 was throwing this error, but > rolled back to version in googles pom.xml 2.7.4 and it is working fine now. > > Kindof anno

Re: installing Apache Beam on Pycharm with Python 3.7

2019-09-05 Thread Lukasz Cwik
+user -dev Can you provide any error messages or details as to what failed? On Thu, Sep 5, 2019 at 9:43 AM Priti Badami < pbadami.srdataengin...@gmail.com> wrote: > Hi Dev Team, > > I am trying to install Apache Beam. I have pip 19.2.3 but I am facing > issues while installing Beam. > > please ad

Re: How to buffer events using spark portable runner ?

2019-09-08 Thread Lukasz Cwik
Try using Apache Flink. On Sun, Sep 8, 2019 at 6:23 AM Yu Watanabe wrote: > Hello . > > I would like to ask question related to timely processing as stated in > below page. > > https://beam.apache.org/blog/2017/08/28/timely-processing.html > > Python version: 3.7.4 > apache beam version: 2.15.0

Re: How do I run Beam Python pipelines using Flink deployed on Kubernetes?

2019-09-10 Thread Lukasz Cwik
External environments are mainly used for testing since they represent environments that are already running. There are other clever uses of it as well but unlikely what your looking for. The docker way will be nice once there is integration between the FlinkRunner and its supported cluster manage

Re: How do I run Beam Python pipelines using Flink deployed on Kubernetes?

2019-09-11 Thread Lukasz Cwik
n the python > container (source file boot.go). Is that right? > > > > Best, > > Andrea > > > > *From: *Lukasz Cwik > *Reply-To: *"user@beam.apache.org" > *Date: *Tuesday, 10 September 2019 at 23:19 > *To: *user > *Subject: *Re: How do I r

Re: How do you write portable runner pipeline on separate python code ?

2019-09-12 Thread Lukasz Cwik
When you use a local filesystem path and a docker environment, "/tmp" is written inside the container. You can solve this issue by: * Using a "remote" filesystem such as HDFS/S3/GCS/... * Mounting an external directory into the container so that any "local" writes appear outside the container * Usi

Re: How can I work with multiple pcollections?

2019-09-12 Thread Lukasz Cwik
Yes you can create multiple output PCollections using a ParDo with multiple outputs instead of inserting them into Mongo. It could be useful to read through the programming guide related to PCollections[1] and PTransforms with multiple outputs[2] and feel free to return with more questions. 1: ht

Re: How do you write portable runner pipeline on separate python code ?

2019-09-13 Thread Lukasz Cwik
em >>>>> (e.g. a Flink/Spark cluster or Dataflow). It would be very convenient if >>>>> we >>>>> could automatically stage local files to be read as artifacts that could >>>>> be >>>>> consumed by any worker (possibly via

Re: How to debug dataflow locally

2019-09-13 Thread Lukasz Cwik
In general there is no generic source/sink fake/mock/emulator that runs for all sources/sinks locally. Configuration and data is on a case by case basis. Most of the Apache Beam integration tests either launch a local implementation that is specific to the source/sink (such as a DB for JdbcIO) or u

Re: Where is /opt/apache/beam/boot?

2019-09-18 Thread Lukasz Cwik
It is embedded inside the docker container that corresponds to which SDK your using. Python container boot src: https://github.com/apache/beam/blob/master/sdks/python/container/boot.go Java container boot src: https://github.com/apache/beam/blob/master/sdks/java/container/boot.go Go container boot

Re: Any possibility to run larger data sets with DirectRunner?

2019-09-25 Thread Lukasz Cwik
+1 for local execution using Flink. On Tue, Sep 17, 2019 at 4:24 AM Paweł Kordek wrote: > Hi Steve > > Maybe local execution on a Flink cluster will work for you: > https://beam.apache.org/documentation/runners/flink/ ? > > Cheers > Pawel > > On Tue, 17 Sep 2019 at 10:51, Steve973 wrote: > >> H

Re: Python Portable Runner Issues

2019-09-25 Thread Lukasz Cwik
Google Dataflow currently uses a JSON representation of the pipeline graph and also the pipeline proto. We represent the graph in two different ways which leads to some wonderful *features*. Google Dataflow also side steps the Beam job service since Dataflow has its own Job API. Supporting the Beam

Re: Multiple iterations after GroupByKey with SparkRunner

2019-09-26 Thread Lukasz Cwik
Jan, in Beam users expect to be able to iterate the GBK output multiple times even from within the same ParDo. Is this something that Beam on Spark Runner never supported? On Thu, Sep 26, 2019 at 6:50 AM Jan Lukavský wrote: > Hi Gershi, > > could you please outline the pipeline you are trying to

Re: Avro SortedKeyValueFile

2019-09-27 Thread Lukasz Cwik
Not to my knowledge. On Fri, Sep 27, 2019 at 7:48 AM Shannon Duncan wrote: > Does beam support the avro SortedKeyValueFile > ? > Not able to find any documentation in beam area on this. > > I need to

Re: One-to-many mapping between unbounded input source and pipelines with session windows

2016-12-21 Thread Lukasz Cwik
= “foo” or record[“record_kind”] == “bar.” However, > I’d be curious if the answer changes if I wanted to do this globally for > the whole stream. > > Thanks! > > Ray > > From: Lukasz Cwik > Reply-To: "u...@beam.incubator.apache.org" > > Date: Wednesday, December

Re: One-to-many mapping between unbounded input source and pipelines with session windows

2016-12-22 Thread Lukasz Cwik
asz! > > Suppose type A is common but type B is rate. Would your second suggestion > be more appropriate in this case? > > Ray > > From: Lukasz Cwik > Reply-To: "user@beam.apache.org" > Date: Wednesday, December 21, 2016 at 8:14 PM > To: "u...@beam.incubator

Re: window settings for recovery scenario

2016-12-27 Thread Lukasz Cwik
The withAllowedLateness controls when data can enter the system and still be considered valid. The timestamp of the data is always relative to the watermark. timestamp is before watermark - withAllowedLateness -> data can be dropped timestamp is after watermark - withAllowedLatness -> data can not

Re: window settings for recovery scenario

2016-12-28 Thread Lukasz Cwik
, Mingmin wrote: > Thanks Lukasz. With the provided window function, can I control how the > watermark move forward ? Or a customized WindowFn is required. > > Sent from my iPhone > > On Dec 27, 2016, at 10:40 AM, Lukasz Cwik wrote: > > The withAllowedLateness controls when d

Re: some questions about metrics in beam

2016-12-28 Thread Lukasz Cwik
There is this tracking issue (https://issues.apache.org/jira/browse/BEAM-147) for the work and also a doc http://s.apache.org/beam-metrics-api that describes some additional details. Feel free to comment on the issue / doc with questions. On Tue, Dec 27, 2016 at 10:55 PM, 陈竞 wrote: > is there an

Re: Can beam create a iterative streaming job?

2017-01-09 Thread Lukasz Cwik
There are no built in constructs which allow for a conditional iterative execution. This issue (https://issues.apache.org/jira/browse/BEAM-106) tracks the feature request. Today, if you know some upperbound on how many times your pipeline needs to loop then you can manually unroll the loop. Alterna

Re: KafkaIO Example

2017-01-12 Thread Lukasz Cwik
I'm assuming you mean DirectRunner and not DataflowDirectRunner. On Wed, Jan 11, 2017 at 4:23 PM, Raghu Angadi wrote: > > On Wed, Jan 11, 2017 at 1:56 PM, Madhire, Naveen < > naveen.madh...@capitalone.com> wrote: > >> I can confirm the authorization and authentication to Kafka is behaving >> cor

Re: Debugging a failed Dataflow job

2017-01-23 Thread Lukasz Cwik
Please take a look at https://cloud.google.com/dataflow/pipelines/logging#monitoring-pipeline-logs On Mon, Jan 23, 2017 at 12:21 PM, Gareth Western wrote: > I'm having trouble running my pipeline using the dataflow runner. The job > is submitted successfully: > > Dataflow SDK version: 0.4.0 > Su

Re: Documentation - Programming Guide - Creating a PCollection

2017-01-24 Thread Lukasz Cwik
Yes, please report documentation issues via JIRA (added you as a contributor so that you can create it), also feel free to open a PR addressing the issue. On Tue, Jan 24, 2017 at 5:34 AM, Tobias Feldhaus < tobias.feldh...@localsearch.ch> wrote: > Hi, > > in the Programming Guide, under the “Creat

Re: Having a local cache (per JVM) to use in DoFns

2017-04-06 Thread Lukasz Cwik
You should follow any valid singleton pattern and preferably initialize on class load or within a method annotated with @Setup [1] @Setup/@Teardown is called each time an instance of a DoFn is created/discarded respectively. @Setup/@Teardown generally will be called fewer times then startBundle/fi

Re: howto zip pcollection with index

2017-04-10 Thread Lukasz Cwik
If the PCollection is small you can just convert it into a PCollectionView> using View.asList and then in another ParDo read in this list as a side input and iterate over all the elements using the index offset in the list. To parallelize the above, you need to break up the List into ranges which

Re: How to skip processing on failure at BigQueryIO sink?

2017-04-11 Thread Lukasz Cwik
Have you thought of fetching the schema upfront from BigQuery and prefiltering out any records in a preceeding DoFn instead of relying on BigQuery telling you that the schema doesn't match? Otherwise you are correct in believing that you will need to update BigQueryIO to have the retry/error seman

Re: dealing with metadata coming in a separate pubsub

2017-04-21 Thread Lukasz Cwik
How do you know when a record in the data pipeline has enough meta information stored so that it can be processed? How far behind is the meta data pubsub compared to the main pubsub? Do you expect late data/metadata, and if so what do you want to do? Also, side inputs aren't meant to be slow and t

Re: Cloning options instances

2017-04-21 Thread Lukasz Cwik
It was removed because many of the fields stored in PipelineOptions were not really cloneable but used as a way to pass around items such as an ExecutorService or Credentials for dependency injection reasons. With the above caveat that your not getting a true clone, feel free to copy the code show

Re: Slack Channel Invite

2017-04-24 Thread Lukasz Cwik
Done On Mon, Apr 24, 2017 at 7:34 AM, Paulo Bittencourt wrote: > Hello, > > Could I please have an invite to the Slack Channel? We're testing out > running Beam on Google Dataflow. Would love to be able to share experience > and ask for help/insight. > > Thanks, > Paulo >

Re: Using watermarks with bounded sources

2017-04-24 Thread Lukasz Cwik
BoundedSource is able to report the timestamp[1] for records. It is just that runners know that it is a fixed dataset so they have a trivial optimization where the watermark goes from negative infinity to positive infinity once all the data is read. For bounded splittable DoFns, its likely that run

Re: Triggers in 0.6.0 buffering behavior aligns with multiples of 10

2017-04-24 Thread Lukasz Cwik
Triggers are designed as you had mentioned to fire eventually after the conditions are met. This gives control to runners between being low latency by evaluating triggers often or highly performant with larger bundle sizes and less trigger evaluations. For testing of runners, we specifically have

Re: Slack Channel Invite

2017-04-25 Thread Lukasz Cwik
Done On Tue, Apr 25, 2017 at 1:24 PM, Alexandre Crayssac < alexandre.crays...@polynom.io> wrote: > Same here! > > Thanks > > On Mon, Apr 24, 2017 at 5:16 PM, Lukasz Cwik wrote: > >> Done >> >> On Mon, Apr 24, 2017 at 7:34 AM, Paulo Bittencourt >

Re: Reprocessing historic data with streaming jobs

2017-05-01 Thread Lukasz Cwik
I believe that if your data from the past can't effect the data of the future because the windows/state are independent of each other then just reprocessing the old data using a batch job is simplest and likely to be the fastest. About your choices 1, 2, and 3, allowed lateness is relative to the

Re: HDFSFileSource and distributed Apex questions

2017-05-02 Thread Lukasz Cwik
Moving this to user@beam.apache.org In the latest snapshot version of Apache Beam, file based sources like AvroIO/TextIO were updated to support reading from Hadoop, see HadoopFileSystem

Re: Reprocessing historic data with streaming jobs

2017-05-03 Thread Lukasz Cwik
An extremely convenient point is that you should be able to use the same pipeline (albeit with a different data source) to process bounded (backfill/reprocessing) and unbounded datasets (live). On Wed, May 3, 2017 at 12:55 AM, Lars BK wrote: > Thanks for your input and sorry for the late reply.

Re: [HEADS UP] Using "new" filesystem layer

2017-05-04 Thread Lukasz Cwik
JB, for your second point it seems as though you may not be setting the Hadoop configuration on HadoopFileSystemOptions. Also, I just merged https://github.com/apache/beam/pull/2890 which will auto detect Hadoop configuration based upon your HADOOP_CONF_DIR and YARN_CONF_DIR environment variables.

Re: Slack channel invite pls

2017-05-04 Thread Lukasz Cwik
Sent On Thu, May 4, 2017 at 9:32 AM, Seshadri Raghunathan wrote: > Hi , > > Please add me to the Slack channel in the next possible cycle. > > sesh...@gmail.com > > Thanks, > Seshadri >

Re: Slack invite

2017-05-04 Thread Lukasz Cwik
Sent On Thu, May 4, 2017 at 10:06 AM, Vinoth Chandar wrote: > Can you please add me to the slack group? > > vinoth at uber dot com > > Thanks, > Vinoth >

Re: [HEADS UP] Using "new" filesystem layer

2017-05-05 Thread Lukasz Cwik
documentation about > HDFS support. > > Regards > JB > > On 05/04/2017 06:07 PM, Lukasz Cwik wrote: > >> JB, for your second point it seems as though you may not be setting the >> Hadoop >> configuration on HadoopFileSystemOptions. >> Also, I just merged ht

Re: Learn testLateDataAccumulating

2017-05-08 Thread Lukasz Cwik
There are 3 groupings of firings, before the watermark has passed the end of the window, when the watermark reaches the end of the window, and after the watermark has passed the end of the window. They typically represent before and after watermark represent speculative and late data respectively.

Re: Modify default command line options

2017-05-10 Thread Lukasz Cwik
Your not missing anything. Are you speaking about being able to override defaults or being able to see if the response from pipelineOptions.getY() is the default response? Overriding defaults is tricky because many parts of the system call options.getRunner(), now which default should be returned

Re: Modify default command line options

2017-05-11 Thread Lukasz Cwik
fill in a PipelineOptions instance before calling > .fromArgs. That way I'd set my own defaults for the pipeline, but retain > the ability to override them through the command line. > > On Wed, May 10, 2017 at 6:16 PM, Lukasz Cwik wrote: > >> Your not missing any

Re: weird serialization issue on beam 2.0.0rc2

2017-05-11 Thread Lukasz Cwik
We were able to reproduce the issue and this is a regression in how we serialize the SerializableCoder, we used to only save the class but now save the class and type variable. The type variable is unnecessary and being marked as transient. On Thu, May 11, 2017 at 4:52 PM, Thomas Groh wrote: > H

Re: weird serialization issue on beam 2.0.0rc2

2017-05-11 Thread Lukasz Cwik
Being tracked on https://issues.apache.org/jira/browse/BEAM-2275 On Thu, May 11, 2017 at 5:37 PM, Dan Halperin wrote: > Thanks Anthony -- I was wrong here when I said that PCollectionViews don't > need to be serializable ; your original report is great and your new one is > better. > > Great to

Re: Custom log appenders with Dataflow runner

2017-05-17 Thread Lukasz Cwik
Have you tried installing a logger onto the root JUL logger once the pipeline starts executing in the worker (so inside one of your DoFn(s) setup methods)? ROOT_LOGGER_NAME = "" LogManager.getLogManager().getLogger(ROOT_LOGGER_NAME).addHandler(myCustomHandler); Also, the logging integration seems

Re: Getting "Unable to find registrar for c" error

2017-05-22 Thread Lukasz Cwik
This is a known problem with how the local file system is being interacted with as is evident by the similar test failures when running the Apache Beam unit tests on Windows: https://issues.apache.org/jira/browse/BEAM-2299 I would suggest trying to use "file:///c:/my/path/to/file.txt" There is a

Re: Getting "Unable to find registrar for c" error

2017-05-23 Thread Lukasz Cwik
t; as input file as seen in >> https://issues.apache.org/jira/browse/BEAM-2298, however both failed. >> >> "file:///C:/Users/ges/Workspace/word-count-beampom.xml" failed >> with Illegal char <:> at index 4 >> "C:/Users/ges/Workspace/word-count-beampom.

Re: Best way to load heavy object into memory on nodes (python sdk)

2017-05-24 Thread Lukasz Cwik
Why not use a singleton like pattern and have a function which either loads and caches the ML model from a side input or returns the singleton if it has been loaded. You'll want to use some form of locking to ensure that you really only load the ML model once. On Wed, May 24, 2017 at 6:18 AM, Vilh

Re: How to partition a stream by key before writing with FileBasedSink?

2017-05-24 Thread Lukasz Cwik
Since your using a small number of shards, add a Partition transform which uses a deterministic hash of the key to choose one of 4 partitions. Write each partition with a single shard. (Fixed width diagram below) Pipeline -> AvroIO(numShards = 4) Becomes: Pipeline -> Partition --> AvroIO(numShards

Re: How to decrease latency when using PubsubIO.Read?

2017-05-24 Thread Lukasz Cwik
What runner are you using (Flink, Spark, Google Cloud Dataflow, Apex, ...)? On Wed, May 24, 2017 at 8:09 AM, Ankur Chauhan wrote: > Sorry that was an autocorrect error. I meant to ask - what dataflow runner > are you using? If you are using google cloud dataflow then the PubsubIO > class is not

Re: How to partition a stream by key before writing with FileBasedSink?

2017-05-24 Thread Lukasz Cwik
; partitions so that Dataflow won't override numShards. > > Josh > > > On Wed, May 24, 2017 at 4:10 PM, Lukasz Cwik wrote: > >> Since your using a small number of shards, add a Partition transform >> which uses a deterministic hash of the key to choose one of 4 part

Re: Getting "Unable to find registrar for c" error

2017-05-25 Thread Lukasz Cwik
onTargetException: >> java.lang.IllegalStateException: Unable to find registrar for c -> [Help >> 1] >> >> Is there a way to know which "c" beams tried to read but failed to find >> registrar? The reason I ask is because from what I can see, the compile an

Re: Call CombineFns.compose() with uncertain combinefn

2017-05-30 Thread Lukasz Cwik
For the compilation error, have you tried? Long maxLatency = e.get((TupleTag) (TupleTag) finalOutputTags.get(0)); I'm not sure whether I fully understand the problem. On Sat, May 27, 2017 at 2:15 AM, 郭亚峰(默岭) wrote: > > Hi there, > I'm working with a small DSL project on top of beam sdk. In th

Re: Enriching stream messages based on external data

2017-06-01 Thread Lukasz Cwik
Combining PubSub + Bigtable is common. You should try to use the BigtableSession approach because the hbase approach adds a lot of dependencies (leading to dependency conflicts). You should use the same version of Bigtable libraries that Apache Beam is using (Apache Beam 2.0.0 uses Bigtable 0.9.6.

Re: Enriching stream messages based on external data

2017-06-02 Thread Lukasz Cwik
>>> ava:1178) >>>>> at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) >>>>> at org.apache.beam.sdk.util.SerializableUtils.serializeToByteAr >>>>> ray(SerializableUtils.java:49) >>>>> ... 10 more >>>>&

Re: Running Beam in Hadoop Cluster

2017-06-02 Thread Lukasz Cwik
To flatten all the dependencies into one jar is build system dependent. If using Maven I would look into the Maven Shade Plugin ( https://maven.apache.org/plugins/maven-shade-plugin/). Jar files are also just zip files so you could merge them manually as well but you'll need to deal with dependency

Re: Running Beam in Hadoop Cluster

2017-06-02 Thread Lukasz Cwik
eah, we're working on altering the build file to include all dependencies > in one, huge jar. Is there a better way than this to run Beam jobs on a > cluster? Putting everything into a jar seems like a clunky solution. > > > On Friday, June 2, 2017 11:41 AM, Lukasz Cwik wrote: &

Re: How to partition a stream by key before writing with FileBasedSink?

2017-06-06 Thread Lukasz Cwik
n Wed, May 24, 2017 at 6:15 PM, Josh wrote: > >> Ahh I see - Ok I'll try out this solution then. Thanks Lukasz! >> >> On Wed, May 24, 2017 at 5:20 PM, Lukasz Cwik wrote: >> >>> Google Cloud Dataflow won't override your setting. The dynamic sharding &

Re: How to partition a stream by key before writing with FileBasedSink?

2017-06-06 Thread Lukasz Cwik
across e.g. 8 instances instead of 4? If there are two input > elements with the same key are they actually guaranteed to be processed on > the same instance? > > > Thanks, > > Josh > > > > > On Tue, Jun 6, 2017 at 4:51 PM, Lukasz Cwik wrote: > >> I t

  1   2   3   4   >