Beam Pipeline: storing files in CloudBucket / overriding

2021-07-21 Thread Sofia’s World
HI all if i remember correct, if my pipeline writes a file to a GCP Bucket, and the file already exists, the default behaviour is that the file is not overridden What is the pattern to follow if i want that every time the pipeline run and the file is stored to GCP Bucket, the existing file is ove

Re: Checking a Pcoll is empty in Apache Beam

2021-07-21 Thread Rajnil Guha
Yes I am just thinking how to modify/rewrite this piece of code if I want to run my pipeline on Dataflow runner. Thanks & Regards Rajnil Guha On Wed, Jul 21, 2021 at 1:12 AM Robert Bradshaw wrote: > On Tue, Jul 20, 2021 at 12:33 PM Rajnil Guha > wrote: > > > > Hi, > > > > Thank you so much fo

Re: [Help] KeyError: 'beam:coders:javasdk:0.1' with KafkaIO and Python external transform

2021-07-21 Thread Ignacio Taranto
OK, I replaced: .apply(KafkaIO.read() .withBootstrapServers(servers) .withTopic(readTopic) .withKeyDeserializer(StringDeserializer.class) .withValueDeserializer(StringDeserializer.class) .withoutMetadata()

[2.28.0] [Java] [Dataflow] ParquetIO writeDynamic stuck in Garbage Collection when writing ~125K files to dynamic destinations

2021-07-21 Thread Andrew Kettmann
Worker machines are n1-standard-2s (2 cpus and 7.5GB of RAM) Pipeline is simple, but large amounts of end files, ~125K temp files written in one case at least 1. Scan Bigtable (NoSQL DB) 2. Transform with business logic 3. Convert to GenericRecord 4. WriteDynamic to a google bucket a

Mapping *part* of a PCollection possible? (Lens optics for PCollection?)

2021-07-21 Thread Vincent Marquez
Let's say I have PCollection and I want to use the 'readAll' pattern to enhance some data from an additional source such as redis (which has a readKeys PTransform). However I don't want to 'lose' the original A. There *are* a few ways to do this currently (side inputs, joining two streams with Co

Re: [2.28.0] [Java] [Dataflow] ParquetIO writeDynamic stuck in Garbage Collection when writing ~125K files to dynamic destinations

2021-07-21 Thread Vincent Marquez
Windowing doesn't work with Batch jobs. You could dump your BQ data to pubsub and then use a streaming job to window. *~Vincent* On Wed, Jul 21, 2021 at 10:13 AM Andrew Kettmann wrote: > Worker machines are n1-standard-2s (2 cpus and 7.5GB of RAM) > > Pipeline is simple, but large amounts of e

Re: [2.28.0] [Java] [Dataflow] ParquetIO writeDynamic stuck in Garbage Collection when writing ~125K files to dynamic destinations

2021-07-21 Thread Andrew Kettmann
Oof none of the documentation around windowing I have read has said anything about it not working in a batch job. Where did you find that info at if you remember? From: Vincent Marquez Sent: Wednesday, July 21, 2021 12:21 PM To: user Subject: Re: [2.28.0] [Java]

Re: Mapping *part* of a PCollection possible? (Lens optics for PCollection?)

2021-07-21 Thread Andrew Kettmann
Worth noting that you never "lose" a PCollection. You can use the same PCollection in as many transforms as you like and every time you reference that PCollection it will be in the same state it was when you first read it in. So if you have: PCollection colA = ...; PCollection = colA.apply(ParD

Re: Mapping *part* of a PCollection possible? (Lens optics for PCollection?)

2021-07-21 Thread Vincent Marquez
On Wed, Jul 21, 2021 at 10:37 AM Andrew Kettmann wrote: > Worth noting that you never "lose" a PCollection. You can use the same > PCollection in as many transforms as you like and every time you reference > that PCollection it will be in the same state it was when you first read > it in. > > So

Re: Checking a Pcoll is empty in Apache Beam

2021-07-21 Thread Rajnil Guha
Hi, Thanks for your suggestions, I will surely check them out. My exact use-case is to check if the Pcoll is empty, and if it is, publish a message into a Pub/Sub topic. This message will then be further used downstream by some other processes. Thanks & Regards Rajnil Guha On Wed, Jul 21, 2021 a

Dynamic insert sql statement in JDBCIO.write.withStatement (based on inputfile and mapping)

2021-07-21 Thread Sarath Sowri
Hi Team, We are working on a ETL pipeline, As part of some transformation, some columns are deleted from the input pcollection( that are no mapped to appropriate Transactional TableColumns). Based on the input mapping, attributes that gets imported will be dynamic to the inputfile and the mapping

How to create a Schema at runtime and use it

2021-07-21 Thread Sarath Sowri
Hi Team, We are designing an ETL pipeline. Pipeline is invoked when there is a new input file(CSV), and Input file schema(columns) can be dynamic. WE are using Apache Schema to validate the rows in the InputFile. We are creating a Schema(after reading header row from CSV) at runtime in dofn and s