date:20210721

Beam Pipeline: storing files in CloudBucket / overriding

2021-07-21 Thread Sofia’s World

HI all if i remember correct, if my pipeline writes a file to a GCP Bucket, and the file already exists, the default behaviour is that the file is not overridden What is the pattern to follow if i want that every time the pipeline run and the file is stored to GCP Bucket, the existing file is ove

Re: Checking a Pcoll is empty in Apache Beam

2021-07-21 Thread Rajnil Guha

Yes I am just thinking how to modify/rewrite this piece of code if I want to run my pipeline on Dataflow runner. Thanks & Regards Rajnil Guha On Wed, Jul 21, 2021 at 1:12 AM Robert Bradshaw wrote: > On Tue, Jul 20, 2021 at 12:33 PM Rajnil Guha > wrote: > > > > Hi, > > > > Thank you so much fo

Re: [Help] KeyError: 'beam:coders:javasdk:0.1' with KafkaIO and Python external transform

2021-07-21 Thread Ignacio Taranto

OK, I replaced: .apply(KafkaIO.read() .withBootstrapServers(servers) .withTopic(readTopic) .withKeyDeserializer(StringDeserializer.class) .withValueDeserializer(StringDeserializer.class) .withoutMetadata()

[2.28.0] [Java] [Dataflow] ParquetIO writeDynamic stuck in Garbage Collection when writing ~125K files to dynamic destinations

2021-07-21 Thread Andrew Kettmann

Worker machines are n1-standard-2s (2 cpus and 7.5GB of RAM) Pipeline is simple, but large amounts of end files, ~125K temp files written in one case at least 1. Scan Bigtable (NoSQL DB) 2. Transform with business logic 3. Convert to GenericRecord 4. WriteDynamic to a google bucket a

Mapping part of a PCollection possible? (Lens optics for PCollection?)

2021-07-21 Thread Vincent Marquez

Let's say I have PCollection and I want to use the 'readAll' pattern to enhance some data from an additional source such as redis (which has a readKeys PTransform). However I don't want to 'lose' the original A. There *are* a few ways to do this currently (side inputs, joining two streams with Co

Re: [2.28.0] [Java] [Dataflow] ParquetIO writeDynamic stuck in Garbage Collection when writing ~125K files to dynamic destinations

2021-07-21 Thread Vincent Marquez

Windowing doesn't work with Batch jobs. You could dump your BQ data to pubsub and then use a streaming job to window. *~Vincent* On Wed, Jul 21, 2021 at 10:13 AM Andrew Kettmann wrote: > Worker machines are n1-standard-2s (2 cpus and 7.5GB of RAM) > > Pipeline is simple, but large amounts of e

Re: [2.28.0] [Java] [Dataflow] ParquetIO writeDynamic stuck in Garbage Collection when writing ~125K files to dynamic destinations

2021-07-21 Thread Andrew Kettmann

Oof none of the documentation around windowing I have read has said anything about it not working in a batch job. Where did you find that info at if you remember? From: Vincent Marquez Sent: Wednesday, July 21, 2021 12:21 PM To: user Subject: Re: [2.28.0] [Java]

Re: Mapping part of a PCollection possible? (Lens optics for PCollection?)

2021-07-21 Thread Andrew Kettmann

Worth noting that you never "lose" a PCollection. You can use the same PCollection in as many transforms as you like and every time you reference that PCollection it will be in the same state it was when you first read it in. So if you have: PCollection colA = ...; PCollection = colA.apply(ParD

Re: Mapping part of a PCollection possible? (Lens optics for PCollection?)

2021-07-21 Thread Vincent Marquez

On Wed, Jul 21, 2021 at 10:37 AM Andrew Kettmann wrote: > Worth noting that you never "lose" a PCollection. You can use the same > PCollection in as many transforms as you like and every time you reference > that PCollection it will be in the same state it was when you first read > it in. > > So

Re: Checking a Pcoll is empty in Apache Beam

2021-07-21 Thread Rajnil Guha

Hi, Thanks for your suggestions, I will surely check them out. My exact use-case is to check if the Pcoll is empty, and if it is, publish a message into a Pub/Sub topic. This message will then be further used downstream by some other processes. Thanks & Regards Rajnil Guha On Wed, Jul 21, 2021 a

Dynamic insert sql statement in JDBCIO.write.withStatement (based on inputfile and mapping)

2021-07-21 Thread Sarath Sowri

Hi Team, We are working on a ETL pipeline, As part of some transformation, some columns are deleted from the input pcollection( that are no mapped to appropriate Transactional TableColumns). Based on the input mapping, attributes that gets imported will be dynamic to the inputfile and the mapping

How to create a Schema at runtime and use it

2021-07-21 Thread Sarath Sowri

Hi Team, We are designing an ETL pipeline. Pipeline is invoked when there is a new input file(CSV), and Input file schema(columns) can be dynamic. WE are using Apache Schema to validate the rows in the InputFile. We are creating a Schema(after reading header row from CSV) at runtime in dofn and s

Beam Pipeline: storing files in CloudBucket / overriding

Re: Checking a Pcoll is empty in Apache Beam

Re: [Help] KeyError: 'beam:coders:javasdk:0.1' with KafkaIO and Python external transform

[2.28.0] [Java] [Dataflow] ParquetIO writeDynamic stuck in Garbage Collection when writing ~125K files to dynamic destinations

Mapping part of a PCollection possible? (Lens optics for PCollection?)

Re: [2.28.0] [Java] [Dataflow] ParquetIO writeDynamic stuck in Garbage Collection when writing ~125K files to dynamic destinations

Re: [2.28.0] [Java] [Dataflow] ParquetIO writeDynamic stuck in Garbage Collection when writing ~125K files to dynamic destinations

Re: Mapping part of a PCollection possible? (Lens optics for PCollection?)

Re: Mapping part of a PCollection possible? (Lens optics for PCollection?)

Re: Checking a Pcoll is empty in Apache Beam

Dynamic insert sql statement in JDBCIO.write.withStatement (based on inputfile and mapping)

How to create a Schema at runtime and use it

12 matches

Site Navigation

Mail list logo

Footer information