Re: [Question] [CassandraIO]Using a query

2023-02-07 Thread Vincent Marquez
draToBigtable.java#L220) > , a BeamtoBigtableFn. > > Any hints on how to add a write/output to Bigtable? > > TIA, > Adam > > > > On Fri, Feb 3, 2023 at 1:25 PM Vincent Marquez > wrote: > >> There are some examples in the test code that should be ea

Re: [Question] [CassandraIO]Using a query

2023-02-03 Thread Vincent Marquez
There are some examples in the test code that should be easy enough to follow. Here is an example of just querying the entire table: https://github.com/apache/beam/blob/master/sdks/java/io/cassandra/src/test/java/org/apache/beam/sdk/io/cassandra/CassandraIOTest.java#L460 Here's an example of usi

Re: Beam CassandraIO

2023-02-02 Thread Vincent Marquez
*~Vincent* On Thu, Feb 2, 2023 at 3:01 AM Alexey Romanenko wrote: > - d...@beam.apache.org > + user@beam.apache.org > > Hi Enzo, > > Can you make sure that all your workers were properly added and listed in > Spark WebUI? > > Did you specify a “ --master spark://HOST:PORT” option while running

Re: Kafka manually commit offsets

2021-12-13 Thread Vincent Marquez
for most of the pipeline. The checkpoint advancement code is > likely limited to the number of partitions but can be a very small portion > of the pipeline. > > On Fri, Dec 10, 2021 at 10:20 AM Vincent Marquez < > vincent.marq...@gmail.com> wrote: > >> If you want to en

Re: Kafka manually commit offsets

2021-12-10 Thread Vincent Marquez
If you want to ensure you have at least once processing I think the *maximum* amount of parallelization you can have would be the number of partitions you have, so you'd want to group by partition, process a bundle of that partition, then commit the last offset for a given partition. *~Vincent*

Re: Mapping *part* of a PCollection possible? (Lens optics for PCollection?)

2021-07-21 Thread Vincent Marquez
1. Do Whatever else you need to do with the combined data. > > > -- > *From:* Vincent Marquez > *Sent:* Wednesday, July 21, 2021 12:14 PM > *To:* user > *Subject:* Mapping *part* of a PCollection possible? (Lens optics for > PCollection?) > >

Re: [2.28.0] [Java] [Dataflow] ParquetIO writeDynamic stuck in Garbage Collection when writing ~125K files to dynamic destinations

2021-07-21 Thread Vincent Marquez
Windowing doesn't work with Batch jobs. You could dump your BQ data to pubsub and then use a streaming job to window. *~Vincent* On Wed, Jul 21, 2021 at 10:13 AM Andrew Kettmann wrote: > Worker machines are n1-standard-2s (2 cpus and 7.5GB of RAM) > > Pipeline is simple, but large amounts of e

Mapping *part* of a PCollection possible? (Lens optics for PCollection?)

2021-07-21 Thread Vincent Marquez
Let's say I have PCollection and I want to use the 'readAll' pattern to enhance some data from an additional source such as redis (which has a readKeys PTransform). However I don't want to 'lose' the original A. There *are* a few ways to do this currently (side inputs, joining two streams with Co

Re: Rate Limiting in Beam

2021-06-18 Thread Vincent Marquez
o put an upper > bound on the total concurrency of a step. > > On Thu, Jun 17, 2021 at 4:54 PM Vincent Marquez > wrote: > >> >> >> Do individual stages of a beam job exhibit backpressure to the consumer >> though? I would think buffering elements with Beam

Re: Rate Limiting in Beam

2021-06-17 Thread Vincent Marquez
Do individual stages of a beam job exhibit backpressure to the consumer though? I would think buffering elements with Beam's BagState might lead to OOM errors on the workers if the consumerIO continues to feed in data. Or does something else happen? --Vincent On Thu, Jun 17, 2021 at 11:42 AM Lu

Re: Is there a way (seetings) to limit the number of element per worker machine

2021-06-02 Thread Vincent Marquez
On Wed, Jun 2, 2021 at 11:27 AM Robert Bradshaw wrote: > On Wed, Jun 2, 2021 at 11:18 AM Vincent Marquez > wrote: > > > > On Wed, Jun 2, 2021 at 11:11 AM Robert Bradshaw > wrote: > >> > >> If you want to control the total number of elements being proc

Re: Is there a way (seetings) to limit the number of element per worker machine

2021-06-02 Thread Vincent Marquez
On Wed, Jun 2, 2021 at 11:11 AM Robert Bradshaw wrote: > If you want to control the total number of elements being processed > across all workers at a time, you can do this by assigning random keys > of the form RandomInteger() % TotalDesiredConcurrency followed by a > GroupByKey. > > If you want

Re: File processing triggered from external source

2021-05-25 Thread Vincent Marquez
On Tue, May 25, 2021 at 11:14 AM Sozonoff Serge wrote: > Hi, > > Thanks for the clarification. > > What is an issue with applying windowing/triggering strategy for your case? > > > The problem was actually not the trigger but the whole approach I took. > > > I guess fundamentally the whole issue

Re: Checkpointing Dataflow Pipeline

2021-04-07 Thread Vincent Marquez
has come up a few times in the beam slack so I think at the very least some extra documentation on these types of use cases might be welcome. > > On Wed, Apr 7, 2021 at 11:28 AM Vincent Marquez > wrote: > >> Looks like this is a common source of confusion, I had similar questions

Re: Checkpointing Dataflow Pipeline

2021-04-07 Thread Vincent Marquez
Looks like this is a common source of confusion, I had similar questions about checkpointing in the beam slack. In Spark Structured Streaming, checkpoints are saved to an *external* HDFS location and persist *beyond* each run, so in the event of a stream crashing, you can just point your next exec

Re: Write to multiple IOs in linear fashion

2021-03-24 Thread Vincent Marquez
builders) that >>>> compose well with Wait, etc. would be welcome. >>>> > >>>> > On Wed, Mar 24, 2021 at 11:14 AM Alexey Romanenko < >>>> aromanenko@gmail.com> wrote: >>>> >> >>>> >> In this way, I think

Re: Write to multiple IOs in linear fashion

2021-03-24 Thread Vincent Marquez
;> >> On Wed, Mar 24, 2021 at 9:49 AM Alexey Romanenko < >> aromanenko@gmail.com> wrote: >> >>> Do you want to wait for ALL records are written for Cassandra and then >>> write all successfully written records to PubSub or it should be performed >>>

Re: Write to multiple IOs in linear fashion

2021-03-24 Thread Vincent Marquez
for Cassandra and then > write all successfully written records to PubSub or it should be performed > "record by record"? > > On 24 Mar 2021, at 04:58, Vincent Marquez > wrote: > > I have a common use case where my pipeline looks like this: > CassandraIO.readAll -&g

Write to multiple IOs in linear fashion

2021-03-23 Thread Vincent Marquez
I have a common use case where my pipeline looks like this: CassandraIO.readAll -> Aggregate -> CassandraIO.write -> PubSubIO.write I do NOT want my pipeline to look like the following: CassandraIO.readAll -> Aggregate -> CassandraIO.write

KafkaIO.read without dataloss?

2021-02-22 Thread Vincent Marquez
Forgive me for the long email I figured more details was better, also asked on SO if you prefer there: https://stackoverflow.com/questions/66325929/dataflow-reading-from-kafka-without-data-loss We're currently big users of Beam/Dataflow batch

Re: Doubts on Looping inside a beam transform. Processing sequentially using Apache Beam

2020-12-15 Thread Vincent Marquez
Hi Feba, I can't say for sure *where* your pipeline is running out of memory, but I'm going to guess that it's due to the fact that CassandraIO currently only has the ability to read up an entire table, or have a single query attached. So if you are calling CassandraIO.read() that grabs all the "

Re: Global window with Bounded Source

2020-04-16 Thread Vincent Marquez
I actually ran into the same issue, and would love some guidance! I had a list of avro files within folders in GCS, each folder representing a single day, and I needed to de-dupe events per day (by a key). I didn't want a GroupByKey to hold billions of events when it didn't matter, so I added a

Re: Input File Tracking

2020-04-13 Thread Vincent Marquez
On first glance it sounds like a problem for a persistent queue such as Kafka or Google Cloud's pubsub. You could write a path to the queue upon download, which would trigger Beam to read the file and then bump the offset only upon completion of the read to the queue. If the read of the file fail