Keying a data stream by tenant and performing ML on each sub-stream - help

2023-07-18 Thread Catalin Stavaru
Hello everyone, Here is my use case: I have an event data stream which I need to key by a certain field (tenant id) and then for each tenant's events I need to independently perform ML clustering using FlinkML's OnlineKMeans component. I am using Java. I tried different approaches but none of the

Re: Backpressure due to busy sub-tasks

2022-12-16 Thread Alexis Sarda-Espinosa
Hi Martijn, yes, that's what I meant, the throughput in the process function(s) didn't change, so even if they were busy 100% of the time with parallelism=2, they were processing data quickly enough. Regards, Alexis. Am Fr., 16. Dez. 2022 um 14:20 Uhr schrieb Martijn Visser < martijnvis...@apach

Re: Backpressure due to busy sub-tasks

2022-12-16 Thread Martijn Visser
Hi, Backpressure implies that it's actually a later operator that is busy. So in this case, that would be your process function that can't handle the incoming load from your Kafka source. Best regards, Martijn On Tue, Dec 13, 2022 at 7:46 PM Alexis Sarda-Espinosa < sarda.espin...@gmail.com> wro

Backpressure due to busy sub-tasks

2022-12-13 Thread Alexis Sarda-Espinosa
Hello, I have a Kafka source (the new one) in Flink 1.15 that's followed by a process function with parallelism=2. Some days, I see long periods of backpressure in the source. During those times, the pool-usage metrics of all tasks stay between 0 and 1%, but the process function appears 100% busy.

Stateful function with GCP Pub/Sub ingress/egress

2022-03-16 Thread David Dixon
The statefun docs have some nice examples of how to use Kafka and Kinesis for ingress/egress in conjunction with a function. Is there some documentation or example code I could reference to do the same with a GCP Pub/Sub topic? Thanks. Dave

Re: Do data streams partitioned by same keyBy but from different operators converge to the same sub-task?

2021-03-29 Thread David Anderson
Yes, since the two streams have the same type, you can union the two streams, key the resulting stream, and then apply something like a RichFlatMapFunction. Or you can connect the two streams (again, they'll need to be keyed so you can use state), and apply a RichCoFlatMapFunction. You can use whic

Re: Do data streams partitioned by same keyBy but from different operators converge to the same sub-task?

2021-03-29 Thread vishalovercome
I've gone through the example as well as the documentation and I still couldn't understand whether my use case requires joining. 1. What would happen if I didn't join?2. As the 2 incoming data streams have the same type, if joining is absolutely necessary then just a union (oneStream.union(anotherS

Re: Do data streams partitioned by same keyBy but from different operators converge to the same sub-task?

2021-03-24 Thread vishalovercome
Let me make the example more concrete. Say O1 gets as input a data stream T1 which it splits into two using some function and produces DataStreams of type T2 and T3, each of which are partitioned by the same key function TK. Now after O2 processes a stream, it could sometimes send the stream to O3

Re: Do data streams partitioned by same keyBy but from different operators converge to the same sub-task?

2021-03-24 Thread David Anderson
For an example of a similar join implemented as a RichCoFlatMap, see [1]. For more background, the Flink docs have a tutorial [2] on how to work with connected streams. [1] https://github.com/apache/flink-training/tree/master/rides-and-fares [2] https://ci.apache.org/projects/flink/flink-docs-stab

Re: Do data streams partitioned by same keyBy but from different operators converge to the same sub-task?

2021-03-24 Thread Matthias Pohl
1. yes - the same key would affect the same state variable 2. you need a join to have the same operator process both streams Matthias On Wed, Mar 24, 2021 at 7:29 AM vishalovercome wrote: > Let me make the example more concrete. Say O1 gets as input a data stream > T1 > which it splits into two

Do data streams partitioned by same keyBy but from different operators converge to the same sub-task?

2021-03-23 Thread vishalovercome
type and value for a key then will the two streams end up in the same sub-task and therefore affect the same state variables keyed to that particular key? Do the streams themselves have to have the same type or is it enough that just the keys of each of the input streams have the same type and va

Re: Do data streams partitioned by same keyBy but from different operators converge to the same sub-task?

2021-03-23 Thread Matthias Pohl
stream partitioned by keyBy > > If operator O3 receives inputs from two operators and both inputs have the > same type and value for a key then will the two streams end up in the same > sub-task and therefore affect the same state variables keyed to that > particular key? Do the streams

Re: Change in sub-task id assignment from 1.9 to 1.10?

2020-08-12 Thread Zhu Zhu
that each slot can contain a source task. With config cluster.evenly-spread-out-slots set to true, slots can be evenly distributed in all available taskmanagers in most cases. Thanks, Zhu Zhu Ken Krugler 于2020年8月7日周五 上午5:28写道: > Hi all, > > Was there any change in how sub-tasks get all

Change in sub-task id assignment from 1.9 to 1.10?

2020-08-06 Thread Ken Krugler
Hi all, Was there any change in how sub-tasks get allocated to TMs, from Flink 1.9 to 1.10? Specifically for consecutively numbered sub-tasks (e.g. 0, 1, 2) did it become more or less likely that they’d be allocated to the same Task Manager? Asking because a workflow that ran fine in 1.9 now

Re: sub

2020-04-14 Thread Sivaprasanna
Hi, To subscribe, you have to send a mail to user-subscr...@flink.apache.org On Wed, 15 Apr 2020 at 7:33 AM, lamber-ken wrote: > user@flink.apache.org >

sub

2020-04-14 Thread lamber-ken
user@flink.apache.org

Re: Colocating Sub-Tasks across jobs / Sharing Task Slots across jobs

2020-02-27 Thread Andrey Zagrebin
>> Hello all! >>>> >>>> I have a setup composed of several streaming pipelines. These have >>>> different deployment lifecycles: I want to be able to modify and redeploy >>>> the topology of one while the other is still up. I am thus putting the

Re: Colocating Sub-Tasks across jobs / Sharing Task Slots across jobs

2020-02-25 Thread Xintong Song
gy of one while the other is still up. I am thus putting them in >>> different jobs. >>> >>> The problem is I have a Co-Location constraint between one subtask of >>> each pipeline; I'd like them to run on the same TaskSlots, much like if >>&

Re: Colocating Sub-Tasks across jobs / Sharing Task Slots across jobs

2020-02-25 Thread Benoît Paris
like if >> they were sharing a TaskSlot; or at least have them on the same JVM. >> >> A semi-official feature >> "DataStream.getTransformation().setCoLocationGroupKey(stringKey)" [1] >> exists for this, but seem to be tied to the Sub-Tasks actually being able &g

Re: Colocating Sub-Tasks across jobs / Sharing Task Slots across jobs

2020-02-24 Thread Xintong Song
fficial feature > "DataStream.getTransformation().setCoLocationGroupKey(stringKey)" [1] > exists for this, but seem to be tied to the Sub-Tasks actually being able > to be co-located on the same Task Slot. > > The documentation mentions [2] that it might be impossible to

Colocating Sub-Tasks across jobs / Sharing Task Slots across jobs

2020-02-24 Thread Benoît Paris
between one subtask of each pipeline; I'd like them to run on the same TaskSlots, much like if they were sharing a TaskSlot; or at least have them on the same JVM. A semi-official feature "DataStream.getTransformation().setCoLocationGroupKey(stringKey)" [1] exists for this, but seem to be

Re: Sub-user

2020-01-02 Thread vino yang
Hi Jary, All the Flink's mailing list information can be found here[1]. [1]: https://flink.apache.org/community.html#mailing-lists Best, Vino Benchao Li 于2020年1月2日周四 下午4:56写道: > Hi Jary, > > You need to send a email to *user-subscr...@flink.apache.org > * to subscribe, not user@flink.apache.o

Re: Sub-user

2020-01-02 Thread Benchao Li
Hi Jary, You need to send a email to *user-subscr...@flink.apache.org * to subscribe, not user@flink.apache.org. Jary Zhen 于2020年1月2日周四 下午4:53写道: > > -- Benchao Li School of Electronics Engineering and Computer Science, Peking University Tel:+86-15650713730 Email: libenc...@gmail.com; libenc

Sub-user

2020-01-02 Thread Jary Zhen

Re: All sub-tasks stuck in CREATED state after "BlobServerConnection: GET operation failed"

2018-03-29 Thread Nico Kruber
)start but what makes it worse is that it didn't exit > as failed either. > > Next time I tried running the same job (but new EMR > cluster & all from scratch) it just worked normally. > > On the problematic ru

Re: All sub-tasks stuck in CREATED state after "BlobServerConnection: GET operation failed"

2018-03-29 Thread Gary Yao
On Thu, Mar 29, 2018 at 8:59 AM, Juho Autio >>> wrote: >>> >>>> I built a new Flink distribution from release-1.5 branch yesterday. >>>> >>>> The first time I tried to run a job with it ended up in some stalled >>>> state so that the jo

Re: All sub-tasks stuck in CREATED state after "BlobServerConnection: GET operation failed"

2018-03-29 Thread Juho Autio
branch yesterday. >>> >>> The first time I tried to run a job with it ended up in some stalled >>> state so that the job didn't manage to (re)start but what makes it worse is >>> that it didn't exit as failed either. >>> >>> Next ti

Re: All sub-tasks stuck in CREATED state after "BlobServerConnection: GET operation failed"

2018-03-29 Thread Juho Autio
x27;t exit as failed either. >> >> Next time I tried running the same job (but new EMR cluster & all from >> scratch) it just worked normally. >> >> On the problematic run, The YARN job was started and Flink UI was being >> served, but

Re: All sub-tasks stuck in CREATED state after "BlobServerConnection: GET operation failed"

2018-03-29 Thread Gary Yao
YARN job was started and Flink UI was being > served, but Flink UI kept showing status CREATED for all sub-tasks and > nothing seemed to be happening. > > I found this in Job manager log first (could be unrelated) : > > 2018-03-28 15:26:17,449 INFO > org.apache.flink.r

All sub-tasks stuck in CREATED state after "BlobServerConnection: GET operation failed"

2018-03-28 Thread Juho Autio
e job (but new EMR cluster & all from scratch) it just worked normally. On the problematic run, The YARN job was started and Flink UI was being served, but Flink UI kept showing status CREATED for all sub-tasks and nothing seemed to be happening. I found this in Job manager log first (coul

Re: Re-keying / sub-keying a stream without repartitioning

2017-04-26 Thread Elias Levy
stream keyed by (A,B) are already being processed by the local task. Reshuffling as a result of rekeying would have no benefit and would double the network traffic. It is why I suggested subKey(B) may be a good to clearly indicate that the new key just sub-partitions the existing key partition wi

Re: Re-keying / sub-keying a stream without repartitioning

2017-04-26 Thread Aljoscha Krettek
Hi Elias, sorry for the delay, this must have fallen under the table after Flink Forward. I did spend some time thinking about this and we had the idea for a while now to add an operation like “keyByWithoutPartitioning()” (name not final ;-) that would allow the user to tell the system that we d

Re: Re-keying / sub-keying a stream without repartitioning

2017-04-25 Thread Elias Levy
Anyone? On Fri, Apr 21, 2017 at 10:15 PM, Elias Levy wrote: > This is something that has come up before on the list, but in a different > context. I have a need to rekey a stream but would prefer the stream to > not be repartitioned. There is no gain to repartitioning, as the new > partition k

Re-keying / sub-keying a stream without repartitioning

2017-04-21 Thread Elias Levy
This is something that has come up before on the list, but in a different context. I have a need to rekey a stream but would prefer the stream to not be repartitioned. There is no gain to repartitioning, as the new partition key is a composite of the stream key, going from a key of A to a key of

Re: fan out parallel-able operator sub-task beyond total slots number

2016-04-18 Thread Till Rohrmann
t; https://ci.apache.org/projects/flink/flink-docs-master/concepts/concepts.html#workers-slots-resources > > My question is, which is best way to fan out large number of sub tasking > parallel within a task? > > public void testFanOut() throws Exception{ > env =

fan out parallel-able operator sub-task beyond total slots number

2016-04-17 Thread Chen Qin
at. Yet it seems it were not designed to address the scenario above. https://ci.apache.org/projects/flink/flink-docs-master/concepts/concepts.html#workers-slots-resources My question is, which is best way to fan out large number of sub tasking parallel within a task? public void testFanOut() thr

Re: streaming hdfs sub folders

2016-02-26 Thread Stephan Ewen
t; (that might be relevant how the flag needs to be applied) >>>>>> See the code Snipped below: >>>>>> >>>>>> DataStream inStream = >>>>>> env.readFile(new AvroInputFormat(new >>>>>> Path(filePat

Re: streaming hdfs sub folders

2016-02-23 Thread Martin Neumann
;> See the code Snipped below: >>>>> >>>>> DataStream inStream = >>>>> env.readFile(new AvroInputFormat(new >>>>> Path(filePath), EndSongCleanedPq.class), filePath); >>>>> >>>>> >>>>> On

Re: streaming hdfs sub folders

2016-02-19 Thread Robert Metzger
usually it gets the data from >>>>> kafka. It's an anomaly detection program that learns from the stream >>>>> itself. The reason I want to read from files is to test different settings >>>>> of the algorithm and compare them.

Re: streaming hdfs sub folders

2016-02-18 Thread Martin Neumann
I don't need to reply things in the exact order (wich is not >>>> possible with parallel reads anyway) and I have written the program so it >>>> can deal with out of order events. >>>> I only need the subfolders to be processed roughly in order. Its fine &g

Re: streaming hdfs sub folders

2016-02-18 Thread Martin Neumann
exact order (wich is not >>> possible with parallel reads anyway) and I have written the program so it >>> can deal with out of order events. >>> I only need the subfolders to be processed roughly in order. Its fine to >>> process some stuff from 01 before everythin

Re: streaming hdfs sub folders

2016-02-18 Thread Stephan Ewen
ssed roughly in order. Its fine to >> process some stuff from 01 before everything from 00 is finished, if I get >> records from all 24 subfolders at the same time things will break though. >> If I set the flag will it try to get data from all sub dir's in parallel or

Re: streaming hdfs sub folders

2016-02-17 Thread Martin Neumann
process some stuff from 01 before everything from 00 is finished, if I get > records from all 24 subfolders at the same time things will break though. > If I set the flag will it try to get data from all sub dir's in parallel or > will it go sub dir by sub dir? > > Also can you poi

Re: streaming hdfs sub folders

2016-02-17 Thread Martin Neumann
rom all 24 subfolders at the same time things will break though. If I set the flag will it try to get data from all sub dir's in parallel or will it go sub dir by sub dir? Also can you point me to some documentation or something where I can see how to set the Flag? cheers Martin On Wed, Feb

Re: streaming hdfs sub folders

2016-02-17 Thread Stephan Ewen
Hi! Going through nested folders is pretty simple, there is a flag on the FileInputFormat that makes sure those are read. Tricky is the part that all "00" files should be read before the "01" files. If you still want parallel reads, that means you need to sync at some point, wait for all parallel

streaming hdfs sub folders

2016-02-16 Thread Martin Neumann
Hi, I have a streaming machine learning job that usually runs with input from kafka. To tweak the models I need to run on some old data from HDFS. Unfortunately the data on HDFS is spread out over several subfolders. Basically I have a datum with one subfolder for each hour within those are the a

Re: Flink Streaming and Google Cloud Pub/Sub?

2015-09-15 Thread Robert Metzger
Hey Martin, I don't think anybody used Google Cloud Pub/Sub with Flink yet. There are no tutorials for implementing streaming sources and sinks, but Flink has a few connectors that you can use as a reference. For the sources, you basically have to extend RichSourceFunctio

Flink Streaming and Google Cloud Pub/Sub?

2015-09-14 Thread Martin Neumann
Hej, Has anyone tried use connect Flink Streaming to Google Cloud Pub/Sub and has a code example for me? If I have to implement my own sources and sinks are there any good tutorials for that? cheers Martin