Hello everyone,
Here is my use case: I have an event data stream which I need to key by a
certain field (tenant id) and then for each tenant's events I need to
independently perform ML clustering using FlinkML's OnlineKMeans component.
I am using Java.
I tried different approaches but none of the
Hi Martijn,
yes, that's what I meant, the throughput in the process function(s) didn't
change, so even if they were busy 100% of the time with parallelism=2, they
were processing data quickly enough.
Regards,
Alexis.
Am Fr., 16. Dez. 2022 um 14:20 Uhr schrieb Martijn Visser <
martijnvis...@apach
Hi,
Backpressure implies that it's actually a later operator that is busy. So
in this case, that would be your process function that can't handle the
incoming load from your Kafka source.
Best regards,
Martijn
On Tue, Dec 13, 2022 at 7:46 PM Alexis Sarda-Espinosa <
sarda.espin...@gmail.com> wro
Hello,
I have a Kafka source (the new one) in Flink 1.15 that's followed by a
process function with parallelism=2. Some days, I see long periods of
backpressure in the source. During those times, the pool-usage metrics of
all tasks stay between 0 and 1%, but the process function appears 100% busy.
The statefun docs have some nice examples of how to use Kafka and Kinesis
for ingress/egress in conjunction with a function. Is there some
documentation or example code I could reference to do the same with a GCP
Pub/Sub topic? Thanks.
Dave
Yes, since the two streams have the same type, you can union the two
streams, key the resulting stream, and then apply something like a
RichFlatMapFunction. Or you can connect the two streams (again, they'll
need to be keyed so you can use state), and apply a RichCoFlatMapFunction.
You can use whic
I've gone through the example as well as the documentation and I still
couldn't understand whether my use case requires joining. 1. What would
happen if I didn't join?2. As the 2 incoming data streams have the same
type, if joining is absolutely necessary then just a union
(oneStream.union(anotherS
Let me make the example more concrete. Say O1 gets as input a data stream T1
which it splits into two using some function and produces DataStreams of
type T2 and T3, each of which are partitioned by the same key function TK.
Now after O2 processes a stream, it could sometimes send the stream to O3
For an example of a similar join implemented as a RichCoFlatMap, see [1].
For more background, the Flink docs have a tutorial [2] on how to work with
connected streams.
[1] https://github.com/apache/flink-training/tree/master/rides-and-fares
[2]
https://ci.apache.org/projects/flink/flink-docs-stab
1. yes - the same key would affect the same state variable
2. you need a join to have the same operator process both streams
Matthias
On Wed, Mar 24, 2021 at 7:29 AM vishalovercome wrote:
> Let me make the example more concrete. Say O1 gets as input a data stream
> T1
> which it splits into two
type and value for a key then will the two streams end up in the same
sub-task and therefore affect the same state variables keyed to that
particular key? Do the streams themselves have to have the same type or is
it enough that just the keys of each of the input streams have the same type
and va
stream partitioned by keyBy
>
> If operator O3 receives inputs from two operators and both inputs have the
> same type and value for a key then will the two streams end up in the same
> sub-task and therefore affect the same state variables keyed to that
> particular key? Do the streams
that each
slot can contain a source task. With config cluster.evenly-spread-out-slots
set to true, slots can be evenly distributed in all available taskmanagers
in most cases.
Thanks,
Zhu Zhu
Ken Krugler 于2020年8月7日周五 上午5:28写道:
> Hi all,
>
> Was there any change in how sub-tasks get all
Hi all,
Was there any change in how sub-tasks get allocated to TMs, from Flink 1.9 to
1.10?
Specifically for consecutively numbered sub-tasks (e.g. 0, 1, 2) did it become
more or less likely that they’d be allocated to the same Task Manager?
Asking because a workflow that ran fine in 1.9 now
Hi,
To subscribe, you have to send a mail to user-subscr...@flink.apache.org
On Wed, 15 Apr 2020 at 7:33 AM, lamber-ken wrote:
> user@flink.apache.org
>
user@flink.apache.org
>> Hello all!
>>>>
>>>> I have a setup composed of several streaming pipelines. These have
>>>> different deployment lifecycles: I want to be able to modify and redeploy
>>>> the topology of one while the other is still up. I am thus putting the
gy of one while the other is still up. I am thus putting them in
>>> different jobs.
>>>
>>> The problem is I have a Co-Location constraint between one subtask of
>>> each pipeline; I'd like them to run on the same TaskSlots, much like if
>>&
like if
>> they were sharing a TaskSlot; or at least have them on the same JVM.
>>
>> A semi-official feature
>> "DataStream.getTransformation().setCoLocationGroupKey(stringKey)" [1]
>> exists for this, but seem to be tied to the Sub-Tasks actually being able
&g
fficial feature
> "DataStream.getTransformation().setCoLocationGroupKey(stringKey)" [1]
> exists for this, but seem to be tied to the Sub-Tasks actually being able
> to be co-located on the same Task Slot.
>
> The documentation mentions [2] that it might be impossible to
between one subtask of each
pipeline; I'd like them to run on the same TaskSlots, much like if they
were sharing a TaskSlot; or at least have them on the same JVM.
A semi-official feature
"DataStream.getTransformation().setCoLocationGroupKey(stringKey)" [1]
exists for this, but seem to be
Hi Jary,
All the Flink's mailing list information can be found here[1].
[1]: https://flink.apache.org/community.html#mailing-lists
Best,
Vino
Benchao Li 于2020年1月2日周四 下午4:56写道:
> Hi Jary,
>
> You need to send a email to *user-subscr...@flink.apache.org
> * to subscribe, not user@flink.apache.o
Hi Jary,
You need to send a email to *user-subscr...@flink.apache.org
* to subscribe, not user@flink.apache.org.
Jary Zhen 于2020年1月2日周四 下午4:53写道:
>
>
--
Benchao Li
School of Electronics Engineering and Computer Science, Peking University
Tel:+86-15650713730
Email: libenc...@gmail.com; libenc
)start but what makes it worse is that it didn't exit
> as failed either.
>
> Next time I tried running the same job (but new EMR
> cluster & all from scratch) it just worked normally.
>
> On the problematic ru
On Thu, Mar 29, 2018 at 8:59 AM, Juho Autio
>>> wrote:
>>>
>>>> I built a new Flink distribution from release-1.5 branch yesterday.
>>>>
>>>> The first time I tried to run a job with it ended up in some stalled
>>>> state so that the jo
branch yesterday.
>>>
>>> The first time I tried to run a job with it ended up in some stalled
>>> state so that the job didn't manage to (re)start but what makes it worse is
>>> that it didn't exit as failed either.
>>>
>>> Next ti
x27;t exit as failed either.
>>
>> Next time I tried running the same job (but new EMR cluster & all from
>> scratch) it just worked normally.
>>
>> On the problematic run, The YARN job was started and Flink UI was being
>> served, but
YARN job was started and Flink UI was being
> served, but Flink UI kept showing status CREATED for all sub-tasks and
> nothing seemed to be happening.
>
> I found this in Job manager log first (could be unrelated) :
>
> 2018-03-28 15:26:17,449 INFO
> org.apache.flink.r
e job (but new EMR cluster & all from
scratch) it just worked normally.
On the problematic run, The YARN job was started and Flink UI was being
served, but Flink UI kept showing status CREATED for all sub-tasks and
nothing seemed to be happening.
I found this in Job manager log first (coul
stream keyed by (A,B)
are already being processed by the local task. Reshuffling as a result of
rekeying would have no benefit and would double the network traffic. It is
why I suggested subKey(B) may be a good to clearly indicate that the new
key just sub-partitions the existing key partition wi
Hi Elias,
sorry for the delay, this must have fallen under the table after Flink Forward.
I did spend some time thinking about this and we had the idea for a while now
to add an operation like “keyByWithoutPartitioning()” (name not final ;-) that
would allow the user to tell the system that we d
Anyone?
On Fri, Apr 21, 2017 at 10:15 PM, Elias Levy
wrote:
> This is something that has come up before on the list, but in a different
> context. I have a need to rekey a stream but would prefer the stream to
> not be repartitioned. There is no gain to repartitioning, as the new
> partition k
This is something that has come up before on the list, but in a different
context. I have a need to rekey a stream but would prefer the stream to
not be repartitioned. There is no gain to repartitioning, as the new
partition key is a composite of the stream key, going from a key of A to a
key of
t; https://ci.apache.org/projects/flink/flink-docs-master/concepts/concepts.html#workers-slots-resources
>
> My question is, which is best way to fan out large number of sub tasking
> parallel within a task?
>
> public void testFanOut() throws Exception{
> env =
at. Yet it
seems it were not designed to address the scenario above.
https://ci.apache.org/projects/flink/flink-docs-master/concepts/concepts.html#workers-slots-resources
My question is, which is best way to fan out large number of sub tasking
parallel within a task?
public void testFanOut() thr
t; (that might be relevant how the flag needs to be applied)
>>>>>> See the code Snipped below:
>>>>>>
>>>>>> DataStream inStream =
>>>>>> env.readFile(new AvroInputFormat(new
>>>>>> Path(filePat
;> See the code Snipped below:
>>>>>
>>>>> DataStream inStream =
>>>>> env.readFile(new AvroInputFormat(new
>>>>> Path(filePath), EndSongCleanedPq.class), filePath);
>>>>>
>>>>>
>>>>> On
usually it gets the data from
>>>>> kafka. It's an anomaly detection program that learns from the stream
>>>>> itself. The reason I want to read from files is to test different settings
>>>>> of the algorithm and compare them.
I don't need to reply things in the exact order (wich is not
>>>> possible with parallel reads anyway) and I have written the program so it
>>>> can deal with out of order events.
>>>> I only need the subfolders to be processed roughly in order. Its fine
&g
exact order (wich is not
>>> possible with parallel reads anyway) and I have written the program so it
>>> can deal with out of order events.
>>> I only need the subfolders to be processed roughly in order. Its fine to
>>> process some stuff from 01 before everythin
ssed roughly in order. Its fine to
>> process some stuff from 01 before everything from 00 is finished, if I get
>> records from all 24 subfolders at the same time things will break though.
>> If I set the flag will it try to get data from all sub dir's in parallel or
process some stuff from 01 before everything from 00 is finished, if I get
> records from all 24 subfolders at the same time things will break though.
> If I set the flag will it try to get data from all sub dir's in parallel or
> will it go sub dir by sub dir?
>
> Also can you poi
rom all 24 subfolders at the same time things will break though.
If I set the flag will it try to get data from all sub dir's in parallel or
will it go sub dir by sub dir?
Also can you point me to some documentation or something where I can see
how to set the Flag?
cheers Martin
On Wed, Feb
Hi!
Going through nested folders is pretty simple, there is a flag on the
FileInputFormat that makes sure those are read.
Tricky is the part that all "00" files should be read before the "01"
files. If you still want parallel reads, that means you need to sync at
some point, wait for all parallel
Hi,
I have a streaming machine learning job that usually runs with input from
kafka. To tweak the models I need to run on some old data from HDFS.
Unfortunately the data on HDFS is spread out over several subfolders.
Basically I have a datum with one subfolder for each hour within those are
the a
Hey Martin,
I don't think anybody used Google Cloud Pub/Sub with Flink yet.
There are no tutorials for implementing streaming sources and sinks, but
Flink has a few connectors that you can use as a reference.
For the sources, you basically have to extend RichSourceFunctio
Hej,
Has anyone tried use connect Flink Streaming to Google Cloud Pub/Sub and
has a code example for me?
If I have to implement my own sources and sinks are there any good
tutorials for that?
cheers Martin
48 matches
Mail list logo