Does any know if this is a correct assumption
DataStream sorted = stream.keyBy("partition");
Will automattically put same record to the same sink thread ?
The behavior I am seeing is that a Sink setup with multiple threads is see data
from the same hour.
Any good examples of how to sort data so
I have a process that will take 250,000 records from kafka and produce a file.
(Using a CustomFileSync)
Currently I just have the following:
DataStream stream =env.addSource(new
FlinkKafkaConsumer010("topic"", schema,
properties)).setParallelism(40).flatMap(new
SchemaRecordSplit()).setParalle
All,
Im looking to process files in a directory based on files that are coming in
via file transfer.
The files are renamed once the transfer is done to a .DONE.
These are binary files and I need to process billions per day.
What I want to do is process the file and then create a new file called
.
number of partitions in your topic. If you have 240 partitions
that's fine, but if you have less than other subtasks will be idle. Only one
task can read from one partition in parallel.
On Tue, Apr 18, 2017 at 3:38 PM Telco Phone wrote:
I am trying to use the task number as a
I am trying to use the task number as a keyby value to help fan out the work
load reading from kafka.
Given:
DataStream stream = env.addSource(new
FlinkKafkaConsumer010("topicA", schema, properties)
).setParallelism(240).flatMap(new SchemaRecordSplit()).se
Getting this:
DataStream stream = env.addSource(new
FlinkKafkaConsumer08<>("raw", schema, properties)
).setParallelism(30).flatMap(new RecordSplit()).setParallelism(30).
name("Raw splitter").keyBy("id","keyByHelper","schema");
Field expressio
I am looking to get readers from kafka / keyBy and Sink working with all 60
threads.
For the most part it is working correctly
DataStream stream =env.addSource(new
FlinkKafkaConsumer08<>("kafkatopic", schema,
properties)).setParallelism(60).flatMap(new
SchemaRecordSplit()).setParallelism(60).n
I am trying to access the keyBy value in the "open" method in a RichSink
Is there a way to access the actual keyBy value in the RichSink ?
DataStream stream =
env.addSource(new FlinkKafkaConsumer08<>("test", schema, properties)
).setParallelism(1).keyBy("partition");