Flink 1.1.1 is running on AWS / EMR. 3 boxes - total 24 cores and 90Gb of
RAM.

Job is submitted via yarn.

Topology:

read csv files from SQS -> parse files by line  and create object for each
line -> pass through 'KeySelector' to pair entries (by hash) over 60 second
window -> write original and matched sets to BigQuery.

Each file contains ~ 15K lines and there are ~10 files / second.

My topology can't keep up with this stream. What am I doing wrong?

Articles like this (
http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/)
speak of > 1 million events / sec / core. Im not clear what constitutes an
'event' but given the number of cores Im throwing at this problem I would
expect higher throughput.

I run the job as :

HADOOP_CONF_DIR=/etc/hadoop/conf bin/flink run -m yarn-cluster -yn 3 -ys 8
-yst -ytm 4096 ../flink_all.jar

Reply via email to