Hi,
I have two different parts in my system.
1. Batch application that every x minutes do sql queries between several
tables that contains millions of rows to compound a entity, and sent that
entities to Kafka.
2. Streaming application that processing data from Kafka.
Now, I have entire system working, but I want to improve the performance in
the batch part, because if I have 100 millions of entities I send them to
Kafka in a foreach method in a row, which makes no sense for the next
streaming application. I want, send each 10 millions events to Kafka, for
example.
I have a query, imagine
*select ... from table 1 left outer join table 2 on ... left outer join
table 3 on ... left outer join table 4 on ...*
My target is do *pagination* on table 1 and take 10 million in a separate
RDD, do the joins and send to Kafka, then take another 10 million and do
the same... I have all tables in parquet format in hdfs.
I think to use *toLocalIterator* method and something like that, but I have
doubts about memory and parallelism and sure there is a better way to do it.
rdd.toLocalIterator.grouped(10000000).foreach( seq =>
val rdd: RDD[(String, Int)] = sc.parallelize(seq)
// Do the processing
)
What do you think?
Regards.
--
Gaspar Muñoz
@gmunozsoria
<http://www.stratio.com/>
Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 352 59 42 // *@stratiobd <https://twitter.com/StratioBD>*