posting my question again :)
Thanks for the pointer, looking at the below description from the site it
looks like in spark block size is not fixed, it's determined by block
interval and in fact for the same batch you could have different block
sizes. Did I get it right?
-
Another para
Hi Mohit,
please make sure you use the "Reply to all" button and include the mailing
list, otherwise only I will get your message ;)
Regarding your question:
Yes, that's also my understanding. You can partition streaming RDDs only by
time intervals, not by size. So depending on your incoming rate,
Hi Mohit,
it also depends on what the source for your streaming application is.
If you use Kafka, you can easily partition topics and have multiple
receivers on different machines.
If you have sth like a HTTP, socket, etc stream, you probably can't do
that. The Spark RDDs generated by your receiv
1. If you are consuming data from Kafka or any other receiver based
sources, then you can start 1-2 receivers per worker (assuming you'll have
min 4 core per worker)
2. If you are having single receiver or is a fileStream then what you can
do to distribute the data across machines is to do a repar