Thanks for the info Joe - yes, I do think this will be very useful. Will look 
out for this, eh?!

On June 24, 2014 at 10:32:08 AM, Joe Stein (joe.st...@stealth.ly) wrote:

You could then chunk the data (wrapped in an outer message so you have meta 
data like file name, total size, current chunk size) and produce that with the 
partition key being filename.

We are in progress working on a system for doing file loading to Kafka (which 
will eventually support both chunked and pointers [initially chunking line by 
line since use case 1 is to read from a closed file handle location]) 
https://github.com/stealthly/f2k (there is not much there yet maybe in the next 
few days / later this week) maybe useful for your use case or we could 
eventually add your use case to it.

/*******************************************
 Joe Stein
 Founder, Principal Consultant
 Big Data Open Source Security LLC
 http://www.stealth.ly
 Twitter: @allthingshadoop
********************************************/


On Tue, Jun 24, 2014 at 12:37 PM, Denny Lee <denny.g....@gmail.com> wrote:
Hey Joe,

Yes, I have - my original plan is to do something similar to what you suggested 
which was to simply push the data into HDFS / S3 and then having only the event 
information within Kafka so that way multiple consumers can just read the event 
information and ping HDFS/S3 for the actual message itself.  

Part of the reason for considering just pushing the entire message up is due to 
the potential where we will have a firehose of messages of this size and we 
will need to push this data to multiple locations.

Thanks,
Denny

On June 24, 2014 at 9:26:49 AM, Joe Stein (joe.st...@stealth.ly) wrote:

Hi Denny, have you considered saving those files to HDFS and sending the
"event" information to Kafka?

You could then pass that off to Apache Spark in a consumer and get data
locality for the file saved (or something of the sort [no pun intended]).

You could also stream every line (or however you want to "chunk" it) in the
file as a separate message to the broker with a wrapping message object (so
you know which file you are dealing with when consuming).

What you plan to-do with the data has a lot to-do with how you are going to
process and manage it.

/*******************************************
Joe Stein
Founder, Principal Consultant
Big Data Open Source Security LLC
http://www.stealth.ly
Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop>
********************************************/


On Tue, Jun 24, 2014 at 11:35 AM, Denny Lee <denny.g....@gmail.com> wrote:

> By any chance has anyone worked with using Kafka with message sizes that
> are approximately 50MB in size? Based on from some of the previous threads
> there are probably some concerns on memory pressure due to the compression
> on the broker and decompression on the consumer and a best practices on
> ensuring batch size (to ultimately not have the compressed message exceed
> message size limit).
>
> Any other best practices or thoughts concerning this scenario?
>
> Thanks!
> Denny
>
>

Reply via email to