Re: Need for UDP / Multicast Source

Brock Noland Wed, 16 Jan 2013 14:35:46 -0800

Good to hear! Take five six thread dumps of it and then them our way.

On Wed, Jan 16, 2013 at 2:30 PM, Andrew Otto <[email protected]> wrote:
> Cool, thanks for the advice!  That's a great blog post.
>
> I've changed my ways (for now at least).  I've got lots of disks to use once 
> memory starts working, and this node has tooooons of memory (192G).
>
> Here's my new flume.conf:
> https://gist.github.com/4551513
>
> This is doing better, for sure.  Note that I took out the timestamp 
> regex_extractor just in case that was impacting performance.  I'm using the 
> regular ol' timestamp interceptor now.
>
> I'm still not doing so great though.  I'm getting about 300 Mb per minute in 
> my HDFS files.  I should be getting about 300G.  That's better than before 
> though.  I've got 10% of the data this time, rather than 0.14% :)
>
>
>
>
> On Jan 16, 2013, at 4:36 PM, Brock Noland <[email protected]> wrote:
>
>> Hi,
>>
>> I would use memory channel for now as opposed to file channel. For
>> file channel to keep up with that you'd need multiple disks. Also your
>> checkpoint period is super-low which will cause lots of checkpoints
>> and slow things down.
>>
>> However, I think the biggest issue is probably batch size. With that
>> much data you are likely going to want a large batch size for all
>> components involved. Something a low multiple of 1000. There is a good
>> article on this:
>> https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1
>>
>> To re-cap would:
>>
>> Use memory channel for now. Once you prove things work you can work on
>> tuning file channel (going to write larger batch sizes and multiple
>> disks).
>>
>> Increase the batch size for your source/sink.
>>
>> On Wed, Jan 16, 2013 at 1:22 PM, Andrew Otto <[email protected]> wrote:
>>> Ok, I'm trying my new UDPSource with Wikimedia's webrequest log stream.  
>>> This is available to me via UDP Multicast.  Everything seems to be working 
>>> great, except that I seem to be missing a lot of data.
>>>
>>> Our webrequest log stream consists of about 100000 events per second, which 
>>> amounts to around 50 Mb per second.
>>>
>>> I understand that this is probably too much for a single node to handle, 
>>> but I should be able to either see most of the data written to HDFS, or at 
>>> least see errors about channels being filled to capacity.  HDFS files are 
>>> set to roll every 60 seconds.  Each of my files is only about 4.2MB, which 
>>> is only 72 Kb per second.  That's only 0.14% of the data I'm expecting to 
>>> consume.
>>>
>>> Where did the rest of it go?  If Flume is dropping it, why doesn't it tell 
>>> me!?
>>>
>>> Here's my flume.conf:
>>>
>>> https://gist.github.com/4551001
>>>
>>>
>>> Thanks!
>>>
>>>
>>>
>>>
>>> On Jan 15, 2013, at 2:31 PM, Andrew Otto <[email protected]> wrote:
>>>
>>>> I just submitted the patch on 
>>>> https://issues.apache.org/jira/browse/FLUME-1838.
>>>>
>>>> Would love some reviews, thanks!
>>>> -Andrew
>>>>
>>>>
>>>> On Jan 14, 2013, at 1:01 PM, Andrew Otto <[email protected]> wrote:
>>>>
>>>>> Thanks guys!  I've opened up a JIRA here:
>>>>>
>>>>> https://issues.apache.org/jira/browse/FLUME-1838
>>>>>
>>>>>
>>>>> On Jan 14, 2013, at 12:43 PM, Alexander Alten-Lorenz 
>>>>> <[email protected]> wrote:
>>>>>
>>>>>> Hey Andrew,
>>>>>>
>>>>>> for your reference, we have a lot of developer informations in our wiki:
>>>>>>
>>>>>> https://cwiki.apache.org/confluence/display/FLUME/Developer+Section
>>>>>> https://cwiki.apache.org/confluence/display/FLUME/Developers+Quick+Hack+Sheet
>>>>>>
>>>>>> cheers,
>>>>>> Alex
>>>>>>
>>>>>> On Jan 14, 2013, at 6:37 PM, Hari Shreedharan 
>>>>>> <[email protected]> wrote:
>>>>>>
>>>>>>> Hi Andrew,
>>>>>>>
>>>>>>> Really happy to hear Wikimedia Foundation is considering Flume. I am 
>>>>>>> fairly sure that if you find such a source useful, there would 
>>>>>>> definitely be others who find it useful too. I'd recommend filing a 
>>>>>>> jira and starting a discussion, and then submitting the patch. We would 
>>>>>>> be happy to review and commit it.
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Hari
>>>>>>>
>>>>>>> --
>>>>>>> Hari Shreedharan
>>>>>>>
>>>>>>>
>>>>>>> On Monday, January 14, 2013 at 9:29 AM, Andrew Otto wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I'm an Systems Engineer at the Wikimedia Foundation, and we're 
>>>>>>>> investigating using Flume for our web request log HDFS imports. We've 
>>>>>>>> previously been using Kafka, but have had to change short term 
>>>>>>>> architecture plans in order to get data into HDFS reliably and 
>>>>>>>> regularly soon.
>>>>>>>>
>>>>>>>> Our current web request logs are available for consumption over a 
>>>>>>>> multicast UDP stream. I could hack something together to try and pipe 
>>>>>>>> this into Flume using the existing sources (SyslogUDPSource, or maybe 
>>>>>>>> some combination of socat + NetcatSource), but I'd rather reduce the 
>>>>>>>> number of moving parts. I'd like to consume directly from the 
>>>>>>>> multicast UDP stream as a Flume source.
>>>>>>>>
>>>>>>>> I coded up proof of concept based on the SyslogUDPSource, mainly just 
>>>>>>>> stripping out the syslog event header extraction, and adding in 
>>>>>>>> multicast Datagram connection code. I plan on cleaning this up, and 
>>>>>>>> making this a generic raw UDP source, with multicast being a 
>>>>>>>> configuration option.
>>>>>>>>
>>>>>>>> My question to you guys is, is this something the Flume community 
>>>>>>>> would find useful? If so, should I open up a JIRA to track this? I've 
>>>>>>>> got a fork of the Flume git repo over on github and will be doing my 
>>>>>>>> work there. I'd love to share it upstream if it would be useful.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>> -Andrew Otto
>>>>>>>> Systems Engineer
>>>>>>>> Wikimedia Foundation
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Alexander Alten-Lorenz
>>>>>> http://mapredit.blogspot.com
>>>>>> German Hadoop LinkedIn Group: http://goo.gl/N8pCF
>>>>>>
>>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/
>




-- 
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/

Re: Need for UDP / Multicast Source

Reply via email to