Ok, thanks.  Quick Q:  Won't each sink consume the same data?  Do I need to set 
up the load balancing sink processor to keep that from happening?


On Jan 16, 2013, at 5:47 PM, Hari Shreedharan <[email protected]> wrote:

> Also can you try adding more HDFS sinks reading from the same channel. I'd 
> recommend using different file prefixes, or paths for each sink, to avoid 
> collision. Since each sink really has just one thread driving them, adding 
> multiple sinks might help. Also, keep an eye on the memory channel's sizes 
> and see if it is filling up (there will be ChannelExceptions in the logs if 
> it is). 
> 
> 
> Hari
> 
> -- 
> Hari Shreedharan
> 
> On Wednesday, January 16, 2013 at 2:34 PM, Brock Noland wrote:
> 
>> Good to hear! Take five six thread dumps of it and then them our way.
>> 
>> On Wed, Jan 16, 2013 at 2:30 PM, Andrew Otto <[email protected]> wrote:
>>> Cool, thanks for the advice! That's a great blog post.
>>> 
>>> I've changed my ways (for now at least). I've got lots of disks to use once 
>>> memory starts working, and this node has tooooons of memory (192G).
>>> 
>>> Here's my new flume.conf:
>>> https://gist.github.com/4551513
>>> 
>>> This is doing better, for sure. Note that I took out the timestamp 
>>> regex_extractor just in case that was impacting performance. I'm using the 
>>> regular ol' timestamp interceptor now.
>>> 
>>> I'm still not doing so great though. I'm getting about 300 Mb per minute in 
>>> my HDFS files. I should be getting about 300G. That's better than before 
>>> though. I've got 10% of the data this time, rather than 0.14% :)
>>> 
>>> 
>>> 
>>> 
>>> On Jan 16, 2013, at 4:36 PM, Brock Noland <[email protected]> wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I would use memory channel for now as opposed to file channel. For
>>>> file channel to keep up with that you'd need multiple disks. Also your
>>>> checkpoint period is super-low which will cause lots of checkpoints
>>>> and slow things down.
>>>> 
>>>> However, I think the biggest issue is probably batch size. With that
>>>> much data you are likely going to want a large batch size for all
>>>> components involved. Something a low multiple of 1000. There is a good
>>>> article on this:
>>>> https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1
>>>> 
>>>> To re-cap would:
>>>> 
>>>> Use memory channel for now. Once you prove things work you can work on
>>>> tuning file channel (going to write larger batch sizes and multiple
>>>> disks).
>>>> 
>>>> Increase the batch size for your source/sink.
>>>> 
>>>> On Wed, Jan 16, 2013 at 1:22 PM, Andrew Otto <[email protected]> wrote:
>>>>> Ok, I'm trying my new UDPSource with Wikimedia's webrequest log stream. 
>>>>> This is available to me via UDP Multicast. Everything seems to be working 
>>>>> great, except that I seem to be missing a lot of data.
>>>>> 
>>>>> Our webrequest log stream consists of about 100000 events per second, 
>>>>> which amounts to around 50 Mb per second.
>>>>> 
>>>>> I understand that this is probably too much for a single node to handle, 
>>>>> but I should be able to either see most of the data written to HDFS, or 
>>>>> at least see errors about channels being filled to capacity. HDFS files 
>>>>> are set to roll every 60 seconds. Each of my files is only about 4.2MB, 
>>>>> which is only 72 Kb per second. That's only 0.14% of the data I'm 
>>>>> expecting to consume.
>>>>> 
>>>>> Where did the rest of it go? If Flume is dropping it, why doesn't it tell 
>>>>> me!?
>>>>> 
>>>>> Here's my flume.conf:
>>>>> 
>>>>> https://gist.github.com/4551001
>>>>> 
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Jan 15, 2013, at 2:31 PM, Andrew Otto <[email protected]> wrote:
>>>>> 
>>>>>> I just submitted the patch on 
>>>>>> https://issues.apache.org/jira/browse/FLUME-1838.
>>>>>> 
>>>>>> Would love some reviews, thanks!
>>>>>> -Andrew
>>>>>> 
>>>>>> 
>>>>>> On Jan 14, 2013, at 1:01 PM, Andrew Otto <[email protected]> wrote:
>>>>>> 
>>>>>>> Thanks guys! I've opened up a JIRA here:
>>>>>>> 
>>>>>>> https://issues.apache.org/jira/browse/FLUME-1838
>>>>>>> 
>>>>>>> 
>>>>>>> On Jan 14, 2013, at 12:43 PM, Alexander Alten-Lorenz 
>>>>>>> <[email protected]> wrote:
>>>>>>> 
>>>>>>>> Hey Andrew,
>>>>>>>> 
>>>>>>>> for your reference, we have a lot of developer informations in our 
>>>>>>>> wiki:
>>>>>>>> 
>>>>>>>> https://cwiki.apache.org/confluence/display/FLUME/Developer+Section
>>>>>>>> https://cwiki.apache.org/confluence/display/FLUME/Developers+Quick+Hack+Sheet
>>>>>>>> 
>>>>>>>> cheers,
>>>>>>>> Alex
>>>>>>>> 
>>>>>>>> On Jan 14, 2013, at 6:37 PM, Hari Shreedharan 
>>>>>>>> <[email protected]> wrote:
>>>>>>>> 
>>>>>>>>> Hi Andrew,
>>>>>>>>> 
>>>>>>>>> Really happy to hear Wikimedia Foundation is considering Flume. I am 
>>>>>>>>> fairly sure that if you find such a source useful, there would 
>>>>>>>>> definitely be others who find it useful too. I'd recommend filing a 
>>>>>>>>> jira and starting a discussion, and then submitting the patch. We 
>>>>>>>>> would be happy to review and commit it.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Hari
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Hari Shreedharan
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Monday, January 14, 2013 at 9:29 AM, Andrew Otto wrote:
>>>>>>>>> 
>>>>>>>>>> Hi all,
>>>>>>>>>> 
>>>>>>>>>> I'm an Systems Engineer at the Wikimedia Foundation, and we're 
>>>>>>>>>> investigating using Flume for our web request log HDFS imports. 
>>>>>>>>>> We've previously been using Kafka, but have had to change short term 
>>>>>>>>>> architecture plans in order to get data into HDFS reliably and 
>>>>>>>>>> regularly soon.
>>>>>>>>>> 
>>>>>>>>>> Our current web request logs are available for consumption over a 
>>>>>>>>>> multicast UDP stream. I could hack something together to try and 
>>>>>>>>>> pipe this into Flume using the existing sources (SyslogUDPSource, or 
>>>>>>>>>> maybe some combination of socat + NetcatSource), but I'd rather 
>>>>>>>>>> reduce the number of moving parts. I'd like to consume directly from 
>>>>>>>>>> the multicast UDP stream as a Flume source.
>>>>>>>>>> 
>>>>>>>>>> I coded up proof of concept based on the SyslogUDPSource, mainly 
>>>>>>>>>> just stripping out the syslog event header extraction, and adding in 
>>>>>>>>>> multicast Datagram connection code. I plan on cleaning this up, and 
>>>>>>>>>> making this a generic raw UDP source, with multicast being a 
>>>>>>>>>> configuration option.
>>>>>>>>>> 
>>>>>>>>>> My question to you guys is, is this something the Flume community 
>>>>>>>>>> would find useful? If so, should I open up a JIRA to track this? 
>>>>>>>>>> I've got a fork of the Flume git repo over on github and will be 
>>>>>>>>>> doing my work there. I'd love to share it upstream if it would be 
>>>>>>>>>> useful.
>>>>>>>>>> 
>>>>>>>>>> Thanks!
>>>>>>>>>> -Andrew Otto
>>>>>>>>>> Systems Engineer
>>>>>>>>>> Wikimedia Foundation
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Alexander Alten-Lorenz
>>>>>>>> http://mapredit.blogspot.com
>>>>>>>> German Hadoop LinkedIn Group: http://goo.gl/N8pCF
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Apache MRUnit - Unit testing MapReduce - 
>>>> http://incubator.apache.org/mrunit/
>> 
>> 
>> 
>> --
>> Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/
> 

Reply via email to