Good to hear! Take five six thread dumps of it and then them our way. On Wed, Jan 16, 2013 at 2:30 PM, Andrew Otto <[email protected]> wrote: > Cool, thanks for the advice! That's a great blog post. > > I've changed my ways (for now at least). I've got lots of disks to use once > memory starts working, and this node has tooooons of memory (192G). > > Here's my new flume.conf: > https://gist.github.com/4551513 > > This is doing better, for sure. Note that I took out the timestamp > regex_extractor just in case that was impacting performance. I'm using the > regular ol' timestamp interceptor now. > > I'm still not doing so great though. I'm getting about 300 Mb per minute in > my HDFS files. I should be getting about 300G. That's better than before > though. I've got 10% of the data this time, rather than 0.14% :) > > > > > On Jan 16, 2013, at 4:36 PM, Brock Noland <[email protected]> wrote: > >> Hi, >> >> I would use memory channel for now as opposed to file channel. For >> file channel to keep up with that you'd need multiple disks. Also your >> checkpoint period is super-low which will cause lots of checkpoints >> and slow things down. >> >> However, I think the biggest issue is probably batch size. With that >> much data you are likely going to want a large batch size for all >> components involved. Something a low multiple of 1000. There is a good >> article on this: >> https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1 >> >> To re-cap would: >> >> Use memory channel for now. Once you prove things work you can work on >> tuning file channel (going to write larger batch sizes and multiple >> disks). >> >> Increase the batch size for your source/sink. >> >> On Wed, Jan 16, 2013 at 1:22 PM, Andrew Otto <[email protected]> wrote: >>> Ok, I'm trying my new UDPSource with Wikimedia's webrequest log stream. >>> This is available to me via UDP Multicast. Everything seems to be working >>> great, except that I seem to be missing a lot of data. >>> >>> Our webrequest log stream consists of about 100000 events per second, which >>> amounts to around 50 Mb per second. >>> >>> I understand that this is probably too much for a single node to handle, >>> but I should be able to either see most of the data written to HDFS, or at >>> least see errors about channels being filled to capacity. HDFS files are >>> set to roll every 60 seconds. Each of my files is only about 4.2MB, which >>> is only 72 Kb per second. That's only 0.14% of the data I'm expecting to >>> consume. >>> >>> Where did the rest of it go? If Flume is dropping it, why doesn't it tell >>> me!? >>> >>> Here's my flume.conf: >>> >>> https://gist.github.com/4551001 >>> >>> >>> Thanks! >>> >>> >>> >>> >>> On Jan 15, 2013, at 2:31 PM, Andrew Otto <[email protected]> wrote: >>> >>>> I just submitted the patch on >>>> https://issues.apache.org/jira/browse/FLUME-1838. >>>> >>>> Would love some reviews, thanks! >>>> -Andrew >>>> >>>> >>>> On Jan 14, 2013, at 1:01 PM, Andrew Otto <[email protected]> wrote: >>>> >>>>> Thanks guys! I've opened up a JIRA here: >>>>> >>>>> https://issues.apache.org/jira/browse/FLUME-1838 >>>>> >>>>> >>>>> On Jan 14, 2013, at 12:43 PM, Alexander Alten-Lorenz >>>>> <[email protected]> wrote: >>>>> >>>>>> Hey Andrew, >>>>>> >>>>>> for your reference, we have a lot of developer informations in our wiki: >>>>>> >>>>>> https://cwiki.apache.org/confluence/display/FLUME/Developer+Section >>>>>> https://cwiki.apache.org/confluence/display/FLUME/Developers+Quick+Hack+Sheet >>>>>> >>>>>> cheers, >>>>>> Alex >>>>>> >>>>>> On Jan 14, 2013, at 6:37 PM, Hari Shreedharan >>>>>> <[email protected]> wrote: >>>>>> >>>>>>> Hi Andrew, >>>>>>> >>>>>>> Really happy to hear Wikimedia Foundation is considering Flume. I am >>>>>>> fairly sure that if you find such a source useful, there would >>>>>>> definitely be others who find it useful too. I'd recommend filing a >>>>>>> jira and starting a discussion, and then submitting the patch. We would >>>>>>> be happy to review and commit it. >>>>>>> >>>>>>> >>>>>>> Thanks, >>>>>>> Hari >>>>>>> >>>>>>> -- >>>>>>> Hari Shreedharan >>>>>>> >>>>>>> >>>>>>> On Monday, January 14, 2013 at 9:29 AM, Andrew Otto wrote: >>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> I'm an Systems Engineer at the Wikimedia Foundation, and we're >>>>>>>> investigating using Flume for our web request log HDFS imports. We've >>>>>>>> previously been using Kafka, but have had to change short term >>>>>>>> architecture plans in order to get data into HDFS reliably and >>>>>>>> regularly soon. >>>>>>>> >>>>>>>> Our current web request logs are available for consumption over a >>>>>>>> multicast UDP stream. I could hack something together to try and pipe >>>>>>>> this into Flume using the existing sources (SyslogUDPSource, or maybe >>>>>>>> some combination of socat + NetcatSource), but I'd rather reduce the >>>>>>>> number of moving parts. I'd like to consume directly from the >>>>>>>> multicast UDP stream as a Flume source. >>>>>>>> >>>>>>>> I coded up proof of concept based on the SyslogUDPSource, mainly just >>>>>>>> stripping out the syslog event header extraction, and adding in >>>>>>>> multicast Datagram connection code. I plan on cleaning this up, and >>>>>>>> making this a generic raw UDP source, with multicast being a >>>>>>>> configuration option. >>>>>>>> >>>>>>>> My question to you guys is, is this something the Flume community >>>>>>>> would find useful? If so, should I open up a JIRA to track this? I've >>>>>>>> got a fork of the Flume git repo over on github and will be doing my >>>>>>>> work there. I'd love to share it upstream if it would be useful. >>>>>>>> >>>>>>>> Thanks! >>>>>>>> -Andrew Otto >>>>>>>> Systems Engineer >>>>>>>> Wikimedia Foundation >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> Alexander Alten-Lorenz >>>>>> http://mapredit.blogspot.com >>>>>> German Hadoop LinkedIn Group: http://goo.gl/N8pCF >>>>>> >>>>> >>>> >>> >> >> >> >> -- >> Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/ >
-- Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/
