MapReduce is very course-grained. It might seem that more cores is better, but once the data sizes get well below the block threshold in size, the overhead of starting JVM processes and all the other background becomes a significant percentage of the overall runtime. So, you quickly reach the point of diminishing returns. Hadoop wasn't designed for small data, so this issue was never optimized. For big data, you amortize that overhead over the actual processing time.
It sounds like 1 day is a good granularity. Your production cluster might use a larger block size, a multiple of 64MB, which is increasingly common these days. On Wed, Mar 6, 2013 at 9:44 AM, Sékine Coulibaly <scoulib...@gmail.com>wrote: > Hi Dean, > > Indeed, switching from RCFiles to SequenceFiles yield a query duration > down 35% (82secs down to 53secs) ! I added Snappy/Gzip block compression > altogether. Things are getting better, down to 30secs (sequenceFile+snappy). > > Yes, most request have a WHERE clause with a time range, will have > partitionning a try. For now, my tests span over 1 day long log data. I > will ingest more of them and partition and see how it goes. > > However, it's not clear to me why I should minimize the mappers ? Having > 16 cores, would it make sens to use as many of them as possible to > parallelize ? So far, 1 day worth log is 256 MB. In my understanding, > provided that HDFS has 64MB blocks, I should use 4 mappers right ? > > If this is the case, since I'm in pseudo distrib for the moment, my number > of mappers =1, so I could try to configure my setup with additional mappers. > > > Does this make sense ? > > Thank you for your help ! > > Sekine > > > > > 2013/3/4 Dean Wampler <dean.wamp...@thinkbiganalytics.com> > >> RCFile won't help much (and apparently not all in this case ;) unless you >> have a lot of columns and you always query just a few of them. However, you >> should get better results with Sequence Files (binary format) and usually >> with a compression scheme like BZip that supports block-level (as opposed >> to file-level) compression. Why? compressed files and also using sequence >> files reduces the amount of disk IO and hence improves IO performance (a >> bottleneck). >> >> Do you almost always query with a WHERE clause with a time range? If so, >> consider partitioning your data by time ranges, e.g., year/month/day. Your >> actual timestamp granularity would be chosen so that each folder (and yes, >> they'll be individual folders) has data files at least 64MB or whatever >> multiple of 64MB your using in your cluster. It could be that per-day is >> the finest granularity or even per hour or minute, if you really have a lot >> of data. Briefly, you want to minimize the number of mapper processes used >> to process the data, and this is the granularity per mapper. Why partition, >> because when you do SELECT * FROM mytable WHERE year = 2012 AND month = 3 >> AND day = 4, Hive knows it only has to read the contents of that single >> directory, not all the directories... >> >> You might also consider clustering by URL. This feature (and the others) >> is described on the Hive wiki. It can also speed up sampling of large data >> sets and joins. >> >> I assume you're just using the virtual machine for experimenting. Lots of >> overhead there, too! >> >> Hope this helps. >> dean >> >> On Mon, Mar 4, 2013 at 4:33 PM, Sékine Coulibaly <scoulib...@gmail.com>wrote: >> >>> Hi there, >>> >>> I've setup a virtual machine hosting Hive. >>> My use case is a Web traffic analytics, hence most of requests are : >>> >>> - how many requests today ? >>> - how many request today, grouped by country ? >>> - most requested urls ? >>> - average http server response time (5 minutes slots) ? >>> >>> In other words, lets consider : >>> CREATE TABLE logs ( url STRING, orig_country STRING, http_rt INT ) >>> and >>> >>> SELECT COUNT(*) FROM logs; >>> SELECT COUNT(*),orig_country FROM logs GROUP BY orig_country; >>> SELECT COUNT(*),url FROM logs BROUP BY url; >>> SELECT AVG(http_rt) FROM logs ... >>> >>> 2 questions here : >>> - How to generate 5 minutes slots to make my averages (in Postgresql, I >>> used to generate_series() and JOIN) ? I wish I could avoid doing multiple >>> requests each with a 'WHERE date>... AND date <...'. Maybe a mapper, >>> mapping the date string to a aslot number ? >>> >>> - What is the best storage method pour this table ? Since it's purpose >>> is analytical, I thought columnar format was the way to go. So I tried >>> RCFILE buy the results are as follow for around 1 million rows (quite >>> small, I know) and are quite the opposite I was expecting : >>> >>> Storage / query duration / disk table size >>> TEXTFILE / 22 seconds / 250MB >>> RCFILE / 31 seconds / 320 MB >>> >>> I thought getting values in columns would speed up the aggregate >>> process. Maybe the dataset is too small to tell, or I missed something ? >>> Will adding Snappy compression help (not sure whether RCFiles are >>> compressed or not) ? >>> >>> Thank you ! >>> >>> >>> >>> >> >> >> -- >> *Dean Wampler, Ph.D.* >> thinkbiganalytics.com >> +1-312-339-1330 >> >> > -- *Dean Wampler, Ph.D.* thinkbiganalytics.com +1-312-339-1330