Hi Dean, Indeed, switching from RCFiles to SequenceFiles yield a query duration down 35% (82secs down to 53secs) ! I added Snappy/Gzip block compression altogether. Things are getting better, down to 30secs (sequenceFile+snappy).
Yes, most request have a WHERE clause with a time range, will have partitionning a try. For now, my tests span over 1 day long log data. I will ingest more of them and partition and see how it goes. However, it's not clear to me why I should minimize the mappers ? Having 16 cores, would it make sens to use as many of them as possible to parallelize ? So far, 1 day worth log is 256 MB. In my understanding, provided that HDFS has 64MB blocks, I should use 4 mappers right ? If this is the case, since I'm in pseudo distrib for the moment, my number of mappers =1, so I could try to configure my setup with additional mappers. Does this make sense ? Thank you for your help ! Sekine 2013/3/4 Dean Wampler <dean.wamp...@thinkbiganalytics.com> > RCFile won't help much (and apparently not all in this case ;) unless you > have a lot of columns and you always query just a few of them. However, you > should get better results with Sequence Files (binary format) and usually > with a compression scheme like BZip that supports block-level (as opposed > to file-level) compression. Why? compressed files and also using sequence > files reduces the amount of disk IO and hence improves IO performance (a > bottleneck). > > Do you almost always query with a WHERE clause with a time range? If so, > consider partitioning your data by time ranges, e.g., year/month/day. Your > actual timestamp granularity would be chosen so that each folder (and yes, > they'll be individual folders) has data files at least 64MB or whatever > multiple of 64MB your using in your cluster. It could be that per-day is > the finest granularity or even per hour or minute, if you really have a lot > of data. Briefly, you want to minimize the number of mapper processes used > to process the data, and this is the granularity per mapper. Why partition, > because when you do SELECT * FROM mytable WHERE year = 2012 AND month = 3 > AND day = 4, Hive knows it only has to read the contents of that single > directory, not all the directories... > > You might also consider clustering by URL. This feature (and the others) > is described on the Hive wiki. It can also speed up sampling of large data > sets and joins. > > I assume you're just using the virtual machine for experimenting. Lots of > overhead there, too! > > Hope this helps. > dean > > On Mon, Mar 4, 2013 at 4:33 PM, Sékine Coulibaly <scoulib...@gmail.com>wrote: > >> Hi there, >> >> I've setup a virtual machine hosting Hive. >> My use case is a Web traffic analytics, hence most of requests are : >> >> - how many requests today ? >> - how many request today, grouped by country ? >> - most requested urls ? >> - average http server response time (5 minutes slots) ? >> >> In other words, lets consider : >> CREATE TABLE logs ( url STRING, orig_country STRING, http_rt INT ) >> and >> >> SELECT COUNT(*) FROM logs; >> SELECT COUNT(*),orig_country FROM logs GROUP BY orig_country; >> SELECT COUNT(*),url FROM logs BROUP BY url; >> SELECT AVG(http_rt) FROM logs ... >> >> 2 questions here : >> - How to generate 5 minutes slots to make my averages (in Postgresql, I >> used to generate_series() and JOIN) ? I wish I could avoid doing multiple >> requests each with a 'WHERE date>... AND date <...'. Maybe a mapper, >> mapping the date string to a aslot number ? >> >> - What is the best storage method pour this table ? Since it's purpose is >> analytical, I thought columnar format was the way to go. So I tried RCFILE >> buy the results are as follow for around 1 million rows (quite small, I >> know) and are quite the opposite I was expecting : >> >> Storage / query duration / disk table size >> TEXTFILE / 22 seconds / 250MB >> RCFILE / 31 seconds / 320 MB >> >> I thought getting values in columns would speed up the aggregate >> process. Maybe the dataset is too small to tell, or I missed something ? >> Will adding Snappy compression help (not sure whether RCFiles are >> compressed or not) ? >> >> Thank you ! >> >> >> >> > > > -- > *Dean Wampler, Ph.D.* > thinkbiganalytics.com > +1-312-339-1330 > >