RCFile won't help much (and apparently not all in this case ;) unless you have a lot of columns and you always query just a few of them. However, you should get better results with Sequence Files (binary format) and usually with a compression scheme like BZip that supports block-level (as opposed to file-level) compression. Why? compressed files and also using sequence files reduces the amount of disk IO and hence improves IO performance (a bottleneck).
Do you almost always query with a WHERE clause with a time range? If so, consider partitioning your data by time ranges, e.g., year/month/day. Your actual timestamp granularity would be chosen so that each folder (and yes, they'll be individual folders) has data files at least 64MB or whatever multiple of 64MB your using in your cluster. It could be that per-day is the finest granularity or even per hour or minute, if you really have a lot of data. Briefly, you want to minimize the number of mapper processes used to process the data, and this is the granularity per mapper. Why partition, because when you do SELECT * FROM mytable WHERE year = 2012 AND month = 3 AND day = 4, Hive knows it only has to read the contents of that single directory, not all the directories... You might also consider clustering by URL. This feature (and the others) is described on the Hive wiki. It can also speed up sampling of large data sets and joins. I assume you're just using the virtual machine for experimenting. Lots of overhead there, too! Hope this helps. dean On Mon, Mar 4, 2013 at 4:33 PM, Sékine Coulibaly <scoulib...@gmail.com>wrote: > Hi there, > > I've setup a virtual machine hosting Hive. > My use case is a Web traffic analytics, hence most of requests are : > > - how many requests today ? > - how many request today, grouped by country ? > - most requested urls ? > - average http server response time (5 minutes slots) ? > > In other words, lets consider : > CREATE TABLE logs ( url STRING, orig_country STRING, http_rt INT ) > and > > SELECT COUNT(*) FROM logs; > SELECT COUNT(*),orig_country FROM logs GROUP BY orig_country; > SELECT COUNT(*),url FROM logs BROUP BY url; > SELECT AVG(http_rt) FROM logs ... > > 2 questions here : > - How to generate 5 minutes slots to make my averages (in Postgresql, I > used to generate_series() and JOIN) ? I wish I could avoid doing multiple > requests each with a 'WHERE date>... AND date <...'. Maybe a mapper, > mapping the date string to a aslot number ? > > - What is the best storage method pour this table ? Since it's purpose is > analytical, I thought columnar format was the way to go. So I tried RCFILE > buy the results are as follow for around 1 million rows (quite small, I > know) and are quite the opposite I was expecting : > > Storage / query duration / disk table size > TEXTFILE / 22 seconds / 250MB > RCFILE / 31 seconds / 320 MB > > I thought getting values in columns would speed up the aggregate process. > Maybe the dataset is too small to tell, or I missed something ? Will adding > Snappy compression help (not sure whether RCFiles are compressed or not) ? > > Thank you ! > > > > -- *Dean Wampler, Ph.D.* thinkbiganalytics.com +1-312-339-1330