Re: Best table storage for analytical use case

Sékine Coulibaly Wed, 06 Mar 2013 07:45:31 -0800

Hi Dean,

Indeed, switching from RCFiles to SequenceFiles yield a query duration down
35% (82secs down to 53secs) ! I added Snappy/Gzip block compression
altogether. Things are getting better, down to 30secs (sequenceFile+snappy).


Yes, most request have a WHERE clause with a time range, will have
partitionning a try. For now, my tests span over 1 day long log data. I
will ingest more of them and partition and see how it goes.

However, it's not clear to me why I should minimize the mappers ? Having 16
cores, would it make sens to use as many of them as possible to parallelize
? So far, 1 day worth log is 256 MB. In my understanding, provided that
HDFS has 64MB blocks, I should use 4 mappers right ?

If this is the case, since I'm in pseudo distrib for the moment, my number
of mappers =1, so I could try to configure my setup with additional mappers.


Does this make sense ?

Thank you for your help !

Sekine




2013/3/4 Dean Wampler <dean.wamp...@thinkbiganalytics.com>

> RCFile won't help much (and apparently not all in this case ;) unless you
> have a lot of columns and you always query just a few of them. However, you
> should get better results with Sequence Files (binary format) and usually
> with a compression scheme like BZip that supports block-level (as opposed
> to file-level) compression.  Why? compressed files and also using sequence
> files reduces the amount of disk IO and hence improves IO performance (a
> bottleneck).
>
> Do you almost always query with a WHERE clause with a time range? If so,
> consider partitioning your data by time ranges, e.g., year/month/day. Your
> actual timestamp granularity would be chosen so that each folder (and yes,
> they'll be individual folders) has data files at least 64MB or whatever
> multiple of 64MB your using in your cluster. It could be that per-day is
> the finest granularity or even per hour or minute, if you really have a lot
> of data. Briefly, you want to minimize the number of mapper processes used
> to process the data, and this is the granularity per mapper. Why partition,
> because when you do SELECT * FROM mytable WHERE year = 2012 AND month = 3
> AND day = 4, Hive knows it only has to read the contents of that single
> directory, not all the directories...
>
> You might also consider clustering by URL. This feature (and the others)
> is described on the Hive wiki. It can also speed up sampling of large data
> sets and joins.
>
> I assume you're just using the virtual machine for experimenting. Lots of
> overhead there, too!
>
> Hope this helps.
> dean
>
> On Mon, Mar 4, 2013 at 4:33 PM, Sékine Coulibaly <scoulib...@gmail.com>wrote:
>
>> Hi there,
>>
>> I've setup a virtual machine hosting Hive.
>> My use case is a Web traffic analytics, hence most of requests are :
>>
>> - how many requests today ?
>> - how many request today, grouped by country ?
>> - most requested urls ?
>> - average http server response time (5 minutes slots) ?
>>
>> In other words, lets consider :
>> CREATE TABLE logs ( url STRING, orig_country STRING, http_rt INT )
>> and
>>
>> SELECT COUNT(*) FROM logs;
>> SELECT COUNT(*),orig_country FROM logs GROUP BY orig_country;
>> SELECT COUNT(*),url FROM logs BROUP BY url;
>> SELECT AVG(http_rt) FROM logs ...
>>
>> 2 questions here :
>> - How to generate 5 minutes slots to make my averages (in Postgresql, I
>> used to generate_series() and JOIN) ? I wish I could avoid doing multiple
>> requests each with a 'WHERE date>... AND date <...'. Maybe a mapper,
>> mapping the date string to a aslot number ?
>>
>> - What is the best storage method pour this table ? Since it's purpose is
>> analytical, I thought columnar format was the way to go. So I tried RCFILE
>> buy the results are as follow for around 1 million rows (quite small, I
>> know) and are quite the opposite I was expecting :
>>
>> Storage / query duration / disk table size
>> TEXTFILE / 22 seconds / 250MB
>> RCFILE / 31 seconds / 320 MB
>>
>>  I thought getting values in columns would speed up the aggregate
>> process. Maybe the dataset is too small to tell, or I missed something ?
>> Will adding Snappy compression help (not sure whether RCFiles are
>> compressed or not) ?
>>
>> Thank you !
>>
>>
>>
>>
>
>
> --
> *Dean Wampler, Ph.D.*
> thinkbiganalytics.com
> +1-312-339-1330
>
>

Re: Best table storage for analytical use case

Reply via email to