Re: Hive, Tez, clustering, buckets, and Presto

Richard A. Bross Mon, 02 Apr 2018 04:17:02 -0700

Gopal,

Thanks for taking the time to try and help.  A few things in relation to your 
response:


* Yes, the 'epoch' column is an hourly timestamp.  Clustering by a column with 
high cardinality would make little sense.
* I'm interested in your statement that CLUSTERED BY does not CLUSTER BY.  My 
understanding was that this was related to the number of buckets, but you are 
relating it to ORC stripes.  It is odd that no examples that I've seen include 
the SORTED BY statement other than in relation to ORC indexes (that I 
understand).  So the question is; regardless of whether efficient ORC stripes 
are created (wouldn't I have to also specify 'orc.create.index’=’true’ for this 
to have much of an effect)
* Thanks for the hint about using primes for specifying the buckets.  Given 
that r_crs_id is a wide range I think that it would have worked Ok, but any 
optimizations are appreciated and I hadn't thought of the even number issue, 
not having looked at the function; I had assumed that it was a hash, not a 
modulus; shame on me.  Reading the docs I see that hash is only used on string 
columns

Finally, I'm not sure that I got a specific answer to my original question, 
which is can I force Tez to create all bucket files so Presto queries can 
succeed?  Anyway, I will be testing today and the solution will either be to 
forgo buckets completely or to simply rely on ORC indexes.

Thanks.

Rick

----- Original Message -----
From: "Gopal Vijayaraghavan" <[email protected]>
To: [email protected]
Sent: Monday, April 2, 2018 2:16:46 AM
Subject: Re: Hive, Tez, clustering, buckets, and Presto


There's more here than Bucketing or Tez.

>  PARTITIONED BY(daydate STRING, epoch BIGINT)
> CLUSTERED BY(r_crs_id) INTO 64 BUCKETS

I hope the epoch partition column is actually a day rollup and not 1 partition 
for every timestamp.

CLUSTERED BY does not CLUSTER BY, which it should (but it doesn't, yet). 
CLUSTERED BY needs a SORTED BY to produce good ORC stripes here.

>  If I perform the inserts with the mr engine, which is now deprecated, when 
> the partition is created there are always 64 bucket files, even if some are 0 
> length.  In this case I can query with Presto just fine.

The files should be at least 3 bytes long containing the 3 letters 'O','R','C'.

If you have a bucketing impl which is producing empty files when hash(20k) 
values % 64 is producing 0 rows for some buckets, the skew is unnatural.

>  However, when the Tez engine is used, only buckets that contain data are 
> created, and Presto fails since the table properties in the metastore (64 
> buckets) are a mismatch with the actual bucket count.

That was done to prevent creating these 3 byte files which actually cost real 
money to store (because storage is usually at the 4k block level, this uses 
1000x more storage).

You can end up with >64 or <64 or exactly 64 files for a validly bucketed table 
(in the >64 case, at least some files have a _Copy suffix, but the bucket 
prefix is consistent).

Presto just bails out when it finds something strange, because they do not 
trust the Hive bucketing impl - there is a really good reason why there are 2 
bucket pruning configs in Tez for the same feature (hive.tez.bucket.pruning & 
hive.tez.bucket.pruning.compat).

> we'd really like to use buckets, since the the r_crs_id in production can 
> contain 20,000 values.

With 20k values and 64 buckets, if you have zero sized buckets - you might want 
to run your hash values through this.

https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/6

> "hive.enforce.bucketing" appears to have been deprecated as of 2.0 and is 
> always set to "true", but I can't seem to find a description of its intent.  
> Anyway, if it's supposed to force the creation of buckets it's not working 
> with the Tez engine.

Previously you could disable "enforce.bucketing=false" and the inserts of data 
won't use the fixed number of buckets & can play other tricks with the data 
layout. That's not the config you're looking for.

There is something which is a very sharp no-handle knife in the Hive toolkit 
called "hive.exec.infer.bucket.sort.num.buckets.power.two", which is not worth 
explaining right now (but just that it is possible to use it, but not very 
easily).

> configuration so that r_crs_id was at least row optimized and sorted within 
> the ORC files. 

SORTED BY in the Table DDL should do the trick - I like to use a multi-dim 
sort, in some of these scenarios.

CLUSTERED BY(r_crs_id) 
SORTED BY(r_crs_id, id)
INTO 67 BUCKETS

If you want to know why I like primes, when % 64 is done on even numbers.

len(set([(x*2) % 64 for x in xrange(1000)]))

fills exactly 32 buckets out of 64 - so 32 buckets have 2x data and 32 buckets 
have 0x data.

len(set([(x*2) % 61 for x in xrange(1000)]))

fills all 61 buckets - producing better bucketing & no 0 sized files.

FYI any prime number other than 31 works nicer than a 2^n - 
https://issues.apache.org/jira/browse/HIVE-7074

Cheers,
Gopal

Re: Hive, Tez, clustering, buckets, and Presto

Reply via email to