Gopal, Thanks for taking the time to try and help. A few things in relation to your response:
* Yes, the 'epoch' column is an hourly timestamp. Clustering by a column with high cardinality would make little sense. * I'm interested in your statement that CLUSTERED BY does not CLUSTER BY. My understanding was that this was related to the number of buckets, but you are relating it to ORC stripes. It is odd that no examples that I've seen include the SORTED BY statement other than in relation to ORC indexes (that I understand). So the question is; regardless of whether efficient ORC stripes are created (wouldn't I have to also specify 'orc.create.index’=’true’ for this to have much of an effect) * Thanks for the hint about using primes for specifying the buckets. Given that r_crs_id is a wide range I think that it would have worked Ok, but any optimizations are appreciated and I hadn't thought of the even number issue, not having looked at the function; I had assumed that it was a hash, not a modulus; shame on me. Reading the docs I see that hash is only used on string columns Finally, I'm not sure that I got a specific answer to my original question, which is can I force Tez to create all bucket files so Presto queries can succeed? Anyway, I will be testing today and the solution will either be to forgo buckets completely or to simply rely on ORC indexes. Thanks. Rick ----- Original Message ----- From: "Gopal Vijayaraghavan" <[email protected]> To: [email protected] Sent: Monday, April 2, 2018 2:16:46 AM Subject: Re: Hive, Tez, clustering, buckets, and Presto There's more here than Bucketing or Tez. > PARTITIONED BY(daydate STRING, epoch BIGINT) > CLUSTERED BY(r_crs_id) INTO 64 BUCKETS I hope the epoch partition column is actually a day rollup and not 1 partition for every timestamp. CLUSTERED BY does not CLUSTER BY, which it should (but it doesn't, yet). CLUSTERED BY needs a SORTED BY to produce good ORC stripes here. > If I perform the inserts with the mr engine, which is now deprecated, when > the partition is created there are always 64 bucket files, even if some are 0 > length. In this case I can query with Presto just fine. The files should be at least 3 bytes long containing the 3 letters 'O','R','C'. If you have a bucketing impl which is producing empty files when hash(20k) values % 64 is producing 0 rows for some buckets, the skew is unnatural. > However, when the Tez engine is used, only buckets that contain data are > created, and Presto fails since the table properties in the metastore (64 > buckets) are a mismatch with the actual bucket count. That was done to prevent creating these 3 byte files which actually cost real money to store (because storage is usually at the 4k block level, this uses 1000x more storage). You can end up with >64 or <64 or exactly 64 files for a validly bucketed table (in the >64 case, at least some files have a _Copy suffix, but the bucket prefix is consistent). Presto just bails out when it finds something strange, because they do not trust the Hive bucketing impl - there is a really good reason why there are 2 bucket pruning configs in Tez for the same feature (hive.tez.bucket.pruning & hive.tez.bucket.pruning.compat). > we'd really like to use buckets, since the the r_crs_id in production can > contain 20,000 values. With 20k values and 64 buckets, if you have zero sized buckets - you might want to run your hash values through this. https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/6 > "hive.enforce.bucketing" appears to have been deprecated as of 2.0 and is > always set to "true", but I can't seem to find a description of its intent. > Anyway, if it's supposed to force the creation of buckets it's not working > with the Tez engine. Previously you could disable "enforce.bucketing=false" and the inserts of data won't use the fixed number of buckets & can play other tricks with the data layout. That's not the config you're looking for. There is something which is a very sharp no-handle knife in the Hive toolkit called "hive.exec.infer.bucket.sort.num.buckets.power.two", which is not worth explaining right now (but just that it is possible to use it, but not very easily). > configuration so that r_crs_id was at least row optimized and sorted within > the ORC files. SORTED BY in the Table DDL should do the trick - I like to use a multi-dim sort, in some of these scenarios. CLUSTERED BY(r_crs_id) SORTED BY(r_crs_id, id) INTO 67 BUCKETS If you want to know why I like primes, when % 64 is done on even numbers. len(set([(x*2) % 64 for x in xrange(1000)])) fills exactly 32 buckets out of 64 - so 32 buckets have 2x data and 32 buckets have 0x data. len(set([(x*2) % 61 for x in xrange(1000)])) fills all 61 buckets - producing better bucketing & no 0 sized files. FYI any prime number other than 31 works nicer than a 2^n - https://issues.apache.org/jira/browse/HIVE-7074 Cheers, Gopal
