True. The spec does not mandate the bucket files have to be there if they are empty. (missing directories are 0 row tables).
Thanks, Edward On Tue, Apr 3, 2018 at 4:42 PM, Richard A. Bross <[email protected]> wrote: > Gopal, > > The Presto devs say they are willing to make the changes to adhere to the > Hive bucket spec. I quoted > > "Presto could fix their fail-safe for bucketing implementation to actually > trust the Hive bucketing spec & get you out of this mess - the bucketing > contract for Hive is actual file name -> hash % buckets (Utilities:: > getBucketIdFromFile)." > > so there asking "where is the Hive bucketing spec". Is it just to read > the code for that function? They were looking for something more explicit, > I think. > > Thanks > > ----- Original Message ----- > From: "Gopal Vijayaraghavan" <[email protected]> > To: [email protected] > Sent: Tuesday, April 3, 2018 3:15:46 AM > Subject: Re: Hive, Tez, clustering, buckets, and Presto > > > * I'm interested in your statement that CLUSTERED BY does not CLUSTER > BY. My understanding was that this was related to the number of buckets, > but you are relating it to ORC stripes. It is odd that no examples that > I've seen include the SORTED BY statement other than in relation to ORC > indexes (that I understand). So the question is; regardless of whether > efficient ORC stripes are created (wouldn't I have to also specify > 'orc.create.index’=’true’ for this to have much of an effect) > > ORC + bucketing has been something I've spent a lot of time with - a lot > of this has to do with secondary characteristics of data (i.e same device > has natural progressions for metrics), which when combined with a columnar > format & ordering within files produces better storage and runtimes > together (which I guess is usually a trade-off). > > Without a SORTED BY, the organizing function for the data-shuffle does not > order in any specific way - the partition key for the shuffle is the > modulus, while the order key is 0 bytes long, so it sorts by (modulus,) > which for a quick-sort also loses the input order into the shuffle & each > bucket file is produced in random order within itself. > > An explicit sort with bucketing is what I recommend to most of the HDP > customers who have performance problems with ORC. > > This turns the shuffle key into (modulus, key1, key2) producing more > predictable order during shuffle. > > Then the key1 can be RLE encoded so that ORC vector impl will pass it on > as key1x1024 repetitions & do 1000x fewer comparisons when filtering rows > for integers. > > https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/5 > > was written as a warning to customers who use bucketing to try & solve > performance problems, but have ended up bucketing as their main problem. > > Most of what I have written above was discussed a few years back and in > general, bucketing on a high cardinality column + sorting on a low > cardinality together has given good results to my customers. > > > I hadn't thought of the even number issue, not having looked at the > function; I had assumed that it was a hash, not a modulus; shame on me. > Reading the docs I see that hash is only used on string columns > > Actually a hash is used in theory, but I entirely blame Java for it - the > Java hash is an identity function for Integers. > > scala> 42.hashCode > res1: Int = 42 > > scala> 42L.hashCode > res2: Int = 42 > > > Finally, I'm not sure that I got a specific answer to my original > question, which is can I force Tez to create all bucket files so Presto > queries can succeed? Anyway, I will be testing today and the solution will > either be to forgo buckets completely or to simply rely on ORC indexes. > > There's no config to do that today & Presto is already incompatible with > Hive 3.0 tables (Update/Delete support). > > Presto could fix their fail-safe for bucketing implementation to actually > trust the Hive bucketing spec & get you out of this mess - the bucketing > contract for Hive is actual file name -> hash % buckets (Utilities:: > getBucketIdFromFile). > > The file-count is a very flaky way to check if the table is bucketed > correctly - either you trust the user to have properly bucketed the table > or you don't use it. Failing to work on valid tables does look pretty bad, > instead of soft fallbacks. > > I wrote a few UDFs which was used to validate suspect tables and fix them > for customers who had bad historical data, which was loaded with > "enforce.bucketing=false" or for the short hive-0.13 period with HIVE-12945. > > https://github.com/t3rmin4t0r/hive-bucket-helpers/blob/ > master/src/main/java/org/notmysock/hive/udf/BucketCheckUDF.java#L27 > > LLAP has a bucket pruning implementation if Presto wants to copy from it > (LLAP's S3 BI mode goes further and caches column indexes in memory or SSD). > > Optimizer: https://github.com/apache/hive/blob/master/ql/src/java/ > org/apache/hadoop/hive/ql/optimizer/FixedBucketPruningOptimizer.java#L236 > Runtime: https://github.com/apache/hive/blob/master/ql/src/java/ > org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java#L281 > > That actually does things according to the Hive bucketing contract where > uncovered buckets are assumed to have 0 rows without a file present & not > error out instead. > > If you do have the ability to redeploy Hive, the change you are looking > for is a 1-liner to enable. > > https://github.com/apache/hive/blob/master/ql/src/java/ > org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L1248 > > Cheers, > Gopal > >
