Re: Hive, Tez, clustering, buckets, and Presto

Edward Capriolo Tue, 03 Apr 2018 14:45:36 -0700

True. The spec does not mandate the bucket files have to be there if they
are empty. (missing directories are 0 row tables).


Thanks,
Edward

On Tue, Apr 3, 2018 at 4:42 PM, Richard A. Bross <[email protected]> wrote:

> Gopal,
>
> The Presto devs say they are willing to make the changes to adhere to the
> Hive bucket spec.  I quoted
>
> "Presto could fix their fail-safe for bucketing implementation to actually
> trust the Hive bucketing spec & get you out of this mess - the bucketing
> contract for Hive is actual file name -> hash % buckets (Utilities::
> getBucketIdFromFile)."
>
> so there asking "where is the Hive bucketing spec".  Is it just to read
> the code for that function?  They were looking for something more explicit,
> I think.
>
> Thanks
>
> ----- Original Message -----
> From: "Gopal Vijayaraghavan" <[email protected]>
> To: [email protected]
> Sent: Tuesday, April 3, 2018 3:15:46 AM
> Subject: Re: Hive, Tez, clustering, buckets, and Presto
>
> >    * I'm interested in your statement that CLUSTERED BY does not CLUSTER
> BY.  My understanding was that this was related to the number of buckets,
> but you are relating it to ORC stripes.  It is odd that no examples that
> I've seen include the SORTED BY statement other than in relation to ORC
> indexes (that I understand).  So the question is; regardless of whether
> efficient ORC stripes are created (wouldn't I have to also specify
> 'orc.create.index’=’true’ for this to have much of an effect)
>
> ORC + bucketing has been something I've spent a lot of time with - a lot
> of this has to do with secondary characteristics of data (i.e same device
> has natural progressions for metrics), which when combined with a columnar
> format & ordering within files produces better storage and runtimes
> together (which I guess is usually a trade-off).
>
> Without a SORTED BY, the organizing function for the data-shuffle does not
> order in any specific way - the partition key for the shuffle is the
> modulus, while the order key is 0 bytes long, so it sorts by (modulus,)
> which for a quick-sort also loses the input order into the shuffle & each
> bucket file is produced in random order within itself.
>
> An explicit sort with bucketing is what I recommend to most of the HDP
> customers who have performance problems with ORC.
>
> This turns the shuffle key into (modulus, key1, key2) producing more
> predictable order during shuffle.
>
> Then the key1 can be RLE encoded so that ORC vector impl will pass it on
> as key1x1024 repetitions & do 1000x fewer comparisons when filtering rows
> for integers.
>
> https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/5
>
> was written as a warning to customers who use bucketing to try & solve
> performance problems, but have ended up bucketing as their main problem.
>
> Most of what I have written above was discussed a few years back and in
> general, bucketing on a high cardinality column + sorting on a low
> cardinality together has given good results to my customers.
>
> >    I hadn't thought of the even number issue, not having looked at the
> function; I had assumed that it was a hash, not a modulus; shame on me.
> Reading the docs I see that hash is only used on string columns
>
> Actually a hash is used in theory, but I entirely blame Java for it - the
> Java hash is an identity function for Integers.
>
> scala> 42.hashCode
> res1: Int = 42
>
> scala> 42L.hashCode
> res2: Int = 42
>
> > Finally, I'm not sure that I got a specific answer to my original
> question, which is can I force Tez to create all bucket files so Presto
> queries can succeed?  Anyway, I will be testing today and the solution will
> either be to forgo buckets completely or to simply rely on ORC indexes.
>
> There's no config to do that today & Presto is already incompatible with
> Hive 3.0 tables (Update/Delete support).
>
> Presto could fix their fail-safe for bucketing implementation to actually
> trust the Hive bucketing spec & get you out of this mess - the bucketing
> contract for Hive is actual file name -> hash % buckets (Utilities::
> getBucketIdFromFile).
>
> The file-count is a very flaky way to check if the table is bucketed
> correctly - either you trust the user to have properly bucketed the table
> or you don't use it. Failing to work on valid tables does look pretty bad,
> instead of soft fallbacks.
>
> I wrote a few UDFs which was used to validate suspect tables and fix them
> for customers who had bad historical data, which was loaded with
> "enforce.bucketing=false" or for the short hive-0.13 period with HIVE-12945.
>
> https://github.com/t3rmin4t0r/hive-bucket-helpers/blob/
> master/src/main/java/org/notmysock/hive/udf/BucketCheckUDF.java#L27
>
> LLAP has a bucket pruning implementation if Presto wants to copy from it
> (LLAP's S3 BI mode goes further and caches column indexes in memory or SSD).
>
> Optimizer: https://github.com/apache/hive/blob/master/ql/src/java/
> org/apache/hadoop/hive/ql/optimizer/FixedBucketPruningOptimizer.java#L236
> Runtime: https://github.com/apache/hive/blob/master/ql/src/java/
> org/apache/hadoop/hive/ql/exec/tez/HiveSplitGenerator.java#L281
>
> That actually does things according to the Hive bucketing contract where
> uncovered buckets are assumed to have 0 rows without a file present & not
> error out instead.
>
> If you do have the ability to redeploy Hive, the change you are looking
> for is a 1-liner to enable.
>
> https://github.com/apache/hive/blob/master/ql/src/java/
> org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L1248
>
> Cheers,
> Gopal
>
>

Re: Hive, Tez, clustering, buckets, and Presto

Reply via email to