Re: Cannot INSERT OVERWRITE on clustered table with > 8 buckets

Federico D'Ambrosio Mon, 16 Jul 2018 03:58:43 -0700

Hi Gopal,

Thank you very much for your response and your very helpful insight on my
problem.


As you correctly guessed, I'm quite new to Hive.
I tried the query you suggested, in your gist, for the creation of the
table without using LLAP and it worked fine, as in it didn't hang like the
one I tried before. I'll be trying the performances of the read queries and
check back in case of need.
In the slides you linked, you mention that the average partition should be
>=1GB, now I have 181 partitions with an uneven distribution data
(partitions sizes go from 64MB to 1.7 kB, in orc format). Could that be a
problem? Should I be aware of any specific issue with this?

In any case, the dataset I'm using would be a single partition of a bigger
dataset (there are passaggi1617, passaggi1516 and so on, of the same raw
size of 20GB each), and for this reason I was thinking that in the possible
final table, there would be 2 levels of partitions (SEASONPASSAGE,
DATEPASSAGE). Is there any best practice when it comes to partitions
hierarchy? Other than what you already mentioned, making sure partitioning
is not too fine grained.

Thank you very much Gopal,
really appreciate your help.

Kind regards,
Federico



Il giorno ven 13 lug 2018 alle ore 20:18 Gopal Vijayaraghavan <
gop...@apache.org> ha scritto:

>
> > I'm using Hive 1.2.1 with LLAP on HDP 2.6.5. Tez AM is 3GB, there are 3
> daemons for a total of 34816 MB.
>
> Assuming you're using Hive2 here (with LLAP) and LLAP kinda sucks for ETL
> workloads, but this is a different problem.
>
> > PARTITIONED BY (DATAPASSAGGIO string, ORAPASSAGGIO string)
> > CLUSTERED BY (ID_TICKETTYPE, ID_PERSONTYPE, NPOOLNR, NKASSANR) INTO 8
> BUCKETS
> > STORED AS ORC
> ...
> > Total number of partitions is 137k.
>
> 20Gb divided by 137k makes for very poorly written ORC files, because I'd
> guess that it has too few rows in a file (will be much smaller than 1 HDFS
> block) - partitioning this fine is actually a performance issue on compile
> time.
>
> You can make this insert work by changing the insert shuffle mechanism
> (run an explain with/without to see the difference).
>
> set hive.optimize.sort.dynamic.partition=true; --
> https://issues.apache.org/jira/browse/HIVE-6455
>
> But I suspect you will be very disappointed by the performance of the read
> queries after this insert.
>
> >      ,NPOOLNR decimal(4,0)
> >     ,NZUTRNR decimal(3,0)
> >     ,NKASSANR decimal(3,0)
> >     ,ID_TICKETTYPE decimal(5,0)
> >     ,ID_PERSONTYPE decimal(6,0)
> >     ,ID_TICKETPERSONTYPEDEF decimal(6,0)
>
> That's also going to hurt - your schema raises a lot of red-flags that I
> find people do when they first migrated to hive.
>
> https://www.slideshare.net/t3rmin4t0r/data-organization-hive-meetup/
>
> In general, you need to fix the partition count, bucketing structure (how
> clustered by does not "cluster", you need another "sorted by"), zero scale
> decimals.
>
> Can you try running with (& see what your query read-perf looks like)
>
> https://gist.github.com/t3rmin4t0r/087b61f79514673c307bb9a88327a4db
>
> Cheers,
> Gopal
>
>
>

-- 
Federico D'Ambrosio

Re: Cannot INSERT OVERWRITE on clustered table with > 8 buckets

Reply via email to