Hi, I would say the most efficient way would be option (3), where all the subtables are partitioned by date, and clustered+**sorted** by id. This way, efficient SMB map joins can be performed over the 10 tables of the same partition.
Unfortunately, I haven't found a way to achieve SMB map joins* over more than one* partition :( Example: CREATE TABLE test_1 (id INT, c1 FLOAT, c2 FLOAT) PARTITIONED BY (ds STRING) CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS STORED AS ORC; CREATE TABLE test_2 (id INT, c3 FLOAT, c4 FLOAT) PARTITIONED BY (ds STRING) CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS STORED AS ORC; CREATE TABLE test_3 (id INT, c5 FLOAT, c6 FLOAT) PARTITIONED BY (ds STRING) CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS STORED AS ORC; Over this tables, one can perform efficient single-stage SMB map joins, *if you filter on a single partition*: set hive.execution.engine=tez; set hive.enforce.sortmergebucketmapjoin=false; set hive.optimize.bucketmapjoin=true; set hive.optimize.bucketmapjoin.sortedmerge=true; set hive.auto.convert.sortmerge.join=true; set hive.auto.convert.join=true; set hive.auto.convert.join.noconditionaltask.size=0; EXPLAIN SELECT * FROM test_1 a JOIN test_2 b ON a.id = b.id AND a.ds = b.ds JOIN test_3 c ON b.id = c.id AND b.ds = c.ds WHERE a.ds = '2019-10-01' ; When you try to query on two partitions, then it does a shuffle :( set hive.execution.engine=tez; set hive.enforce.sortmergebucketmapjoin=false; set hive.optimize.bucketmapjoin=true; set hive.optimize.bucketmapjoin.sortedmerge=true; set hive.auto.convert.sortmerge.join=true; set hive.auto.convert.join=true; set hive.auto.convert.join.noconditionaltask.size=0; EXPLAIN SELECT * FROM test_1 a JOIN test_2 b ON a.id = b.id AND a.ds = b.ds JOIN test_3 c ON b.id = c.id AND b.ds = c.ds WHERE a.ds IN ('2019-10-01', '2019-10-02') ; My problem is very similar, so let's hope someone out there has the answer :) Cheers, Pau. Missatge de Saurabh Santhosh <saurabhsanth...@gmail.com> del dia dt., 1 d’oct. 2019 a les 8:48: > Hi, > > I am facing the following problem while trying to store/use a huge > partitioned table with 1000+ columns in Hive. I would like to know how to > solve this problem either using hive or any other store. > > Requirement: > > 1).There is a table with around 1000+ columns which is partitioned by > date. > 2).Every day consist of data about around 500 million entities. There will > be an id column with the id of the entity and around 1000+ columns which > represent attributes of given entity for each day. > 3).We have to store data for around 2 years > 4). New columns may be added/logic of existing column may be changed any > day and when this happens we have to populate data for the given column for > last 2 years > > > Our Solution 1: > > 1). We created a table with 1000+ columns and partitioned by date. > 2). Every day we create a new partition and delete partition older than 2 > years > > Problems Faced in Solution 1: > > Whenever we had to add/modify certain columns, the backfill of data took a > long time and it was taking months to backfill the data for 2 years (this > was because there is lot of IO due to the read/write of each partition) > > > Our Solution 2: > > 1). We created 10 tables with around 100+ columns each and each of them > was partitioned by date. > 2). Every day we create a new partition in each of the small tables and > delete partition older than 2 years > 3). Created a view which was a join between all the tables with id, date > as join key > > > Problems Faced in Solution 2: > > Now the backfill time was considerably reduced from months to weeks as we > need to only refresh the small table which contained the columns to be > backfilled thus reducing the IO drastically. > But this lead to very slow queries on top of the view. Especially when we > query for 6 months data, the queries were taking more than 10 hrs due to > the shuffling of data > > Our Solution 3: > > 1). We also tried to bucket each small table based on the id column, but > this did not give us the desired result as the shuffling was still happening > > > Can anyone suggest what is the best approach to go with in the above > scenario? > -- ---------------------------------- Pau Tallada Crespí Dep. d'Astrofísica i Cosmologia Port d'Informació Científica (PIC) Tel: +34 93 170 2729 ----------------------------------