Hi,

I would say the most efficient way would be option (3), where all the
subtables are partitioned by date, and clustered+**sorted** by id.
This way, efficient SMB map joins can be performed over the 10 tables of
the same partition.

Unfortunately, I haven't found a way to achieve SMB map joins* over more
than one* partition :(

Example:

CREATE TABLE test_1 (id INT, c1 FLOAT, c2 FLOAT)
PARTITIONED BY (ds STRING)
CLUSTERED BY (id)
SORTED BY (id ASC)
INTO 4 BUCKETS
STORED AS ORC;

CREATE TABLE test_2 (id INT, c3 FLOAT, c4 FLOAT)
PARTITIONED BY (ds STRING)
CLUSTERED BY (id)
SORTED BY (id ASC)
INTO 4 BUCKETS
STORED AS ORC;

CREATE TABLE test_3 (id INT, c5 FLOAT, c6 FLOAT)
PARTITIONED BY (ds STRING)
CLUSTERED BY (id)
SORTED BY (id ASC)
INTO 4 BUCKETS
STORED AS ORC;

Over this tables, one can perform efficient single-stage SMB map joins, *if
you filter on a single partition*:

set hive.execution.engine=tez;
set hive.enforce.sortmergebucketmapjoin=false;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
set hive.auto.convert.sortmerge.join=true;
set hive.auto.convert.join=true;
set hive.auto.convert.join.noconditionaltask.size=0;

EXPLAIN
SELECT *
FROM test_1 a
JOIN test_2 b
  ON a.id = b.id AND a.ds = b.ds
JOIN test_3 c
  ON b.id = c.id AND b.ds = c.ds
WHERE a.ds = '2019-10-01'
;

When you try to query on two partitions, then it does a shuffle :(

set hive.execution.engine=tez;
set hive.enforce.sortmergebucketmapjoin=false;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
set hive.auto.convert.sortmerge.join=true;
set hive.auto.convert.join=true;
set hive.auto.convert.join.noconditionaltask.size=0;

EXPLAIN
SELECT *
FROM test_1 a
JOIN test_2 b
  ON a.id = b.id AND a.ds = b.ds
JOIN test_3 c
  ON b.id = c.id AND b.ds = c.ds
WHERE a.ds IN ('2019-10-01', '2019-10-02')
;


My problem is very similar, so let's hope someone out there has the answer
:)

Cheers,

Pau.

Missatge de Saurabh Santhosh <saurabhsanth...@gmail.com> del dia dt., 1
d’oct. 2019 a les 8:48:

> Hi,
>
> I am facing the following problem while trying to store/use a huge
> partitioned table with 1000+ columns in Hive. I would like to know how to
> solve this problem either using hive or any other store.
>
> Requirement:
>
> 1).There is a table with around 1000+ columns which is partitioned by
> date.
> 2).Every day consist of data about around 500 million entities. There will
> be an id column with the id of the entity and around 1000+ columns which
> represent attributes of given entity for each day.
> 3).We have to store data for around 2 years
> 4). New columns may be added/logic of existing column may be changed any
> day and when this happens we have to populate data for the given column for
> last 2 years
>
>
> Our Solution 1:
>
> 1). We created a table with 1000+ columns and partitioned by date.
> 2). Every day we create a new partition and delete partition older than 2
> years
>
> Problems Faced in Solution 1:
>
> Whenever we had to add/modify certain columns, the backfill of data took a
> long time and it was taking months to backfill the data for 2 years (this
> was because there is lot of IO due to the read/write of each partition)
>
>
> Our Solution 2:
>
> 1). We created 10 tables with around 100+ columns each and each of them
> was partitioned by date.
> 2). Every day we create a new partition in each of the small tables and
> delete partition older than 2 years
> 3). Created a view which was a join between all the tables with id, date
> as join key
>
>
> Problems Faced in Solution 2:
>
> Now the backfill time was considerably reduced from months to weeks as we
> need to only refresh the small table which contained the columns to be
> backfilled thus reducing the IO drastically.
> But this lead to very slow queries on top of the view. Especially when we
> query for 6 months data, the queries were taking more than 10 hrs due to
> the shuffling of data
>
> Our Solution 3:
>
> 1). We also tried to bucket each small table based on the id column, but
> this did not give us the desired result as the shuffling was still happening
>
>
> Can anyone suggest what is the best approach to go with in the above
> scenario?
>


-- 
----------------------------------
Pau Tallada Crespí
Dep. d'Astrofísica i Cosmologia
Port d'Informació Científica (PIC)
Tel: +34 93 170 2729
----------------------------------

Reply via email to