Hi,

I am facing the following problem while trying to store/use a huge
partitioned table with 1000+ columns in Hive. I would like to know how to
solve this problem either using hive or any other store.

Requirement:

1).There is a table with around 1000+ columns which is partitioned by date.
2).Every day consist of data about around 500 million entities. There will
be an id column with the id of the entity and around 1000+ columns which
represent attributes of given entity for each day.
3).We have to store data for around 2 years
4). New columns may be added/logic of existing column may be changed any
day and when this happens we have to populate data for the given column for
last 2 years


Our Solution 1:

1). We created a table with 1000+ columns and partitioned by date.
2). Every day we create a new partition and delete partition older than 2
years

Problems Faced in Solution 1:

Whenever we had to add/modify certain columns, the backfill of data took a
long time and it was taking months to backfill the data for 2 years (this
was because there is lot of IO due to the read/write of each partition)


Our Solution 2:

1). We created 10 tables with around 100+ columns each and each of them was
partitioned by date.
2). Every day we create a new partition in each of the small tables and
delete partition older than 2 years
3). Created a view which was a join between all the tables with id, date as
join key


Problems Faced in Solution 2:

Now the backfill time was considerably reduced from months to weeks as we
need to only refresh the small table which contained the columns to be
backfilled thus reducing the IO drastically.
But this lead to very slow queries on top of the view. Especially when we
query for 6 months data, the queries were taking more than 10 hrs due to
the shuffling of data

Our Solution 3:

1). We also tried to bucket each small table based on the id column, but
this did not give us the desired result as the shuffling was still happening


Can anyone suggest what is the best approach to go with in the above
scenario?

Reply via email to