Hi, I am facing the following problem while trying to store/use a huge partitioned table with 1000+ columns in Hive. I would like to know how to solve this problem either using hive or any other store.
Requirement: 1).There is a table with around 1000+ columns which is partitioned by date. 2).Every day consist of data about around 500 million entities. There will be an id column with the id of the entity and around 1000+ columns which represent attributes of given entity for each day. 3).We have to store data for around 2 years 4). New columns may be added/logic of existing column may be changed any day and when this happens we have to populate data for the given column for last 2 years Our Solution 1: 1). We created a table with 1000+ columns and partitioned by date. 2). Every day we create a new partition and delete partition older than 2 years Problems Faced in Solution 1: Whenever we had to add/modify certain columns, the backfill of data took a long time and it was taking months to backfill the data for 2 years (this was because there is lot of IO due to the read/write of each partition) Our Solution 2: 1). We created 10 tables with around 100+ columns each and each of them was partitioned by date. 2). Every day we create a new partition in each of the small tables and delete partition older than 2 years 3). Created a view which was a join between all the tables with id, date as join key Problems Faced in Solution 2: Now the backfill time was considerably reduced from months to weeks as we need to only refresh the small table which contained the columns to be backfilled thus reducing the IO drastically. But this lead to very slow queries on top of the view. Especially when we query for 6 months data, the queries were taking more than 10 hrs due to the shuffling of data Our Solution 3: 1). We also tried to bucket each small table based on the id column, but this did not give us the desired result as the shuffling was still happening Can anyone suggest what is the best approach to go with in the above scenario?