Re: One Schema Per Partition? (Multiple schemas per table?)

Ashutosh Chauhan Thu, 06 Oct 2011 20:05:54 -0700

How is it broken? What was the result you were expecting?

Ashutosh


On Thu, Oct 6, 2011 at 11:13, Time Less <timelessn...@gmail.com> wrote:

> I have finally gotten around to testing this functionality, and it would
> doesn't work. The ALTER table change columns command just changes the
> metadata for the table, not for the partition. Follows is exactly what I did
> to test this, and the (broken) result:
>
> hive> create table multischema_test (
>     >   id      int,
>     >   crdt    string,
>     >   name    string,
>     >   age     int
>     > )
>     > partitioned by (dtp string)
>     > row format delimited
>     > fields terminated by '\t'
>     > lines terminated by '\n'
>     > stored as textfile;
> OK
> Time taken: 0.345 seconds
> hive> alter table multischema_test add partition (dtp=20110101) location
> '/user/hive/warehouse/test/multischema_test/20110101';
> OK
> Time taken: 0.662 seconds
> hive> alter table multischema_test replace columns (id int, name string,
> gender string, age int, crdt string) ;
> OK
> Time taken: 0.119 seconds
> hive> alter table multischema_test add partition (dtp=20110102) location
> '/user/hive/warehouse/test/multischema_test/20110102';
> OK
> Time taken: 0.186 seconds
> hive> select * from multischema_test ;
> OK
> 1    2010-07-01    Jeff    32    NULL    20110101
> 2    2010-07-01    Lisa    33    NULL    20110101
> 3    2010-07-01    Bob    22    NULL    20110101
> 4    2010-07-01    Fred    27    NULL    20110101
> 100    Gregory    Male    45    2010-08-01    20110102
> 101    Horus    Male    14    2010-08-01    20110102
> 102    Verdann    Male    33    2010-08-01    20110102
> 103    Gennefer    Female    32    2010-08-01    20110102
> Time taken: 0.348 seconds
> hive> select name,gender from multischema_test ;
> Total MapReduce jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks is set to 0 since there's no reduce operator
> Starting Job = job_201108291505_28322, Tracking URL =
> http://laxhadoop1-001:50030/jobdetails.jsp?jobid=job_201108291505_28322
> Kill Command = /usr/lib/hadoop/bin/hadoop job
> -Dmapred.job.tracker=laxhadoop1-001:54311 -kill job_201108291505_28322
> 2011-10-06 11:02:27,099 Stage-1 map = 0%,  reduce = 0%
> 2011-10-06 11:02:31,129 Stage-1 map = 100%,  reduce = 0%
> 2011-10-06 11:02:33,144 Stage-1 map = 100%,  reduce = 100%
> Ended Job = job_201108291505_28322
> OK
> Gregory    Male
> Horus    Male
> Verdann    Male
> Gennefer    Female
> 2010-07-01    Jeff
> 2010-07-01    Lisa
> 2010-07-01    Bob
> 2010-07-01    Fred
> Time taken: 8.977 seconds
> hive>
>
> The two text files that make up the data in the table are like this:
>
> [hdfs@laxhadoop1-012 ~/Tim] :) cat multischema-datafile-20110101 | hadoop
> fs -put - /user/hive/warehouse/test/multischema_test/20110101/datafile
> [hdfs@laxhadoop1-012 ~/Tim] :) cat multischema-datafile-20110102 | hadoop
> fs -put - /user/hive/warehouse/test/multischema_test/20110102/datafile
> [hdfs@laxhadoop1-012 ~/Tim] :) cat multischema-datafile-20110101
> 1    2010-07-01    Jeff    32
> 2    2010-07-01    Lisa    33
> 3    2010-07-01    Bob    22
> 4    2010-07-01    Fred    27
> [hdfs@laxhadoop1-012 ~/Tim] :) cat multischema-datafile-20110102
> 100    Gregory    Male    45    2010-08-01
> 101    Horus    Male    14    2010-08-01
> 102    Verdann    Male    33    2010-08-01
> 103    Gennefer    Female    32    2010-08-01
>
> Did I do something wrong?
>
>
>
> On Mon, Aug 29, 2011 at 10:46 PM, Ashutosh Chauhan 
> <hashut...@apache.org>wrote:
>
>> Hi Tim,
>>
>> I figured that both reading the code and manual. I don't think
>> its explicitly documented anywhere, so it will be great if you document
>> this. This page looks right place where this place of information can live.
>>   Thanks for the help in making Hive better.
>>
>> Ashutosh
>>
>> On Mon, Aug 29, 2011 at 15:26, Time Less <timelessn...@gmail.com> wrote:
>>
>>> Hello, Ashutosh,
>>>
>>> I did nothing like that... :)
>>>
>>> It seems the problem here is I didn't RTFM. Perchance, could you say
>>> where you figured this out? I am going from the Hive DDL page on
>>> confluence[1], and although it mentions partitions and it mentions the
>>> "replace columns" you've mentioned here, it doesn't mention them together
>>> that I see. I would like to document this for future generations. Is that
>>> the proper page where I'd document this?
>>>
>>> I would probably explicitly create a section titled "Different Schemas
>>> per Partition" and basically give the syntax you give (from quoted, assuming
>>> after I test it, it works).
>>>
>>>
>>> [1]
>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable%2FPartitionStatements
>>>
>>>
>>> On Wed, Aug 24, 2011 at 6:14 PM, Ashutosh Chauhan 
>>> <hashut...@apache.org>wrote:
>>>
>>>> Hey Tim,
>>>>
>>>> Hive does support different schema's for different partitions. If your
>>>> data comes out garbled, that seems to be a bug then. In your case, is the
>>>> following sequence of steps resemble what you did:
>>>>
>>>> a) create table tbl (id: int, name: string, level: int) partitioned by
>>>> date;
>>>> b) -- add partitions
>>>> c) alter table tbl replace columns (id: int, level: int, name_id: int)
>>>> d) -- add more partitions.
>>>>
>>>> If you do select * from tbl, then this should work. You need not to
>>>> rewrite any of your data. Can you provide more info about what output you
>>>> were expecting and what you got. Are there any error logs?
>>>>
>>>> Ashutosh
>>>>
>>>>
>>>> On Mon, Aug 22, 2011 at 14:34, Time Less <timelessn...@gmail.com>wrote:
>>>>
>>>>> I found a set of slides from Facebook online about Hive that claims you
>>>>> can have a schema per partition in the table, this is exciting to us,
>>>>> because we have a table like so:
>>>>>
>>>>> id     int
>>>>> name   string
>>>>> level  int
>>>>> date   string
>>>>>
>>>>> And it's broken up into partitions by date. However, on a particular
>>>>> date last year, the table dramatically changed its schema to:
>>>>>
>>>>> id       int
>>>>> level    int
>>>>> date     string
>>>>> name_id  int
>>>>>
>>>>> So now if I do "select * from table" in hive, the data is completely
>>>>> garbled for whichever portion of data doesn't fit the Hive schema. We are
>>>>> considering re-writing the datafiles so they're the same before/after that
>>>>> date, but if Hive supports having two entirely different schemas depending
>>>>> on the partition, that'd be really convenient, since these datafiles are
>>>>> hundreds of gigabytes in size (and we do sort of like the idea of knowing
>>>>> how the datafile looked back then...).
>>>>>
>>>>> This page:
>>>>> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterTable%2FPartitionStatementsdoesn't
>>>>>  seem to have an appropriate example, so I'm left wondering.
>>>>>
>>>>> Has anyone done anything like this?
>>>>>
>>>>> --
>>>>> Tim Ellis
>>>>> Data Architect, Riot Games
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Tim
>>>
>>
>>
>
>
> --
> Tim
>

Re: One Schema Per Partition? (Multiple schemas per table?)

Reply via email to