[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

Pala M Muthaia (JIRA) Wed, 19 Mar 2014 18:33:12 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-6131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13941289#comment-13941289
 ]


Pala M Muthaia commented on HIVE-6131:
--------------------------------------

Browsing the code, it seems like this was introduced by fix for HIVE-3833 in 
Hive 0.11. In that patch, the partition schema is used to read results, instead 
of the table schema as it was before. Since partition schema is a snapshot of 
table schema at the time of partition creation, it doesn't contain all the new 
columns added later. So, the result is read using stale schema and thus do not 
contain new column values, even though they are present in underlying data.

Clearly the intent of the 3833 patch is to use partition specific metadata, to 
allow for multiple serdes for partitions of a table (as i understand it). This 
issue seems to be a regression introduced by that patch.

One possible fix is to use partition metadata, except update column list from 
table metadata. It is quite possible that while this will work, this may not be 
the 'right' fix. [~namitjain]] [~ashutoshc], any thoughts on this?



> New columns after table alter result in null values despite data
> ----------------------------------------------------------------
>
>                 Key: HIVE-6131
>                 URL: https://issues.apache.org/jira/browse/HIVE-6131
>             Project: Hive
>          Issue Type: Bug
>            Reporter: James Vaughan
>            Priority: Minor
>
> Hi folks,
> I found and verified a bug on our CDH 4.0.3 install of Hive when adding 
> columns to tables with Partitions using 'REPLACE COLUMNS'.  I dug through the 
> Jira a little bit and didn't see anything for it so hopefully this isn't just 
> noise on the radar.
> Basically, when you alter a table with partitions and then reupload data to 
> that partition, it doesn't seem to recognize the extra data that actually 
> exists in HDFS- as in, returns NULL values on the new column despite having 
> the data and recognizing the new column in the metadata.
> Here's some steps to reproduce using a basic table:
> 1.  Run this hive command:  CREATE TABLE jvaughan_test (col1 string) 
> partitioned by (day string);
> 2.  Create a simple file on the system with a couple of entries, something 
> like "hi" and "hi2" separated by newlines.
> 3.  Run this hive command, pointing it at the file:  LOAD DATA LOCAL INPATH 
> '<FILEDIR>' OVERWRITE INTO TABLE jvaughan_test PARTITION (day = '2014-01-02');
> 4.  Confirm the data with:  SELECT * FROM jvaughan_test WHERE day = 
> '2014-01-02';
> 5.  Alter the column definitions:  ALTER TABLE jvaughan_test REPLACE COLUMNS 
> (col1 string, col2 string);
> 6.  Edit your file and add a second column using the default separator 
> (ctrl+v, then ctrl+a in Vim) and add two more entries, such as "hi3" on the 
> first row and "hi4" on the second
> 7.  Run step 3 again
> 8.  Check the data again like in step 4
> For me, this is the results that get returned:
> hive> select * from jvaughan_test where day = '2014-01-01';
> OK
> hi    NULL    2014-01-02
> hi2   NULL    2014-01-02
> This is despite the fact that there is data in the file stored by the 
> partition in HDFS.
> Let me know if you need any other information.  The only workaround for me 
> currently is to drop partitions for any I'm replacing data in and THEN 
> reupload the new data file.
> Thanks,
> -James



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-6131) New columns after table alter result in null values despite data

Reply via email to