[jira] [Updated] (HIVE-22438) Additional comma is added to projection column ids

Wenning Ding (Jira) Thu, 31 Oct 2019 14:18:05 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-22438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wenning Ding updated HIVE-22438:
--------------------------------
    Description: 
I ran into this issue when querying a Hudi data through Hive.

Basically, to query a Hudi style table, Hudi implements its own InputFormat 
class and overwrite the getRecordReader method. In this method, because of some 
reasons, Hudi will manually add several projection column ids and projection 
column names when each time getRecordReader method is called. Like this:

 
{code:java}
public RecordReader<NullWritable, ArrayWritable> getRecordReader(final 
InputSplit split, final JobConf job,
        final Reporter reporter) throws IOException {
    if 
(!job.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR).contains("col_a")) {
        job.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "col_a");
    }
    if (!job.get(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR).contains("1")) 
{
        job.set(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR, "1");
    }
    super.getRecordReader(split, job, reporter);
}
{code}
 

In this situation, it will cause a problem when using COUNT(\*) or COUNT(1) 
query. Note that for COUNT(\*) or COUNT(1), Hive don't need to read any column. 
So the projection column ids is an empty string.

Here is a log example to show the whole workflow.
{code:java}
[DEBUG] [TezChild] |split.TezGroupedSplitsInputFormat|: Init record reader for 
index 0 of 2
[INFO] [TezChild] |realtime.HoodieParquetRealtimeInputFormat|: Before adding 
Hoodie columns, Projections : Ids :
[INFO] [TezChild] |hadoop.HoodieParquetInputFormat|: After adding Hoodie 
columns, Projections :col_a Ids :1
[DEBUG] [TezChild] |split.TezGroupedSplitsInputFormat|: Init record reader for 
index 1 of 2
[INFO] [TezChild] |realtime.HoodieParquetRealtimeInputFormat|: Before adding 
Hoodie columns, Projections :col_a Ids :,1
{code}
As we can see, at the second time, projection ids becomes ",1" and that 
additional comma will cause exceptions in the following program.

 

This error would happen before when Hive version is lower than Hive 3.0.0.

Before Hive 3.0.0, Hive directly adds old column ids and new column ids.
{code:java}
  public static void appendReadColumns(Configuration conf, List<Integer> ids) { 
   
    String id = toReadColumnIDString(ids);    
    String old = conf.get(READ_COLUMN_IDS_CONF_STR, null);    
    String newConfStr = id;    
    if (old != null && !old.isEmpty()) {      
      newConfStr = newConfStr + StringUtils.COMMA_STR + old;    
    }    
    setReadColumnIDConf(conf, newConfStr);    
    // Set READ_ALL_COLUMNS to false    
    conf.setBoolean(READ_ALL_COLUMNS, false);  
}
{code}
In this case, if newConfStr is empty, then newConfStr becomes ",1".

Hive 3.0.0 has fixed this problem but for the version lower than Hive 3.0.0 
also need to fix this problem.

  was:
I ran into this issue when querying a Hudi data through Hive.

Basically, to query a Hudi style table, Hudi implements its own InputFormat 
class and overwrite the getRecordReader method. In this method, because of some 
reasons, Hudi will manually add several projection column ids and projection 
column names when each time getRecordReader method is called. Like this:

 
{code:java}
public RecordReader<NullWritable, ArrayWritable> getRecordReader(final 
InputSplit split, final JobConf job,
        final Reporter reporter) throws IOException {
    if 
(!job.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR).contains("col_a")) {
        job.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "col_a");
    }
    if (!job.get(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR).contains("1")) 
{
        job.set(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR, "1");
    }
    super.getRecordReader(split, job, reporter);
}
{code}
 

In this situation, it will cause a problem when using COUNT(*) or COUNT(1) 
query. Note that for COUNT(*) or COUNT(1), Hive don't need to read any column. 
So the projection column ids is an empty string.

Here is a log example to show the whole workflow.
{code:java}
[DEBUG] [TezChild] |split.TezGroupedSplitsInputFormat|: Init record reader for 
index 0 of 2
[INFO] [TezChild] |realtime.HoodieParquetRealtimeInputFormat|: Before adding 
Hoodie columns, Projections : Ids :
[INFO] [TezChild] |hadoop.HoodieParquetInputFormat|: After adding Hoodie 
columns, Projections :col_a Ids :1
[DEBUG] [TezChild] |split.TezGroupedSplitsInputFormat|: Init record reader for 
index 1 of 2
[INFO] [TezChild] |realtime.HoodieParquetRealtimeInputFormat|: Before adding 
Hoodie columns, Projections :col_a Ids :,1
{code}
As we can see, at the second time, projection ids becomes ",1" and that 
additional comma will cause exceptions in the following program.

 

This error would happen before when Hive version is lower than Hive 3.0.0.

Before Hive 3.0.0, Hive directly adds old column ids and new column ids.
{code:java}
  public static void appendReadColumns(Configuration conf, List<Integer> ids) { 
   
    String id = toReadColumnIDString(ids);    
    String old = conf.get(READ_COLUMN_IDS_CONF_STR, null);    
    String newConfStr = id;    
    if (old != null && !old.isEmpty()) {      
      newConfStr = newConfStr + StringUtils.COMMA_STR + old;    
    }    
    setReadColumnIDConf(conf, newConfStr);    
    // Set READ_ALL_COLUMNS to false    
    conf.setBoolean(READ_ALL_COLUMNS, false);  
}
{code}
In this case, if newConfStr is empty, then newConfStr becomes ",1".

Hive 3.0.0 has fixed this problem but for the version lower than Hive 3.0.0 
also need to fix this problem.


> Additional comma is added to projection column ids
> --------------------------------------------------
>
>                 Key: HIVE-22438
>                 URL: https://issues.apache.org/jira/browse/HIVE-22438
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Wenning Ding
>            Assignee: Wenning Ding
>            Priority: Major
>
> I ran into this issue when querying a Hudi data through Hive.
> Basically, to query a Hudi style table, Hudi implements its own InputFormat 
> class and overwrite the getRecordReader method. In this method, because of 
> some reasons, Hudi will manually add several projection column ids and 
> projection column names when each time getRecordReader method is called. Like 
> this:
>  
> {code:java}
> public RecordReader<NullWritable, ArrayWritable> getRecordReader(final 
> InputSplit split, final JobConf job,
>         final Reporter reporter) throws IOException {
>     if 
> (!job.get(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR).contains("col_a"))
>  {
>         job.set(ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR, "col_a");
>     }
>     if 
> (!job.get(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR).contains("1")) {
>         job.set(ColumnProjectionUtils.READ_COLUMN_IDS_CONF_STR, "1");
>     }
>     super.getRecordReader(split, job, reporter);
> }
> {code}
>  
> In this situation, it will cause a problem when using COUNT(\*) or COUNT(1) 
> query. Note that for COUNT(\*) or COUNT(1), Hive don't need to read any 
> column. So the projection column ids is an empty string.
> Here is a log example to show the whole workflow.
> {code:java}
> [DEBUG] [TezChild] |split.TezGroupedSplitsInputFormat|: Init record reader 
> for index 0 of 2
> [INFO] [TezChild] |realtime.HoodieParquetRealtimeInputFormat|: Before adding 
> Hoodie columns, Projections : Ids :
> [INFO] [TezChild] |hadoop.HoodieParquetInputFormat|: After adding Hoodie 
> columns, Projections :col_a Ids :1
> [DEBUG] [TezChild] |split.TezGroupedSplitsInputFormat|: Init record reader 
> for index 1 of 2
> [INFO] [TezChild] |realtime.HoodieParquetRealtimeInputFormat|: Before adding 
> Hoodie columns, Projections :col_a Ids :,1
> {code}
> As we can see, at the second time, projection ids becomes ",1" and that 
> additional comma will cause exceptions in the following program.
>  
> This error would happen before when Hive version is lower than Hive 3.0.0.
> Before Hive 3.0.0, Hive directly adds old column ids and new column ids.
> {code:java}
>   public static void appendReadColumns(Configuration conf, List<Integer> ids) 
> {    
>     String id = toReadColumnIDString(ids);    
>     String old = conf.get(READ_COLUMN_IDS_CONF_STR, null);    
>     String newConfStr = id;    
>     if (old != null && !old.isEmpty()) {      
>       newConfStr = newConfStr + StringUtils.COMMA_STR + old;    
>     }    
>     setReadColumnIDConf(conf, newConfStr);    
>     // Set READ_ALL_COLUMNS to false    
>     conf.setBoolean(READ_ALL_COLUMNS, false);  
> }
> {code}
> In this case, if newConfStr is empty, then newConfStr becomes ",1".
> Hive 3.0.0 has fixed this problem but for the version lower than Hive 3.0.0 
> also need to fix this problem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-22438) Additional comma is added to projection column ids

Reply via email to