[ 
https://issues.apache.org/jira/browse/HIVE-25765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ganesha Shreedhara updated HIVE-25765:
--------------------------------------
    Description: 
When _skip.header.line.count_ property is set in table properties, simple 
select queries that gets converted into FetchTask skip rows of each block 
instead of skipping header lines of each file. This happens when the file size 
is larger and file is read in blocks. This issue doesn't exist when select 
query is converted into map only job by setting _hive.fetch.task.conversion_ to 
_none_ because the header lines are skipped only for the first block because 
of[ this 
check|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/HiveContextAwareRecordReader.java#L330]
 We should have similar check in FetchOperator to avoid this issue. 

 

*Steps to reproduce:* 
{code:java}
-- Create table on top of the data file (uncompressed size: ~239M) attached in 
this ticket
CREATE EXTERNAL TABLE test_table(
  col1 string,
  col2 string,
  col3 string,
  col4 string,
  col5 string,
  col6 string,
  col7 string,
  col8 string,
  col9 string,
  col10 string,
  col11 string,
  col12 string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'location_of_data_file'
TBLPROPERTIES ('skip.header.line.count'='1');


-- Counting number of rows gives correct result with only one header line 
skipped

select count(*) from test_table;
3145727

-- Select query skips more rows and the result depends upon the number of 
blocks configured in underlying filesystem. 3 rows are skipped when the file is 
read in 3 blocks. 

select * from test_table;
.
.
Fetched 3145724 rows
 {code}

  was:
When _skip.header.line.count_ property is set in table properties, simple 
select queries that gets converted into FetchTask skip rows of each block 
instead of skipping header lines of each file. This happens when the file size 
is larger and file is read in blocks. This issue doesn't exist when select 
query is converted into map only job by setting _hive.fetch.task.conversion_ to 
_none_ because the header lines are skipped only for the first block because of 
[this check|#L330].] We should have similar check in FetchOperator to avoid 
this issue. 

 

*Steps to reproduce:* 
{code:java}
-- Create table on top of the data file (uncompressed size: ~239M) attached in 
this ticket
CREATE EXTERNAL TABLE test_table(
  col1 string,
  col2 string,
  col3 string,
  col4 string,
  col5 string,
  col6 string,
  col7 string,
  col8 string,
  col9 string,
  col10 string,
  col11 string,
  col12 string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'location_of_data_file'
TBLPROPERTIES ('skip.header.line.count'='1');


-- Counting number of rows gives correct result with only one header line 
skipped

select count(*) from test_table;
3145727

-- Select query skips more rows and the result depends upon the number of 
blocks configured in underlying filesystem. 3 rows are skipped when the file is 
read in 3 blocks. 

select * from test_table;
.
.
Fetched 3145724 rows
 {code}


> skip.header.line.count property skips rows of each block in FetchOperator 
> when file size is larger
> --------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-25765
>                 URL: https://issues.apache.org/jira/browse/HIVE-25765
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 3.1.2
>            Reporter: Ganesha Shreedhara
>            Assignee: Ganesha Shreedhara
>            Priority: Major
>         Attachments: data.txt.gz
>
>
> When _skip.header.line.count_ property is set in table properties, simple 
> select queries that gets converted into FetchTask skip rows of each block 
> instead of skipping header lines of each file. This happens when the file 
> size is larger and file is read in blocks. This issue doesn't exist when 
> select query is converted into map only job by setting 
> _hive.fetch.task.conversion_ to _none_ because the header lines are skipped 
> only for the first block because of[ this 
> check|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/HiveContextAwareRecordReader.java#L330]
>  We should have similar check in FetchOperator to avoid this issue. 
>  
> *Steps to reproduce:* 
> {code:java}
> -- Create table on top of the data file (uncompressed size: ~239M) attached 
> in this ticket
> CREATE EXTERNAL TABLE test_table(
>   col1 string,
>   col2 string,
>   col3 string,
>   col4 string,
>   col5 string,
>   col6 string,
>   col7 string,
>   col8 string,
>   col9 string,
>   col10 string,
>   col11 string,
>   col12 string)
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.mapred.TextInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION
>   'location_of_data_file'
> TBLPROPERTIES ('skip.header.line.count'='1');
> -- Counting number of rows gives correct result with only one header line 
> skipped
> select count(*) from test_table;
> 3145727
> -- Select query skips more rows and the result depends upon the number of 
> blocks configured in underlying filesystem. 3 rows are skipped when the file 
> is read in 3 blocks. 
> select * from test_table;
> .
> .
> Fetched 3145724 rows
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to