[ https://issues.apache.org/jira/browse/HIVE-25765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17453204#comment-17453204 ]
Panagiotis Garefalakis edited comment on HIVE-25765 at 12/3/21, 8:51 PM: ------------------------------------------------------------------------- Hey [~ganeshas] – thanks for reporting this! Is this also reproducible in the latest master branch? was (Author: pgaref): Hey [~ganeshas] – thanks for reporting this! Is this bug also visible in the latest master branch? > skip.header.line.count property skips rows of each block in FetchOperator > when file size is larger > -------------------------------------------------------------------------------------------------- > > Key: HIVE-25765 > URL: https://issues.apache.org/jira/browse/HIVE-25765 > Project: Hive > Issue Type: Bug > Affects Versions: 3.1.2 > Reporter: Ganesha Shreedhara > Assignee: Ganesha Shreedhara > Priority: Major > Labels: pull-request-available > Attachments: data.txt.gz > > Time Spent: 20m > Remaining Estimate: 0h > > When _skip.header.line.count_ property is set in table properties, simple > select queries that gets converted into FetchTask skip rows of each block > instead of skipping header lines of each file. This happens when the file > size is larger and file is read in blocks. This issue doesn't exist when > select query is converted into map only job by setting > _hive.fetch.task.conversion_ to _none_ because the header lines are skipped > only for the first block because of [this > check|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/HiveContextAwareRecordReader.java#L330] > We should have similar check in FetchOperator to avoid this issue. > > *Steps to reproduce:* > {code:java} > -- Create table on top of the data file (uncompressed size: ~239M) attached > in this ticket > CREATE EXTERNAL TABLE test_table( > col1 string, > col2 string, > col3 string, > col4 string, > col5 string, > col6 string, > col7 string, > col8 string, > col9 string, > col10 string, > col11 string, > col12 string) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.mapred.TextInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' > LOCATION > 'location_of_data_file' > TBLPROPERTIES ('skip.header.line.count'='1'); > -- Counting number of rows gives correct result with only one header line > skipped > select count(*) from test_table; > 3145727 > -- Select query skips more rows and the result depends upon the number of > blocks configured in underlying filesystem. 3 rows are skipped when the file > is read in 3 blocks. > select * from test_table; > . > . > Fetched 3145724 rows > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001)