[ https://issues.apache.org/jira/browse/HIVE-27498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mayank Kunwar reopened HIVE-27498: ---------------------------------- The issue is hitting again, so reopening the ticket > Support custom delimiter in SkippingTextInputFormat > --------------------------------------------------- > > Key: HIVE-27498 > URL: https://issues.apache.org/jira/browse/HIVE-27498 > Project: Hive > Issue Type: Bug > Components: Hive > Reporter: Taraka Rama Rao Lethavadla > Priority: Major > > Simple select is returning results as expected when there are configs > {noformat} > 'skip.header.line.count'='1', > 'textinputformat.record.delimiter'='|'{noformat} > but if we execute select count(*) or any query that launches a tez job is > considering the whole text as single line > *Test case* > data.csv > {noformat} > Code Name|A AAAA|B BBBB > CCCC|C DDDD{noformat} > DDL > {noformat} > create external table test(code string,name string) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > WITH SERDEPROPERTIES ( > 'field.delim'='\t') > STORED AS INPUTFORMAT > 'org.apache.hadoop.mapred.TextInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' > location '${system:test.tmp.dir}/test' > TBLPROPERTIES ( > 'skip.header.line.count'='1', > 'textinputformat.record.delimiter'='|');{noformat} > Query result > select code,name from test; > {noformat} > A AAAA > B BBBB > CCCC > C DDDD{noformat} > *Problem:* But query _+select count(*) from test+_ is returning 1 instead of > 3 > It used to work in older hive versions. > The difference in behaviour started to happen after the introduction of > feature https://issues.apache.org/jira/browse/HIVE-21924 > The feature aims at splitting the text files while reading even though the > table has configuration to skip headers. There by increasing the number of > mappers to process the query there by improving throughput of the query. > The actual problem lies in how new feature is reading a file. It does not > consider 'textinputformat.record.delimiter' property and tries to read the > file looking for new line characters. Since the input file does not have a > new line for every record, it is reading the whole file as single line and > count is returned as 1 > Ref: > [https://github.com/apache/hive/blob/24a82a65f96b65eeebe4e23b2fec425037a70216/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L548] > > *Workaround* > If we can remove headers in the data and skip header config in table > properties or compress the files, then we will not get into this issue > > -- This message was sent by Atlassian Jira (v8.20.10#820010)