John Omernik created HIVE-5506: ---------------------------------- Summary: Hive SPLIT function does not return array correctly Key: HIVE-5506 URL: https://issues.apache.org/jira/browse/HIVE-5506 Project: Hive Issue Type: Bug Components: SQL, UDF Affects Versions: 0.11.0, 0.10.0, 0.9.0 Environment: Hive Reporter: John Omernik
Hello all, I think I have outlined a bug in the hive split function: Summary: When calling split on a string of data, it will only return all array items if the the last array item has a value. For example, if I have a string of text delimited by tab with 7 columns, and the first four are filled, but the last three are blank, split will only return a 4 position array. If any number of "middle" columns are empty, but the last item still has a value, then it will return the proper number of columns. This was tested in Hive 0.9 and hive 0.11. Data: (Note \t represents a tab char, \x09 the line endings should be \n (UNIX style) not sure what email will do to them). Basically my data is 7 lines of data with the first 7 letters separated by tab. On some lines I've left out certain letters, but kept the number of tabs exactly the same. input.txt a\tb\tc\td\te\tf\tg a\tb\tc\td\te\t\tg a\tb\t\td\t\tf\tg \t\t\td\te\tf\tg a\tb\tc\td\t\t\t a\t\t\t\te\tf\tg a\t\t\td\t\t\tg I then created a table with one column from that data: DROP TABLE tmp_jo_tab_test; CREATE table tmp_jo_tab_test (message_line STRING) STORED AS TEXTFILE; LOAD DATA LOCAL INPATH '/tmp/input.txt' OVERWRITE INTO TABLE tmp_jo_tab_test; Ok just to validate I created a python counting script: #!/usr/bin/python import sys for line in sys.stdin: line = line[0:-1] out = line.split("\t") print len(out) The output there is : $ cat input.txt |./cnt_tabs.py 7 7 7 7 7 7 7 Based on that information, split on tab should return me 7 for each line as well: hive -e "select size(split(message_line, '\\t')) from tmp_jo_tab_test;" 7 7 7 7 4 7 7 However it does not. It would appear that the line where only the first four letters are filled in(and blank is passed in on the last three) only returns 4 splits, where there should technically be 7, 4 for letters included, and three blanks. a\tb\tc\td\t\t\t -- This message was sent by Atlassian JIRA (v6.1#6144)