+ Improvement The “Count” can be done in a cleaner way (The previous way works also with simple ‘replace’)
hive> select RowID,length(regexp_replace(stringColumn,'[^,]',''))+1 as count from t; 1 2 2 5 3 24 4 17 5 8 6 11 7 26 8 18 9 9 From: Markovitz, Dudu [mailto:dmarkov...@paypal.com] Sent: Thursday, June 09, 2016 11:30 PM To: user@hive.apache.org Subject: RE: Get 100 items in Comma Separated strings from Hive Column. ---------------------------------------------------------------------------------------------------- -- bash ---------------------------------------------------------------------------------------------------- mkdir t cat>t/data.txt 1|44,85 2|56,37,83,68,43 3|33,48,42,18,23,80,31,86,48,42,37,52,99,55,93,1,63,67,32,75,44,57,70,2 4|77,26,95,53,11,99,74,82,7,55,75,6,32,87,75,99,80 5|48,78,39,62,16,44,43,63 6|35,97,99,19,22,50,29,84,82,25,77 7|80,43,82,94,81,58,70,8,70,6,62,100,60,84,55,24,100,75,84,15,53,5,19,45,61,73 8|66,44,66,4,80,72,81,63,51,24,51,77,87,85,10,36,43,2 9|39,64,29,14,9,42,66,56,33 hdfs dfs -put t /tmp ---------------------------------------------------------------------------------------------------- -- hive ---------------------------------------------------------------------------------------------------- create external table t ( RowID int ,stringColumn string ) row format delimited fields terminated by '|' location '/tmp/t' ; select RowID,regexp_extract (stringColumn,'(,?[^,]*){0,10}',0) as string10,length(stringColumn)-length(regexp_replace(stringColumn,',',''))+1 as count from t; 1 44,85 2 2 56,37,83,68,43 5 3 33,48,42,18,23,80,31,86,48,42 24 4 77,26,95,53,11,99,74,82,7,55 17 5 48,78,39,62,16,44,43,63 8 6 35,97,99,19,22,50,29,84,82,25 11 7 80,43,82,94,81,58,70,8,70,6 26 8 66,44,66,4,80,72,81,63,51,24 18 9 39,64,29,14,9,42,66,56,33 9 Extracting the first 100 (10 in my example) tokens can be done with regexp_extract or regexp_replace hive> select regexp_extract ('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','(,?[^,]*){0,10}',0); 1,2,3,4,5,6,7,8,9,10 hive> select regexp_replace ('1,2,3,4,5,6,7,8,9,10,11,12,13,14,15','((,?[^,]*){0,10}).*','$1'); 1,2,3,4,5,6,7,8,9,10 From: Mahender Sarangam [mailto:mahender.bigd...@outlook.com] Sent: Thursday, June 09, 2016 7:13 PM To: user@hive.apache.org<mailto:user@hive.apache.org> Subject: Get 100 items in Comma Separated strings from Hive Column. Hi, We have hive table which has a single column with more than 1000 comma separated string items. Is there a way to retrieve only 100 string items from that Column. Also we need to capture number of comma separated string items. We are looking for more of "substring_index" functionality, since we are using Hive 1.2 version, we couldn't find "substring_index" UDF function, Is there a way to achieve the same functionality with "regexp_extract" and I also see there is UDF available not sure whether this helps us achieving same functionality. https://github.com/brndnmtthws/facebook-hive-udfs/blob/master/src/main/java/com/facebook/hive/udf/UDFRegexpExtractAll.java Scenario : Table1 (Source Table) RowID stringColumn 1 1,2,3,4...10000 2 2,4,5,8,4 3 10,11,98,100 Now i Would like to show table result structure like below Row ID 100String count 1 1,2,3...100 10000 2 2,4,5,8,4 5