[ https://issues.apache.org/jira/browse/HIVE-664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13877185#comment-13877185 ]
Navis commented on HIVE-664: ---------------------------- Ran a simple micro test on splitting only and found it's not faster significantly (max 15%?) than current implementation (even slower for sometimes). But reusing previous pattern string seemed good idea. Furthermore, if OI for regex is constant type, comparing itself can be ignored. Could you do that too? > optimize UDF split > ------------------ > > Key: HIVE-664 > URL: https://issues.apache.org/jira/browse/HIVE-664 > Project: Hive > Issue Type: Bug > Components: UDF > Reporter: Namit Jain > Assignee: Teddy Choi > Labels: optimization > Attachments: HIVE-664.1.patch.txt, HIVE-664.2.patch.txt, > HIVE-664.3.patch.txt > > > Min Zhou added a comment - 21/Jul/09 07:34 AM > It's very useful for us . > some comments: > 1. Can you implement it directly with Text ? Avoiding string decoding and > encoding would be faster. Of course that trick may lead to another problem, > as String.split uses a regular expression for splitting. > 2. getDisplayString() always return a string in lowercase. > [ Show » ] > Min Zhou added a comment - 21/Jul/09 07:34 AM It's very useful for us . some > comments: > 1. Can you implement it directly with Text ? Avoiding string decoding and > encoding would be faster. Of course that trick may lead to another problem, > as String.split uses a regular expression for splitting. > 2. getDisplayString() always return a string in lowercase. > [ Permlink | « Hide ] > Namit Jain added a comment - 21/Jul/09 09:22 AM > Committed. Thanks Emil > [ Show » ] > Namit Jain added a comment - 21/Jul/09 09:22 AM Committed. Thanks Emil > [ Permlink | « Hide ] > Emil Ibrishimov added a comment - 21/Jul/09 10:48 AM > There are some easy (compromise) ways to optimize split: > 1. Check if the regex argument actually contains some "regex specific > characters" and if it doesn't, do a straightforward split without converting > to strings. > 2. Assume some default value for the second argument (for example - > split(str) to be equivalent to split(str, ' ') and optimize for this value > 3. Have two separate split functions - one that does regex and one that > splits around plain text. > I think that 1 is a good choice and can be done rather quickly. > [ Show » ] > Emil Ibrishimov added a comment - 21/Jul/09 10:48 AM There are some easy > (compromise) ways to optimize split: 1. Check if the regex argument actually > contains some "regex specific characters" and if it doesn't, do a > straightforward split without converting to strings. 2. Assume some default > value for the second argument (for example - split(str) to be equivalent to > split(str, ' ') and optimize for this value 3. Have two separate split > functions - one that does regex and one that splits around plain text. I > think that 1 is a good choice and can be done rather quickly. -- This message was sent by Atlassian JIRA (v6.1.5#6160)