[ https://issues.apache.org/jira/browse/HIVE-14318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gopal V updated HIVE-14318: --------------------------- Status: Patch Available (was: Open) > Vectorization: LIKE should use matches() instead of find(0) > ----------------------------------------------------------- > > Key: HIVE-14318 > URL: https://issues.apache.org/jira/browse/HIVE-14318 > Project: Hive > Issue Type: Bug > Components: Vectorization > Affects Versions: 1.2.1, 1.3.0, 2.2.0 > Reporter: Gopal V > Assignee: Gopal V > Attachments: HIVE-14318.1.patch > > > Checking for a match instead of find() would allow matcher to exit early > instead of looking for sub-sequences beyond the first non-match. > In UDFLike.java, the complex pattern checker uses matches() and the > vectorized version uses find(0), which is more expensive. > {code} > Benchmark Mode Cnt Score Error Units > RegexBench.testGreedyRegexHit avgt 5 379.316 ± 32.444 ns/op > RegexBench.testGreedyRegexHitCheck avgt 5 344.895 ± 15.436 ns/op > RegexBench.testGreedyRegexMiss avgt 5 497.193 ± 18.168 ns/op > RegexBench.testGreedyRegexMissCheck avgt 5 171.872 ± 8.588 ns/op > {code} > The miss in match is nearly ~3x more expensive per-row with the .find(0) over > the .match() check version. > The pattern match scenario is nearly the same. > The lazy scenario makes it slower when there's a hit (because match runs the > check till end, but ~2x faster when there's a miss). > {code} > RegexBench.testLazyRegexHit avgt 5 78.398 ± 6.007 ns/op > RegexBench.testLazyRegexHitCheck avgt 5 120.557 ± 4.396 ns/op > RegexBench.testLazyRegexMiss avgt 5 387.594 ± 25.672 ns/op > RegexBench.testLazyRegexMissCheck avgt 5 154.489 ± 13.622 ns/op > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)