[jira] [Commented] (HIVE-2327) Optimize REGEX UDFs with constant parameter information

Alexander Pivovarov (JIRA) Mon, 18 May 2015 12:45:23 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-2327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14548626#comment-14548626
 ]


Alexander Pivovarov commented on HIVE-2327:
-------------------------------------------

I checked old and new regexp UDF performance. New implementation is 1% faster.

table files generator
{code}
import java.io.FileNotFoundException;
import java.io.PrintWriter;
import java.math.BigInteger;
import java.security.SecureRandom;

public class RandomStringGenerator {
  private SecureRandom random = new SecureRandom();

  public String nextSessionId() {
    return new BigInteger(130, random).toString(32);
  }

  public static void main(String[] args) throws FileNotFoundException {
    RandomStringGenerator g = new RandomStringGenerator();

    // lets generate 10 files 10Mil rows each
    for (int j = 0; j < 10; j++) {
      System.out.println("start file " + j);
      PrintWriter pw = new PrintWriter("/tmp/rexexp_test/00000" + j + "_0");
      try {
        for (int i = 0; i < 1000000; i++) {
          String id = g.nextSessionId();
          pw.println(id);
        }
      } finally {
        pw.close();
      }
    }
    System.out.println("All Done");
  }
}
{code}

create table
{code}
hadoop fs -put -f /tmp/regexp_test /tmp

create table regexp_test (
  a string
)
stored as textfile
location '/tmp/regexp_test';
{code}

test queries
{code}
--1
time bin/hive -e "select * from regexp_test where regexp(a, '.*abcd.*')"
--2
time bin/hive -e "select * from regexp_test where regexp(a, '.*efgh.*')"
--3
time bin/hive -e "select * from regexp_test where regexp(a, '.*ijkl.*')"
--4
time bin/hive -e "select a from regexp_test where regexp(a, '.*mnop.*')"
{code}

old regexp implementation
{code}
--1  233 rows
real    1m6.881s
user    1m10.582s
sys     0m1.652s

--2  247 rows
real    1m6.520s
user    1m10.082s
sys     0m1.534s

--3  224 rows
real    1m8.037s
user    1m11.718s
sys     0m1.608s

--4   rows 232
real    1m6.698s
user    1m10.378s
sys     0m1.499s

--AVG 67.034
{code}

new regexp implementation
{code}
--1  233 rows
real    1m6.762s
user    1m10.517s
sys     0m1.471s

--2  247 rows
real    1m6.362s
user    1m9.961s
sys     0m1.558s

--3  224 rows
real    1m5.854s
user    1m9.534s
sys     0m1.452s

--4  232 rows
real    1m6.435s
user    1m10.816s
sys     0m1.571s

--AVG 66.35325
{code}

delta = AVG2 - AVG1 = 0.68075 sec
new implementation is 1% faster  (delta / max(AVG1, AVG2))

> Optimize REGEX UDFs with constant parameter information
> -------------------------------------------------------
>
>                 Key: HIVE-2327
>                 URL: https://issues.apache.org/jira/browse/HIVE-2327
>             Project: Hive
>          Issue Type: Improvement
>          Components: UDF
>            Reporter: Adam Kramer
>            Assignee: Alexander Pivovarov
>         Attachments: HIVE-2327.01.patch, HIVE-2327.2.patch
>
>
> There are a lot of UDFs which would show major performance differences if one 
> assumes that some of its arguments are constant.
> Consider, for example, any UDF that takes a regular expression as input: This 
> can be complied once (fast) if it's a constant, or once per row (wicked slow) 
> if it's not a constant.
> Or, consider any UDF that reads from a file and/or takes a filename as input; 
> it would have to re-read the whole file if the filename changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-2327) Optimize REGEX UDFs with constant parameter information

Reply via email to