Thanks, Marcos! I ended up going with a UDF and it's working great.
On Tue, Apr 16, 2013 at 4:06 AM, MARCOS MEDRADO RUBINELLI < [email protected]> wrote: > Dylan, > > It seems my first message fell through a crack, so I apologize if you > receive it twice, but: yes it is a known issu, and there isn't a stable > version with the fix yet. I see two ways to work around it: > > 1. write a UDF that encapsulates the regex > > 2. load the regex from a file > > I actually tested number 2. I ran it on 0.10.0, but it should work on a > recent version of EMR too: > > $ echo "test=(\\S+);?" > testregex.txt > $ hadoop fs -put testregex.txt /tmp > > B = LOAD '/tmp/testregex.txt' as (regex :chararray); > > blah = > FOREACH > data > GENERATE > FLATTEN ( > REGEX_EXTRACT ( > str_of_interest, B.regex, 1 > ) > ) > AS ( > test: chararray > ) > ; > > Cheers, > Marcos > > On 16-04-2013 02:03, Dylan Sather wrote: > > Hi y'all, > > > > First time on this list, and hoping you might be able to help me with a > > (possible) issue. > > > > I'm working with some data in Pig that includes strings of interest, > > optionally separated by semicolons and in random order, e.g. > > > > test=12345;foo=bar > > test=12345 > > foo=bar;test=12345 > > > > The following code should extract the value of the string for the test > > 'key': > > > > blah = > > FOREACH > > data > > GENERATE > > FLATTEN ( > > EXTRACT ( > > str_of_interest, > > 'test=(\\S+);?' > > ) > > ) > > AS ( > > test: chararray > > ) > > ; > > > > However, when running the code, I encounter the following error: > > > > <line 46, column 0> mismatched character '<EOF>' expecting ''' > > 2013-04-16 04:46:05,245 [main] ERROR > org.apache.pig.tools.grunt.Grunt - > > ERROR 1200: <line 46, column 0> mismatched character '<EOF>' expecting > ''' > > > > I thought I had my regex escape syntax off at first, but that doesn't > > appear to be the problem. The only information I get from a Google search > > is a bug report (https://issues.apache.org/jira/browse/PIG-2507) that > > appears to have been recently fixed, but it's still an issue on the > Amazon > > EMR cluster I'm running (spun up ad hoc, just now, for this analysis). > > > > As in the bug report and as suggested elsewhere, replacing the semicolon > > with its Unicode equivalent (\u003B) yields the same error. > > > > I could be crazy and this could be a syntax issue, so I'm hoping someone > > might be able to point me in the right direction or confirm that this is > an > > existing problem. If the latter, are there any workarounds (either in > Pig, > > or for matching the string I want)? > > > > Cheers. > > Dylan > > >
