Dylan,
It seems my first message fell through a crack, so I apologize if you
receive it twice, but: yes it is a known issu, and there isn't a stable
version with the fix yet. I see two ways to work around it:
1. write a UDF that encapsulates the regex
2. load the regex from a file
I actually tested number 2. I ran it on 0.10.0, but it should work on a
recent version of EMR too:
$ echo "test=(\\S+);?" > testregex.txt
$ hadoop fs -put testregex.txt /tmp
B = LOAD '/tmp/testregex.txt' as (regex :chararray);
blah =
FOREACH
data
GENERATE
FLATTEN (
REGEX_EXTRACT (
str_of_interest, B.regex, 1
)
)
AS (
test: chararray
)
;
Cheers,
Marcos
On 16-04-2013 02:03, Dylan Sather wrote:
> Hi y'all,
>
> First time on this list, and hoping you might be able to help me with a
> (possible) issue.
>
> I'm working with some data in Pig that includes strings of interest,
> optionally separated by semicolons and in random order, e.g.
>
> test=12345;foo=bar
> test=12345
> foo=bar;test=12345
>
> The following code should extract the value of the string for the test
> 'key':
>
> blah =
> FOREACH
> data
> GENERATE
> FLATTEN (
> EXTRACT (
> str_of_interest,
> 'test=(\\S+);?'
> )
> )
> AS (
> test: chararray
> )
> ;
>
> However, when running the code, I encounter the following error:
>
> <line 46, column 0> mismatched character '<EOF>' expecting '''
> 2013-04-16 04:46:05,245 [main] ERROR org.apache.pig.tools.grunt.Grunt -
> ERROR 1200: <line 46, column 0> mismatched character '<EOF>' expecting '''
>
> I thought I had my regex escape syntax off at first, but that doesn't
> appear to be the problem. The only information I get from a Google search
> is a bug report (https://issues.apache.org/jira/browse/PIG-2507) that
> appears to have been recently fixed, but it's still an issue on the Amazon
> EMR cluster I'm running (spun up ad hoc, just now, for this analysis).
>
> As in the bug report and as suggested elsewhere, replacing the semicolon
> with its Unicode equivalent (\u003B) yields the same error.
>
> I could be crazy and this could be a syntax issue, so I'm hoping someone
> might be able to point me in the right direction or confirm that this is an
> existing problem. If the latter, are there any workarounds (either in Pig,
> or for matching the string I want)?
>
> Cheers.
> Dylan
>