I'm trying to use InvokeForString to call a simple static method that wraps http://mzsanford.github.com/twitter-text-java/docs/api/index.html https://github.com/twitter/twitter-text-java ... specifically the Extractor class extractURLs method. In fact since the logical result is a list of URLs perhaps I should be writing proper Pig-centric wrapper that returns a tuple, but for now I thought a stringified list would be ok for my immediate purposes. That purpose being pulling out all the URLs from a corpus of tweets, so we can expand the bit.ly and other short urls...
So - I built the extra class (src below) and packaged it inside the twitter-text jar, and verify it's in there and usable as follows: danbri$ java -cp twitter-text-1.3.1-plus-tv.notube.TwitterExtractor.jar tv.notube.TwitterExtractor "hello http://example.com/ http://example.org/ world" URLs: [http://example.com/, http://example.org/] Then from the same directory, I try run this as a Pig job: tw06 = load '/user/danbri/twitter/tweets2009-06.tab.txt.lzo' AS ( when: chararray, who: chararray, msg: chararray); REGISTER twitter-text-1.3.1-plus-tv.notube.TwitterExtractor.jar; DEFINE ExtractURLs InvokeForString('tv.notube.TwitterExtractor.urls', 'String'); urls = FOREACH tw06 GENERATE ExtractURLs(msg); x = SAMPLE urls 0.001; dump x; ...but we don't get past InvokeForString, 2011-03-01 14:50:31,033 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. could not instantiate 'InvokeForString' with arguments '[tv.notube.TwitterExtractor.urls, String]' Details at logfile: /home/danbri/twitter/pig_1298987430385.log ...-> Caused by: java.lang.reflect.InvocationTargetException Caused by: java.lang.ClassNotFoundException: tv.notube.TwitterExtractor I checked that Pig is finding the jar by mis-spelling the filename in the "REGISTER" line (which as expected causes things to fail earlier). Also double-check that the class is in the jar, danbri$ jar -tvf twitter-text-1.3.1-plus-tv.notube.TwitterExtractor.jar | grep tv 0 Tue Mar 01 12:03:04 CET 2011 tv/ 0 Tue Mar 01 12:03:04 CET 2011 tv/notube/ 1114 Tue Mar 01 13:40:30 CET 2011 tv/notube/TwitterExtractor.class ...so I'm finding myself stuck. I'm sure the answer is staring me in the face, but I can't see it. Perhaps I should just do things properly with "extends EvalFunc<String>" and return the tuples separately anyway... Thanks for any pointers, Dan package tv.notube; import com.twitter.Extractor; import java.util.List; class TwitterExtractor { public static void main (String[] args) { String in = args[0]; System.out.println("URLs: " + urls(in)); } public static String urls(String tweet) { Extractor ex = new Extractor(); List urls = ex.extractURLs(tweet); String o = urls.toString(); return o; } }
