Thanks Pradeep! I'll give it a try and report back Ryan
On Fri, Dec 5, 2014 at 12:30 PM, Pradeep Gollakota <[email protected]> wrote: > I forgot to mention earlier that you should probably move the PdfParser > initialization code out of the evaluate method. This will probably cause a > significant overhead both in terms of gc and runtime performance. You'll > want to initialize your parser once and evaluate all your docs against it. > > - Pradeep > > On Fri Dec 05 2014 at 9:18:16 AM Pradeep Gollakota <[email protected]> > wrote: > > > Java string's are immutable. So "pdfText.concat()" returns a new string > > and the original string is left unmolested. So at the end, all you're > doing > > is returning an empty string. Instead, you can do "pdfText = > > pdfText.concat(...)". But the better way to write it is to use a > > StringBuilder. > > > > StringBuilder pdfText = ...; > > pdfText.append(...); > > pdfText.append(...); > > ... > > return pdfText.toString(); > > > > On Fri Dec 05 2014 at 9:12:37 AM Ryan <[email protected]> > > wrote: > > > >> Hi, > >> > >> I'm working on an open source project attempting to convert raw content > >> from a pdf (stored as a databytearray) into plain text using a Pig UDF > and > >> Apache Tika. I could use your help. For some reason, the UDF I'm using > >> isn't working. The script succeeds but no output is written. *This is > the > >> Pig script I'm following:* > >> > >> register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar'; > >> DEFINE ExtractTextFromPDFs > >> org.warcbase.pig.piggybank.ExtractTextFromPDFs(); > >> DEFINE ArcLoader org.warcbase.pig.ArcLoader(); > >> > >> raw = load '/data/arc/sample.arc' using ArcLoader as (url: chararray, > >> date: > >> chararray, mime: chararray, content: bytearray); --load the data > >> > >> a = FILTER raw BY (url matches '.*\\.pdf$'); --gets all PDF pages from > >> the > >> arc file > >> b = LIMIT a 2; --limit to 2 pages to speed up testing time > >> c = foreach b generate url, ExtractTextFromPDFs(content); > >> store c into 'output/pdf_test'; > >> > >> > >> *This is the UDF I wrote:* > >> > >> public class ExtractTextFromPDFs extends EvalFunc<String> { > >> > >> @Override > >> public String exec(Tuple input) throws IOException { > >> String pdfText = ""; > >> > >> if (input == null || input.size() == 0 || input.get(0) == null) { > >> return "N/A"; > >> } > >> > >> DataByteArray dba = (DataByteArray)input.get(0); > >> pdfText.concat(String.valueOf(dba.size())); //my attempt at > >> debugging. Nothing written > >> > >> InputStream is = new ByteArrayInputStream(dba.get()); > >> > >> ContentHandler contenthandler = new BodyContentHandler(); > >> Metadata metadata = new Metadata(); > >> DefaultDetector detector = new DefaultDetector(); > >> AutoDetectParser pdfparser = new AutoDetectParser(detector); > >> > >> try { > >> pdfparser.parse(is, contenthandler, metadata, new > ParseContext()); > >> } catch (SAXException | TikaException e) { > >> // TODO Auto-generated catch block > >> e.printStackTrace(); > >> } > >> pdfText.concat(" : "); //another attempt at debugging. Still > nothing > >> written > >> pdfText.concat(contenthandler.toString()); > >> > >> //close the input stream > >> if(is != null){ > >> is.close(); > >> } > >> return pdfText; > >> } > >> > >> } > >> > >> Thank you for your assistance, > >> Ryan > >> > > >
