Thanks Pradeep! I'll give it a try and report back

Ryan

On Fri, Dec 5, 2014 at 12:30 PM, Pradeep Gollakota <[email protected]>
wrote:

> I forgot to mention earlier that you should probably move the PdfParser
> initialization code out of the evaluate method. This will probably cause a
> significant overhead both in terms of gc and runtime performance. You'll
> want to initialize your parser once and evaluate all your docs against it.
>
> - Pradeep
>
> On Fri Dec 05 2014 at 9:18:16 AM Pradeep Gollakota <[email protected]>
> wrote:
>
> > Java string's are immutable. So "pdfText.concat()" returns a new string
> > and the original string is left unmolested. So at the end, all you're
> doing
> > is returning an empty string. Instead, you can do "pdfText =
> > pdfText.concat(...)". But the better way to write it is to use a
> > StringBuilder.
> >
> > StringBuilder pdfText = ...;
> > pdfText.append(...);
> > pdfText.append(...);
> > ...
> > return pdfText.toString();
> >
> > On Fri Dec 05 2014 at 9:12:37 AM Ryan <[email protected]>
> > wrote:
> >
> >> Hi,
> >>
> >> I'm working on an open source project attempting to convert raw content
> >> from a pdf (stored as a databytearray) into plain text using a Pig UDF
> and
> >> Apache Tika. I could use your help. For some reason, the UDF I'm using
> >> isn't working. The script succeeds but no output is written. *This is
> the
> >> Pig script I'm following:*
> >>
> >> register 'target/warcbase-0.1.0-SNAPSHOT-fatjar.jar';
> >> DEFINE ExtractTextFromPDFs
> >>  org.warcbase.pig.piggybank.ExtractTextFromPDFs();
> >> DEFINE ArcLoader org.warcbase.pig.ArcLoader();
> >>
> >> raw = load '/data/arc/sample.arc' using ArcLoader as (url: chararray,
> >> date:
> >> chararray, mime: chararray, content: bytearray); --load the data
> >>
> >> a = FILTER raw BY (url matches '.*\\.pdf$');  --gets all PDF pages from
> >> the
> >> arc file
> >> b = LIMIT a 2; --limit to 2 pages to speed up testing time
> >> c = foreach b generate url, ExtractTextFromPDFs(content);
> >> store c into 'output/pdf_test';
> >>
> >>
> >> *This is the UDF I wrote:*
> >>
> >> public class ExtractTextFromPDFs extends EvalFunc<String> {
> >>
> >>   @Override
> >>   public String exec(Tuple input) throws IOException {
> >>       String pdfText = "";
> >>
> >>       if (input == null || input.size() == 0 || input.get(0) == null) {
> >>           return "N/A";
> >>       }
> >>
> >>       DataByteArray dba = (DataByteArray)input.get(0);
> >>       pdfText.concat(String.valueOf(dba.size())); //my attempt at
> >> debugging. Nothing written
> >>
> >>       InputStream is = new ByteArrayInputStream(dba.get());
> >>
> >>       ContentHandler contenthandler = new BodyContentHandler();
> >>       Metadata metadata = new Metadata();
> >>       DefaultDetector detector = new DefaultDetector();
> >>       AutoDetectParser pdfparser = new AutoDetectParser(detector);
> >>
> >>       try {
> >>         pdfparser.parse(is, contenthandler, metadata, new
> ParseContext());
> >>       } catch (SAXException | TikaException e) {
> >>         // TODO Auto-generated catch block
> >>         e.printStackTrace();
> >>       }
> >>       pdfText.concat(" : "); //another attempt at debugging. Still
> nothing
> >> written
> >>       pdfText.concat(contenthandler.toString());
> >>
> >>       //close the input stream
> >>       if(is != null){
> >>         is.close();
> >>       }
> >>       return pdfText;
> >>   }
> >>
> >> }
> >>
> >> Thank you for your assistance,
> >> Ryan
> >>
> >
>

Reply via email to