Re: Best practices for dealing with large no of PDF files

Deepak Sharma Mon, 23 Apr 2018 09:47:02 -0700

Is there any open source code base to refer to for this kind of use case ?

Thanks
Deepak


On Mon, Apr 23, 2018, 22:13 Nicolas Paris <[email protected]> wrote:

> Hi
>
> Problem is number of files on hadoop;
>
>
> I deal with 50M pdf files. What I did is to put them in an avro table on
> hdfs,
> as a binary column.
>
> Then I read it with spark and push that into pdfbox.
>
> Transforming 50M pdfs into text took 2hours on a 5 computers clusters
>
> About colors and formating, I guess pdfbox is able to get that information
> and then maybe you could add html balise in your txt output.
> That's some extra work indeed
>
>
>
>
> 2018-04-23 18:25 GMT+02:00 unk1102 <[email protected]>:
>
>> Hi I need guidance on dealing with large no of pdf files when using Hadoop
>> and Spark. Can I store as binaryFiles using sc.binaryFiles and then
>> convert
>> it to text using pdf parsers like Apache Tika or PDFBox etc or I convert
>> it
>> into text using these parsers and store it as text files but in doing so I
>> am loosing colors, formatting etc Please guide.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [email protected]
>>
>>
>

Re: Best practices for dealing with large no of PDF files

Reply via email to