Re: Best practices for dealing with large no of PDF files

Nicolas Paris Mon, 23 Apr 2018 10:46:53 -0700

sure then let me recap steps:
1. load pdfs in a local folder to hdfs avro
2. load avro in spark as a RDD
3. apply pdfbox to each csv and return content as string
4. write the result as a huge csv file


That's some work guys for me to push all that. Should find some time
however within 7 days

@unk1102: this won't cover the colors and formatting aspects then you could
play with pdfbox until I release
the other parts

Cheers

2018-04-23 19:34 GMT+02:00 Deepak Sharma <[email protected]>:

> Yes Nicolas.
> It would be great hell if you can push code to github and share URL.
>
> Thanks
> Deepak
>
>
> On Mon, Apr 23, 2018, 23:00 unk1102 <[email protected]> wrote:
>
>> Hi Nicolas thanks much for guidance it was very useful information if you
>> can
>> push that code to github and share url it would be a great help. Looking
>> forward. If you can find time to push early it would be even greater help
>> as
>> I have to finish POC on this use case ASAP.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [email protected]
>>
>>

Re: Best practices for dealing with large no of PDF files

Reply via email to