Is there any open source code base to refer to for this kind of use case ? Thanks Deepak
On Mon, Apr 23, 2018, 22:13 Nicolas Paris <[email protected]> wrote: > Hi > > Problem is number of files on hadoop; > > > I deal with 50M pdf files. What I did is to put them in an avro table on > hdfs, > as a binary column. > > Then I read it with spark and push that into pdfbox. > > Transforming 50M pdfs into text took 2hours on a 5 computers clusters > > About colors and formating, I guess pdfbox is able to get that information > and then maybe you could add html balise in your txt output. > That's some extra work indeed > > > > > 2018-04-23 18:25 GMT+02:00 unk1102 <[email protected]>: > >> Hi I need guidance on dealing with large no of pdf files when using Hadoop >> and Spark. Can I store as binaryFiles using sc.binaryFiles and then >> convert >> it to text using pdf parsers like Apache Tika or PDFBox etc or I convert >> it >> into text using these parsers and store it as text files but in doing so I >> am loosing colors, formatting etc Please guide. >> >> >> >> -- >> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: [email protected] >> >> >
