sure then let me recap steps: 1. load pdfs in a local folder to hdfs avro 2. load avro in spark as a RDD 3. apply pdfbox to each csv and return content as string 4. write the result as a huge csv file
That's some work guys for me to push all that. Should find some time however within 7 days @unk1102: this won't cover the colors and formatting aspects then you could play with pdfbox until I release the other parts Cheers 2018-04-23 19:34 GMT+02:00 Deepak Sharma <deepakmc...@gmail.com>: > Yes Nicolas. > It would be great hell if you can push code to github and share URL. > > Thanks > Deepak > > > On Mon, Apr 23, 2018, 23:00 unk1102 <umesh.ka...@gmail.com> wrote: > >> Hi Nicolas thanks much for guidance it was very useful information if you >> can >> push that code to github and share url it would be a great help. Looking >> forward. If you can find time to push early it would be even greater help >> as >> I have to finish POC on this use case ASAP. >> >> >> >> -- >> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >>