Hi Joel
I built such pipeline to transform pdf-> text
https://github.com/EDS-APHP/SparkPdfExtractor
You can take a look
It transforms 20M pdfs in 2 hours on a 5 node spark cluster
Le 2018-10-10 23:56, Joel D a écrit :
> Hi,
>
> I need to process millions of PDFs in hdfs using spark. First I’m
I believe your use case can be better covered with an own data source reading
PDF files.
On Big Data platforms in general you have the issue that individual PDF files
are very small and are a lot of them - this is not very efficient for those
platforms. That could be also one source of your pe
Hi,
I need to process millions of PDFs in hdfs using spark. First I’m trying
with some 40k files. I’m using binaryFiles api with which I’m facing couple
of issues:
1. It creates only 4 tasks and I can’t seem to increase the parallelism
there.
2. It took 2276 seconds and that means for millions of