I'm trying to extract text from pdf files in hdfs using pdfBox. However it throws an error:
"Exception in thread "main" org.apache.spark.SparkException: ... java.io.FileNotFoundException: /nnAlias:8020/tmp/sample.pdf (No such file or directory)" What am I missing? Should I be working with PortableDataStream instead of the string part of: val files: RDD[(String, PortableDataStream)]? def pdfRead(fileNameFromRDD: (String, PortableDataStream), sparkSession: SparkSession) = { val file: File = new File(fileNameFromRDD._1.drop(5)) val document = PDDocument.load(file); //It throws an error here. if (!document.isEncrypted()) { val stripper = new PDFTextStripper() val text = stripper.getText(document) println("Text:" + text) } document.close() } //This is where I call the above pdf to text converter method. val files = sparkSession.sparkContext.binaryFiles("hdfs://nnAlias:8020/tmp/sample.pdf") files.foreach(println) files.foreach(f => println(f._1)) files.foreach(fileStream => pdfRead(fileStream, sparkSession)) Thanks.