I'm trying to extract text from pdf files in hdfs using pdfBox.

However it throws an error:

"Exception in thread "main" org.apache.spark.SparkException: ...

java.io.FileNotFoundException: /nnAlias:8020/tmp/sample.pdf

(No such file or directory)"




What am I missing? Should I be working with PortableDataStream instead of
the string part of:

val files: RDD[(String, PortableDataStream)]?

def pdfRead(fileNameFromRDD: (String, PortableDataStream), sparkSession:
SparkSession) = {

val file: File = new File(fileNameFromRDD._1.drop(5))

val document = PDDocument.load(file); //It throws an error here.


if (!document.isEncrypted()) {

  val stripper = new PDFTextStripper()

  val text = stripper.getText(document)

  println("Text:" + text)


}

    document.close()


  }


//This is where I call the above pdf to text converter method.

     val files =
sparkSession.sparkContext.binaryFiles("hdfs://nnAlias:8020/tmp/sample.pdf")

    files.foreach(println)


    files.foreach(f => println(f._1))


    files.foreach(fileStream => pdfRead(fileStream, sparkSession))


Thanks.

Reply via email to