I have a standalone spark 3.2.0 cluster with two workers started on PC_A and want to run a pyspark job from PC_B. The job wants to load a text file. I keep getting file not found error messages when I execute the job.
Folder/file "/home/bddev/parrot/words.txt" exists on PC_B but not on PC_A. try 1: >>> df = spark.read.text("/home/bddev/parrot/words.txt") >>> df.select("*").groupBy("value").count().orderBy("count",ascending=False).show() 22/03/14 14:14:44 WARN TaskSetManager: Lost task 11.0 in stage 0.0 (TID 11) (pca executor 0): java.io.FileNotFoundException: File file:/home/bddev/parrot/words.txt does not exist ... try 2: >>> from pyspark import SparkFiles >>> sc.addFile("/home/bddev/parrot/words.txt") >>> SparkFiles.get("words.txt") '/tmp/spark-43bf6d61-45a5-463f-adb9-ad4240743010/userFiles-261ec611-2655-4e05-a76c-681122bd22f1/words.txt' >>> df = spark.read.text("words.txt") >>> df.select("*").groupBy("value").count().orderBy("count",ascending=False).show() [Stage 1:> (0 + 16) / 16] [lots of network activity, looks like the file is being copied over to the from PC_B to PC_A] 22/03/14 14:19:21 WARN TaskSetManager: Lost task 15.0 in stage 1.0 (TID 72) (pca executor 1): java.io.FileNotFoundException: File file:/home/bddev/parrot/words.txt does not exist ... How can I work with filenames that are not local to my machine? For example can I put the file on the cluster machine locally and access it from the pyspark somehow? Is it required that the cluster and the client "see" the file in the same folder? This just for playing around, longer term I plan on having a bit different setup with data sitting on a remote network attached storage machine. Thanks! //hinko --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org