Loading a text file

Hinko Kocevar Mon, 14 Mar 2022 06:31:18 -0700

I have a standalone spark 3.2.0 cluster with two workers started on PC_A and 
want to run a pyspark job from PC_B. The job wants to load a text file. I keep 
getting file not found error messages when I execute the job.


Folder/file "/home/bddev/parrot/words.txt" exists on PC_B but not on PC_A.

try 1:

>>> df = spark.read.text("/home/bddev/parrot/words.txt")
>>> df.select("*").groupBy("value").count().orderBy("count",ascending=False).show()
22/03/14 14:14:44 WARN TaskSetManager: Lost task 11.0 in stage 0.0 (TID 11) 
(pca executor 0): java.io.FileNotFoundException: 
File file:/home/bddev/parrot/words.txt does not exist
...

try 2:

>>> from pyspark import SparkFiles
>>> sc.addFile("/home/bddev/parrot/words.txt")
>>> SparkFiles.get("words.txt")
'/tmp/spark-43bf6d61-45a5-463f-adb9-ad4240743010/userFiles-261ec611-2655-4e05-a76c-681122bd22f1/words.txt'
>>> df = spark.read.text("words.txt")
>>> df.select("*").groupBy("value").count().orderBy("count",ascending=False).show()
[Stage 1:>                                                        (0 + 16) / 16]

[lots of network activity, looks like the file is being copied over to the from 
PC_B to PC_A]

22/03/14 14:19:21 WARN TaskSetManager: Lost task 15.0 in stage 1.0 (TID 72) 
(pca executor 1): java.io.FileNotFoundException: 
File file:/home/bddev/parrot/words.txt does not exist
...


How can I work with filenames that are not local to my machine? For example can 
I put the file on the cluster machine locally and access it from the pyspark 
somehow? Is it required that the cluster and the client "see" the file in the 
same folder?

This just for playing around, longer term I plan on having a bit different 
setup with data sitting on a remote network attached storage machine.

Thanks!
//hinko


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Loading a text file

Reply via email to