Please read the docs - there is also saveAsObjectFile, for example, but you almost surely want to handle this as a DataFrame. You can .write.format("...") as desired.
On Sun, Nov 28, 2021 at 3:58 PM Kendall Wagner <kendawag...@gmail.com> wrote: > Thanks Mich > As you show, after reading back from textFile the int becomes str. I need > another map to translate them? > > Regards > Kendall > > Hi, >> >> >> In Pyspark you can persist storage of a Dataframe (df) to disk by using >> the following command >> >> >> df.persist(pyspark.StorageLevel.DISK_ONLY) >> >> >> note pyspark.Storagelevel above >> >> >> But that only stores the dataframe df to a temporary storage (work area) >> for spark akin to using the swap area on a Linux host. The >> temporary storage will disappear as soon as your spark session ends! Thus >> it is not persistent. >> >> >> Spark like my other tools uses persistent files (a normal file, an HDFS >> file etc) for storage. you can also write to a database table. In that case >> you should be able to >> >> read that file later. >> >> >> A simple example will show >> >> >> import pyspark >> >> from pyspark.sql import SparkSession >> >> spark = SparkSession.builder.appName('example').getOrCreate() >> >> sc = spark.sparkContext >> >> rdd = sc.parallelize(range(1,10)) >> >> file_path = "file:///tmp/abcd.txt" >> >> >>> rdd.getNumPartitions() >> >> 6 >> >> # save it as a textfile in /tmp directory on linux. Use coalesce(1) to reduce >> the number of partitions to one and save the file, check the docs >> >> rdd.coalesce(1).saveAsTextFile(file_path) >> >> # read that saved file >> >> content = sc.textFile(file_path) >> >> >>> content.collect() >> >> ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'] >> >> That file file_path is persistent and will stay there in /tmp directory >> >> >> HTH >> >> >> >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Sun, 28 Nov 2021 at 03:13, Kendall Wagner <kendawag...@gmail.com> >> wrote: >> >>> Hello, >>> >>> Sorry I am a spark newbie. >>> In pyspark session, I want to store the RDD so that next time I run >>> pyspark again, the RDD will be reloaded. >>> >>> I tried this: >>> >>> >>> fruit.count() >>> 1000 >>> >>> >>> fruit.take(5) >>> [('peach', 1), ('apricot', 2), ('apple', 3), ('haw', 1), ('persimmon', >>> 9)] >>> >>> >>> fruit.persist(StorageLevel.DISK_ONLY) >>> Traceback (most recent call last): >>> File "<stdin>", line 1, in <module> >>> NameError: name 'StorageLevel' is not defined >>> >>> >>> RDD.persist method seems not working for me. >>> How to store a RDD to disk and how can I reload it again? >>> >>> >>> Thank you in advance. >>> Kendall >>> >>> >>>