Hi,
In Pyspark you can persist storage of a Dataframe (df) to disk by using the following command df.persist(pyspark.StorageLevel.DISK_ONLY) note pyspark.Storagelevel above But that only stores the dataframe df to a temporary storage (work area) for spark akin to using the swap area on a Linux host. The temporary storage will disappear as soon as your spark session ends! Thus it is not persistent. Spark like my other tools uses persistent files (a normal file, an HDFS file etc) for storage. you can also write to a database table. In that case you should be able to read that file later. A simple example will show import pyspark from pyspark.sql import SparkSession spark = SparkSession.builder.appName('example').getOrCreate() sc = spark.sparkContext rdd = sc.parallelize(range(1,10)) file_path = "file:///tmp/abcd.txt" >>> rdd.getNumPartitions() 6 # save it as a textfile in /tmp directory on linux. Use coalesce(1) to reduce the number of partitions to one and save the file, check the docs rdd.coalesce(1).saveAsTextFile(file_path) # read that saved file content = sc.textFile(file_path) >>> content.collect() ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'] That file file_path is persistent and will stay there in /tmp directory HTH view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Sun, 28 Nov 2021 at 03:13, Kendall Wagner <kendawag...@gmail.com> wrote: > Hello, > > Sorry I am a spark newbie. > In pyspark session, I want to store the RDD so that next time I run > pyspark again, the RDD will be reloaded. > > I tried this: > > >>> fruit.count() > 1000 > > >>> fruit.take(5) > [('peach', 1), ('apricot', 2), ('apple', 3), ('haw', 1), ('persimmon', 9)] > > >>> fruit.persist(StorageLevel.DISK_ONLY) > Traceback (most recent call last): > File "<stdin>", line 1, in <module> > NameError: name 'StorageLevel' is not defined > > > RDD.persist method seems not working for me. > How to store a RDD to disk and how can I reload it again? > > > Thank you in advance. > Kendall > > >