Thanks Mich As you show, after reading back from textFile the int becomes str. I need another map to translate them?
Regards Kendall Hi, > > > In Pyspark you can persist storage of a Dataframe (df) to disk by using > the following command > > > df.persist(pyspark.StorageLevel.DISK_ONLY) > > > note pyspark.Storagelevel above > > > But that only stores the dataframe df to a temporary storage (work area) > for spark akin to using the swap area on a Linux host. The > temporary storage will disappear as soon as your spark session ends! Thus > it is not persistent. > > > Spark like my other tools uses persistent files (a normal file, an HDFS > file etc) for storage. you can also write to a database table. In that case > you should be able to > > read that file later. > > > A simple example will show > > > import pyspark > > from pyspark.sql import SparkSession > > spark = SparkSession.builder.appName('example').getOrCreate() > > sc = spark.sparkContext > > rdd = sc.parallelize(range(1,10)) > > file_path = "file:///tmp/abcd.txt" > > >>> rdd.getNumPartitions() > > 6 > > # save it as a textfile in /tmp directory on linux. Use coalesce(1) to reduce > the number of partitions to one and save the file, check the docs > > rdd.coalesce(1).saveAsTextFile(file_path) > > # read that saved file > > content = sc.textFile(file_path) > > >>> content.collect() > > ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'] > > That file file_path is persistent and will stay there in /tmp directory > > > HTH > > > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Sun, 28 Nov 2021 at 03:13, Kendall Wagner <kendawag...@gmail.com> > wrote: > >> Hello, >> >> Sorry I am a spark newbie. >> In pyspark session, I want to store the RDD so that next time I run >> pyspark again, the RDD will be reloaded. >> >> I tried this: >> >> >>> fruit.count() >> 1000 >> >> >>> fruit.take(5) >> [('peach', 1), ('apricot', 2), ('apple', 3), ('haw', 1), ('persimmon', 9)] >> >> >>> fruit.persist(StorageLevel.DISK_ONLY) >> Traceback (most recent call last): >> File "<stdin>", line 1, in <module> >> NameError: name 'StorageLevel' is not defined >> >> >> RDD.persist method seems not working for me. >> How to store a RDD to disk and how can I reload it again? >> >> >> Thank you in advance. >> Kendall >> >> >>