Re: Question on RDD storage

Sean Owen Sun, 28 Nov 2021 14:09:24 -0800

Please read the docs - there is also saveAsObjectFile, for example, but you
almost surely want to handle this as a DataFrame. You can
.write.format("...") as desired.


On Sun, Nov 28, 2021 at 3:58 PM Kendall Wagner <kendawag...@gmail.com>
wrote:

> Thanks Mich
> As you show, after reading back from textFile the int becomes str. I need
> another map to translate them?
>
> Regards
> Kendall
>
> Hi,
>>
>>
>> In Pyspark you can persist storage of a Dataframe (df) to disk by using
>> the following command
>>
>>
>> df.persist(pyspark.StorageLevel.DISK_ONLY)
>>
>>
>> note pyspark.Storagelevel above
>>
>>
>> But that only stores the dataframe df to a temporary storage (work area)
>> for spark akin to using the swap area on a Linux host. The
>> temporary storage will disappear as soon as your spark session ends! Thus
>> it is not persistent.
>>
>>
>> Spark like my other tools uses persistent files (a normal file, an HDFS
>> file etc) for storage. you can also write to a database table. In that case
>> you should be able to
>>
>>  read that file later.
>>
>>
>> A simple example will show
>>
>>
>> import pyspark
>>
>> from pyspark.sql import SparkSession
>>
>> spark = SparkSession.builder.appName('example').getOrCreate()
>>
>> sc = spark.sparkContext
>>
>> rdd = sc.parallelize(range(1,10))
>>
>> file_path = "file:///tmp/abcd.txt"
>>
>> >>> rdd.getNumPartitions()
>>
>> 6
>>
>> # save it as a textfile in /tmp directory on linux. Use coalesce(1) to reduce
>> the number of partitions to one and save the file, check the docs
>>
>> rdd.coalesce(1).saveAsTextFile(file_path)
>>
>> # read that saved file
>>
>> content = sc.textFile(file_path)
>>
>> >>> content.collect()
>>
>> ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
>>
>> That file file_path is persistent and will stay there in /tmp directory
>>
>>
>> HTH
>>
>>
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sun, 28 Nov 2021 at 03:13, Kendall Wagner <kendawag...@gmail.com>
>> wrote:
>>
>>> Hello,
>>>
>>> Sorry I am a spark newbie.
>>> In pyspark session, I want to store the RDD so that next time I run
>>> pyspark again, the RDD will be reloaded.
>>>
>>> I tried this:
>>>
>>> >>> fruit.count()
>>> 1000
>>>
>>> >>> fruit.take(5)
>>> [('peach', 1), ('apricot', 2), ('apple', 3), ('haw', 1), ('persimmon',
>>> 9)]
>>>
>>> >>> fruit.persist(StorageLevel.DISK_ONLY)
>>> Traceback (most recent call last):
>>>   File "<stdin>", line 1, in <module>
>>> NameError: name 'StorageLevel' is not defined
>>>
>>>
>>> RDD.persist method seems not working for me.
>>> How to store a RDD to disk and how can I reload it again?
>>>
>>>
>>> Thank you in advance.
>>> Kendall
>>>
>>>
>>>

Re: Question on RDD storage

Reply via email to