Re: Question on RDD storage

Mich Talebzadeh Sun, 28 Nov 2021 12:22:22 -0800

Hi,

In Pyspark you can persist storage of a Dataframe (df) to disk by using the
following command

df.persist(pyspark.StorageLevel.DISK_ONLY)

note pyspark.Storagelevel above

But that only stores the dataframe df to a temporary storage (work area)
for spark akin to using the swap area on a Linux host. The
temporary storage will disappear as soon as your spark session ends! Thus
it is not persistent.

Spark like my other tools uses persistent files (a normal file, an HDFS
file etc) for storage. you can also write to a database table. In that case
you should be able to

 read that file later.

A simple example will show

import pyspark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('example').getOrCreate()

sc = spark.sparkContext

rdd = sc.parallelize(range(1,10))

file_path = "file:///tmp/abcd.txt"

>>> rdd.getNumPartitions()

6

# save it as a textfile in /tmp directory on linux. Use coalesce(1) to reduce
the number of partitions to one and save the file, check the docs

rdd.coalesce(1).saveAsTextFile(file_path)

# read that saved file

content = sc.textFile(file_path)

>>> content.collect()

['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']

That file file_path is persistent and will stay there in /tmp directory

HTH

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

On Sun, 28 Nov 2021 at 03:13, Kendall Wagner <kendawag...@gmail.com> wrote:

> Hello,
>
> Sorry I am a spark newbie.
> In pyspark session, I want to store the RDD so that next time I run
> pyspark again, the RDD will be reloaded.
>
> I tried this:
>
> >>> fruit.count()
> 1000
>
> >>> fruit.take(5)
> [('peach', 1), ('apricot', 2), ('apple', 3), ('haw', 1), ('persimmon', 9)]
>
> >>> fruit.persist(StorageLevel.DISK_ONLY)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> NameError: name 'StorageLevel' is not defined
>
>
> RDD.persist method seems not working for me.
> How to store a RDD to disk and how can I reload it again?
>
>
> Thank you in advance.
> Kendall
>
>
>

Re: Question on RDD storage

Reply via email to