Re: Question on RDD storage

Kendall Wagner Sun, 28 Nov 2021 13:58:32 -0800

Thanks Mich
As you show, after reading back from textFile the int becomes str. I need
another map to translate them?


Regards
Kendall

Hi,
>
>
> In Pyspark you can persist storage of a Dataframe (df) to disk by using
> the following command
>
>
> df.persist(pyspark.StorageLevel.DISK_ONLY)
>
>
> note pyspark.Storagelevel above
>
>
> But that only stores the dataframe df to a temporary storage (work area)
> for spark akin to using the swap area on a Linux host. The
> temporary storage will disappear as soon as your spark session ends! Thus
> it is not persistent.
>
>
> Spark like my other tools uses persistent files (a normal file, an HDFS
> file etc) for storage. you can also write to a database table. In that case
> you should be able to
>
>  read that file later.
>
>
> A simple example will show
>
>
> import pyspark
>
> from pyspark.sql import SparkSession
>
> spark = SparkSession.builder.appName('example').getOrCreate()
>
> sc = spark.sparkContext
>
> rdd = sc.parallelize(range(1,10))
>
> file_path = "file:///tmp/abcd.txt"
>
> >>> rdd.getNumPartitions()
>
> 6
>
> # save it as a textfile in /tmp directory on linux. Use coalesce(1) to reduce
> the number of partitions to one and save the file, check the docs
>
> rdd.coalesce(1).saveAsTextFile(file_path)
>
> # read that saved file
>
> content = sc.textFile(file_path)
>
> >>> content.collect()
>
> ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
>
> That file file_path is persistent and will stay there in /tmp directory
>
>
> HTH
>
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 28 Nov 2021 at 03:13, Kendall Wagner <kendawag...@gmail.com>
> wrote:
>
>> Hello,
>>
>> Sorry I am a spark newbie.
>> In pyspark session, I want to store the RDD so that next time I run
>> pyspark again, the RDD will be reloaded.
>>
>> I tried this:
>>
>> >>> fruit.count()
>> 1000
>>
>> >>> fruit.take(5)
>> [('peach', 1), ('apricot', 2), ('apple', 3), ('haw', 1), ('persimmon', 9)]
>>
>> >>> fruit.persist(StorageLevel.DISK_ONLY)
>> Traceback (most recent call last):
>>   File "<stdin>", line 1, in <module>
>> NameError: name 'StorageLevel' is not defined
>>
>>
>> RDD.persist method seems not working for me.
>> How to store a RDD to disk and how can I reload it again?
>>
>>
>> Thank you in advance.
>> Kendall
>>
>>
>>

Re: Question on RDD storage

Reply via email to