Re: Question on RDD storage

Mich Talebzadeh Sun, 28 Nov 2021 15:04:28 -0800

Forgot


You need to import below


from pyspark.sql.types import *

Also you can get the history of commands in Python using below

import readline
for i in range(readline.get_current_history_length()):
    print (readline.get_history_item(i + 1))

HTH



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 28 Nov 2021 at 22:56, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi Kendal,
>
> We had the following before
>
> # read that saved file
>
> content = sc.textFile(file_path)
>
> >>> content.collect()
>
> ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
>
> Let us define a schema for this first list first
>
> Schema = StructType([ StructField("ID", IntegerType(), False)])
>
> Next read this from the saved file together with the schema. The easiest
> solution would be to read the data as a dataframe (pointed out by Sean)
> from the saved file in the beginning:
>
> content = spark.read.csv(file_path,schema = Schema)
> content.printSchema()
> root
>  |-- ID: integer (nullable = true)
>
> >>> content.show()
> +---+
> | ID|
> +---+
> |  0|
> |  1|
> |  2|
> |  3|
> |  4|
> |  5|
> |  6|
> |  7|
> |  8|
> |  9|
> +---+
>
> HTH
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sun, 28 Nov 2021 at 21:58, Kendall Wagner <kendawag...@gmail.com>
> wrote:
>
>> Thanks Mich
>> As you show, after reading back from textFile the int becomes str. I need
>> another map to translate them?
>>
>> Regards
>> Kendall
>>
>> Hi,
>>>
>>>
>>> In Pyspark you can persist storage of a Dataframe (df) to disk by using
>>> the following command
>>>
>>>
>>> df.persist(pyspark.StorageLevel.DISK_ONLY)
>>>
>>>
>>> note pyspark.Storagelevel above
>>>
>>>
>>> But that only stores the dataframe df to a temporary storage (work area)
>>> for spark akin to using the swap area on a Linux host. The
>>> temporary storage will disappear as soon as your spark session ends! Thus
>>> it is not persistent.
>>>
>>>
>>> Spark like my other tools uses persistent files (a normal file, an HDFS
>>> file etc) for storage. you can also write to a database table. In that case
>>> you should be able to
>>>
>>>  read that file later.
>>>
>>>
>>> A simple example will show
>>>
>>>
>>> import pyspark
>>>
>>> from pyspark.sql import SparkSession
>>>
>>> spark = SparkSession.builder.appName('example').getOrCreate()
>>>
>>> sc = spark.sparkContext
>>>
>>> rdd = sc.parallelize(range(1,10))
>>>
>>> file_path = "file:///tmp/abcd.txt"
>>>
>>> >>> rdd.getNumPartitions()
>>>
>>> 6
>>>
>>> # save it as a textfile in /tmp directory on linux. Use coalesce(1) to 
>>> reduce
>>> the number of partitions to one and save the file, check the docs
>>>
>>> rdd.coalesce(1).saveAsTextFile(file_path)
>>>
>>> # read that saved file
>>>
>>> content = sc.textFile(file_path)
>>>
>>> >>> content.collect()
>>>
>>> ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
>>>
>>> That file file_path is persistent and will stay there in /tmp directory
>>>
>>>
>>> HTH
>>>
>>>
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Sun, 28 Nov 2021 at 03:13, Kendall Wagner <kendawag...@gmail.com>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> Sorry I am a spark newbie.
>>>> In pyspark session, I want to store the RDD so that next time I run
>>>> pyspark again, the RDD will be reloaded.
>>>>
>>>> I tried this:
>>>>
>>>> >>> fruit.count()
>>>> 1000
>>>>
>>>> >>> fruit.take(5)
>>>> [('peach', 1), ('apricot', 2), ('apple', 3), ('haw', 1), ('persimmon',
>>>> 9)]
>>>>
>>>> >>> fruit.persist(StorageLevel.DISK_ONLY)
>>>> Traceback (most recent call last):
>>>>   File "<stdin>", line 1, in <module>
>>>> NameError: name 'StorageLevel' is not defined
>>>>
>>>>
>>>> RDD.persist method seems not working for me.
>>>> How to store a RDD to disk and how can I reload it again?
>>>>
>>>>
>>>> Thank you in advance.
>>>> Kendall
>>>>
>>>>
>>>>

Re: Question on RDD storage

Reply via email to