I am trying to understand Spark Architecture.

For Dataframes that are created from python objects ie. that are *created
in memory where are they stored ?*

Take following example:

from pyspark.sql import Rowimport datetime
courses = [
    {
        'course_id': 1,
        'course_title': 'Mastering Python',
        'course_published_dt': datetime.date(2021, 1, 14),
        'is_active': True,
        'last_updated_ts': datetime.datetime(2021, 2, 18, 16, 57, 25)
    }

]


courses_df = spark.createDataFrame([Row(**course) for course in courses])


Where is the dataframe stored when I invoke the call:

courses_df = spark.createDataFrame([Row(**course) for course in courses])

Does it:

   1. Send the data to a random executor ?


   - Does this mean this counts as a shuffle ?


   1. Or does it stay on the driver node ?


   - That does not make sense when the dataframe grows large.


-- 
Regards,
Sreyan Chakravarty

Reply via email to