I am trying to understand Spark Architecture.
For Dataframes that are created from python objects ie. that are *created
in memory where are they stored ?*
Take following example:
from pyspark.sql import Rowimport datetime
courses = [
{
'course_id': 1,
'course_title': 'Mastering Python',
'course_published_dt': datetime.date(2021, 1, 14),
'is_active': True,
'last_updated_ts': datetime.datetime(2021, 2, 18, 16, 57, 25)
}
]
courses_df = spark.createDataFrame([Row(**course) for course in courses])
Where is the dataframe stored when I invoke the call:
courses_df = spark.createDataFrame([Row(**course) for course in courses])
Does it:
1. Send the data to a random executor ?
- Does this mean this counts as a shuffle ?
1. Or does it stay on the driver node ?
- That does not make sense when the dataframe grows large.
--
Regards,
Sreyan Chakravarty