Re: pyspark - Where are Dataframes created from Python objects stored?

Mich Talebzadeh Thu, 14 Mar 2024 14:40:55 -0700

Hi,

When you create a DataFrame from Python objects using
spark.createDataFrame, here it goes:

*Initial Local Creation:*
The DataFrame is initially created in the memory of the driver node. The
data is not yet distributed to executors at this point.

*The role of lazy Evaluation:*

Spark applies lazy evaluation, *meaning transformations are not executed
immediately*.  It constructs a logical plan describing the operations, but
data movement does not occur yet.

*Action Trigger:*

When you initiate an action (things like show(), collect(), etc), Spark
triggers the execution.

*When partitioning  and distribution come in:Spark partitions the DataFrame
into logical chunks for parallel processing*. It divides the data based on
a partitioning scheme (default is hash partitioning). Each partition is
sent to different executor nodes for distributed execution.
This stage involves data transfer across the cluster, but it is not that
expensive shuffle you have heard of. Shuffles happen within repartitioning
or certain join operations.

*Storage on Executors:*

Executors receive their assigned partitions and store them in their
memory. If memory is limited, Spark spills partitions to disk. look at
stages tab in UI (4040)

*In summary:*
No Data Transfer During Creation: --> Data transfer occurs only when an
action is triggered.
Distributed Processing: --> DataFrames are distributed for parallel
execution, not stored entirely on the driver node.
Lazy Evaluation Optimization: --> Delaying data transfer until necessary
enhances performance.
Shuffle vs. Partitioning: --> Data movement during partitioning is not
considered a shuffle in Spark terminology.
Shuffles involve more complex data rearrangement.

*Considerations: *
Large DataFrames: For very large DataFrames

   - manage memory carefully to avoid out-of-memory errors. Consider
   options like:
   - Increasing executor memory
   - Using partitioning strategies to optimize memory usage
   - Employing techniques like checkpointing to persistent storage (hard
   disks) or caching for memory efficiency
   - You can get additional info from Spark UI default port 4040 tabs like
   SQL and executors
   - Spark uses Catalyst optimiser for efficient execution plans.
   df.explain("extended") shows both logical and physical plans

HTH

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom

   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".

On Thu, 14 Mar 2024 at 19:46, Sreyan Chakravarty <sreya...@gmail.com> wrote:

> I am trying to understand Spark Architecture.
>
> For Dataframes that are created from python objects ie. that are *created
> in memory where are they stored ?*
>
> Take following example:
>
> from pyspark.sql import Rowimport datetime
> courses = [
>     {
>         'course_id': 1,
>         'course_title': 'Mastering Python',
>         'course_published_dt': datetime.date(2021, 1, 14),
>         'is_active': True,
>         'last_updated_ts': datetime.datetime(2021, 2, 18, 16, 57, 25)
>     }
>
> ]
>
>
> courses_df = spark.createDataFrame([Row(**course) for course in courses])
>
>
> Where is the dataframe stored when I invoke the call:
>
> courses_df = spark.createDataFrame([Row(**course) for course in courses])
>
> Does it:
>
>    1. Send the data to a random executor ?
>
>
>    - Does this mean this counts as a shuffle ?
>
>
>    1. Or does it stay on the driver node ?
>
>
>    - That does not make sense when the dataframe grows large.
>
>
> --
> Regards,
> Sreyan Chakravarty
>

Re: pyspark - Where are Dataframes created from Python objects stored?

Reply via email to