Hi @Mich Talebzadeh <mich.talebza...@gmail.com> , community, Where can I find such insights on the Spark Architecture ?
I found few sites below which did/does cover internals : 1. https://github.com/JerryLead/SparkInternals 2. https://books.japila.pl/apache-spark-internals/overview/ 3. https://stackoverflow.com/questions/30691385/how-spark-works-internally Most of them are very old, and hoping the basic internals have not changed, where can we find more information on internals ? Asking in case you or someone from community has more articles / videos / document links to share. Appreciate your help. Regards, Varun Shah On Fri, Mar 15, 2024, 03:10 Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi, > > When you create a DataFrame from Python objects using > spark.createDataFrame, here it goes: > > > *Initial Local Creation:* > The DataFrame is initially created in the memory of the driver node. The > data is not yet distributed to executors at this point. > > *The role of lazy Evaluation:* > > Spark applies lazy evaluation, *meaning transformations are not executed > immediately*. It constructs a logical plan describing the operations, > but data movement does not occur yet. > > *Action Trigger:* > > When you initiate an action (things like show(), collect(), etc), Spark > triggers the execution. > > > > *When partitioning and distribution come in:Spark partitions the > DataFrame into logical chunks for parallel processing*. It divides the > data based on a partitioning scheme (default is hash partitioning). Each > partition is sent to different executor nodes for distributed execution. > This stage involves data transfer across the cluster, but it is not that > expensive shuffle you have heard of. Shuffles happen within repartitioning > or certain join operations. > > *Storage on Executors:* > > Executors receive their assigned partitions and store them in their > memory. If memory is limited, Spark spills partitions to disk. look at > stages tab in UI (4040) > > > *In summary:* > No Data Transfer During Creation: --> Data transfer occurs only when an > action is triggered. > Distributed Processing: --> DataFrames are distributed for parallel > execution, not stored entirely on the driver node. > Lazy Evaluation Optimization: --> Delaying data transfer until necessary > enhances performance. > Shuffle vs. Partitioning: --> Data movement during partitioning is not > considered a shuffle in Spark terminology. > Shuffles involve more complex data rearrangement. > > *Considerations: * > Large DataFrames: For very large DataFrames > > - manage memory carefully to avoid out-of-memory errors. Consider > options like: > - Increasing executor memory > - Using partitioning strategies to optimize memory usage > - Employing techniques like checkpointing to persistent storage (hard > disks) or caching for memory efficiency > - You can get additional info from Spark UI default port 4040 tabs > like SQL and executors > - Spark uses Catalyst optimiser for efficient execution plans. > df.explain("extended") shows both logical and physical plans > > HTH > > Mich Talebzadeh, > Dad | Technologist | Solutions Architect | Engineer > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* The information provided is correct to the best of my > knowledge but of course cannot be guaranteed . It is essential to note > that, as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > > > On Thu, 14 Mar 2024 at 19:46, Sreyan Chakravarty <sreya...@gmail.com> > wrote: > >> I am trying to understand Spark Architecture. >> >> For Dataframes that are created from python objects ie. that are *created >> in memory where are they stored ?* >> >> Take following example: >> >> from pyspark.sql import Rowimport datetime >> courses = [ >> { >> 'course_id': 1, >> 'course_title': 'Mastering Python', >> 'course_published_dt': datetime.date(2021, 1, 14), >> 'is_active': True, >> 'last_updated_ts': datetime.datetime(2021, 2, 18, 16, 57, 25) >> } >> >> ] >> >> >> courses_df = spark.createDataFrame([Row(**course) for course in courses]) >> >> >> Where is the dataframe stored when I invoke the call: >> >> courses_df = spark.createDataFrame([Row(**course) for course in courses]) >> >> Does it: >> >> 1. Send the data to a random executor ? >> >> >> - Does this mean this counts as a shuffle ? >> >> >> 1. Or does it stay on the driver node ? >> >> >> - That does not make sense when the dataframe grows large. >> >> >> -- >> Regards, >> Sreyan Chakravarty >> >