Just to further clarify that the Shuffle Write Size/Records column in the Spark UI can be misleading when working with cached/persisted data because it reflects the shuffled data size and record count, not the entire cached/persisted data., So it is fair to say that this is a limitation of the UI's display, not necessarily a bug in the Spark framework itself.
HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)". Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)". On Sun, 26 May 2024 at 16:45, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > > Yep, the Spark UI's Shuffle Write Size/Records" column can sometimes show > incorrect record counts when data is retrieved from cache or persisted data. > This happens because the record count reflects the number of records written > to disk for shuffling, and not the actual number of records in the cached or > persisted data itself. Add to it, because of lazy evaluation:, Spark may only > materialize a portion of the cached or persisted data when a task needs it. > The "Shuffle Write Size/Records" might only reflect the materialized portion, > not the total number of records in the cache/persistence. While the "Shuffle > Write Size/Records" might be inaccurate for cached/persisted data, the > "Shuffle Read Size/Records" column can be more reliable. This metric shows > the number of records read from shuffle by the following stage, which should > be closer to the actual number of records processed. > > HTH > > Mich Talebzadeh, > > Technologist | Architect | Data Engineer | Generative AI | FinCrime > > London > United Kingdom > > > view my Linkedin profile > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > Disclaimer: The information provided is correct to the best of my knowledge > but of course cannot be guaranteed . It is essential to note that, as with > any advice, quote "one test result is worth one-thousand expert opinions > (Werner Von Braun)". > > > > On Thu, 23 May 2024 at 17:45, Prem Sahoo <prem.re...@gmail.com> wrote: >> >> Hello Team, >> in spark DAG UI , we have Stages tab. Once you click on each stage you can >> view the tasks. >> >> In each task we have a column "ShuffleWrite Size/Records " that column >> prints wrong data when it gets the data from cache/persist . it typically >> will show the wrong record number though the data size is correct for e.g >> 3.2G/ 7400 which is wrong . >> >> please advise. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org