[PR] [SPARK-51314][DOCS][PS] Add proper note for distributed-sequence about indeterministic case [spark]

via GitHub Wed, 26 Feb 2025 00:02:56 -0800


itholic opened a new pull request, #50086:
URL: https://github.com/apache/spark/pull/50086


   
   
   ### What changes were proposed in this pull request?
   
   This PR proposes to add proper note for distributed-sequence about 
indeterministic case
   
   ### Why are the changes needed?
   
   
   There could be some indeterministic case leading users get confused when 
using `distributed-sequence` so we'd better to document it.
   
   For example,
   
   ```python
   # Reading the same data
   >>> df1.read_csv("big_data.csv")
   >>> df2.read_csv("big_data.csv")
   
   # The row-index mapping for `df1` and `df2` could be different when using 
`distributed-sequence`.
   >>> df1.head(10)
        record_id start_date   end_date
   0  RECORD_1001 2024-01-01 2024-01-10
   1  RECORD_1002 2024-01-15 2024-01-20
   2  RECORD_1003 2024-02-01 2024-02-10
   3  RECORD_1004 2024-02-15 2024-02-20
   4  RECORD_1005 2024-03-01 2024-03-10
   5  RECORD_1006 2024-03-15 2024-03-20
   6  RECORD_1007 2024-04-01 2024-04-10
   7  RECORD_1008 2024-04-15 2024-04-20
   8  RECORD_1009 2024-05-01 2024-05-10
   9  RECORD_1010 2024-05-15 2024-05-20
   
   >>> df2.head(10)
        record_id start_date   end_date
   0  RECORD_2001 2024-06-01 2024-06-10
   1  RECORD_2002 2024-06-15 2024-06-20
   2  RECORD_2003 2024-07-01 2024-07-10
   3  RECORD_2004 2024-07-15 2024-07-20
   4  RECORD_2005 2024-08-01 2024-08-10
   5  RECORD_2006 2024-08-15 2024-08-20
   6  RECORD_2007 2024-09-01 2024-09-10
   7  RECORD_2008 2024-09-15 2024-09-20
   8  RECORD_2009 2024-10-01 2024-10-10
   9  RECORD_2010 2024-10-15 2024-10-20
   
   # Using `index_col` prevent the indeterministic case
   >>> df1.read_csv("big_data.csv", index_col="record_id")
   >>> df2.read_csv("big_data.csv", index_col="record_id")
   
   # Now this guarantees the order of the rows for both DataFrame
   >>> df1.head(10)
               start_date   end_date
   record_id
   RECORD_1001 2024-01-01 2024-01-10
   RECORD_1002 2024-01-15 2024-01-20
   RECORD_1003 2024-02-01 2024-02-10
   RECORD_1004 2024-02-15 2024-02-20
   RECORD_1005 2024-03-01 2024-03-10
   RECORD_1006 2024-03-15 2024-03-20
   RECORD_1007 2024-04-01 2024-04-10
   RECORD_1008 2024-04-15 2024-04-20
   RECORD_1009 2024-05-01 2024-05-10
   RECORD_1010 2024-05-15 2024-05-20
   
   >>> df2.head(10)
               start_date   end_date
   record_id
   RECORD_1001 2024-01-01 2024-01-10
   RECORD_1002 2024-01-15 2024-01-20
   RECORD_1003 2024-02-01 2024-02-10
   RECORD_1004 2024-02-15 2024-02-20
   RECORD_1005 2024-03-01 2024-03-10
   RECORD_1006 2024-03-15 2024-03-20
   RECORD_1007 2024-04-01 2024-04-10
   RECORD_1008 2024-04-15 2024-04-20
   RECORD_1009 2024-05-01 2024-05-10
   RECORD_1010 2024-05-15 2024-05-20
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   
   No API changes, but the note will be added to user-facing documentation.
   
   <img width="770" alt="Screenshot 2025-02-26 at 5 02 18 PM" 
src="https://github.com/user-attachments/assets/fbd351a7-1646-429e-98cf-69df02933957";
 />
   
   ### How was this patch tested?
   
   Manually tested, and also the existing CI should pass.
   
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[PR] [SPARK-51314][DOCS][PS] Add proper note for distributed-sequence about indeterministic case [spark]

Reply via email to