based on this blog post 
https://sergei-ivanov.medium.com/why-you-should-not-use-randomsplit-in-pyspark-to-split-data-into-train-and-test-58576d539a36
 , I noticed a recommendation against using randomSplit for data splitting due 
to data sorting. Is the information provided in the blog accurate? I understand 
that the reason for data sorting is to partition the data using Spark. Could 
anyone clarify whether we should continue using randomSplit to divide our data 
into training and test sets or use filter() instead?

Thank you


Reply via email to