Hi All tldr; IMHO repartition(n) should be deprecated or red-flagged, so that everybody will understand consequences of usage of this method
Following conversation in https://issues.apache.org/jira/browse/SPARK-38388 (still relevant for recent versions of spark) I think it's very important to mark this function somehow and to alert end-user about consequences of such usage Basically it may produce duplicates and data loss under retries for several kinds of input: among them non-deterministic input, but more importantly input that deterministic but might produce not exactly same results due to precision of doubles(and floats) in very simple queries like following sqlContext.sql( " SELECT integerColumn, SUM(someDoubleTypeValue) AS value FROM data GROUP BY integerColumn " ).repartition(3) (see comment from Tom in ticket) As an end-user I'd expect the retries mechanism to work in a consistent way and not to drop data silently(neither to produce duplicates) Any thoughts? thanks in advance Igor