Hi All
tldr; IMHO repartition(n) should be deprecated or red-flagged, so that
everybody will understand consequences of usage of this method

Following conversation in https://issues.apache.org/jira/browse/SPARK-38388
(still relevant for recent versions of spark) I think it's very important
to mark this function somehow and to alert end-user about consequences of
such usage

Basically it may produce duplicates and data loss under retries for several
kinds of input: among them non-deterministic input, but more importantly
input that deterministic but might produce not exactly same results due to
precision of doubles(and floats) in very simple queries like following

sqlContext.sql(
" SELECT integerColumn, SUM(someDoubleTypeValue) AS value
  FROM data
  GROUP BY integerColumn "
).repartition(3)

(see comment from Tom in ticket)

As an end-user I'd expect the retries mechanism to work in a consistent way
and not to drop data silently(neither to produce duplicates)

Any thoughts?
thanks in advance
Igor

Reply via email to