I don't have any good answer here, but, I seem to recall that this is because of SQL semantics, which follows column ordering not naming when performing operations like this. It may well be as intended.
On Tue, Mar 2, 2021 at 6:10 AM Oldrich Vlasic < oldrich.vla...@datasentics.com> wrote: > Hi, > > I have encountered a weird and potentially dangerous behaviour of Spark > concerning > partial overwrites of partitioned data. Not sure if this is a bug or just > abstraction > leak. I have checked Spark section of Stack Overflow and haven't found any > relevant > questions or answers. > > Full minimal working example provided as attachment. Tested on Databricks > runtime 7.3 LTS > ML (Spark 3.0.1). Short summary: > > Write dataframe using partitioning by a column using saveAsTable. Filter > out part of the > dataframe, change some values (simulates new increment of data) and write > again, > overwriting a subset of partitions using insertInto. This operation will > either fail on > schema mismatch or cause data corruption. > > Reason: on the first write, the ordering of the columns is changed > (partition column is > placed at the end). On the second write this is not taken into > consideration and Spark > tries to insert values into the columns based on their order and not on > their name. If > they have different types this will fail. If not, values will be written > to incorrect > columns causing data corruption. > > My question: is this a bug or intended behaviour? Can something be done > about it to prevent > it? This issue can be avoided by doing a select with schema loaded from > the target table. > However, when user is not aware this could cause hard to track down errors > in data. > > Best regards, > Oldřich Vlašic > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org