We recently switched to Spark 1.5.0 from 1.4.1 and have noticed some inconsistent behavior in persisting DataFrames.
df1 = sqlContext.read.parquet(“df1.parquet”) df1.count() > 161,100,982 df2 = sqlContext.read.parquet(“df2.parquet”) df2.count() > 67,498,706 join_df = df1.join(df2, ‘id’) join_df.count() > 160,608,147 join_df.write.parquet(“join.parquet”) join_parquet = sqlContext.read.parquet(“join.parquet”) join_parquet.count() > 67,698,892 join_df.write.json(“join.json”) join_json = sqlContext.read.parquet(“join.json”) join_son.count() > 67,695,663 The first major issue is that there is an order of magnitude difference between the count of the join DataFrame and the persisted join DataFrame. Secondly, persisting the same DataFrame into 2 different formats yields different results. Does anyone have any idea on what could be going on here? -- Colin Alstad Data Scientist colin.als...@pokitdok.com <http://www.pokitdok.com/>