We recently switched to Spark 1.5.0 from 1.4.1 and have noticed some
inconsistent behavior in persisting DataFrames.

df1 = sqlContext.read.parquet(“df1.parquet”)
df1.count()
> 161,100,982

df2 = sqlContext.read.parquet(“df2.parquet”)
df2.count()
> 67,498,706

join_df = df1.join(df2, ‘id’)
join_df.count()
> 160,608,147

join_df.write.parquet(“join.parquet”)
join_parquet = sqlContext.read.parquet(“join.parquet”)
join_parquet.count()
> 67,698,892

join_df.write.json(“join.json”)
join_json = sqlContext.read.parquet(“join.json”)
join_son.count()
> 67,695,663

The first major issue is that there is an order of magnitude difference
between the count of the join DataFrame and the persisted join DataFrame.
Secondly, persisting the same DataFrame into 2 different formats yields
different results.

Does anyone have any idea on what could be going on here?

-- 
Colin Alstad
Data Scientist
colin.als...@pokitdok.com

<http://www.pokitdok.com/>

Reply via email to