Re: A DataFrame cache bug

2017-02-26 Thread Liang-Chi Hsieh
t;overwrite").parquet(dir) >>> val df = spark.read.parquet(dir) >>> df.count // output 100 which is correct >>> f(df).count // output 89 which is correct >>> >>> spark.range(1000).write.mode("overwrite").parquet(dir) >>> spark.cata

Re: A DataFrame cache bug

2017-02-22 Thread gen tang
// output 89 which is correct >> >> spark.range(1000).write.mode("overwrite").parquet(dir) >> spark.catalog.refreshByPath(dir) // insert a NEW statement >> val df1 = spark.read.parquet(dir) >> df1.count // output 1000 which is correct, in fact other operation exp

Re: A DataFrame cache bug

2017-02-21 Thread gen tang
) > spark.catalog.refreshByPath(dir) // insert a NEW statement > val df1 = spark.read.parquet(dir) > df1.count // output 1000 which is correct, in fact other operation expect > df1.filter("id>10") return correct result. > f(df1).count // output 89 which is in

Re: A DataFrame cache bug

2017-02-21 Thread Kazuaki Ishizaki
t;id>10") return correct result. f(df1).count // output 89 which is incorrect Regards, Kazuaki Ishizaki From: gen tang To: dev@spark.apache.org Date: 2017/02/22 15:02 Subject:Re: A DataFrame cache bug Hi All, I might find a related issue on jira: https://issues.apa

Re: A DataFrame cache bug

2017-02-21 Thread gen tang
Hi All, I might find a related issue on jira: https://issues.apache.org/jira/browse/SPARK-15678 This issue is closed, may be we should reopen it. Thanks Cheers Gen On Wed, Feb 22, 2017 at 1:57 PM, gen tang wrote: > Hi All, > > I found a strange bug which is related with reading data from a

A DataFrame cache bug

2017-02-21 Thread gen tang
Hi All, I found a strange bug which is related with reading data from a updated path and cache operation. Please consider the following code: import org.apache.spark.sql.DataFrame def f(data: DataFrame): DataFrame = { val df = data.filter("id>10") df.cache df.count df } f(spark.range(10