Re: Spark SQL parser bug?

Yin Huai Mon, 13 Oct 2014 08:48:28 -0700

Yeah, it is not related to timezone. I think you hit this issue
<https://issues.apache.org/jira/browse/SPARK-3173> and it was fixed after
1.1 release.


On Mon, Oct 13, 2014 at 11:24 AM, Mohammed Guller <moham...@glassbeam.com>
wrote:

>  Good guess, but that is not the reason. Look at this code:
>
>
>
> scala> val data =
> sc.parallelize(1325548800000L::1335548800000L::Nil).map(i=> T(i.toString,
> new java.sql.Timestamp(i)))
>
> data: org.apache.spark.rdd.RDD[T] = MappedRDD[17] at map at <console>:17
>
>
>
> scala> data.collect
>
> res3: Array[T] = *Array(T(1325548800000,2012-01-02 16:00:00.0),
> T(1335548800000,2012-04-27 10:46:40.0))*
>
>
>
> scala> data.registerTempTable("x")
>
>
>
> scala> val s = sqlContext.sql("select a from x where ts>='1970-01-01
> 00:00:00';")
>
> s: org.apache.spark.sql.SchemaRDD =
>
> SchemaRDD[20] at RDD at SchemaRDD.scala:103
>
> == Query Plan ==
>
> == Physical Plan ==
>
> Project [a#2]
>
> ExistingRdd [a#2,ts#3], MapPartitionsRDD[22] at mapPartitions at
> basicOperators.scala:208
>
>
>
> scala> s.collect
>
> res5: Array[org.apache.spark.sql.Row] = Array()
>
>
>
>
>
>
>
> Mohammed
>
>
>
> *From:* Yin Huai [mailto:huaiyin....@gmail.com]
> *Sent:* Monday, October 13, 2014 7:19 AM
> *To:* Mohammed Guller
> *Cc:* Cheng, Hao; Cheng Lian; user@spark.apache.org
>
> *Subject:* Re: Spark SQL parser bug?
>
>
>
> Seems the reason that you got "wrong" results was caused by timezone.
>
>
>
> The time in java.sql.Timestamp(long time) means "milliseconds since
> January 1, 1970, 00:00:00 *GMT*. A negative number is the number of
> milliseconds before January 1, 1970, 00:00:00 *GMT*."
>
>
>
> However, in ts>='1970-01-01 00:00:00', '1970-01-01 00:00:00' is using
> your local timezone.
>
>
>
> Thanks,
>
>
>
> Yin
>
>
>
>
>
> On Mon, Oct 13, 2014 at 9:58 AM, Mohammed Guller <moham...@glassbeam.com>
> wrote:
>
> Hi Cheng,
>
> I am using version 1.1.0.
>
>
>
> Looks like that bug was fixed sometime after 1.1.0 was released.
> Interestingly, I tried your code on 1.1.0 and it gives me a different
> (incorrect) result:
>
>
>
> case class T(a:String, ts:java.sql.Timestamp)
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>
> import sqlContext.createSchemaRDD
>
> val data = sc.parallelize(10000::20000::Nil).map(i=> T(i.toString, new
> java.sql.Timestamp(i)))
>
> data.registerTempTable("x")
>
> val s = sqlContext.sql("select a from x where ts>='1970-01-01 00:00:00';")
>
>
>
> scala> s.collect
>
> res1: Array[org.apache.spark.sql.Row] = Array()
>
>
>
> Mohammed
>
>
>
> *From:* Cheng, Hao [mailto:hao.ch...@intel.com]
> *Sent:* Sunday, October 12, 2014 1:35 AM
> *To:* Mohammed Guller; Cheng Lian; user@spark.apache.org
>
>
> *Subject:* RE: Spark SQL parser bug?
>
>
>
> Hi, I couldn’t reproduce the bug with the latest master branch. Which
> version are you using? Can you also list data in the table “x”?
>
>
>
> case class T(a:String, ts:java.sql.Timestamp)
>
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
>
> import sqlContext.createSchemaRDD
>
> val data = sc.parallelize(10000::20000::Nil).map(i=> T(i.toString, new
> java.sql.Timestamp(i)))
>
> data.registerTempTable("x")
>
> val s = sqlContext.sql("select a from x where ts>='1970-01-01 00:00:00';")
>
> s.collect
>
>
>
> output:
>
> res1: Array[org.apache.spark.sql.Row] = Array([10000], [20000])
>
>
>
> Cheng Hao
>
>
>
> *From:* Mohammed Guller [mailto:moham...@glassbeam.com
> <moham...@glassbeam.com>]
> *Sent:* Sunday, October 12, 2014 12:06 AM
> *To:* Cheng Lian; user@spark.apache.org
> *Subject:* RE: Spark SQL parser bug?
>
>
>
> I tried even without the “T” and it still returns an empty result:
>
>
>
> scala> val sRdd = sqlContext.sql("select a from x where ts >= '2012-01-01
> 00:00:00';")
>
> sRdd: org.apache.spark.sql.SchemaRDD =
>
> SchemaRDD[35] at RDD at SchemaRDD.scala:103
>
> == Query Plan ==
>
> == Physical Plan ==
>
> Project [a#0]
>
> ExistingRdd [a#0,ts#1], MapPartitionsRDD[37] at mapPartitions at
> basicOperators.scala:208
>
>
>
> scala> sRdd.collect
>
> res10: Array[org.apache.spark.sql.Row] = Array()
>
>
>
>
>
> Mohammed
>
>
>
> *From:* Cheng Lian [mailto:lian.cs....@gmail.com <lian.cs....@gmail.com>]
> *Sent:* Friday, October 10, 2014 10:14 PM
> *To:* Mohammed Guller; user@spark.apache.org
> *Subject:* Re: Spark SQL parser bug?
>
>
>
> Hmm, there is a “T” in the timestamp string, which makes the string not a
> valid timestamp string representation. Internally Spark SQL uses
> java.sql.Timestamp.valueOf to cast a string to a timestamp.
>
> On 10/11/14 2:08 AM, Mohammed Guller wrote:
>
> scala> rdd.registerTempTable("x")
>
>
>
> scala> val sRdd = sqlContext.sql("select a from x where ts >= '2012-01-01
> *T*00:00:00';")
>
> sRdd: org.apache.spark.sql.SchemaRDD =
>
> SchemaRDD[4] at RDD at SchemaRDD.scala:103
>
> == Query Plan ==
>
> == Physical Plan ==
>
> Project [a#0]
>
> ExistingRdd [a#0,ts#1], MapPartitionsRDD[6] at mapPartitions at
> basicOperators.scala:208
>
>
>
> scala> sRdd.collect
>
> res2: Array[org.apache.spark.sql.Row] = Array()
>
>  
>
>
>

Re: Spark SQL parser bug?

Reply via email to