Preprocessing (after loading the data into HDFS). I started with data in JSON format in text files (stored in HDFS), and then loaded the data into parquet files with a bit of preprocessing and now I always retrieve the data by creating a SchemaRDD from the parquet file and using the SchemaRDD to back a table in a SQLContext.
On Fri, Sep 5, 2014 at 9:53 AM, Benjamin Zaitlen <quasi...@gmail.com> wrote: > Hi Brad, > > When you do the conversion is this a Hive/Spark job or is it a > pre-processing step before loading into HDFS? > > ---Ben > > > On Fri, Sep 5, 2014 at 10:29 AM, Brad Miller <bmill...@eecs.berkeley.edu> > wrote: > >> My approach may be partly influenced by my limited experience with SQL >> and Hive, but I just converted all my dates to seconds-since-epoch and then >> selected samples from specific time ranges using integer comparisons. >> >> >> On Thu, Sep 4, 2014 at 6:38 PM, Cheng, Hao <hao.ch...@intel.com> wrote: >> >>> There are 2 SQL dialects, one is a very basic SQL support and another >>> is Hive QL. In most of cases I think people prefer using the HQL, which >>> also means you have to use HiveContext instead of the SQLContext. >>> >>> >>> >>> In this particular query you showed, seems datatime is the type Date, >>> unfortunately, neither of those SQL dialect supports Date, but Timestamp. >>> >>> >>> >>> Cheng Hao >>> >>> >>> >>> *From:* Benjamin Zaitlen [mailto:quasi...@gmail.com] >>> *Sent:* Friday, September 05, 2014 5:37 AM >>> *To:* user@spark.apache.org >>> *Subject:* TimeStamp selection with SparkSQL >>> >>> >>> >>> I may have missed this but is it possible to select on datetime in a >>> SparkSQL query >>> >>> >>> >>> jan1 = sqlContext.sql("SELECT * FROM Stocks WHERE datetime = >>> '2014-01-01'") >>> >>> >>> >>> Additionally, is there a guide as to what SQL is valid? The guide says, >>> "Note that Spark SQL currently uses a very basic SQL parser" It would be >>> great to post what is currently supported. >>> >>> >>> >>> --Ben >>> >>> >>> >>> >>> >> >> >