Preprocessing (after loading the data into HDFS).

I started with data in JSON format in text files (stored in HDFS), and then
loaded the data into parquet files with a bit of preprocessing and now I
always retrieve the data by creating a SchemaRDD from the parquet file and
using the SchemaRDD to back a table in a SQLContext.


On Fri, Sep 5, 2014 at 9:53 AM, Benjamin Zaitlen <quasi...@gmail.com> wrote:

> Hi Brad,
>
> When you do the conversion is this a Hive/Spark job or is it a
> pre-processing step before loading into HDFS?
>
> ---Ben
>
>
> On Fri, Sep 5, 2014 at 10:29 AM, Brad Miller <bmill...@eecs.berkeley.edu>
> wrote:
>
>> My approach may be partly influenced by my limited experience with SQL
>> and Hive, but I just converted all my dates to seconds-since-epoch and then
>> selected samples from specific time ranges using integer comparisons.
>>
>>
>> On Thu, Sep 4, 2014 at 6:38 PM, Cheng, Hao <hao.ch...@intel.com> wrote:
>>
>>>  There are 2 SQL dialects, one is a very basic SQL support and another
>>> is Hive QL. In most of cases I think people prefer using the HQL, which
>>> also means you have to use HiveContext instead of the SQLContext.
>>>
>>>
>>>
>>> In this particular query you showed, seems datatime is the type Date,
>>> unfortunately, neither of those SQL dialect supports Date, but Timestamp.
>>>
>>>
>>>
>>> Cheng Hao
>>>
>>>
>>>
>>> *From:* Benjamin Zaitlen [mailto:quasi...@gmail.com]
>>> *Sent:* Friday, September 05, 2014 5:37 AM
>>> *To:* user@spark.apache.org
>>> *Subject:* TimeStamp selection with SparkSQL
>>>
>>>
>>>
>>> I may have missed this but is it possible to select on datetime in a
>>> SparkSQL query
>>>
>>>
>>>
>>> jan1 = sqlContext.sql("SELECT * FROM Stocks WHERE datetime =
>>> '2014-01-01'")
>>>
>>>
>>>
>>> Additionally, is there a guide as to what SQL is valid? The guide says,
>>> "Note that Spark SQL currently uses a very basic SQL parser"  It would be
>>> great to post what is currently supported.
>>>
>>>
>>>
>>> --Ben
>>>
>>>
>>>
>>>
>>>
>>
>>
>

Reply via email to