Re: SQL FIlter of tweets (json) running on Disk

Abel Coronado Iruegas Fri, 04 Jul 2014 12:05:20 -0700

Thank you, DataBricks Rules !!!!



On Fri, Jul 4, 2014 at 1:58 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> sqlContext.jsonFile("data.json")  <---- Is this already available in the
>> master branch???
>>
>
> Yes, and it will be available in the soon to come 1.0.1 release.
>
>
>> But the question about the use a combination of resources (Memory
>> processing & Disk processing) still remains.
>>
>
> This code should work just fine off of disk.  I would not recommend trying
> to cache the JSON data in memory as it is heavily nested and this is a
> place where the columnar storage code does not do great.  Instead, maybe
> try converting it to parquet and reading that data from disk
> (tweets.saveAsParquetFile(...);
>  sqlContext.parquetFile(...).registerAsTable(...))  You should see improved
> compression and much better performance for queries that only read some of
> the columns.  You could also just pull out the relevant columns and cache
> only that data in memory:
>
> sqlContext.jsonFile("data.json").registerAsTable("allTweets")
> sql("SELECT text FROM allTweets").registerAsTable("tweetText")
> sqlContext.cacheTable("tweetText")
>

Re: SQL FIlter of tweets (json) running on Disk

Reply via email to