Re: SQL FIlter of tweets (json) running on Disk

Michael Armbrust Fri, 04 Jul 2014 11:59:23 -0700

>
> sqlContext.jsonFile("data.json")  <---- Is this already available in the
> master branch???
>


Yes, and it will be available in the soon to come 1.0.1 release.


> But the question about the use a combination of resources (Memory
> processing & Disk processing) still remains.
>

This code should work just fine off of disk.  I would not recommend trying
to cache the JSON data in memory as it is heavily nested and this is a
place where the columnar storage code does not do great.  Instead, maybe
try converting it to parquet and reading that data from disk
(tweets.saveAsParquetFile(...);
 sqlContext.parquetFile(...).registerAsTable(...))  You should see improved
compression and much better performance for queries that only read some of
the columns.  You could also just pull out the relevant columns and cache
only that data in memory:

sqlContext.jsonFile("data.json").registerAsTable("allTweets")
sql("SELECT text FROM allTweets").registerAsTable("tweetText")
sqlContext.cacheTable("tweetText")

Re: SQL FIlter of tweets (json) running on Disk

Reply via email to