Is there a way to query fields by similarity (like Lucene or using a similarity metric) to be able to query something like WHERE language LIKE "it~0.5" ?
Best, Flavio On Thu, May 22, 2014 at 8:56 AM, Michael Cutler <mich...@tumra.com> wrote: > Hi Nick, > > Here is an illustrated example which extracts certain fields from Facebook > messages, each one is a JSON object and they are serialised into files with > one complete JSON object per line. Example of one such message: > CandyCrush.json <https://gist.github.com/cotdp/131a1c9fc620ab7898c4> > > You need to define a case class which has all the fields you'll be able to > query later in SQL, e.g. > > case class CandyCrushInteraction(id: String, user: String, level: Int, > gender: String, language: String) > > The basic objective is to use Spark to convert the file from RDD[String] - > - parse JSON - - > RDD[JValue] - - extract fields - - > RDD[ > CandyCrushInteraction] > > // Produces a RDD[String] > val lines = sc.textFile("facebook-2014-05-19.json") > > > // Process the messages > > val interactions = lines.map(line => { > > > // Parse the JSON, returns RDD[JValue] > parse(line) > > > }).filter(json => { > // Filter out only 'Candy Crush Saga' Facebook App activity > > (json \ "facebook" \ "application").extract[String] == "Candy Crush > Saga" > > > }).map(json => { > // Extract fields we want, we use compact() because they may not exist > > val id = compact(json \ "facebook" \ "id") > > > val user = compact(json \ "facebook" \ "author" \ "hash") > > > val gender = compact(json \ "demographic" \ "gender") > > > val language = compact(json \ "language" \ "tag") > > > // Extract the 'level' using a RegEx or default to zero > > var level = 0; > pattern.findAllIn( compact(json \ "interaction" \ "title") > ).matchData.foreach(m => { > > > level = m.group(1).toInt > > > }) > // Return an RDD[CandyCrushInteraction] > ( CandyCrushInteraction(id, user, level, gender, language) ) > > > }) > > > Now you can register the RDD[CandyCrushInteraction] as a table and query > it in SQL. > > interactions.registerAsTable("candy_crush_interaction") > > > // Game level by Gender > > sql("SELECT gender, COUNT(level), MAX(level), MIN(level), AVG(level) FROM > candy_crush_interaction WHERE level > 0 GROUP BY > gender").collect().foreach(println) > > > /* Returns: > ["male",14727,590,1,104.71705031574659] > ["female",15422,590,1,114.17202697445208] > ["mostly_male",2824,590,1,97.08852691218131] > ["mostly_female",1934,590,1,99.0517063081696] > > > ["unisex",2674,590,1,113.42071802543006] > [,11023,590,1,93.45677220357435] > */ > > > Full working example: > CandyCrushSQL.scala<https://gist.github.com/cotdp/b5b8155bb85e254d2a3c> > > MC > > > *Michael Cutler* > Founder, CTO > > > * Mobile: +44 789 990 7847 Email: mich...@tumra.com <mich...@tumra.com> > Web: tumra.com > <http://tumra.com/?utm_source=signature&utm_medium=email> * > *Visit us at our offices in Chiswick Park <http://goo.gl/maps/abBxq>* > *Registered in England & Wales, 07916412. VAT No. 130595328* > > > This email and any files transmitted with it are confidential and may also > be privileged. It is intended only for the person to whom it is addressed. > If you have received this email in error, please inform the sender > immediately. > If you are not the intended recipient you must not use, disclose, copy, > print, distribute or rely on this email. > > > On 22 May 2014 04:43, Nicholas Chammas <nicholas.cham...@gmail.com> wrote: > >> That's a good idea. So you're saying create a SchemaRDD by applying a >> function that deserializes the JSON and transforms it into a relational >> structure, right? >> >> The end goal for my team would be to expose some JDBC endpoint for >> analysts to query from, so once Shark is updated to use Spark SQL that >> would become possible without having to resort to using Hive at all. >> >> >> On Wed, May 21, 2014 at 11:11 PM, Tobias Pfeiffer <t...@preferred.jp>wrote: >> >>> Hi, >>> >>> as far as I understand, if you create an RDD with a relational >>> structure from your JSON, you should be able to do much of that >>> already today. For example, take lift-json's deserializer and do >>> something like >>> >>> val json_table: RDD[MyCaseClass] = json_data.flatMap(json => >>> json.extractOpt[MyCaseClass]) >>> >>> then I guess you can use Spark SQL on that. (Something like your >>> likes[2] query won't work, though, I guess.) >>> >>> Regards >>> Tobias >>> >>> >>> On Thu, May 22, 2014 at 5:32 AM, Nicholas Chammas >>> <nicholas.cham...@gmail.com> wrote: >>> > Looking forward to that update! >>> > >>> > Given a table of JSON objects like this one: >>> > >>> > { >>> > "name": "Nick", >>> > "location": { >>> > "x": 241.6, >>> > "y": -22.5 >>> > }, >>> > "likes": ["ice cream", "dogs", "Vanilla Ice"] >>> > } >>> > >>> > It would be SUPER COOL if we could query that table in a way that is as >>> > natural as follows: >>> > >>> > SELECT DISTINCT name >>> > FROM json_table; >>> > >>> > SELECT MAX(location.x) >>> > FROM json_table; >>> > >>> > SELECT likes[2] -- Ice Ice Baby >>> > FROM json_table >>> > WHERE name = "Nick"; >>> > >>> > Of course, this is just a hand-wavy suggestion of how I’d like to be >>> able to >>> > query JSON (particularly that last example) using SQL. I’m interested >>> in >>> > seeing what y’all come up with. >>> > >>> > A large part of what my team does is make it easy for analysts to >>> explore >>> > and query JSON data using SQL. We have a fairly complex home-grown >>> process >>> > to do that and are looking to replace it with something more out of >>> the box. >>> > So if you’d like more input on how users might use this feature, I’d >>> be glad >>> > to chime in. >>> > >>> > Nick >>> > >>> > >>> > >>> > On Wed, May 21, 2014 at 11:21 AM, Michael Armbrust < >>> mich...@databricks.com> >>> > wrote: >>> >> >>> >> You can already extract fields from json data using Hive UDFs. We >>> have an >>> >> intern working on on better native support this summer. We will be >>> sure to >>> >> post updates once there is a working prototype. >>> >> >>> >> Michael >>> >> >>> >> >>> >> On Tue, May 20, 2014 at 6:46 PM, Nick Chammas < >>> nicholas.cham...@gmail.com> >>> >> wrote: >>> >>> >>> >>> The Apache Drill home page has an interesting heading: "Liberate >>> Nested >>> >>> Data". >>> >>> >>> >>> Is there any current or planned functionality in Spark SQL or Shark >>> to >>> >>> enable SQL-like querying of complex JSON? >>> >>> >>> >>> Nick >>> >>> >>> >>> >>> >>> ________________________________ >>> >>> View this message in context: Using Spark to analyze complex JSON >>> >>> Sent from the Apache Spark User List mailing list archive at >>> Nabble.com. >>> >> >>> >> >>> > >>> >>