Hello Joe, The first step is acquiring some data, either through the Facebook API<https://developers.facebook.com/>or a third-party service like Datasift <https://datasift.com/> (paid). Once you've acquired some data, and got it somewhere Spark can access it (like HDFS), you can then load and manipulate it just like any other data.
Here is a pretty-printed example JSON message I got from a Datasift<https://datasift.com/> stream this morning, it illustrates an anonymised someone with *clearly too much time on their hands* having reached *level 576* on Candy Crush Saga. { "demographic": { "gender": "mostly_female" }, "facebook": { "application": "Candy Crush Saga", "author": { "type": "user", "hash_id": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" }, "caption": "I just completed level 576, scored 494020 points and got 3 stars.", "created_at": "Tue, 20 May 2014 03:08:09 +0000", "description": "Click here to follow my progress!", "id": "100000000000000_123456789012345", "link": " http://apps.facebook.com/candycrush/?urlMessage=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX ", "name": "Yay, I completed level 576 in Candy Crush Saga!", "source": "Candy Crush Saga (123456789012345)", "type": "link" }, "interaction": { "schema": { "version": 3 }, "type": "facebook", "id": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", "created_at": "Tue, 20 May 2014 03:08:09 +0000", "received_at": 1400555303.6832, "author": { "type": "user", "hash_id": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" }, "title": "Yay, I completed level 576 in Candy Crush Saga!", "link": "http://www.facebook.com/100000000000000_123456789012345", "subtype": "link", "content": "Click here to follow my progress!", "source": "Candy Crush Saga (123456789012345)" }, "language": { "tag": "en", "tag_extended": "en", "confidence": 97 } } Much like processing Twitter streams, the data arrives as a single JSON object on each line. So you need to pass the RDD[String] you get from opening the textFile through a JSON parser. Spark has json4s<https://github.com/json4s/json4s>and jackson JSON parsers embedded in the assembly so you can basically use those for 'free' without having to bundle them in your JAR. Here is an example Spark job which answers the age-old question: "Who is better at Candy Crush, boys? or girls?" // We want to extract the level number from "Yay, I completed level 576 in Candy Crush Saga!" // the actual text will change based on the users language but parsing the 'last number' works val pattern = """(\d+)""".r // Produces a RDD[String] val lines = sc.textFile("facebook-2014-05-19.json") lines.map(line => { // Parse the JSON parse(line) }).filter(json => { // Filter out only 'Candy Crush Saga' activity json \ "facebook" \ "application" == JString("Candy Crush Saga") }).map(json => { // Extract the 'level' or default to zero var level = 0; pattern.findAllIn( compact(json \ "interaction" \ "title") ).matchData.foreach(m => { level = m.group(1).toInt }) // Extract the gender val gender = compact(json \ "demographic" \ "gender") // Return a Tuple of RDD[gender: String, (level: Int, count: Int)] ( gender, (level, 1) ) }).filter(a => { // Filter out entries with a level of zero a._2._1 > 0 }).reduceByKey( (a, b) => { // Sum the levels and counts so we can average later ( a._1 + b._1, a._2 + b._2 ) }).collect().foreach(entry => { // Print the results val gender = entry._1 val values = entry._2 val average = values._1 / values._2 println(gender + ": average=" + average + ", count=" + values._2 ) }) See more: https://gist.github.com/cotdp/fda64b4248e43a3c8f46 If you run this on a small sample of data you get results like this: - "female": average=114, count=15422 - "male": average=104, count=14727 Which basically says the average level achieved by women is slightly higher than guys. Best of luck fishing through Facebook data! MC *Michael Cutler* Founder, CTO *Mobile: +44 789 990 7847Email: mich...@tumra.com <mich...@tumra.com>Web: tumra.com <http://tumra.com/?utm_source=signature&utm_medium=email>* *Visit us at our offices in Chiswick Park <http://goo.gl/maps/abBxq>* *Registered in England & Wales, 07916412. VAT No. 130595328* This email and any files transmitted with it are confidential and may also be privileged. It is intended only for the person to whom it is addressed. If you have received this email in error, please inform the sender immediately. If you are not the intended recipient you must not use, disclose, copy, print, distribute or rely on this email. On 20 May 2014 05:07, Joe L <selme...@yahoo.com> wrote: > Is there any way to get facebook data into Spark and filter the content of > it? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/facebook-data-mining-with-Spark-tp6072.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >