Here is another interesting post. http://www.kdnuggets.com/2016/02/apache-spark-rdd-dataframe-dataset.html?utm_content=buffer31ce5&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer <http://www.kdnuggets.com/2016/02/apache-spark-rdd-dataframe-dataset.html?utm_content=buffer31ce5&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer>
> On Feb 16, 2016, at 6:01 PM, Mich Talebzadeh <m...@peridale.co.uk> wrote: > > Hi, > > A Resilient Distributed Dataset (RDD) is a heap of data distributed among > all nodes of cluster. It is basically raw data and that is all about it with > little optimization on it. Remember data is not much of a value until it is > turned into information. > > On the other hand a DataFrame is equivalent to a table in RDBMS akin to a > table in Oracle or Sybase. In other words a two-dimensional array-like > structure, in which each column contains measurements on one variable, and > each row contains one case. > > So, a DataFrame by definition has additional metadata due to its tabular > format, which allows Spark Optimizer AKA Catalyst to take advantage of this > tabular format for certain optimizations. So still after so many years, the > relational model is arguably the most elegant model known and used and > emulated everywhere. > > Much like a table in RDBMS, a DataFrame keeps track of the schema and > supports various relational operations that lead to more optimized execution. > Essentially each DataFrame object represents a logical plan but because of > their "lazy" nature no execution occurs until the user calls a specific > "output operation". This is very important to remember. You can go from a > DataFrame to an RDD via its rdd method. You can go from an RDD to a DataFrame > (if the RDD is in a tabular format) via the toDF method. > > In general it is recommended to use a DataFrame where possible due to the > built in query optimization. > > For those familiar with SQL a DataFrame can be conveniently registered as a > temporary table and SQL operations can be performed on it. > > Case in point I am looking for all my replication server log files compressed > and stored in an HDFS directory for error on a specific connection > > //create an RDD > val rdd = sc.textFile("/test/REP_LOG.gz") > //convert it to Data Frame > val df = rdd.toDF("line") > //register the line as a temporary table > df.registerTempTable("t") > println("\n Search for ERROR plus another word in table t\n") > sql("select * from t WHERE line like '%ERROR%' and line like > '%hiveserver2.asehadoop%'").collect().foreach(println) > > Alternatively you can just use method calls on the DataFrame itself to filter > out the word > > df.filter(col("line").like("%ERROR%")).collect.foreach(println) > > HTH, > > Dr Mich Talebzadeh > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw> > > http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> > > NOTE: The information in this email is proprietary and confidential. This > message is for the designated recipient only, if you are not the intended > recipient, you should destroy it immediately. Any information in this message > shall not be understood as given or endorsed by Peridale Technology Ltd, its > subsidiaries or their employees, unless expressly so stated. It is the > responsibility of the recipient to ensure that this email is virus free, > therefore neither Peridale Technology Ltd, its subsidiaries nor their > employees accept any responsibility. > > > From: Ashok Kumar [mailto:ashok34...@yahoo.com.INVALID > <mailto:ashok34...@yahoo.com.INVALID>] > Sent: 16 February 2016 16:06 > To: User <user@spark.apache.org <mailto:user@spark.apache.org>> > Subject: Use case for RDD and Data Frame > > Gurus, > > What are the main differences between a Resilient Distributed Data (RDD) and > Data Frame (DF) > > Where one can use RDD without transforming it to DF? > > Regards and obliged