Hey guys, I've been digging around trying to figure out if I should transition from RDDs to DataFrames. I'm currently using RDDs to represent tiles of binary imagery data and I'm wondering if representing the data as a DataFrame is a better solution.
To get my feet wet, I did a little comparison on a Word Count application, on a 1GB file of random text, using an RDD and DataFrame. And I got the following results: RDD Count total: 137733312 Time Elapsed: 44.5675378 s DataFrame Count total: 137733312 Time Elapsed: 69.201253448 s I figured the DataFrame would outperform the RDD, since I've seen many sources that state superior speeds with DataFrames. These results could just be an implementation issue, unstructured data, or a result of the data source. I'm not really sure. This leads me to take a step back and figure out what applications are better suited with DataFrames than RDDs? In my case, while the original image file is unstructured. The data is loaded in a pairRDD, where the key contains multiple attributes that pertain to the value. The value is a chunk of the image represented as an array of bytes. Since, my data will be in a structured format, I don't see why I can't benefit from DataFrames. However, should I be concerned of any performance issues that pertain to processing/moving of byte array (each chunk is uniform size in the KB-MB range). I'll potentially be scanning the entire image, select specific image tiles and perform some work on them. If DataFrames are well suited for my use case, how does the data source affect my performance? I could always just load data into an RDD and convert to DataFrame, or I could convert the image into a parquet file and create DataFrames directly. Is one way recommended over the other? These are a lot of questions, and I'm still trying to ingest and make sense of everything. Any feedback would be greatly appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-use-case-tp27543.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org