I'd say that Datasets, not DataFrames, are the natural evolution of RDDs. DataFrames are for inherently tabular data, and most naturally manipulated by SQL-like operations. Datasets operate on programming language objects like RDDs.
So, RDDs to DataFrames isn't quite apples-to-apples to begin with. It's just never true that "X is always faster than Y" in a case like this. Indeed your case doesn't sound like anything where a tabular representation would be beneficial. There's overhead to treating it like that. You're doing almost nothing to the data itself except counting it, and RDDs have the lowest overhead of the three concepts because they treat their contents as opaque objects anyway. The benefit comes when you do things like SQL-like operations on tabular data in the DataFrame API instead of RDD API. That's where more optimization can kick in. Dataset brings some of the same possible optimizations to an RDD-like API because it has more knowledge of the type and nature of the entire data set. If you're really only manipulating byte arrays, I don't know if DataFrame adds anything. I know Dataset has some specialization for byte[], so I'd expect you could see some storage benefits over RDDs, maybe. On Tue, Aug 16, 2016 at 6:32 PM, jtgenesis <jtgene...@gmail.com> wrote: > Hey guys, I've been digging around trying to figure out if I should > transition from RDDs to DataFrames. I'm currently using RDDs to represent > tiles of binary imagery data and I'm wondering if representing the data as a > DataFrame is a better solution. > > To get my feet wet, I did a little comparison on a Word Count application, > on a 1GB file of random text, using an RDD and DataFrame. And I got the > following results: > > RDD Count total: 137733312 Time Elapsed: 44.5675378 s > DataFrame Count total: 137733312 Time Elapsed: 69.201253448 s > > I figured the DataFrame would outperform the RDD, since I've seen many > sources that state superior speeds with DataFrames. These results could just > be an implementation issue, unstructured data, or a result of the data > source. I'm not really sure. > > This leads me to take a step back and figure out what applications are > better suited with DataFrames than RDDs? In my case, while the original > image file is unstructured. The data is loaded in a pairRDD, where the key > contains multiple attributes that pertain to the value. The value is a chunk > of the image represented as an array of bytes. Since, my data will be in a > structured format, I don't see why I can't benefit from DataFrames. However, > should I be concerned of any performance issues that pertain to > processing/moving of byte array (each chunk is uniform size in the KB-MB > range). I'll potentially be scanning the entire image, select specific image > tiles and perform some work on them. > > If DataFrames are well suited for my use case, how does the data source > affect my performance? I could always just load data into an RDD and convert > to DataFrame, or I could convert the image into a parquet file and create > DataFrames directly. Is one way recommended over the other? > > These are a lot of questions, and I'm still trying to ingest and make sense > of everything. Any feedback would be greatly appreciated. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/DataFrame-use-case-tp27543.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org