so latest optimizations done on spark 1.4 and 1.5 releases are mostly from project Tungsten. Docs says it usues sun.misc.unsafe to convert physical rdd structure into byte array at some point for optimized GC and memory. My question is why is it only applicable to SQL/Dataframe and not RDD? RDD has types too!
On Mon, Jan 25, 2016 at 11:10 AM, Nirav Patel <npa...@xactlycorp.com> wrote: > I haven't gone through much details of spark catalyst optimizer and > tungston project but we have been advised by databricks support to use > DataFrame to resolve issues with OOM error that we are getting during Join > and GroupBy operations. We use spark 1.3.1 and looks like it can not > perform external sort and blows with OOM. > > https://forums.databricks.com/questions/2082/i-got-the-akka-frame-size-exceeded-exception.html > > Now it's great that it has been addressed in spark 1.5 release but why > databricks advocating to switch to DataFrames? It may make sense for batch > jobs or near real-time jobs but not sure if they do when you are developing > real time analytics where you want to optimize every millisecond that you > can. Again I am still knowledging myself with DataFrame APIs and > optimizations and I will benchmark it against RDD for our batch and > real-time use case as well. > > On Mon, Jan 25, 2016 at 9:47 AM, Mark Hamstra <m...@clearstorydata.com> > wrote: > >> What do you think is preventing you from optimizing your own RDD-level >> transformations and actions? AFAIK, nothing that has been added in >> Catalyst precludes you from doing that. The fact of the matter is, though, >> that there is less type and semantic information available to Spark from >> the raw RDD API than from using Spark SQL, DataFrames or DataSets. That >> means that Spark itself can't optimize for raw RDDs the same way that it >> can for higher-level constructs that can leverage Catalyst; but if you want >> to write your own optimizations based on your own knowledge of the data >> types and semantics that are hiding in your raw RDDs, there's no reason >> that you can't do that. >> >> On Mon, Jan 25, 2016 at 9:35 AM, Nirav Patel <npa...@xactlycorp.com> >> wrote: >> >>> Hi, >>> >>> Perhaps I should write a blog about this that why spark is focusing more >>> on writing easier spark jobs and hiding underlaying performance >>> optimization details from a seasoned spark users. It's one thing to provide >>> such abstract framework that does optimization for you so you don't have to >>> worry about it as a data scientist or data analyst but what about >>> developers who do not want overhead of SQL and Optimizers and unnecessary >>> abstractions ! Application designer who knows their data and queries should >>> be able to optimize at RDD level transformations and actions. Does spark >>> provides a way to achieve same level of optimization by using either SQL >>> Catalyst or raw RDD transformation? >>> >>> Thanks >>> >>> >>> >>> >>> >>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> >>> >>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >>> <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] >>> <https://twitter.com/Xactly> [image: Facebook] >>> <https://www.facebook.com/XactlyCorp> [image: YouTube] >>> <http://www.youtube.com/xactlycorporation> >> >> >> > -- [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] <https://twitter.com/Xactly> [image: Facebook] <https://www.facebook.com/XactlyCorp> [image: YouTube] <http://www.youtube.com/xactlycorporation>