Re: DataFrame#rdd doesn't respect DataFrame#cache, slowing down CrossValidator

2015-07-31 Thread Justin Uang
Sweet! It's here: https://issues.apache.org/jira/browse/SPARK-9141?focusedCommentId=14649437&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14649437 On Tue, Jul 28, 2015 at 11:21 PM Michael Armbrust wrote: > Can you add your description of the problem as a comment

Re: DataFrame#rdd doesn't respect DataFrame#cache, slowing down CrossValidator

2015-07-28 Thread Michael Armbrust
Can you add your description of the problem as a comment to that ticket and we'll make sure to test both cases and break it out if the root cause ends up being different. On Tue, Jul 28, 2015 at 2:48 PM, Justin Uang wrote: > Sweet! Does this cover DataFrame#rdd also using the cached query from >

Re: DataFrame#rdd doesn't respect DataFrame#cache, slowing down CrossValidator

2015-07-28 Thread Justin Uang
Sweet! Does this cover DataFrame#rdd also using the cached query from DataFrame#cache? I think the ticket 9141 is mainly concerned with whether a derived DataFrame (B) of a cached DataFrame (A) uses the cached query of A, not whether the rdd from A.rdd or B.rdd uses the cached query of A. On Tue, J

Re: DataFrame#rdd doesn't respect DataFrame#cache, slowing down CrossValidator

2015-07-28 Thread Joseph Bradley
Thanks for bringing this up! I talked with Michael Armbrust, and it sounds like this is a from a bug in DataFrame caching: https://issues.apache.org/jira/browse/SPARK-9141 It's marked as a blocker for 1.5. Joseph On Tue, Jul 28, 2015 at 2:36 AM, Justin Uang wrote: > Hey guys, > > I'm running in

DataFrame#rdd doesn't respect DataFrame#cache, slowing down CrossValidator

2015-07-28 Thread Justin Uang
Hey guys, I'm running into some pretty bad performance issues when it comes to using a CrossValidator, because of caching behavior of DataFrames. The root of the problem is that while I have cached my DataFrame representing the features and labels, it is caching at the DataFrame level, while Cros