Yeah, the reason it happens is that sortByKey tries to sample the data to figure out the right range partitions for it. But we could do this later, as the suggestion in there says.
Matei On Apr 7, 2014, at 10:06 AM, Diana Carroll <dcarr...@cloudera.com> wrote: > Aha! Well I'm not crazy then, thanks. > > > On Mon, Apr 7, 2014 at 11:51 AM, Mark Hamstra <m...@clearstorydata.com> wrote: > https://issues.apache.org/jira/browse/SPARK-1021?jql=text%20~%20%22sortByKey%22 > > > On Mon, Apr 7, 2014 at 8:42 AM, Diana Carroll <dcarr...@cloudera.com> wrote: > Until today, I was under the impression that *all* Spark transformations were > "lazy"...that is, they wouldn't actually execute until an *action* such as > count or take was performed. > > However today I'm using the "sortByKey" transformation, which would appear to > execute immediately, rather than as a result of an operation. Am I > misunderstanding something, is this a bug, or is this a deliberate difference > between sortByKey and other transformations? > > Here's my test. I'm parsing a bunch of weblog files and I want to know which > users made the most requests. So my code pull out the 2nd field of each line > (the user ID), add up the total number of hits for each user ID, swap user > ID/hit count, and sort of hitcount. > > var userreqs = > sc.textFile("file:/home/training/training_materials/sparkdev/data/weblogs/*"). > map(_.split(" ")). > map(words => (words(2),1)). > reduceByKey(_ + _). > map(pair => (pair._2,pair._1)). > sortByKey(false) > > I thought nothing would actually happen here until I did userreqs.take(10) > but actually it did execute without the take(). It took about a minute for it > to complete and if I look at the web UI I see completed execution of 3 > stages: (Why is sortByKey two stages?) > > <sparkdev-2014-03-26.png> > > Something else about this strikes me as odd, too. If I follow this command > by userreqs.take(10), I think it executes the whole thing all over again, but > doesn't show all the stages: stage 3 is missing in the UI: > <sparkdev-2014-03-26.png> > > > Plus it seems to automatically be caching my results? Because when I execute > "take(10)" repeatedly, subsequent executions are very fast, and trigger only > a single stage: > > <sparkdev-2014-03-26.png> > > And I confirmed it is caching because i tried deleting the underlying files > and the take() still worked. > > Anyone have any insight? > > Diana > >