How many users and items do you have? Each iteration will first iterate through users and then items, so each iteration of ALS actually ends up having 2 flatMap operations. I'd assume that you have many more users than items (or vice versa), which is why one of the operations generates more data.
On Wed, Jun 25, 2014 at 11:39 AM, Lizhengbing (bing, BIPA) < [email protected]> wrote: > > > Sometimes, shuffle write of flatMap is 14.8G and sometimes is 647.9M > > Why does this happen? > > The size of training data is about 1.5G. and the feature number is 200 > > > > *Stage Id* > > *Description* > > *Submitted* > > *Duration* > > *Tasks: Succeeded/Total* > > *Shuffle Read* > > *Shuffle Write* > > 114 > > flatMap at ALS.scala:434 > > 2014/06/25 17:13:39 > > 6.3 min > > 48/48 > > 611.7 MB > > 14.8 GB > > 115 > > groupByKey at ALS.scala:442 > > 2014/06/25 17:13:34 > > 4 s > > 48/48 > > 337.5 MB > > 1275.9 MB > > 116 > > flatMap at ALS.scala:434 > > 2014/06/25 17:09:02 > > 4.5 min > > 48/48 > > 12.2 GB > > 674.9 MB > > 117 > > groupByKey at ALS.scala:442 > > 2014/06/25 17:07:05 > > 2.0 min > > 48/48 > > 7.4 GB > > 25.5 GB > > 118 > > flatMap at ALS.scala:434 > > 2014/06/25 17:00:41 > > 6.4 min > > 48/48 > > 664.2 MB > > 14.8 GB > > 119 > > groupByKey at ALS.scala:442 > > 2014/06/25 17:00:30 > > 10 s > > 48/48 > > 337.4 MB > > 1275.9 MB > > 120 > > flatMap at ALS.scala:434 > > 2014/06/25 16:55:19 > > 5.2 min > > 48/48 > > 12.2 GB > > 674.9 MB > > 121 > > groupByKey at ALS.scala:442 > > 2014/06/25 16:54:02 > > 1.3 min > > 48/48 > > 7.4 GB > > 25.5 GB > > 122 > > flatMap at ALS.scala:434 > > 2014/06/25 16:53:52 > > 9 s > > 48/48 > > 14.8 GB > > 123 > > mapPartitionsWithIndex at ALS.scala:200 > <http://10.71.123.101:4040/stages/stage?id=123> > > 2014/06/25 16:53:40 > > 12 s > > 48/48 > > 399.5 MB > > 737.4 MB > > 6 > > map at ALS.scala:183 <http://10.71.123.101:4040/stages/stage?id=6> > > 2014/06/25 16:53:01 > > 39 s > > 20/20 > > 799.4 MB > > 3 > > map at ALS.scala:186 <http://10.71.123.101:4040/stages/stage?id=3> > > 2014/06/25 16:53:01 > > 39 s > > 20/20 > > 652.2 MB > > >
