Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

Xiangrui Meng Sun, 06 Apr 2014 21:16:23 -0700

Btw, explicit ALS doesn't need persist because each intermediate
factor is only used once. -Xiangrui


On Sun, Apr 6, 2014 at 9:13 PM, Xiangrui Meng <men...@gmail.com> wrote:
> The persist used in implicit ALS doesn't help StackOverflow problem.
> Persist doesn't cut lineage. We need to call count() and then
> checkpoint() to cut the lineage. Did you try the workaround mentioned
> in https://issues.apache.org/jira/browse/SPARK-958:
>
> "I tune JVM thread stack size to 512k via option -Xss512k and it works."
>
> Best,
> Xiangrui
>
> On Sun, Apr 6, 2014 at 10:21 AM, Debasish Das <debasish.da...@gmail.com> 
> wrote:
>> At the head I see persist option in implicitPrefs but more cases like the
>> ones mentioned above why don't we use similar technique and take an input
>> that which iteration should we persist in explicit runs as well ?
>>
>> for (iter <- 1 to iterations) {
>>         // perform ALS update
>>         logInfo("Re-computing I given U (Iteration %d/%d)".format(iter,
>> iterations))
>>         products = updateFeatures(users, userOutLinks, productInLinks,
>> partitioner, rank, lambda,
>>           alpha, YtY = None)
>>         logInfo("Re-computing U given I (Iteration %d/%d)".format(iter,
>> iterations))
>>         users = updateFeatures(products, productOutLinks, userInLinks,
>> partitioner, rank, lambda,
>>           alpha, YtY = None)
>>       }
>>
>> Say if I want to persist at every k iterations out of N iterations of ALS
>> explicit, there shoud be an option to do that...implicit right now uses
>> persist at each iteration...
>>
>> Does this option make sense or you guys want this issue to be fixed in a
>> different way...
>>
>> I definitely see that for my 25M x 3M run, with 64 gb executor memory,
>> something is going wrong after 5-th iteration and I wanted to run for 10
>> iterations...
>>
>> So my k is 4/5 for this particular problem...
>>
>> I can ask for the PR after testing the fix on the dataset I have...I will
>> also try to see if we can make such datasets public for more research...
>>
>> For the LDA problem mentioned earlier in this email chain, k is 10...NMF
>> can generate topics similar to LDA as well...Carrot2 project uses it...
>>
>>
>>
>> On Thu, Mar 27, 2014 at 3:20 PM, Debasish Das 
>> <debasish.da...@gmail.com>wrote:
>>
>>> Hi Matei,
>>>
>>> I am hitting similar problems with 10 ALS iterations...I am running with
>>> 24 gb executor memory on 10 nodes for 20M x 3 M matrix with rank =50
>>>
>>> The first iteration of flatMaps run fine which means that the memory
>>> requirements are good per iteration...
>>>
>>> If I do check-pointing on RDD, most likely rest 9 iterations will also run
>>> fine and I will get the results...
>>>
>>> Is there a plan to add checkpoint option to ALS for such large
>>> factorization jobs ?
>>>
>>> Thanks.
>>> Deb
>>>
>>>
>>>
>>>
>>>
>>> On Tue, Jan 28, 2014 at 11:10 PM, Matei Zaharia 
>>> <matei.zaha...@gmail.com>wrote:
>>>
>>>> That would be great to add. Right now it would be easy to change it to
>>>> use another Hadoop FileSystem implementation at the very least (I think you
>>>> can just pass the URL for that), but for Cassandra you'd have to use a
>>>> different InputFormat or some direct Cassandra access API.
>>>>
>>>> Matei
>>>>
>>>> On Jan 28, 2014, at 5:02 PM, Evan Chan <e...@ooyala.com> wrote:
>>>>
>>>> > By the way, is there any plan to make a pluggable backend for
>>>> > checkpointing?   We might be interested in writing a, for example,
>>>> > Cassandra backend.
>>>> >
>>>> > On Sat, Jan 25, 2014 at 9:49 PM, Xia, Junluan <junluan....@intel.com>
>>>> wrote:
>>>> >> Hi all
>>>> >>
>>>> >> The description about this Bug submitted by Matei is as following
>>>> >>
>>>> >>
>>>> >> The tipping point seems to be around 50. We should fix this by
>>>> checkpointing the RDDs every 10-20 iterations to break the lineage chain,
>>>> but checkpointing currently requires HDFS installed, which not all users
>>>> will have.
>>>> >>
>>>> >> We might also be able to fix DAGScheduler to not be recursive.
>>>> >>
>>>> >>
>>>> >> regards,
>>>> >> Andrew
>>>> >>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > --
>>>> > Evan Chan
>>>> > Staff Engineer
>>>> > e...@ooyala.com  |
>>>>
>>>>
>>>

Re: Any suggestion about JIRA 1006 "MLlib ALS gets stack overflow with too many iterations"?

Reply via email to