be integers. Specifically, the
>>> >> > input
>>> >> > to
>>> >> > ALS is an RDD[Rating] and Rating is an (Int, Int, Double). I am
>>> >> > wondering if
>>> >> > perhaps one of your identifiers exceeds MAX_INT, could you
D[Rating] and Rating is an (Int, Int, Double). I am
>>> >> > wondering if
>>> >> > perhaps one of your identifiers exceeds MAX_INT, could you write a
>>> >> > quick
>>> >> > check for that?
>>> >> >
>
; >> > perhaps one of your identifiers exceeds MAX_INT, could you write a
>> quick
>> >> > check for that?
>> >> >
>> >> > I have been running a very similar use case to yours (with more
>> >> > constrained
>> >>
> >> > wondering if
> >> > perhaps one of your identifiers exceeds MAX_INT, could you write a
> quick
> >> > check for that?
> >> >
> >> > I have been running a very similar use case to yours (with more
> >> > constrained
>
r that?
>> >
>> > I have been running a very similar use case to yours (with more
>> > constrained
>> > hardware resources) and I haven’t seen this exact problem but I’m sure
>> > we’ve
>> > seen similar issues. Please let me know if you have o
.
> >
> > From: Bharath Ravi Kumar
> > Date: Thursday, November 27, 2014 at 1:30 PM
> > To: "user@spark.apache.org"
> > Subject: ALS failure with size > Integer.MAX_VALUE
> >
> > We're training a recommender with ALS in mllib 1.1 against a dataset of
Date: Thursday, November 27, 2014 at 1:30 PM
> To: "user@spark.apache.org"
> Subject: ALS failure with size > Integer.MAX_VALUE
>
> We're training a recommender with ALS in mllib 1.1 against a dataset of 150M
> users and 4.5K items, with the total number of train
questions.
From: Bharath Ravi Kumar mailto:reachb...@gmail.com>>
Date: Thursday, November 27, 2014 at 1:30 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>"
mailto:user@spark.apache.org>>
Subject: ALS failure with size > Integer.MAX_VALUE
We're training
Any suggestions to address the described problem? In particular, it appears
that considering the skewed degree of some of the item nodes in the graph,
I believe it should be possible to define better block sizes to reflect
that fact, but am unsure of the way of arriving at the sizes accordingly.
T
We're training a recommender with ALS in mllib 1.1 against a dataset of
150M users and 4.5K items, with the total number of training records being
1.2 Billion (~30GB data). The input data is spread across 1200 partitions
on HDFS. For the training, rank=10, and we've configured {number of user
data
10 matches
Mail list logo