Increase the number to Integer.max or the highest of your number of users or 
items. The “or" means that the row and columns are both downsampled to that 
number or less.

To use all data you will also have to increase the —maxSimilaritiesPerItem

There are two marices in the Hadoop itemsimilarity. The input is A, and is one 
row per user with each item the user has interacted with. From this AtA is 
calculated as the output using LLR instead of actual matrix multiplication. 
This yields an AtA with values weighted but LLR strength. 
—maxSimilaritiesPerItem will further limit the values here to no more than that 
number. There is also a quality threshold, which is pretty difficult to use.

If you remove all of these downsampling params you will approach O(n^2) 
runtime, if you use them you will have O(n). You will also get rapidly 
diminishing returns by removing downsampling.

The indicator matrix will have arbitrarily many similar items of diminishing 
strength, some could be nearly useless. This potentially large vector may be 
unwieldy in you other calculations and has not had low value similar items 
filtered out.

Bottom line it that the downsampling is possible to tweak but removal 
altogether is not likely to be a good thing.


On Dec 12, 2014, at 6:18 AM, Gruszowska Natalia 
<[email protected]> wrote:

Hi All, 

In itemsimilarity metod tere is a parameter like:

--maxPrefs (-mppu) maxPrefs                               max number of
                                                         preferences to
                                                         consider per user or
                                                         item, users or items
                                                         with more preferences
                                                         will be sampled down
                                                         (default: 500)

How does it work exactly?
If I have 5 mln users and 5000 items and I run itemsimilarity with default 
maxPrefs, it consider only 500 ranks from those 5 mln or what? Is it sampling? 
What can I do to force calculation for all input data? 

                        M1   M2   M3 .... M5000
U_1
U_2
...
U_5000000

What does mean "or" in definition:
"max number of preferences to consider per user or item"


Thx in advance
Natalia



Reply via email to