Re: Question about Mahout spark-itemsimilarity

Pat Ferrel Wed, 10 Dec 2014 11:04:16 -0800

The idea is to put each of the indicators in a doc or db row with a field for 
each indicator.


If you index docs as csv files (not DB columns) with a row per doc you’ll have 
the following for [AtA]

doc-ID,item1-ID item2-ID item10-ID  and so on

The doc-ID field contains the item-ID and the indicator field contains a space 
delimited list of similar items

You add a field for each of the cross-cooccurrence indicators, you calculate 
[AtA] with every run if the job, but [AtB] will be with a different action and 
so the IDs for A and B  don’t have to be the same. For instance if you have a 
user’s location saved periodically you could treat the action of “being at a 
location” as the input to B. You run A and B into the job and get back two text 
csv sparse matrices. Combine them and index the following:

combined csv:
item1000-ID,item1-ID item2-ID item10-ID ...,location1-ID location2-ID …

This has two indicators in two fields created by joining  [AtA] and  [AtB] 
files. The item/doc-ID for the row is from A always—if you look at the output 
that much should be be clear. Set up Elasticsearch to index the docs with two 
fields.

Then the query uses the history of the user that you want to make 
recommendations for.

Don’t know Elasticsearch query format but the idea is to use the user’s history 
of items interacted with and locations as the query.

Q: field1:”item1-ID item2-ID item3-ID” field2:”location1-ID locationN-ID”
 
field1 contains the data from the primary action indicator and the query string 
is the user’s history of the primary action. Field2 contains the location 
cross-cooccurrence indicators and so will be location-IDs, the query will be 
the user’s history of locations. The query will result in an ordered list of 
items to recommend. So two fields, two different histories for the same user. 
There will be one combined two field query. 

The items IDs may also be the same. For instance if you were recording 
purchases and detail-views for an ecom site. You’d feed purchase history in as 
A, and detail view history in as B.

Other answers below

On Dec 10, 2014, at 7:25 AM, Ariel Wolfmann <[email protected]> wrote:

Hi Pat, 

Sorry that i continue bothering you, but in the next days i need to finish my 
annual report, so it would be very helpful for me if you can answer the 
previous email, because there are things that i haven't clear enough.

Cheers

Ariel

 

2014-12-03 12:41 GMT-03:00 Ariel Wolfmann <[email protected] 
<mailto:[email protected]>>:
Hi Pat, 

Thank you for your answers, for now it's ok for me to run Spark local, so now i 
have another question:

Once i ran spark-itemsimilarity, i need to make the recommendation, as i read 
in your blog i can do it, using a Search engine as the similarity engine, i 
chose ElasticSearch, but i can't understand how this works to make the 
recommendations. It needs to calculate rp = hp[AtA] where [AtA] is the 
spark-itemsimilarity output (the output will be that only if i set to calculate 
all the similarities, not just top 100, right?)

hp[AtA] is actually what the search engine is doing. It is calculating the 
cosine based similarity of your query with the indicator field you point it to. 
This is very loosely speaking = hp[AtA]

The number returned is defined in your config for the search engine and may 
even allow batched up cursor type queries—again I haven’t used Elasticsearch. 
But it’s a search engine thing, not defined by the algorithm

In most cases you will want to filter out the recommended items that were 
actually known by the user. So usually you don’t want to show them things they 
have already purchased.


how should i need to index this output? i supose that i need to store the 
similarity calculated by mahout with the itemId as the indicator, but in the 
example in the blog the indicator are just the names of the most similar items. 
If i only pay atention just to the most similar items, this will not calculate 
hp[AtA], for example if there is an item that is not in the top similar items 
for the items that the user view, but have enough similarity to be a good 
recommendation when you sum the similarities from this item to the viewed 
items..


Nope, as described above the indicator is just the most similar items (based on 
which users preferred them) for the simple case. The IDs are actually strings 
and it’s assumed they have no spaces in them so the search engine won’t get 
confused and break one ID into multiple tokens. 

The number of similar items in the indicator is configurable but it’s a good 
start to leave the number as the default of 500. There are very quickly 
diminishing returns from using more and by “downsampling” based on strength of 
similarity you tend to get the “best” similar items in that 500. Setting this 
too high will cause a very longer runtime and will give very little return in 
quality.


The code i wrote for ElasticSearch is like this: 
http://paste.ubuntu.com/9355001/ <http://paste.ubuntu.com/9355001/>

Also i read that there are some similarity modules for ElasticSearch, but is 
not clear for me if this can help or not.

As you can see, i'm starting to learn about ElasticSearch or any other NoSQL 
search engine, so this part of the "tutorial" is not clear enough for me.

Another question but less important:
How to do this with more than one cross-similarity, in the blog says that you 
can run spark-itemsimilarity several times with the differents kind of 
interactions, but how should i combine this results to make a multiple cross 
recommendation? 
Answered above

I'll really appreciate your answer


Regards 

Ariel

2014-11-14 14:23 GMT-03:00 Pat Ferrel <[email protected] 
<mailto:[email protected]>>:

This is a Spark error, not hadoop. Some temp file that spark creates is not 
found in the raw file system. 

We are having a lot of problems with running the downloaded jars for spark. 
When you go from local[4] (where you are running Spark as it was linked in by 
maven) to spark://ssh-aw:7077 <> (where you launch spark with your local 
machine as master and slave) you are using two different sets of code.

Do you build Spark? What version of Hadoop are you using?

BTW I don’t understand why you are doing these things. You get no benefit. The 
jobs will complete much faster if you use local[4] than if you go through 
psuedo-distributed Spark and Hadoop.

It’s true that you’ll see a similar error if you ran this on a real cluster so 
eventually you’ll need to solve the problem.

Try the process described in the second FAQ entry here, where you build Spark 
locally:
https://mahout.apache.org/users/sparkbindings/faq.html 
<https://mahout.apache.org/users/sparkbindings/faq.html>
The Mahout build now defaults to hadoop 2
https://mahout.apache.org/developers/buildingmahout.html 
<https://mahout.apache.org/developers/buildingmahout.html>

On Nov 14, 2014, at 7:02 AM, Ariel Wolfmann <[email protected] 
<mailto:[email protected]>> wrote:

Hi Pat

Thank you very much for your help, after removing spaces it work fine in 
standalone mode.

With Hadoop In Pseudo distributed mode, it also work fine running this command: 
./bin/mahout spark-itemsimilarity --input 
hdfs://localhost:9000/user/ariel/intro_multiple.csv <> --output 
hdfs://localhost:9000/user/ariel/multiple <>  --master local[4] --filter1 
purchase --filter2 view --itemIDColumn 2 --rowIDColumn 0 --filterColumn 1

But trying also with Spark in Pseudo distributed mode i got an error:
The commands:
export MASTER=spark://ssh-aw:7077 <>
/bin/mahout spark-itemsimilarity --input 
hdfs://localhost:9000/user/ariel/intro_multiple.csv <> --output 
hdfs://localhost:9000/user/ariel/outsis2 <>  --master spark://ssh-aw:7077 <> 
--filter1 purchase --filter2 view --itemIDColumn 2 --rowIDColumn 0 
--filterColumn 1 

The error: 
INFO scheduler.DAGScheduler: Failed to run collect at 
TextDelimitedReaderWriter.scala:76
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost 
task 0.3 in stage 1.0 (TID 4, ssh-aw): java.io.FileNotFoundException: 
/tmp/spark-local-20141114143728-65fb/0d/shuffle_0_0_1 (No such file or 
directory)

On /tmp/ related to spark i have these files: 
spark-aw-org.apache.spark.deploy.master.Master-1.pid
spark-aw-org.apache.spark.deploy.worker.Worker-1.pid

It seems that may be i need to set up any other variables to run Spark pseudo 
distributed? 
I just run  ~/spark-1.1.0/sbin/start-all.sh  

Running with Hadoop Pseudo distributed, for now is good for me, because i'm 
just playing with small examples, so if you think that is a simple error tell 
me how to solve it, if not, this is enough for me. Also i take your tip for run 
hadoop standalone is faster than pseudo distributed. 

Once again, thank you very much!

Cheers

Ariel

2014-11-12 14:31 GMT-03:00 Pat Ferrel <[email protected] 
<mailto:[email protected]>>:
Let’s solve the empty file problem first. There’s no need to specify the 
inDelim, which is [ ,\t] by default (that’s a space, comma, or tab). Also it 
looks like you have a space after each comma which would make the filters not 
match what’s in the file. If so remove all spaces and try again. In delimited 
text files every character is significant.

If there aren’t any spaces can you send me intro_multiple.csv?



On Nov 12, 2014, at 8:39 AM, Ariel Wolfmann <[email protected] 
<mailto:[email protected]>> wrote:

Hi Pat,

First of all, thank you for your quickly answer and sory for the late response, 
i was having problems with the memory on my laptop, so i get access to a server 
(in a docker).

I've pulled the last commit on the mahout's github, and updated to Spark 1.1.0. 

Running locally with MAHOUT_LOCAL=true, and without starting any service from 
Hadoop and Spark (that means running standalone mode), i still getting empty 
output (files  part-00000 empty and _SUCCESS)
I've tried with this line:
./bin/mahout spark-itemsimilarity --input intro_multiple.csv --output multiple  
--master local[4] --filter1 purchase --filter2 view --itemIDColumn 2 
--rowIDColumn 0 --filterColumn 1 --inDelim ,

Running in Pseudo distributed mode (starting dfs, yarn and spark all), adding 
hdfs://localhost:9000/ <> to the files i got:
 INFO scheduler.DAGScheduler: Failed to run collect at 
TextDelimitedReaderWriter.scala:76 
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to 
stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost 
task 0.4 in stage 1.0 (TID 8, ssh-aw): ExecutorLostFailure (executor lost)
Driver stacktrace:
        at org.apache.spark.scheduler.DAGScheduler.org 
<http://org.apache.spark.scheduler.dagscheduler.org/>$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)

I've tried with this line:
 ./bin/mahout spark-itemsimilarity --input 
hdfs://localhost:9000/user/ariel/intro_multiple.csv <> --output 
hdfs://localhost:9000/user/ariel/outsis <>  --master spark://ssh-aw:7077 <> 
--filter1 purchase --filter2 view --itemIDColumn 2 --rowIDColumn 0 
--filterColumn 1 --inDelim ,


Could be anything related to Hadoop 2.4.1 or may be the format of the input?
The input file is like:
u1, purchase, iphone
u1, purchase, ipad 


Any Idea?


Cheers

Ariel

2014-10-29 14:51 GMT-03:00 Pat Ferrel <[email protected] 
<mailto:[email protected]>>:
How recent a build are you using. I suggest the master branch on github but it 
was just bumped to use Spark 1.1.0 so be aware.

I’d guess that there is a problem in how you are specifying your input. --input 
multiple_intro.csv will look in your current directory if you have Mahout set 
to run locally or in your HDFS default directory if you are running Mahout in 
clustered mode. Check:

> env | grep MAHOUT_LOCAL #that’s a pipe character not the letter “I"

to see if it is set. If it’s not set you are trying to get your file from HDFS. 
Try specifying the full HDFS path if using Hadoop in psuedo distributed mode. 

BTW using a master of “local[4]” and setting MAHOUT_LOCAL=true (so Mahout uses 
local storage, not HDFS) will run the job much faster on your laptop. You can 
set the number of cores used in the “local[n]” to whatever you have. This only 
matters when using real data—the samples are tiny.

On Oct 29, 2014, at 8:47 AM, Ariel Wolfmann <[email protected] 
<mailto:[email protected]>> wrote:

Hi Pat,

My name is Ariel Wolfmann, I'm from Cordoba, Argentina, and i study computer 
science.
I'm very interested to use  Mahout spark-itemsimilarity in my thesis, so i'm 
playing with that.
After reading carefully your posts on http://occamsmachete.com/ml/ 
<http://occamsmachete.com/ml/> and 
https://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html 
<https://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html>, 
i'm trying with the examples, but i'm getting empty outputs when running:

* ./bin/mahout spark-itemsimilarity --input multiple_intro.csv --output outsis2 
 --master spark://ariel-Satellite-L655:7077 <> --filter1 purchase --filter2 
view --itemIDColumn 2 --rowIDColumn 0 --filterColumn 1

where multiple_intro.csv contains things like:
u1, purchase, iphone
u1, purchase, ipad
u2, purchase, nexus

as the tutorial of mahout suggest (the first simple example i could run 
succesfully, with no empty output)

* ./bin/mahout spark-itemsimilarity --input multiple_intro2.csv --output 
outmulti2  --master spark://ariel-Satellite-L655:7077 <> --filter1 like 
--itemIDColumn 2 --rowIDColumn 0 --filterColumn 1

where multiple_intro2.csv contains things like: 
906507914, dislike, mars_needs_moms
906507914, like, moneyball
as you suggest on your post.

In both cases, in the output folder i'm getting the files part-00000 and 
_SUCCESS, but part-00000 is empty..

I'm using Mahout 1.0, Spark 1.0.1 and Hadoop 2.4.1, running Hadoop as 
Pseudo-distributed on my laptop.

Do you have any idea what i'm doing wrong?


Thanks.

-- 
Ariel Wolfmann




-- 
Ariel Wolfmann




-- 
Ariel Wolfmann




-- 
Ariel Wolfmann



-- 
Ariel Wolfmann

Re: Question about Mahout spark-itemsimilarity

Reply via email to