The idea is to put each of the indicators in a doc or db row with a field for each indicator.
If you index docs as csv files (not DB columns) with a row per doc you’ll have the following for [AtA] doc-ID,item1-ID item2-ID item10-ID and so on The doc-ID field contains the item-ID and the indicator field contains a space delimited list of similar items You add a field for each of the cross-cooccurrence indicators, you calculate [AtA] with every run if the job, but [AtB] will be with a different action and so the IDs for A and B don’t have to be the same. For instance if you have a user’s location saved periodically you could treat the action of “being at a location” as the input to B. You run A and B into the job and get back two text csv sparse matrices. Combine them and index the following: combined csv: item1000-ID,item1-ID item2-ID item10-ID ...,location1-ID location2-ID … This has two indicators in two fields created by joining [AtA] and [AtB] files. The item/doc-ID for the row is from A always—if you look at the output that much should be be clear. Set up Elasticsearch to index the docs with two fields. Then the query uses the history of the user that you want to make recommendations for. Don’t know Elasticsearch query format but the idea is to use the user’s history of items interacted with and locations as the query. Q: field1:”item1-ID item2-ID item3-ID” field2:”location1-ID locationN-ID” field1 contains the data from the primary action indicator and the query string is the user’s history of the primary action. Field2 contains the location cross-cooccurrence indicators and so will be location-IDs, the query will be the user’s history of locations. The query will result in an ordered list of items to recommend. So two fields, two different histories for the same user. There will be one combined two field query. The items IDs may also be the same. For instance if you were recording purchases and detail-views for an ecom site. You’d feed purchase history in as A, and detail view history in as B. Other answers below On Dec 10, 2014, at 7:25 AM, Ariel Wolfmann <[email protected]> wrote: Hi Pat, Sorry that i continue bothering you, but in the next days i need to finish my annual report, so it would be very helpful for me if you can answer the previous email, because there are things that i haven't clear enough. Cheers Ariel 2014-12-03 12:41 GMT-03:00 Ariel Wolfmann <[email protected] <mailto:[email protected]>>: Hi Pat, Thank you for your answers, for now it's ok for me to run Spark local, so now i have another question: Once i ran spark-itemsimilarity, i need to make the recommendation, as i read in your blog i can do it, using a Search engine as the similarity engine, i chose ElasticSearch, but i can't understand how this works to make the recommendations. It needs to calculate rp = hp[AtA] where [AtA] is the spark-itemsimilarity output (the output will be that only if i set to calculate all the similarities, not just top 100, right?) hp[AtA] is actually what the search engine is doing. It is calculating the cosine based similarity of your query with the indicator field you point it to. This is very loosely speaking = hp[AtA] The number returned is defined in your config for the search engine and may even allow batched up cursor type queries—again I haven’t used Elasticsearch. But it’s a search engine thing, not defined by the algorithm In most cases you will want to filter out the recommended items that were actually known by the user. So usually you don’t want to show them things they have already purchased. how should i need to index this output? i supose that i need to store the similarity calculated by mahout with the itemId as the indicator, but in the example in the blog the indicator are just the names of the most similar items. If i only pay atention just to the most similar items, this will not calculate hp[AtA], for example if there is an item that is not in the top similar items for the items that the user view, but have enough similarity to be a good recommendation when you sum the similarities from this item to the viewed items.. Nope, as described above the indicator is just the most similar items (based on which users preferred them) for the simple case. The IDs are actually strings and it’s assumed they have no spaces in them so the search engine won’t get confused and break one ID into multiple tokens. The number of similar items in the indicator is configurable but it’s a good start to leave the number as the default of 500. There are very quickly diminishing returns from using more and by “downsampling” based on strength of similarity you tend to get the “best” similar items in that 500. Setting this too high will cause a very longer runtime and will give very little return in quality. The code i wrote for ElasticSearch is like this: http://paste.ubuntu.com/9355001/ <http://paste.ubuntu.com/9355001/> Also i read that there are some similarity modules for ElasticSearch, but is not clear for me if this can help or not. As you can see, i'm starting to learn about ElasticSearch or any other NoSQL search engine, so this part of the "tutorial" is not clear enough for me. Another question but less important: How to do this with more than one cross-similarity, in the blog says that you can run spark-itemsimilarity several times with the differents kind of interactions, but how should i combine this results to make a multiple cross recommendation? Answered above I'll really appreciate your answer Regards Ariel 2014-11-14 14:23 GMT-03:00 Pat Ferrel <[email protected] <mailto:[email protected]>>: This is a Spark error, not hadoop. Some temp file that spark creates is not found in the raw file system. We are having a lot of problems with running the downloaded jars for spark. When you go from local[4] (where you are running Spark as it was linked in by maven) to spark://ssh-aw:7077 <> (where you launch spark with your local machine as master and slave) you are using two different sets of code. Do you build Spark? What version of Hadoop are you using? BTW I don’t understand why you are doing these things. You get no benefit. The jobs will complete much faster if you use local[4] than if you go through psuedo-distributed Spark and Hadoop. It’s true that you’ll see a similar error if you ran this on a real cluster so eventually you’ll need to solve the problem. Try the process described in the second FAQ entry here, where you build Spark locally: https://mahout.apache.org/users/sparkbindings/faq.html <https://mahout.apache.org/users/sparkbindings/faq.html> The Mahout build now defaults to hadoop 2 https://mahout.apache.org/developers/buildingmahout.html <https://mahout.apache.org/developers/buildingmahout.html> On Nov 14, 2014, at 7:02 AM, Ariel Wolfmann <[email protected] <mailto:[email protected]>> wrote: Hi Pat Thank you very much for your help, after removing spaces it work fine in standalone mode. With Hadoop In Pseudo distributed mode, it also work fine running this command: ./bin/mahout spark-itemsimilarity --input hdfs://localhost:9000/user/ariel/intro_multiple.csv <> --output hdfs://localhost:9000/user/ariel/multiple <> --master local[4] --filter1 purchase --filter2 view --itemIDColumn 2 --rowIDColumn 0 --filterColumn 1 But trying also with Spark in Pseudo distributed mode i got an error: The commands: export MASTER=spark://ssh-aw:7077 <> /bin/mahout spark-itemsimilarity --input hdfs://localhost:9000/user/ariel/intro_multiple.csv <> --output hdfs://localhost:9000/user/ariel/outsis2 <> --master spark://ssh-aw:7077 <> --filter1 purchase --filter2 view --itemIDColumn 2 --rowIDColumn 0 --filterColumn 1 The error: INFO scheduler.DAGScheduler: Failed to run collect at TextDelimitedReaderWriter.scala:76 Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, ssh-aw): java.io.FileNotFoundException: /tmp/spark-local-20141114143728-65fb/0d/shuffle_0_0_1 (No such file or directory) On /tmp/ related to spark i have these files: spark-aw-org.apache.spark.deploy.master.Master-1.pid spark-aw-org.apache.spark.deploy.worker.Worker-1.pid It seems that may be i need to set up any other variables to run Spark pseudo distributed? I just run ~/spark-1.1.0/sbin/start-all.sh Running with Hadoop Pseudo distributed, for now is good for me, because i'm just playing with small examples, so if you think that is a simple error tell me how to solve it, if not, this is enough for me. Also i take your tip for run hadoop standalone is faster than pseudo distributed. Once again, thank you very much! Cheers Ariel 2014-11-12 14:31 GMT-03:00 Pat Ferrel <[email protected] <mailto:[email protected]>>: Let’s solve the empty file problem first. There’s no need to specify the inDelim, which is [ ,\t] by default (that’s a space, comma, or tab). Also it looks like you have a space after each comma which would make the filters not match what’s in the file. If so remove all spaces and try again. In delimited text files every character is significant. If there aren’t any spaces can you send me intro_multiple.csv? On Nov 12, 2014, at 8:39 AM, Ariel Wolfmann <[email protected] <mailto:[email protected]>> wrote: Hi Pat, First of all, thank you for your quickly answer and sory for the late response, i was having problems with the memory on my laptop, so i get access to a server (in a docker). I've pulled the last commit on the mahout's github, and updated to Spark 1.1.0. Running locally with MAHOUT_LOCAL=true, and without starting any service from Hadoop and Spark (that means running standalone mode), i still getting empty output (files part-00000 empty and _SUCCESS) I've tried with this line: ./bin/mahout spark-itemsimilarity --input intro_multiple.csv --output multiple --master local[4] --filter1 purchase --filter2 view --itemIDColumn 2 --rowIDColumn 0 --filterColumn 1 --inDelim , Running in Pseudo distributed mode (starting dfs, yarn and spark all), adding hdfs://localhost:9000/ <> to the files i got: INFO scheduler.DAGScheduler: Failed to run collect at TextDelimitedReaderWriter.scala:76 Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.4 in stage 1.0 (TID 8, ssh-aw): ExecutorLostFailure (executor lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org <http://org.apache.spark.scheduler.dagscheduler.org/>$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) I've tried with this line: ./bin/mahout spark-itemsimilarity --input hdfs://localhost:9000/user/ariel/intro_multiple.csv <> --output hdfs://localhost:9000/user/ariel/outsis <> --master spark://ssh-aw:7077 <> --filter1 purchase --filter2 view --itemIDColumn 2 --rowIDColumn 0 --filterColumn 1 --inDelim , Could be anything related to Hadoop 2.4.1 or may be the format of the input? The input file is like: u1, purchase, iphone u1, purchase, ipad Any Idea? Cheers Ariel 2014-10-29 14:51 GMT-03:00 Pat Ferrel <[email protected] <mailto:[email protected]>>: How recent a build are you using. I suggest the master branch on github but it was just bumped to use Spark 1.1.0 so be aware. I’d guess that there is a problem in how you are specifying your input. --input multiple_intro.csv will look in your current directory if you have Mahout set to run locally or in your HDFS default directory if you are running Mahout in clustered mode. Check: > env | grep MAHOUT_LOCAL #that’s a pipe character not the letter “I" to see if it is set. If it’s not set you are trying to get your file from HDFS. Try specifying the full HDFS path if using Hadoop in psuedo distributed mode. BTW using a master of “local[4]” and setting MAHOUT_LOCAL=true (so Mahout uses local storage, not HDFS) will run the job much faster on your laptop. You can set the number of cores used in the “local[n]” to whatever you have. This only matters when using real data—the samples are tiny. On Oct 29, 2014, at 8:47 AM, Ariel Wolfmann <[email protected] <mailto:[email protected]>> wrote: Hi Pat, My name is Ariel Wolfmann, I'm from Cordoba, Argentina, and i study computer science. I'm very interested to use Mahout spark-itemsimilarity in my thesis, so i'm playing with that. After reading carefully your posts on http://occamsmachete.com/ml/ <http://occamsmachete.com/ml/> and https://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html <https://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html>, i'm trying with the examples, but i'm getting empty outputs when running: * ./bin/mahout spark-itemsimilarity --input multiple_intro.csv --output outsis2 --master spark://ariel-Satellite-L655:7077 <> --filter1 purchase --filter2 view --itemIDColumn 2 --rowIDColumn 0 --filterColumn 1 where multiple_intro.csv contains things like: u1, purchase, iphone u1, purchase, ipad u2, purchase, nexus as the tutorial of mahout suggest (the first simple example i could run succesfully, with no empty output) * ./bin/mahout spark-itemsimilarity --input multiple_intro2.csv --output outmulti2 --master spark://ariel-Satellite-L655:7077 <> --filter1 like --itemIDColumn 2 --rowIDColumn 0 --filterColumn 1 where multiple_intro2.csv contains things like: 906507914, dislike, mars_needs_moms 906507914, like, moneyball as you suggest on your post. In both cases, in the output folder i'm getting the files part-00000 and _SUCCESS, but part-00000 is empty.. I'm using Mahout 1.0, Spark 1.0.1 and Hadoop 2.4.1, running Hadoop as Pseudo-distributed on my laptop. Do you have any idea what i'm doing wrong? Thanks. -- Ariel Wolfmann -- Ariel Wolfmann -- Ariel Wolfmann -- Ariel Wolfmann -- Ariel Wolfmann
