There used to be an online page on mahout.apache.org that Pat Ferrel had put together few years ago. Not sure if its still around, Pat ???
If not, I can write up more detailed steps later today and send it ur way. On Thu, May 14, 2015 at 2:18 PM, Jonathan Seale <[email protected]> wrote: > Thanks, guys. Can you recommend any resources that show an example of these > steps? A google search returns very little information. Now I know what to > do, but I can't find anything that tells me how to do it. > > > On Wed, May 13, 2015 at 11:56 PM, Suneel Marthi <[email protected]> > wrote: > > > Hi Jonathan, > > > > Here's what u gotta do to run RowSimilarity on ur CSV formatted data. > You > > would have to use the MapReduce version since the Spark version only > > supports LLR. > > > > 1. Convert CSV to Vectors - use CSVIterator and store the vectors as > > SequenceFiles > > 2. Run RowIDJob on the SequenceFile output of (1). This should generate > a > > Matrix of <IntWritable, VectorWriteable> and a docIndex of <IntWritable, > > Text> > > 3. Run RowSimilarityjob on the matrix output from (2) specifiying > > CosineDistance and a cutoff threshold. This should generate a matrix of > > Rows -> Most similar rows with distances. > > > > > > > > > > On Wed, May 13, 2015 at 11:42 PM, Jonathan Seale < > [email protected] > > > > > wrote: > > > > > Thanks, Charlie, > > > > > > The data has been through lots of processing, but in an attempt to make > > it > > > more Mahout-friendly, I've converted it into a single csv table with > > > columns: star_id, wavelength, intensity. My motivation was to make it > > like > > > a user_id, item_id, rating table you might see in other Mahout uses. > > > > > > As opposed to using my local machine, I've setup an instance on Amazon > > with > > > hopes of turning this into a remote service. So the install is whatever > > > comes with Amazon's default Mahout installation. > > > > > > Jonathan > > > > > > > > > > > > On Wed, May 13, 2015 at 11:29 PM, Charlie Hack < > [email protected] > > > > > > wrote: > > > > > > > Hi Jonathan, how do you have the data stored? More info about your > > setup > > > > the better. > > > > > > > > > > > > Charlie > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > — > > > > Sent from Mailbox > > > > > > > > > > > > > > > > > > > > On Wednesday, May 13, 2015 at 23:16, Jonathan Seale < > > > > [email protected]>, wrote: > > > > Scientists, > > > > > > > > > > > > I have an astrophysical application for Mahout that I need help with. > > > > > > > > > > > > I have 1-dimensional stellar spectra for many, many stars. Each > > spectrum > > > > > > > > consists of a series of intensity values, one per wavelength of > light. > > I > > > > > > > > need to be able to find the cosine similarity between ALL pairs of > > stars. > > > > > > > > Seems to me this is simply a user-user similarity problem where I > have > > > > > > > > stars instead of users, wavelengths instead of items, and intensities > > > > > > > > instead of ratings/clicks. > > > > > > > > > > > > But I'm having difficulty using mahout's row similarity package (I'm > > new > > > to > > > > > > > > this, and these days astronomers code pretty exclusively in python). > I > > > know > > > > > > > > that I must have to 1) create a sparse matrix where each row is a > star, > > > > > > > > columns are wavelengths, and the values are intensity, and 2) > implement > > > row > > > > > > > > similarity. But I'm just not sure how to do it. Anyone have a good > > > resource > > > > > > > > or be willing to help? I could probably offer some compensation to > > anyone > > > > > > > > that would be willing to provide a little focussed, personalized > > > > assistance. > > > > > > > > > > > > Thanks, > > > > > > > > Jonathan > > > > > > > > > >
