Re: Document Similarity

2012-07-30 Thread in.abdul
I had understood your need . You can use k mean clustering in mahout . Which can help your you case . You can better post this question in mahout user list where you get different idea . I had also had use case like this as i did as POC. But still my suggestion is that . You can post this question

RE: Document Similarity

2012-07-30 Thread Elshaimaa Ali
thank you so much for the prompt reply I need to extract a document from the index that is similar to an Html document, and I need to use cosine similarity or latent semantic analysis which means that I need to generate term vector for the html document, the link you sent me doesn't contain any

Re: Document Similarity

2012-07-30 Thread in.abdul
Hi ELshaimaa, I couldnt able understood what is your need . Can you please explain your use case. If this is case "I need to use Lucene to find the most similar documents from the generated index" then go for morelikethis[1] components . Based on your use case people can suggest some good wa

Small Vocabulary

2012-07-30 Thread Carsten Schnober
Dear list, I'm considering to use Lucene for indexing sequences of part-of-speech (POS) tags instead of words; for those who don't know, POS tags are linguistically motivated labels that are assigned to tokens (words) to describe its morpho-syntactic function. Instead of sequences of words, I would