On Jun 28, 2013, at 2:29 PM, Emmanuel Espina wrote:
> I'm building a distributed index (mostly as a reasearch project for
> school) and I'm evaluating indexing the entire collection in memory
> (like google, facebook and others have done years ago). The obvious
> reason for this is performance c
I'm building a distributed index (mostly as a reasearch project for
school) and I'm evaluating indexing the entire collection in memory
(like google, facebook and others have done years ago). The obvious
reason for this is performance considering that the replication will
give me a reasonably good
You can add PatternReplaceFilter
(http://lucene.apache.org/core/4_3_1/analyzers-common/org/apache/lucene/analysis/pattern/PatternReplaceFilter.html)
to replace the tokens only consisting of digits by their vsrisnt with leading
zeroes removed.
Uwe
Jack Krupansky schrieb:
>The user could use
The user could use a regular expression query to match the numbers, but
otherwise, you will have to write some specialized token filter to recognize
numeric tokens and generate extra tokens at the same position for each token
variant that you want to search for.
-- Jack Krupansky
-Origina
I have an application that is indexing the text from various reports and forms
that are generated from our core system. The reports will contain dollar
amounts and various indexes that contain all numbers, but have leading zeros.
If a document contains that following text that is stored in one
I only have about a million docs right now so scaling is not a big issue.
I'm looking to provide a quick implementation and then worry about scale
when I get around to implementing a more robust recommender. I'm looking at
a content based approach because we are not tracking users and items viewed
Hi,
It doesn't have to be one or the other. In the past I've built a news
recommender engine based on CF (Mahout) and combined it with Content
Similarity-based engine (wasn't Solr/Lucene, but something custom that
worked with ngrams, but it may have as well been Lucene/Solr/ES). It
worked well.
More Like This already is kNN. It extracts features from the document (makes a
query), and runs that query against the collection.
If you want the items most similar to the current item, use MLT.
wunder
On Jun 28, 2013, at 11:02 AM, Luis Carlos Guerrero Covo wrote:
> Hey saikat, thanks for you
You could build a custom recommender in mahout to accomplish this, also just
out of curiosity why the content based approach as opposed to building a
recommender based on co-occurence. One other thing, what is your data size,
are you looking at scale where you need something like hadoop?
> Fro
Hi,
Have a look at http://www.youtube.com/watch?v=13yQbaW2V4Y . I'd say
it's easier than Mahout, especially if you already have and know your
way around Solr.
Otis
--
Solr & ElasticSearch Support -- http://sematext.com/
Performance Monitoring -- http://sematext.com/spm
On Fri, Jun 28, 2013 at
Hey saikat, thanks for your suggestion. I've looked into mahout and other
alternatives for computing k nearest neighbors. I would have to run a job
and computer the k nearest neighbors and track them in the index for
retrieval. I wanted to see if this was something I could do with lucene
using luce
Why not just use mahout to do this, there is an item similarity algorithm in
mahout that does exactly this :)
https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/hadoop/similarity/item/ItemSimilarityJob.html
You can use mahout in distributed and non-distributed mode a
Hi,
I'm using lucene and solr right now in a production environment with an
index of about a million docs. I'm working on a recommender that basically
would list the n most similar items to the user based on the current item
he is viewing.
I've been thinking of using solr/lucene since I already h
I am using MultiSimilarity to compute CombSum and I have noticed that the
computeNorm() method takes the value of the first Similarity in the array
of similarities. Is it safe to use MultiSimilarity with similarities that
have different computeNorm() implementations?
Also, I would like to perform
14 matches
Mail list logo