Re: Wikia search goes live today

2008-01-07 Thread Lukas Vlcek
BTW: 1) If they have made any improvements/changes to Nutch (or Lucene/Hadoop) code and they keep it closed then how they can claim they are using open sourced algorithms? 2) Wouldn't it be too expensive for them to keep their changes closed going forward? How about if Nutch changes significantly i

Re: Wikia search goes live today

2008-01-07 Thread Lukas Vlcek
This would be great! I am particularly interested how they are going about customized search (if they have a plan to do it). I mean if they can reorder raw search results based on some kind of collective knowledge (which is probably kept outside of Lucene index - at least that is what I can see fr

Re: Deleting a single TermPosition for a Document

2008-01-07 Thread Otis Gospodnetic
Is your user field stored? If so, you cold find the target Document, get the user field value, modify it, and re-add it to the Document (or something close to this -- I am doing this with one of the indices on simpy.com and it's working well). Otis -- Sematext -- http://sematext.com/ -- Luce

Deleting a single TermPosition for a Document

2008-01-07 Thread Antony Bowesman
I'd like to 'update' a single Document in a Lucene index. In practice, this 'update' is actually just a removal of a single TermPosition for a given Term for a given doc Id. I don't think this is currently possible, but would it be easy to change Lucene to support this type of usage? The re

Re: Question regarding adding documents

2008-01-07 Thread Daniel Noll
On Tuesday 08 January 2008 00:52:35 Developer Developer wrote: > here is another approach. > > StandardAnalyzer st = new StandardAnalyzer(); > StringReader reader= new StringReader("text to index..."); > TokenStream stream = st.tokenStream("content", reader); > > Then use the Field

Re: OutOfMemoryError on small search in large, simple index

2008-01-07 Thread Yonik Seeley
On Jan 7, 2008 5:00 AM, Lars Clausen <[EMAIL PROTECTED]> wrote: > Doesn't appear to be the case in our test. We had two fields with > norms, omitting saved only about 4MB for 50 million entries. It should be 50MB. If you are measuring with an external tool, then that tool is probably in error.

Re: OutOfMemoryError on small search in large, simple index

2008-01-07 Thread Otis Gospodnetic
Please post your results, Lars! Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Lars Clausen <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Monday, January 7, 2008 5:00:54 AM Subject: Re: OutOfMemoryError on small search in la

Re: Wikia search goes live today

2008-01-07 Thread Otis Gospodnetic
See my comment (around #45-50) on Techcrunch about that from late last night. There is actually one Wikia guy helping Nutch - Dennis Kubes. He must have been hitting reload on that TC post, because he IMed me quickly after I posted my comment and clarified that he is that Wikia developer I was

Re: Using Lucene with Jarowinkler

2008-01-07 Thread Chris Lu
Hi, Shivani, For my understanding, Jarowinkler doesn't quite fit with Lucene's structure. Calculating Jaro-Winkler distance for the query against each word in the index is quite computational intensive. What's possible may be using SoundEx, Metaphone, Double Metaphone, etc, instead. For each word

Re: Wikia search goes live today

2008-01-07 Thread Grant Ingersoll
One other thing to note, you can definitely see Lucene in action (or Nutch, that is) by clicking on the score returned for a given document (try searching for Lucene) and you see, in all it's glory, the Lucene explain results... It even displays the Nutch logo, which makes me wonder if the

Re: Using Lucene with Jarowinkler

2008-01-07 Thread Grant Ingersoll
FuzzyQuery uses EditDistance, you probably could create a JaroWinklerQuery that mimics FuzzyQuery but calculates the JaroWinkler score instead of the edit distance. As for dealing with phrases, that would get a bit more complex, but you may be able to use PhraseQuery as an example and then

Re: Merging Lucene documents

2008-01-07 Thread Developer Developer
Eric, Thank you very much for the insight on offsets. I think I may not really need to worry about offsets. Nonetheless, I solved my offset problem by overriding Java StringReader classs instead of overriding tokenstream class., The StringReader class does the streaming nicely, thus solving the o

Re: Question regarding adding documents

2008-01-07 Thread Developer Developer
here is another approach. StandardAnalyzer st = new StandardAnalyzer(); StringReader reader= new StringReader("text to index..."); TokenStream stream = st.tokenStream("content", reader); Then use the Field constructor such as *Field

Using Lucene with Jarowinkler

2008-01-07 Thread Shivani Sawhney
Hi All, I am using Jarowinkler scoring in my current project for matching names. The database of names against which the inputted value has to be matched is huge and thus we are faced with performance issues. We now want lucene to help us here; we want lucene's speed for handling huge data

Re: Wikia search goes live today

2008-01-07 Thread Grant Ingersoll
On Jan 7, 2008, at 7:48 AM, Lukas Vlcek wrote: Hi, I noticed that Wikia search goes live today (see http://www.devxnews.com/article.php/3719906). Does anybody know where I could find more technical information about their solution? Are they going to contribute their enhancements back to Luc

Wikia search goes live today

2008-01-07 Thread Lukas Vlcek
Hi, I noticed that Wikia search goes live today (see http://www.devxnews.com/article.php/3719906). Does anybody know where I could find more technical information about their solution? Are they going to contribute their enhancements back to Lucene/Nutch/Hadoop code? My understanding is that as lon

Re: Question regarding adding documents

2008-01-07 Thread Doron Cohen
Or, very similar, wrap the 'real' analyzer A with your analyzer that delegates to A but also keeps the returned tokens, possibly by using a CachingTokenFilter. On Jan 7, 2008 7:11 AM, Daniel Noll <[EMAIL PROTECTED]> wrote: > On Monday 07 January 2008 11:35:59 chris.b wrote: > > is it possible to

Re: LUCENE and UIMA

2008-01-07 Thread Jesiel Trevisan
I don´t know UIMA, but, you can you NUTCH to create index to Lucene´s search. Nutch is so easy, it implements the indexing and also search functions. You can find it in the lucene home page. Tks. On Jan 7, 2008 8:57 AM, vincenzo iafelice <[EMAIL PROTECTED]> wrote: > Hi all, > > i am a new user of

LUCENE and UIMA

2008-01-07 Thread vincenzo iafelice
Hi all, i am a new user of lucene and i have a little problem. I used UIMA to create a search index for a document collection, so my question is: can i use this index with Lucene? Thanks Vincenzo

Re: OutOfMemoryError on small search in large, simple index

2008-01-07 Thread Lars Clausen
On Tue, 2008-01-01 at 23:38 -0800, Chris Hostetter wrote: > : On Wed, 2007-12-12 at 11:37 +0100, Lars Clausen wrote: > > : Seems there's a reason we still use all this memory: > : SegmentReader.fakeNorms() creates the full-size array for us anyway, so > : the memory usage cannot be avoided as lon