Bucketing (was Re: Wikia search goes live today)

2008-01-08 Thread Otis Gospodnetic
Sounds useful. I suppose this means one would have custom function for within-bucket-reordering? e.g. for a web search you might reorder based on the URL length if you think shorter URLs are an indicator of higher quality. It also sounds like something that can easily sit outside Luceneor

Re: Wikia search goes live today

2008-01-08 Thread Dennis Kubes
Sorry about not responding to this before now, been a little busy :). For those of you who don't know me, I am a committer on the Nutch project. I have been working with Wikia since early July and more actively since the beginning of November. Before Wikia I helped start another search engin

Re: Wikia search goes live today

2008-01-08 Thread Andrzej Bialecki
Ryan McKinley wrote: Andrzej Bialecki wrote: Lukas Vlcek wrote: So staring will be accommodated only during indexing phase. Does it mean it will be pretty static value not a dynamically changing variable... correct? In other words if I add my starts to some document it won't affect the scorin

Re: Basic Named Entity Indexing

2008-01-08 Thread Doron Cohen
On Jan 8, 2008 11:48 PM, chris.b <[EMAIL PROTECTED]> wrote: > > Wrapping the whitespaceanalyzer with the ngramfilter it creates unigrams > and > the ngrams that i indicate, while maintining the whitespaces. :) > The reason i'm doing this is because I only wish to index names with more > than one t

Re: Query processing with Lucene

2008-01-08 Thread Doron Cohen
This is done by Lucene's scorers. You should however start in http://lucene.apache.org/java/docs/scoring.html, - scorers are described in the "Algorithm" section. "Offsets" are used by Phrase Scorers and by Span Scorer. Doron On Jan 8, 2008 11:24 PM, Marjan Celikik < [EMAIL PROTECTED]> wrote: >

Re: Basic Named Entity Indexing

2008-01-08 Thread chris.b
Wrapping the whitespaceanalyzer with the ngramfilter it creates unigrams and the ngrams that i indicate, while maintining the whitespaces. :) The reason i'm doing this is because I only wish to index names with more than one token. -- View this message in context: http://www.nabble.com/Basic-Nam

Re: Sorting on tokenized fields

2008-01-08 Thread Michael Prichard
yes, no worries. i just check in advance what fields are available and build the Sort object accordingly. Eventually BCC would be there...but not necessary so at first. Anyway, got it to work! Thanks for your help. All the best, Michael On Jan 8, 2008, at 4:37 PM, Doron Cohen wrote: H

Re: Sorting on tokenized fields

2008-01-08 Thread Doron Cohen
Hi Michael, I think you mean the exception thrown when you search and sort with a field that was not yet indexed: RuntimeException: field "BBC" does not appear to be indexed I think the current behavior is correct, otherwise an application might (by a bug) attempt to sort by a wrong field, th

Re: Sorting on tokenized fields

2008-01-08 Thread Ryan McKinley
my mistake, I thought I was looking at the solr mailing list ;) If you change your analyzer, it does not change the tokens that are already in the index -- you will need to re-index for any changes to take effect. ryan Michael Prichard wrote: Meaning that it says "field is not indexed". Wh

Re: Query processing with Lucene

2008-01-08 Thread Marjan Celikik
Doron Cohen wrote: Hi Marjan, Lucene process the query in what can be called one-doc-at-a-time. For the example query - x y - (not the phrase query "x y") - all documents containing either x or y are considered a match. When processing the query - x y - the posting lists of these two index ter

Re: Sorting on tokenized fields

2008-01-08 Thread Michael Prichard
Meaning that it says "field is not indexed". Where is sortMissingLastAttribute? thanks. On Jan 8, 2008, at 4:13 PM, Ryan McKinley wrote: what do you mean by "fail"? -- there is the sortMissingLast attribute Michael Prichard wrote: ok... i should read the manual more often. i went ahead a

Re: Sorting on tokenized fields

2008-01-08 Thread Ryan McKinley
what do you mean by "fail"? -- there is the sortMissingLast attribute Michael Prichard wrote: ok... i should read the manual more often. i went ahead and just added untokenized, unstored sort fields question, if I put a field in to sort of but say I have not indexed any as of yet...will

Re: Basic Named Entity Indexing

2008-01-08 Thread Doron Cohen
Hi Chris, A null pointer exception can be causes by not checking newToken for null after this line: Token newToken = input.next() I think Hoss meant to call next() on the input as long as returned tokens do not satisfy the check for being a named entity. Also, this code assumes white space i

Re: Wikia search goes live today

2008-01-08 Thread Lukas Vlcek
I should note that this technique is probably not easily applicable to current Lucene scoring mechanism without additional development. On 1/8/08, Lukas Vlcek <[EMAIL PROTECTED]> wrote: > > After checking the Lucene API of ParallelReader it seems that the star > score could be stored in different

Re: Wikia search goes live today

2008-01-08 Thread Lukas Vlcek
After checking the Lucene API of ParallelReader it seems that the star score could be stored in different index which shares the same identifier for the documents. Such index could be small (partitioned to many small indices?) so the updates can be fast. Is that what you meant Andrzej? ;-) Anyway,

Re: Wikia search goes live today

2008-01-08 Thread Ryan McKinley
Andrzej Bialecki wrote: Lukas Vlcek wrote: So staring will be accommodated only during indexing phase. Does it mean it will be pretty static value not a dynamically changing variable... correct? In other words if I add my starts to some document it won't affect the scoring immediately but afte

Re: Sorting on tokenized fields

2008-01-08 Thread Michael Prichard
ok... i should read the manual more often. i went ahead and just added untokenized, unstored sort fields question, if I put a field in to sort of but say I have not indexed any as of yet...will the Sort fail? For example, say I have a BCC field and nothing has been indexed with that yet

Re: Wikia search goes live today

2008-01-08 Thread Andrzej Bialecki
Lukas Vlcek wrote: So staring will be accommodated only during indexing phase. Does it mean it will be pretty static value not a dynamically changing variable... correct? In other words if I add my starts to some document it won't affect the scoring immediately but after indexing cycle. Correct?

Re: Wikia search goes live today

2008-01-08 Thread Lukas Vlcek
So staring will be accommodated only during indexing phase. Does it mean it will be pretty static value not a dynamically changing variable... correct? In other words if I add my starts to some document it won't affect the scoring immediately but after indexing cycle. Correct? On 1/8/08, Dennis Ku

Re: Wikia search goes live today

2008-01-08 Thread Michael Stoppelman
I'm surprised they aren't keeping *any* logs or so they claim. Seems foolish to me from a data-mining prospective. "A Wikia employee told me today that people were already asking what the most popular search terms were. He said there was no way of finding out as no logs are kept." [1] [1] http://r

Re: Wikia search goes live today

2008-01-08 Thread Dennis Kubes
Star ratings are being stored but not accounted for in the score as of yet. The plan is to include them in future indexing scores. :) Dennis Mike Klaas wrote: On 7-Jan-08, at 11:49 PM, Lukas Vlcek wrote: This would be great! I am particularly interested how they are going about customized

Re: Wikia search goes live today

2008-01-08 Thread Mike Klaas
On 7-Jan-08, at 11:49 PM, Lukas Vlcek wrote: This would be great! I am particularly interested how they are going about customized search (if they have a plan to do it). I mean if they can reorder raw search results based on some kind of collective knowledge (which is probably kept outsid

Re: Query processing with Lucene

2008-01-08 Thread Doron Cohen
Hi Marjan, Lucene process the query in what can be called one-doc-at-a-time. For the example query - x y - (not the phrase query "x y") - all documents containing either x or y are considered a match. When processing the query - x y - the posting lists of these two index terms are traversed, and

Sorting on tokenized fields

2008-01-08 Thread Michael Prichard
Is it possible to sort on a tokenized field? For example, I break email address into pieces, i.e. [EMAIL PROTECTED] becomes [EMAIL PROTECTED] michael.prichard michael prichard email.com email so when sorting on this field I get some strange results. Do I need to create another field jus

Re: Self Join Query

2008-01-08 Thread Chris Lu
Hi, Sachin, If you like self-join, you may need to retrieve the data from the second query and merge them into each Document object. Then you can do the query in one shot. (it's redundant. but do not try to normalize data in the index.) Lucene is an index. Just like index in SQL database, which c

Re: Performance and BestFragments

2008-01-08 Thread Mark Miller
I think the problem is that he is calling getBestFrags on every hit result for 200 page documents. So he is probably getting the document for every result and running the Highlighter on each. Thats some slow stuff there. The first simple thought is to page your results and only getBestFrags for

Re: Performance and BestFragments

2008-01-08 Thread Grant Ingersoll
Are you just trying to search or are you trying to highlight? Usually, you do your search, and then highlight 1 or more documents. You can also speed up highlighting by using term vectors. -Grant On Jan 8, 2008, at 9:38 AM, Yannick Caillaux wrote: Hello, First, sorry for my bad english.

Re: Custom Tokenization of a Single Field

2008-01-08 Thread Briggs
Cool. I just realized that compass also has an annotation value of analyzer. Now I'll just have to find out if you can truly have more than one per index. Thanks! - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional comma

Re: Custom Tokenization of a Single Field

2008-01-08 Thread Erick Erickson
I *think* you want to consider writing your own analyzer that tokenizes your special fields however you want to. Then use PerFieldAnalyzerWrapper at both index and query time to break the stream up appropriately for this (and only this) field.. This would give you two tokens for the attribute fiel

Re: Basic Named Entity Indexing

2008-01-08 Thread chris.b
Following your suggestion (I think), I built a tokenfilter with the following code for next(): public final Token next() throws IOException { Token newToken = input.next(); termText = newToken.termText(); Character tempChar = termText.charAt

Re: Self Join Query

2008-01-08 Thread Erick Erickson
It's often a mistake to try to force Lucene to act like a database. Is it possible to just use the database for the join portion and Lucene for the text search? Otherwise I agree with Developer Developer. You need to provide a higher level idea of *what* it is you're trying to accomplish to get go

Custom Tokenization of a Single Field

2008-01-08 Thread Briggs
I have an index that contains a couple special fields that I need to tokenize differently than the rest. The case is that I basically have a key/value pair stored as the value. The field name is "attribute" and it's value is "SomeValue=1.9" I need to tokenize the value so that I can search on t

Re: Self Join Query

2008-01-08 Thread Developer Developer
Provide more details please. Can you not use boolean query and filters if need be ? On Jan 8, 2008 7:23 AM, sachin <[EMAIL PROTECTED]> wrote: > > I need to write lucene query something similar to SQL self joins. > > My current implementation is very primitive. I fire first query, get the > res

Performance and BestFragments

2008-01-08 Thread Yannick Caillaux
Hello, First, sorry for my bad english. I have an index including 100 Dublin Core notices. I indexed title,creator and I added a field "fulltext" containing the PDF document referenced by the DC notice. (A PDF document is about 200 pages) There's no problem to index them. But when I try

Re: Wikia search goes live today

2008-01-08 Thread Grant Ingersoll
On Jan 8, 2008, at 2:55 AM, Lukas Vlcek wrote: BTW: 1) If they have made any improvements/changes to Nutch (or Lucene/ Hadoop) code and they keep it closed then how they can claim they are using open sourced algorithms? They are "using" it, they just aren't sharing it. Many companies out

Self Join Query

2008-01-08 Thread sachin
I need to write lucene query something similar to SQL self joins. My current implementation is very primitive. I fire first query, get the results, based on the result of first query I fire second query and then merge the results from both the queries. The whole processing is very expensive. Doi

Re: PrefixQuery question

2008-01-08 Thread Shai Erera
Did you try SpanFirstQuery? I had the same need in my application, for implementing type-ahead functionality over the titles and I found that storing them as un_tokenized gives the best performance (of course, I don't run any query, but iterate over the terms in my solution). Span queries are expe

PrefixQuery question

2008-01-08 Thread Cam Bazz
Hello, I am having a problem qith PrefixQuery: I have a field name item title which is indexed as: doc.add(new Field("item_title", item_title.trim().toLowerCase(), Field.Store.YES, Field.Index.TOKENIZED, Field.TermVector.YES)); and I am forming my query like: PrefixQuery pq = new PrefixQuery((

Re: OutOfMemoryError on small search in large, simple index

2008-01-08 Thread Lars Clausen
On Mon, 2008-01-07 at 14:20 -0800, Otis Gospodnetic wrote: > Please post your results, Lars! Tried the patch, and it failed to compile (plain Lucene compiled fine). In the process, I looked at TermQuery and found that it'd be easier to copy that code and just hardcode 1.0f for all norms. Did tha

Re: Deleting a single TermPosition for a Document

2008-01-08 Thread Antony Bowesman
Otis Gospodnetic wrote: Is your user field stored? If so, you cold find the target Document, get the user field value, modify it, and re-add it to the Document (or something close to this -- I am doing this with one of the indices on simpy.com and it's working well). No, it's not stored. I'm