Searching for "keyword" fields using QueryParser

2005-11-24 Thread Tim.Wright
Hi,

Our index has a large text field, and a number of "keyword" fields with
things such as the publication code, article reference and so on. 

We're analysing using the StandardAnalyzer, which works well. Obviously
the fields which are defined as Field.Keyword don't run through the
analyzer. 

The problem comes when we search using a QueryParser. If we've inserted
a field using:

Field.Keyword("pubcode"."PUB123");

We then search using 

Query q = QueryParser.parse(query, "text", new StandardAnalyzer());

When we try to search on something like:

"pubcode:PUB123" - the analyzer promptly lowercases the search term, and
we get no hits back. 

Is there any way to explain that where a search term is being applied
against a field of type Keyword, that it shouldn't be run through the
analyzer? 

We can get around this by creating separate Query objects and joining
them with a BooleanQuery, but we were hoping to be able to handle all of
our queries using QueryParser so that we could easily store queries as
String objects. 

Cheers,

Tim.




The information contained in this email message may be confidential. If you are 
not the intended recipient, any use, interference with, disclosure or copying 
of this material is unauthorised and prohibited. Although this message and any 
attachments are believed to be free of viruses, no responsibility is accepted 
by Informa for any loss or damage arising in any way from receipt or use 
thereof.  Messages to and from the company are monitored for operational 
reasons and in accordance with lawful business practices. 
If you have received this message in error, please notify us by return and 
delete the message and any attachments.  Further enquiries/returns can be sent 
to [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Searching for "keyword" fields using QueryParser

2005-11-24 Thread Tim.Wright
Excellent, that's exactly what I needed. Many thanks!

Cheers,

Tim.

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: 24 November 2005 14:51
To: java-user@lucene.apache.org
Subject: Re: Searching for "keyword" fields using QueryParser


Tim,

The trick is to use PerFieldAnalyzerWrapper with QueryParser, using  
StandardAnalyzer as the default, and using KeywordAnalyzer for each  
of the fields that should not be analyzed.  KeywordAnalyzer is in the  
trunk of Subversion right now, not in a released version.



It was developed for Lucene in Action also, so it is in that codebase  
which you can download from http://www.lucenebook.com

Keep in mind that QueryParser may still interfere with keyword fields  
if there are characters in them that are special to the parser, such  
as parenthesis.

Erik



On 24 Nov 2005, at 09:09, <[EMAIL PROTECTED]>  
<[EMAIL PROTECTED]> wrote:

> Hi,
>
> Our index has a large text field, and a number of "keyword" fields  
> with
> things such as the publication code, article reference and so on.
>
> We're analysing using the StandardAnalyzer, which works well.  
> Obviously
> the fields which are defined as Field.Keyword don't run through the
> analyzer.
>
> The problem comes when we search using a QueryParser. If we've  
> inserted
> a field using:
>
> Field.Keyword("pubcode"."PUB123");
>
> We then search using
>
> Query q = QueryParser.parse(query, "text", new StandardAnalyzer());
>
> When we try to search on something like:
>
> "pubcode:PUB123" - the analyzer promptly lowercases the search  
> term, and
> we get no hits back.
>
> Is there any way to explain that where a search term is being applied
> against a field of type Keyword, that it shouldn't be run through the
> analyzer?
>
> We can get around this by creating separate Query objects and joining
> them with a BooleanQuery, but we were hoping to be able to handle  
> all of
> our queries using QueryParser so that we could easily store queries as
> String objects.
>
> Cheers,
>
> Tim.
>
>
>
> **

> **
> The information contained in this email message may be  
> confidential. If you are not the intended recipient, any use,  
> interference with, disclosure or copying of this material is  
> unauthorised and prohibited. Although this message and any  
> attachments are believed to be free of viruses, no responsibility  
> is accepted by Informa for any loss or damage arising in any way  
> from receipt or use thereof.  Messages to and from the company are  
> monitored for operational reasons and in accordance with lawful  
> business practices.
> If you have received this message in error, please notify us by  
> return and delete the message and any attachments.  Further  
> enquiries/returns can be sent to [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Boost value and LUKE

2006-01-23 Thread Tim.Wright
I'm pretty sure this is a bug or incompatibility with Luke - I'm using
boosted documents, and I seem to remember that Luke reported everything
as 1.0, even though my test applications showed things correctly. 

The boost in the final app is working fine, so the functionality of
Lucene appears to be correct. 

Cheers,

Tim.

-Original Message-
From: Don Vaillancourt [mailto:[EMAIL PROTECTED] 
Sent: 23 January 2006 17:48
To: java-user@lucene.apache.org
Subject: Boost value and LUKE


Hello All,

I am testing the boost value within the latest version of Lucene and I'm

inspecting the results through Luke.

For each FIELD that I want to boost I use the setBoost method.  And 
everything looks good.  But Luke is refusing to exposure the boost value

and keeps returning 1.0 for the specified columns (well actually all 
columns) when I have set one column to 0.5 and the other to 0.75.

1. Is there something I am doing wrong?
2. Is there a bug in Lucene?
3. is there a bug in Luke?

Anyone have any hints.

Thank You

-- 
Don Vaillancourt
Director of Software Development
WEB IMPACT INC.
phone:   416-815-2000 ext. 245
fax: 416-815-2001
toll free:   866-319-1573 ext. 245
email:   [EMAIL PROTECTED] 
blackberry:  [EMAIL PROTECTED] 

web: http://www.web-impact.com
address: http://www.mapquest.ca 
 


This email message is intended only for the addressee(s) and contains 
information that may be confidential and/or copyright.

If you are not the intended recipient please notify the sender by reply 
email and immediately delete this email.

Use, disclosure or reproduction of this email by anyone other than the 
intended recipient(s) is strictly prohibited. No representation  made 
that this email or any attachments are free of viruses. Virus scanning 
is recommended and is the responsibility of the recipient.





The information contained in this email message may be confidential. If you are 
not the intended recipient, any use, interference with, disclosure or copying 
of this material is unauthorised and prohibited. Although this message and any 
attachments are believed to be free of viruses, no responsibility is accepted 
by Informa for any loss or damage arising in any way from receipt or use 
thereof.  Messages to and from the company are monitored for operational 
reasons and in accordance with lawful business practices. 
If you have received this message in error, please notify us by return and 
delete the message and any attachments.  Further enquiries/returns can be sent 
to [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Boost value and LUKE

2006-01-23 Thread Tim.Wright
Ah - that's useful to know. Although in that case I'd suggest that the
sensible thing for Luke to do would be to either remove the boost field,
or show it as "unavailable", instead of (misleadingly) displaying it as
1.0... 

Cheers,

Tim.

-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: 23 January 2006 19:07
To: java-user@lucene.apache.org
Cc: [EMAIL PROTECTED]
Subject: Re: Boost value and LUKE


[EMAIL PROTECTED] wrote:
> I'm pretty sure this is a bug or incompatibility with Luke - I'm using
> boosted documents, and I seem to remember that Luke reported
everything
> as 1.0, even though my test applications showed things correctly. 
>
> The boost in the final app is working fine, so the functionality of
> Lucene appears to be correct. 
>   

This is not a bug or incompatibility - it's the way the boost values are

stored in the index. Once the index is written down, the original boost 
values per field and per doc cannot be retrieved, because they are 
calculated into so called "fieldNorm" values.

-- 
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





The information contained in this email message may be confidential. If you are 
not the intended recipient, any use, interference with, disclosure or copying 
of this material is unauthorised and prohibited. Although this message and any 
attachments are believed to be free of viruses, no responsibility is accepted 
by Informa for any loss or damage arising in any way from receipt or use 
thereof.  Messages to and from the company are monitored for operational 
reasons and in accordance with lawful business practices. 
If you have received this message in error, please notify us by return and 
delete the message and any attachments.  Further enquiries/returns can be sent 
to [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Sorting by calculated custom score at search time

2006-01-24 Thread Tim.Wright
Nick Vincent [mailto:[EMAIL PROTECTED] wrote:

[snip]

> From an earlier thread discussing a calculated score based on the hit
> score and the age of document I gather that TSS regenerate their
indexes
> to alter the document boost based on date.  I need to be able to sort
by
> either relevance or "popularity rated relevance" depending on user
> input, so I don't think adding a precalculated document boost at index
> time is an option.

The easiest way I can think of would be to build two indexes - one with
popularity boosted documents and one without, and search the one you
want. 

Alternatively, if you have to have a single index, you could add each
document twice, once with no boost, and once with "popularity" boost,
and specify a keyword field for each document defining whether it's
boosted or not. Then, when you want to search, restrict results to
documents where the keyword is "boosted" or "notboosted" respectively.

Both fairly low tech solutions, but I'm lazy like that!

Cheers,

Tim.




The information contained in this email message may be confidential. If you are 
not the intended recipient, any use, interference with, disclosure or copying 
of this material is unauthorised and prohibited. Although this message and any 
attachments are believed to be free of viruses, no responsibility is accepted 
by Informa for any loss or damage arising in any way from receipt or use 
thereof.  Messages to and from the company are monitored for operational 
reasons and in accordance with lawful business practices. 
If you have received this message in error, please notify us by return and 
delete the message and any attachments.  Further enquiries/returns can be sent 
to [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Changing default QueryParser operator from OR to AND

2006-02-10 Thread Tim.Wright
Hi guys, 

IF QueryParser gets a phrase with a number of words (ie: "here are
words") it uses the implicit operator OR - "here OR are OR words". LIA
on p94 says the operator "by default is OR", implying that there may be
some way to change this. 

We'd really like the default to be AND. Is that possible?

Cheers,

Tim.

The information contained in this email message may be confidential. If you are 
not the intended recipient, any use, interference with, disclosure or copying 
of this material is unauthorised and prohibited. Although this message and any 
attachments are believed to be free of viruses, no responsibility is accepted 
by Informa for any loss or damage arising in any way from receipt or use 
thereof.  Messages to and from the company are monitored for operational 
reasons and in accordance with lawful business practices. 
If you have received this message in error, please notify us by return and 
delete the message and any attachments.  Further enquiries/returns can be sent 
to [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



TooManyClauses exception in Lucene (1.4)

2006-03-16 Thread Tim.Wright
Hi,

We're using queryparser to generate my queries (not ideal, and we're
planning on rewriting it, but at the moment we don't have the resources
to do so). 

We have a default field "text" which contains all of our text fields,
and a "date" field which is just a string field in the format -MM-DD
(so we can sort). 

I'm passing in the query string: broadband AND date:[2005-03-16 TO
2006-03-16]^0.01

(I'm weighting the date portion of the query so it doesn't affect the
sorting too heavily). 

Running query.toString gives this: +text:broadband +date:[2005-03-16 TO
2006-03-16]^0.01

When I try to run the query, though, I get this Exception:

org.apache.lucene.search.BooleanQuery$TooManyClauses
at
org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:79)
at
org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:71)
at
org.apache.lucene.search.RangeQuery.rewrite(RangeQuery.java:99)
at
org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:243)
at
org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:166)
at org.apache.lucene.search.Query.weight(Query.java:84)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:85)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64)
at org.apache.lucene.search.Hits.(Hits.java:43)
at org.apache.lucene.search.Searcher.search(Searcher.java:33)
at org.apache.lucene.search.Searcher.search(Searcher.java:27)

I know Lucene 1.4 has a limited number of clauses, but assumed two or
three would be okay. :)

Any ideas would be gratefully received! Oddly, this doesn't seem to
occur every time, just with certain date ranges...

Cheers,

Tim.

The information contained in this email message may be confidential. If you are 
not the intended recipient, any use, interference with, disclosure or copying 
of this material is unauthorised and prohibited. Although this message and any 
attachments are believed to be free of viruses, no responsibility is accepted 
by Informa for any loss or damage arising in any way from receipt or use 
thereof.  Messages to and from the company are monitored for operational 
reasons and in accordance with lawful business practices. 
If you have received this message in error, please notify us by return and 
delete the message and any attachments.  Further enquiries/returns can be sent 
to [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: TooManyClauses exception in Lucene (1.4)

2006-03-16 Thread Tim.Wright
Ouch! Yes, we're indexing with seconds, that's almost certainly the
problem. :( I had no idea that rangequery worked by enumerating every
possible value, that's terrifying. 

We have a requirement to index data going back for about 20 years,
though, and although daily resolution would be fine, this is still 7000
days. That's more than BooleanQuery supports. 

So what's the solution? Surely we're not the only people who need to
search in date ranges going back this far? Does PrefixQuery work the
same way, or can we work around this by doing things like a prefixQuery
for "2005-01-" to catch an entire month in one query, etc?

Cheers,

Tim.

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: 16 March 2006 17:26
To: java-user@lucene.apache.org
Subject: Re: TooManyClauses exception in Lucene (1.4)


Tim,

This is possible a lot of days:
  date:[2005-03-16 TO 2006-03-16]
And if your 'date' field is more granular than 'a day', then this is a
lot more hours/minutes/seconds/milliseconds.

Your range query is expanded to all unique values in the range.  This is
probably in the FAQ, but if not, look at the first snippet:
http://www.lucenebook.com/search?query=indexing+date+millisecond

If you don't need hours/minutes/seconds/milliseconds, don't index them.

Otis


- Original Message 
From: [EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Thursday, March 16, 2006 11:34:29 AM
Subject: TooManyClauses exception in Lucene (1.4)

Hi,

We're using queryparser to generate my queries (not ideal, and we're
planning on rewriting it, but at the moment we don't have the resources
to do so). 

We have a default field "text" which contains all of our text fields,
and a "date" field which is just a string field in the format -MM-DD
(so we can sort). 

I'm passing in the query string: broadband AND date:[2005-03-16 TO
2006-03-16]^0.01

(I'm weighting the date portion of the query so it doesn't affect the
sorting too heavily). 

Running query.toString gives this: +text:broadband +date:[2005-03-16 TO
2006-03-16]^0.01

When I try to run the query, though, I get this Exception:

org.apache.lucene.search.BooleanQuery$TooManyClauses
at
org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:79)
at
org.apache.lucene.search.BooleanQuery.add(BooleanQuery.java:71)
at
org.apache.lucene.search.RangeQuery.rewrite(RangeQuery.java:99)
at
org.apache.lucene.search.BooleanQuery.rewrite(BooleanQuery.java:243)
at
org.apache.lucene.search.IndexSearcher.rewrite(IndexSearcher.java:166)
at org.apache.lucene.search.Query.weight(Query.java:84)
at
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:85)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64)
at org.apache.lucene.search.Hits.(Hits.java:43)
at org.apache.lucene.search.Searcher.search(Searcher.java:33)
at org.apache.lucene.search.Searcher.search(Searcher.java:27)

I know Lucene 1.4 has a limited number of clauses, but assumed two or
three would be okay. :)

Any ideas would be gratefully received! Oddly, this doesn't seem to
occur every time, just with certain date ranges...

Cheers,

Tim.


The information contained in this email message may be confidential. If
you are not the intended recipient, any use, interference with,
disclosure or copying of this material is unauthorised and prohibited.
Although this message and any attachments are believed to be free of
viruses, no responsibility is accepted by Informa for any loss or damage
arising in any way from receipt or use thereof.  Messages to and from
the company are monitored for operational reasons and in accordance with
lawful business practices. 
If you have received this message in error, please notify us by return
and delete the message and any attachments.  Further enquiries/returns
can be sent to [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


The information contained in this email message may be confidential. If you are 
not the intended recipient, any use, interference with, disclosure or copying 
of this material is unauthorised and prohibited. Although this message and any 
attachments are believed to be free of viruses, no responsibility is accepted 
by Informa for any loss or damage arising in any way from receipt or use 
thereof.  Messages to and from the company are monitored for operational 
reasons and in accordance with lawful business pr

RE: TooManyClauses exception in Lucene (1.4)

2006-03-17 Thread Tim.Wright
Thanks to everyone for the explanation. Given that RangeQuery is clearly
unsuitable for out requirements, ConstantScoreRangeQuery looks ideal.

However, we're building our queries (at the moment) using QueryParser.
Is there any way we can get QueryParser to use a ConstantScoreRangeQuery
instead of a RangeQuery?

Cheers,

Tim.

-Original Message-
From: Yonik Seeley [mailto:[EMAIL PROTECTED] 
Sent: 16 March 2006 18:04
To: java-user@lucene.apache.org
Subject: Re: TooManyClauses exception in Lucene (1.4)


On 3/16/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> I had no idea that rangequery worked by enumerating every
> possible value, that's terrifying.

You could use either a RangeFilter or a ConstantScoreRangeQuery

-Yonik
http://incubator.apache.org/solr Solr, The Open Source Lucene Search
Server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


The information contained in this email message may be confidential. If you are 
not the intended recipient, any use, interference with, disclosure or copying 
of this material is unauthorised and prohibited. Although this message and any 
attachments are believed to be free of viruses, no responsibility is accepted 
by Informa for any loss or damage arising in any way from receipt or use 
thereof.  Messages to and from the company are monitored for operational 
reasons and in accordance with lawful business practices. 
If you have received this message in error, please notify us by return and 
delete the message and any attachments.  Further enquiries/returns can be sent 
to [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene 1.9.1 Query

2006-03-22 Thread Tim.Wright
You need to create a QueryParser instance and use that instead:

QueryParser qp = new QueryParser("text", new StandardAnalyzer());
Query query = qp.parse(this.searchvalue);

Cheers,

Tim.

-Original Message-
From: WATHELET Thomas [mailto:[EMAIL PROTECTED] 
Sent: 22 March 2006 11:25
To: java-user@lucene.apache.org
Subject: Lucene 1.9.1 Query


How to replace this Expression Query query =
QueryParser.parse(this.searchvalue, "text", new StandardAnalyzer()); in
Lucene 1.9.1 because the method parse is deprecated?


The information contained in this email message may be confidential. If you are 
not the intended recipient, any use, interference with, disclosure or copying 
of this material is unauthorised and prohibited. Although this message and any 
attachments are believed to be free of viruses, no responsibility is accepted 
by Informa for any loss or damage arising in any way from receipt or use 
thereof.  Messages to and from the company are monitored for operational 
reasons and in accordance with lawful business practices. 
If you have received this message in error, please notify us by return and 
delete the message and any attachments.  Further enquiries/returns can be sent 
to [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Handling hyphens and other puncuation in proper nouns

2006-05-24 Thread Tim.Wright
Hi all,

We're having issues searching for proper nouns (names) which have
punctuation in; things like "a-blah" or "blah'x". I suspect the
StandardAnalyzer is replacing the punctuation with spaces, and we get
back results that just contain "blah". 

Any suggestions? I'm guessing we could write our own Analyzer, but I
suspect this has come up before and wondered if there was an easier way
to solve the problem!

Many thanks, 

Tim.

The information contained in this email message may be confidential. If you are 
not the intended recipient, any use, interference with, disclosure or copying 
of this material is unauthorised and prohibited. Although this message and any 
attachments are believed to be free of viruses, no responsibility is accepted 
by Informa for any loss or damage arising in any way from receipt or use 
thereof.  Messages to and from the company are monitored for operational 
reasons and in accordance with lawful business practices. 
If you have received this message in error, please notify us by return and 
delete the message and any attachments.  Further enquiries/returns can be sent 
to [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Search oddities

2006-05-25 Thread Tim.Wright
It appears that I was confused about the way analyzers are working. I
assumed that a typical analyzer would just remove hyphens and treat the
phrase as a space. We're just using StandardAnalyzer. 

When we search (using QueryParser) for the phrase "t-mobile" (including
quotes) we're getting results back which only have the phrase "mobile"
in them. I would assume the analyzer would convert this to the "t
mobile" (again, in quotes) and would only return documents containing
that phrase. 

Oddly, however, if we search for "pay-tv", we only get back documents
that actually have the phrase "pay tv" or "pay-tv" - nothing which just
has "pay" or "tv". 

I'm not quite sure why "t-mobile" is behaving differently to "pay-tv".
If anyone could point me in the right direction I'd be very grateful!

Many thanks, 

Tim.

The information contained in this email message may be confidential. If you are 
not the intended recipient, any use, interference with, disclosure or copying 
of this material is unauthorised and prohibited. Although this message and any 
attachments are believed to be free of viruses, no responsibility is accepted 
by Informa for any loss or damage arising in any way from receipt or use 
thereof.  Messages to and from the company are monitored for operational 
reasons and in accordance with lawful business practices. 
If you have received this message in error, please notify us by return and 
delete the message and any attachments.  Further enquiries/returns can be sent 
to [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Sorting results by both score and date

2005-09-16 Thread Tim.Wright
Hi,

I'm working in an industry which is fairly time sensitive, and older
documents are inherently less valuable. I'd like to be able to "weight"
the score of search results, so that older documents score lower. I
don't just want to sort by date, though - I'd still like results to be
ordered by score, just an "adjusted" score. 

I've read the excellent LIA, including the chapter on custom sort
methods, but from what I can tell that still only implements a sort on
one field - I really want to be able to sort on a "blend" of fields (one
of this is the actual document score). 

Could anyone suggest how I could implement this? I considered explicitly
weighting the documents with a function of their date at index time, but
this would mean the "weight" of the new documents would have to increase
exponentially over time, and I suspect things would get messy! (Our
dataset is around 250k documents, growing by a few thousand a month.)

Cheers,

Tim.




The information contained in this email message may be confidential. If you are 
not the intended recipient, any use, interference with, disclosure or copying 
of this material is unauthorised and prohibited. Although this message and any 
attachments are believed to be free of viruses, no responsibility is accepted 
by T&F Informa for any loss or damage arising in any way from receipt or use 
thereof.  Messages to and from the company are monitored for operational 
reasons and in accordance with lawful business practices. 
If you have received this message in error, please notify us by return and 
delete the message and any attachments.  Further enquiries/returns can be sent 
to [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Sorting results by both score and date

2005-09-16 Thread Tim.Wright
Ah - the one bit of LIA I haven't read yet is the case studies section!
Many thanks, I'll check it out. Sorting by multiple fields isn't quite
what I want - that sorts entirely by field A, then uses field B for
records where A is identical, correct? 

What I really want to do is sort by "A * (1-(B/700))", where A is the
score, and B is the age (in days) of the document. IE - the score is
basically "scaled down" with date. 

Cheers,

Tim.

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED] 
Sent: 16 September 2005 14:54
To: java-user@lucene.apache.org
Subject: Re: Sorting results by both score and date


Tim, check out p. 155 in LIA where we discuss "Sorting by multiple  
fields".

However, what you're really after it seems is boosting documents.   
Check out TheServerSide's case study (online or in LIA) - Dion  
discusses how he implemented boosting for more recent documents.  If  
you're indexing documents in ascending date order, perhaps you could  
leverage the document id in such a boosting factor?

 Erik






The information contained in this email message may be confidential. If you are 
not the intended recipient, any use, interference with, disclosure or copying 
of this material is unauthorised and prohibited. Although this message and any 
attachments are believed to be free of viruses, no responsibility is accepted 
by T&F Informa for any loss or damage arising in any way from receipt or use 
thereof.  Messages to and from the company are monitored for operational 
reasons and in accordance with lawful business practices. 
If you have received this message in error, please notify us by return and 
delete the message and any attachments.  Further enquiries/returns can be sent 
to [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Sorting results by both score and date

2005-09-16 Thread Tim.Wright
>> What I really want to do is sort by "A * (1-(B/700))", where A is the
>> score, and B is the age (in days) of the document. IE - the score is
>> basically "scaled down" with date.

> Maybe the TSS case study will help, though they rebuild their index  
> nightly and can adjust the boost based on the current day.

Just read this - it looks like the best option for us. I think we could 
get away with only periodically reindexing by just inflating the boost
marginally over time. Are there limits to boost? Any reason we can't 
use a boost of, say, 0.0001 or 10,000? 

Cheers,

Tim.




The information contained in this email message may be confidential. If you are 
not the intended recipient, any use, interference with, disclosure or copying 
of this material is unauthorised and prohibited. Although this message and any 
attachments are believed to be free of viruses, no responsibility is accepted 
by T&F Informa for any loss or damage arising in any way from receipt or use 
thereof.  Messages to and from the company are monitored for operational 
reasons and in accordance with lawful business practices. 
If you have received this message in error, please notify us by return and 
delete the message and any attachments.  Further enquiries/returns can be sent 
to [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Deleting documents

2005-09-16 Thread Tim.Wright
If you're indexing a field like this in order to be able to use it as a
reference later, you should normally index it using Field.Keyword
instead of Field.Text - if you use Text, it will go through your
Analyzer, which is probably what's changing the case. (I think this is
right - I'm sure someone will correct me if I'm wrong!)

Cheers,

Tim.

-Original Message-
From: Bogdan Munteanu [mailto:[EMAIL PROTECTED] 
Sent: 16 September 2005 15:40
To: java-user@lucene.apache.org
Subject: Deleting documents


I have a problem when deleting documents.
Lets say I have a Document object doc.
doc.add(Field.Text("id","index1,DML"));
doc.add(Field.Text("contents","some records"));
IndexWriter.addDocument(doc);
 Now if I want to delete the document with id:index1,DML I do something
like 
this:
IndexReader.delete(new Term("id", "index1,DML"));
 And it is not deleted.
 I have debuged it and noticed that lucene compares my "index1,DML" 
parameter with it's internal value "index1,dml".
 So when I do:
 IndexReader.delete(new Term("id", "index1,dml"));
the document is deleted.
 Now please explain me why is there a lower case value for my "id"?
And excuse my poor english!




The information contained in this email message may be confidential. If you are 
not the intended recipient, any use, interference with, disclosure or copying 
of this material is unauthorised and prohibited. Although this message and any 
attachments are believed to be free of viruses, no responsibility is accepted 
by T&F Informa for any loss or damage arising in any way from receipt or use 
thereof.  Messages to and from the company are monitored for operational 
reasons and in accordance with lawful business practices. 
If you have received this message in error, please notify us by return and 
delete the message and any attachments.  Further enquiries/returns can be sent 
to [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]