Retrieving exact matches

2007-02-06 Thread Mile Rosu


Hello,


I have been looking in the documentation but haven't found a solution to 
this:


is there a way to retrieve only the record "picasso" when the query is 
picasso and not the records: "picasso","picasso pablo"  ie a 100% match 
of the query ?



Thank you,
Mile Rosu



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Removing brackets before indexing

2006-05-31 Thread Mile Rosu
Hello!

I am currently trying to index latin language documents, in which
missing letters are appended to words by using square brackets, like
this : "[divinit]atis". 

Could you tell me please which would be the best practice to remove the
brackets before adding into the Lucene index? (in the example to store
the word "divinitatis").

Thank you a lot,
Mile Rosu

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Removing brackets before indexing

2006-06-01 Thread Mile Rosu
Hello Otis,

Thank you for the hint. I have made a custom analyzer which uses a
custom tokenizer similar to CharTokenizer - it treats brackets as token
characters, but removes them in the next() method. This is because I do
not want to split the word when adding it to the index. It seems to work
ok, still needs more testing. By just using SimpleAnalyzer words were
split. 

Mile

  

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, May 31, 2006 7:36 PM
To: java-user@lucene.apache.org
Subject: Re: Removing brackets before indexing

Mile,

Any Analyzer that uses a Tokenizer that throws out non-characters will
do.
For example, take a look at SimpleAnalyzer.  It uses LowerCaseTokenizer.
If you read the javadoc for LowerCaseTokenizer, I think you will see it
suits you.

Otis

- Original Message 
From: Mile Rosu <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Wednesday, May 31, 2006 11:47:12 AM
Subject: Removing brackets before indexing

Hello!

I am currently trying to index latin language documents, in which
missing letters are appended to words by using square brackets, like
this : "[divinit]atis". 

Could you tell me please which would be the best practice to remove the
brackets before adding into the Lucene index? (in the example to store
the word "divinitatis").

Thank you a lot,
Mile Rosu

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Using more than one index

2006-06-12 Thread Mile Rosu
Hello,

We have an application dealing with historical books. The books have
metadata consisting of event dates, and person names among others.
The FullText, Person and Date indexes were split until we realized that
for a larger number of documents (400K) the combination of the
sequential search hits took a way too long time to complete (15 min).
The date index was built using the suggestion found at:
http://wiki.apache.org/jakarta-lucene/LargeScaleDateRangeProcessing (big
thanks for the hint)

Is there a recommended approach to combining results from different
indexes (with different fields)?

The indexes structure:
MainIndex:
Fields:
@ID@ - keyword (document id)
@FULLTEXT@ - tokenized (used for full text6 search)
Ptitle - tokenized (used for full text publication title
search)
Dtitle - tokenized (used for full text document title
search)
Type - keyword - (used for document type)

PersonIndex:
@ID@ - keyword (document id == [EMAIL PROTECTED]@)
Person - tokenized (full text person name search)
DateIndex:
@ID@ - keyword (document id == [EMAIL PROTECTED]@)
Date - date as MMDD - keyword
Type - type of date (document date, birth day, etc...)
@@ - year of date
@MM@ - year and month of date
@DDD@ - decade
@CC@ - century of date


Eg:
If I want to search for documents that contain: person "John", full text
"book" and date: before 06/12/2005 
Step 1:  search in personIndex for John - retrieve all @ID@ from the hit
list
Step 2: search in DateIndex for documents that have dates before
06/12/2005 - retrieve id from the hit list 
Step 3: search in mainIndex for "book" - retrieve all @ID@ 
Step 4: combine all the lists 
Step 5: search mainIndex for documents with the @ID@ from the combined
id list

Each search takes less then 1 second, but retrieving @ID@ from the index
takes a lot more - the time increases by the number of hits. This is
because when retrieving a field value from a document hit, the Lucene
engine loads all the fields from the index (the entire document). So if
in one search I get 300.000 hits cont, I have to iterate through all and
retrieve the @ID@ field value - this takes a lot of time.

Regards,
Mile Rosu

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Best solution for the Date Range problem

2006-06-12 Thread Mile Rosu
Hello,

You might consider using the suggestion at 
http://wiki.apache.org/jakarta-lucene/LargeScaleDateRangeProcessing
We successfully used it to search for wide date ranges, on a relatively large 
number of date records.
Using this approach simplifies a lot the query you are suggesting (3). Gluing 
 and MM in a field like MM also would make your query look nicer.

Greets,
Mile Rosu

-Original Message-
From: Björn Ekengren [mailto:[EMAIL PROTECTED] 
Sent: Monday, June 12, 2006 11:51 AM
To: java-user@lucene.apache.org
Subject: Best solution for the Date Range problem

Hi,
I would like users to be able to search on both terms and within a date
range. The solutions I have come across so far are:
 
1. Use the default QueryParser which will use RangeQuery which will expand
into a number of Boolean clauses. It is quite likely that this will run into
the TooManyClauses error.
2. Extend QueryParser and override getRangeQuery() and let it return a
FilteredQuery containing a RangeFilter.
3. Split Dates during indexing into , MM, DD and create a custom
RangeQuery that uses only the granularity needed:
 
  +date[20040830 TO 20060202]   
 
expands to 
 
(year:2004 AND month:08 AND day:30) OR
(year:2004 AND month:08 AND day:31) OR
(year:2004 AND month:09) OR 
(year:2004 AND month:10) OR 
(year:2004 AND month:11) OR 
(year:2004 AND month:12) OR
(year:2005) OR
(year:2006 AND month:01) OR
(year:2006 AND month:02 AND day:01) OR 
(year:2006 AND month:02 AND day:02)
 
Are there any other options, and which one is the best ?
 
/B

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Using more than one index

2006-06-13 Thread Mile Rosu
Hi Hoss,

Thanks for your quick answer. One of the problems left with the date is
this:

A document (in our case an xml that has many metadata) can have more
than one date, each date with 2 attributes:

Eg:

00-00-1886 

In the date index I have for every  in the input xml a document
with fields: type (document |other), date, art (birthday | deportation |
death...). For example if I merge all the dates that correspond to a
document then the new type field will contain all the values. So if I
want to search for a document that has type:document art:birthday and
date between a and b then I won't get the correct results.

Regards,
Mile Rosu

-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, June 13, 2006 9:55 AM
To: java-user@lucene.apache.org
Subject: Re: Using more than one index


A couple of suggestions...

1) don't use multiple indexes.  create one index, with one document per
"thing" you want to return (in this case it sounds like books) and index
all of the relevent data about each thing in that doc.  If multiple
people
worked on a book, add all of their names to the same field.  addd all of
the dates to the book doc -- if you need to distibguish the differnet
types of dates, make a seaprete field for each type.

If you *must* cross refrence...

2) make sure you aren't useing the Hits API to iterate over all the
results when gathering IDs -- use a lower level api (like a
HitCollector)

3) use the FieldCache to get the IDs instead of he stored Document
fields.

4) don't extract full ID lists from all of then indexes and then search
on one of the indexes again with the ID list ... use the ID lists
generated from the supporting indexes (people and dates) to build a
Filter
that you can use when searching the main index.



: Date: Mon, 12 Jun 2006 12:22:30 +0300
: From: Mile Rosu <[EMAIL PROTECTED]>
: Reply-To: java-user@lucene.apache.org
: To: java-user@lucene.apache.org
: Subject: Using more than one index
:
: Hello,
:
: We have an application dealing with historical books. The books have
: metadata consisting of event dates, and person names among others.
: The FullText, Person and Date indexes were split until we realized
that
: for a larger number of documents (400K) the combination of the
: sequential search hits took a way too long time to complete (15 min).
: The date index was built using the suggestion found at:
: http://wiki.apache.org/jakarta-lucene/LargeScaleDateRangeProcessing
(big
: thanks for the hint)
:
: Is there a recommended approach to combining results from different
: indexes (with different fields)?
:
: The indexes structure:
: MainIndex:
:   Fields:
:   @ID@ - keyword (document id)
:   @FULLTEXT@ - tokenized (used for full text6 search)
:   Ptitle - tokenized (used for full text publication title
: search)
:   Dtitle - tokenized (used for full text document title
: search)
:   Type - keyword - (used for document type)
:
: PersonIndex:
:   @ID@ - keyword (document id == [EMAIL PROTECTED]@)
:   Person - tokenized (full text person name search)
: DateIndex:
:   @ID@ - keyword (document id == [EMAIL PROTECTED]@)
:   Date - date as MMDD - keyword
:   Type - type of date (document date, birth day, etc...)
:   @@ - year of date
:   @MM@ - year and month of date
:   @DDD@ - decade
:   @CC@ - century of date
:
:
: Eg:
: If I want to search for documents that contain: person "John", full
text
: "book" and date: before 06/12/2005
: Step 1:  search in personIndex for John - retrieve all @ID@ from the
hit
: list
: Step 2: search in DateIndex for documents that have dates before
: 06/12/2005 - retrieve id from the hit list
: Step 3: search in mainIndex for "book" - retrieve all @ID@
: Step 4: combine all the lists
: Step 5: search mainIndex for documents with the @ID@ from the combined
: id list
:
: Each search takes less then 1 second, but retrieving @ID@ from the
index
: takes a lot more - the time increases by the number of hits. This is
: because when retrieving a field value from a document hit, the Lucene
: engine loads all the fields from the index (the entire document). So
if
: in one search I get 300.000 hits cont, I have to iterate through all
and
: retrieve the @ID@ field value - this takes a lot of time.
:
: Regards,
: Mile Rosu
:
: -
: To unsubscribe, e-mail: [EMAIL PROTECTED]
: For additional commands, e-mail: [EMAIL PROTECTED]
:



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Using more than one index

2006-06-14 Thread Mile Rosu
Hello again,

Here  http://people.level7.ro/mile.rosu/small_index.zip are a couple of
documents in our index which might provide you a better overview of our
problem(both separated indexes and merged version).

Our problem remains with the date index - a date record has additional
fields used for date range searches which cannot be merged into one
index unfortunately - or at least we do not see a solution. 

Thank you,
Mile

-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED]
Sent: Tuesday, June 13, 2006 8:45 PM
To: java-user@lucene.apache.org
Cc: Mircea Pop
Subject: RE: Using more than one index

: A document (in our case an xml that has many metadata) can have more
: than one date, each date with 2 attributes:

: 00-00-1886
:
: In the date index I have for every  in the input xml a document
: with fields: type (document |other), date, art (birthday | deportation
|
: death...). For example if I merge all the dates that correspond to a
: document then the new type field will contain all the values. So if I

so make a seperate field for each type of date, lucene is very good at
supporting documents with heterogenous sets of fields -- and it's even
better if you use OMIT_NORMS (which makes perfect sense for a date field
where norm values are meaningless anyway)

: A couple of suggestions...
:
: 1) don't use multiple indexes.  create one index, with one document
per
: "thing" you want to return (in this case it sounds like books) and
index
: all of the relevent data about each thing in that doc.  If multiple
: people
: worked on a book, add all of their names to the same field.  addd all
of
: the dates to the book doc -- if you need to distibguish the differnet
: types of dates, make a seaprete field for each type.
:
: If you *must* cross refrence...
:
: 2) make sure you aren't useing the Hits API to iterate over all the
: results when gathering IDs -- use a lower level api (like a
: HitCollector)
:
: 3) use the FieldCache to get the IDs instead of he stored Document
: fields.
:
: 4) don't extract full ID lists from all of then indexes and then
search
: on one of the indexes again with the ID list ... use the ID lists
: generated from the supporting indexes (people and dates) to build a
: Filter
: that you can use when searching the main index.
:
:
:
: : Date: Mon, 12 Jun 2006 12:22:30 +0300
: : From: Mile Rosu <[EMAIL PROTECTED]>
: : Reply-To: java-user@lucene.apache.org
: : To: java-user@lucene.apache.org
: : Subject: Using more than one index
: :
: : Hello,
: :
: : We have an application dealing with historical books. The books have
: : metadata consisting of event dates, and person names among others.
: : The FullText, Person and Date indexes were split until we realized
: that
: : for a larger number of documents (400K) the combination of the
: : sequential search hits took a way too long time to complete (15
min).
: : The date index was built using the suggestion found at:
: : http://wiki.apache.org/jakarta-lucene/LargeScaleDateRangeProcessing
: (big
: : thanks for the hint)
: :
: : Is there a recommended approach to combining results from different
: : indexes (with different fields)?
: :
: : The indexes structure:
: : MainIndex:
: : Fields:
: : @ID@ - keyword (document id)
: : @FULLTEXT@ - tokenized (used for full text6 search)
: : Ptitle - tokenized (used for full text publication title
: : search)
: : Dtitle - tokenized (used for full text document title
: : search)
: : Type - keyword - (used for document type)
: :
: : PersonIndex:
: : @ID@ - keyword (document id == [EMAIL PROTECTED]@)
: : Person - tokenized (full text person name search)
: : DateIndex:
: : @ID@ - keyword (document id == [EMAIL PROTECTED]@)
: : Date - date as MMDD - keyword
: : Type - type of date (document date, birth day, etc...)
: : @@ - year of date
: : @MM@ - year and month of date
: : @DDD@ - decade
: : @CC@ - century of date
: :
: :
: : Eg:
: : If I want to search for documents that contain: person "John", full
: text
: : "book" and date: before 06/12/2005
: : Step 1:  search in personIndex for John - retrieve all @ID@ from the
: hit
: : list
: : Step 2: search in DateIndex for documents that have dates before
: : 06/12/2005 - retrieve id from the hit list
: : Step 3: search in mainIndex for "book" - retrieve all @ID@
: : Step 4: combine all the lists
: : Step 5: search mainIndex for documents with the @ID@ from the
combined
: : id list
: :
: : Each search takes less then 1 second, but retrieving @ID@ from the
: index
: : takes a lot more - the time increases by the number of hits. This is
: : because when retrieving a field value from a document hit, the
Lucene
: : engine loads all the fields from the index (the entire document). So
: if
: : in one s

RE: Questions on Query Scorer

2006-06-15 Thread Mile Rosu
Hello,

The problem may be rather in the name of the field you are querying -
"prohibited" in your case. 
You can check with Luke(http://www.getopt.org/luke/) the structure of
the index on which you are performing your query.

Mile

-Original Message-
From: Ferdinand Chan [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 15, 2006 12:26 PM
To: java-user@lucene.apache.org
Subject: Questions on Query Scorer

How can I create a QueryScorer in Lucene 2.0???

 

When I create a QueryScorer using the following codes,

 

BooleanQuery booleanQuery = new BooleanQuery();

booleanQuery.add(q1,BooleanClause.Occur.SHOULD);

booleanQuery.add(q2,BooleanClause.Occur.SHOULD);

 

QueryScorer scorer = new QueryScorer(booleanQuery);

 

It compiles successfully but throws a runtime exception when I execute
the
code.

 

java.lang.NoSuchFieldError: prohibited

at
org.apache.lucene.search.highlight.QueryTermExtractor.getTermsFromBoolea
nQue
ry(QueryTermExtractor.java:91)

at
org.apache.lucene.search.highlight.QueryTermExtractor.getTerms(QueryTerm
Extr
actor.java:66)

at
org.apache.lucene.search.highlight.QueryTermExtractor.getTerms(QueryTerm
Extr
actor.java:59)

at
org.apache.lucene.search.highlight.QueryTermExtractor.getTerms(QueryTerm
Extr
actor.java:45)

at
org.apache.lucene.search.highlight.QueryScorer.(QueryScorer.java:4
8)

 

Can anyone suggest a solution to this problem?

 

Thanks

 

Ferdinand

 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: How to search for europian word with and without special characters

2006-06-20 Thread Mile Rosu
Hello Supriya,

One possibility would be to search for both müller and mueller from the 
interface. It means you should "normalize" in some way the search query you are 
doing. This solution would not affect the content of the existing index (no 
reindexing needed).

Greets,
Mile

-Original Message-
From: Supriya Kumar Shyamal [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, June 20, 2006 3:09 PM
To: java-user@lucene.apache.org
Subject: How to search for europian word with and without special characters

Hi All,

I have a question regarding the indexing and searching for german 
characters. For eg. when I search for the word "müller" also I want to 
search for the word "mueller". How to achieve this in lucene.

Thanks,
supriya

-- 
Mit freundlichen Grüßen / Regards
 
Supriya Kumar Shyamal

Software Developer
tel +49 (30) 443 50 99 -22
fax +49 (30) 443 50 99 -99
email [EMAIL PROTECTED]
___
artnology GmbH
Milastr. 4
10437 Berlin
___

http://www.artnology.com
__

 News / Aktuelle Projekte:
 * artnology gewinnt Ausschreibung des Bundesministeriums des Innern:
   Softwarelösung für die Verwaltung der Sammlung zeitgenössischer
   Kunstwerke zur kulturellen Repräsentation des Bundes.

 Projektreferenzen:
 * Globaler eShop und Corporate-Site für Springer: www.springeronline.com
 * E-Detailing-Portal für Novartis: www.interaktiv.novartis.de
 * Service-Center-Plattform für Biogen: www.ms-life.de
 * eCRM-System für Grünenthal: www.gruenenthal.com

___ 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: BooleanQuery

2006-06-21 Thread Mile Rosu
You should specify the field name for influenza as well.

Like this:

+doccontent:avian +doccontent:influenza +doctype:AM
+docdate:[2005033122000 TO
2006062022000]

Mile

-Original Message-
From: WATHELET Thomas [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, June 21, 2006 4:40 PM
To: java-user@lucene.apache.org
Subject: BooleanQuery

Why I retrive hits with this query : 

+doccontent:avian +doctype:AM +docdate:[2005033122000 TO
2006062022000]

and not with this one 

+doccontent:avian influenza +doctype:AM +docdate:[2005033122000 TO
2006062022000]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Searching for a phrase which spans on 2 pages

2006-07-11 Thread Mile Rosu

Hello,

I am working on an application similar to google books which allows 
searching on documents which represent a scanned page. Of course, one 
might search for a phrase starting at the end of one page and ending at 
the beginning of the next one. In this case I do not know how I might 
treat this. Both pages should be returned as hit results.

Do you have any idea on how this situation might be handled?

Thank you,
Mile Rosu

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Searching for a phrase which spans on 2 pages

2006-07-12 Thread Mile Rosu

Hello Erick,

I have been trying on Google Books some scenarios and apparently found a 
Google bug ...

It looks like they use number 2 approach, as this query illustrates it.

http://books.google.com/books?vid=ISBN1564968316&id=14Xx2T8tmMYC&pg=PA8&lpg=PA8&dq=%2B%22the+site+is+unburdened%22&sig=QRJSkKNLm0JlbkcWe2m1-y8YYz0

The phrase returns 2 hits, but if you look at the documents, only in the 
first one the phrase is visible.


Anyway, it makes possible finding something like:

http://books.google.com/books?q=%22sense+of+dissatisfaction+with+existing+elements%22&btnG=Search+Books
The returned page is the first one on which the phrase spans (but no 
more highlighting).


It seems we are really close to a good solution, now looking for a way 
to implementing it in terms of index structure.


Thanks again,
Mile Rosu


Erick Erickson wrote:
I can think of several approaches, but the experts will no doubt show 
me up

..

1> index the entire book as a single document. Also, index the 
beginning and

ending offset of each page in separate "documents". Assuming you can find
the offset in the big doc of each matching phrase, you can also find out
what pages each match starts on and ends on, and if they are different 
you'd

know to display two pages. Not sure what this does to relevancy...

2> Index, say, the 10 words on the previous page and 10 words on the next
page with the current page. You'd have to make sure your match wasn't
entirely within the 10 words you prepended or appended to the "match" 
page

(again by match position) when you returned data.

3> Have a series of "joiner" "documents". One for the 9 words of page 
n, and

9 words of  page n + 1 (along with the page number). Another set for 8
before and 8 after. etc. down to 1. If your phrase was 10 words, you'd
search your normal pages, and the 9 word "joiner" pages. Any match in the
joiners would be a page spanner. Again, what does that do to relevancy?


Note that there is no requirement that every document have the same 
fields,
so your searches can be disjoint. Also, I'm assuming that you can 
reasonably
decide that, say, 10 word phrases are the max you'll respect, which 
may not

be true.

I have no idea whether these are reasonable approaches given your problem
domain

Best
Erick




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



PhraseQuery - retrieving the fieldname

2006-07-12 Thread Mile Rosu

Hello,

A small problem this time: I would like to retrieve the field name of a 
PhraseQuery.

Could you tell me please which is the best way for this ?

Thank you,
Mile Rosu


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]