Retrieving exact matches
Hello, I have been looking in the documentation but haven't found a solution to this: is there a way to retrieve only the record "picasso" when the query is picasso and not the records: "picasso","picasso pablo" ie a 100% match of the query ? Thank you, Mile Rosu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Removing brackets before indexing
Hello! I am currently trying to index latin language documents, in which missing letters are appended to words by using square brackets, like this : "[divinit]atis". Could you tell me please which would be the best practice to remove the brackets before adding into the Lucene index? (in the example to store the word "divinitatis"). Thank you a lot, Mile Rosu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Removing brackets before indexing
Hello Otis, Thank you for the hint. I have made a custom analyzer which uses a custom tokenizer similar to CharTokenizer - it treats brackets as token characters, but removes them in the next() method. This is because I do not want to split the word when adding it to the index. It seems to work ok, still needs more testing. By just using SimpleAnalyzer words were split. Mile -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 31, 2006 7:36 PM To: java-user@lucene.apache.org Subject: Re: Removing brackets before indexing Mile, Any Analyzer that uses a Tokenizer that throws out non-characters will do. For example, take a look at SimpleAnalyzer. It uses LowerCaseTokenizer. If you read the javadoc for LowerCaseTokenizer, I think you will see it suits you. Otis - Original Message From: Mile Rosu <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, May 31, 2006 11:47:12 AM Subject: Removing brackets before indexing Hello! I am currently trying to index latin language documents, in which missing letters are appended to words by using square brackets, like this : "[divinit]atis". Could you tell me please which would be the best practice to remove the brackets before adding into the Lucene index? (in the example to store the word "divinitatis"). Thank you a lot, Mile Rosu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Using more than one index
Hello, We have an application dealing with historical books. The books have metadata consisting of event dates, and person names among others. The FullText, Person and Date indexes were split until we realized that for a larger number of documents (400K) the combination of the sequential search hits took a way too long time to complete (15 min). The date index was built using the suggestion found at: http://wiki.apache.org/jakarta-lucene/LargeScaleDateRangeProcessing (big thanks for the hint) Is there a recommended approach to combining results from different indexes (with different fields)? The indexes structure: MainIndex: Fields: @ID@ - keyword (document id) @FULLTEXT@ - tokenized (used for full text6 search) Ptitle - tokenized (used for full text publication title search) Dtitle - tokenized (used for full text document title search) Type - keyword - (used for document type) PersonIndex: @ID@ - keyword (document id == [EMAIL PROTECTED]@) Person - tokenized (full text person name search) DateIndex: @ID@ - keyword (document id == [EMAIL PROTECTED]@) Date - date as MMDD - keyword Type - type of date (document date, birth day, etc...) @@ - year of date @MM@ - year and month of date @DDD@ - decade @CC@ - century of date Eg: If I want to search for documents that contain: person "John", full text "book" and date: before 06/12/2005 Step 1: search in personIndex for John - retrieve all @ID@ from the hit list Step 2: search in DateIndex for documents that have dates before 06/12/2005 - retrieve id from the hit list Step 3: search in mainIndex for "book" - retrieve all @ID@ Step 4: combine all the lists Step 5: search mainIndex for documents with the @ID@ from the combined id list Each search takes less then 1 second, but retrieving @ID@ from the index takes a lot more - the time increases by the number of hits. This is because when retrieving a field value from a document hit, the Lucene engine loads all the fields from the index (the entire document). So if in one search I get 300.000 hits cont, I have to iterate through all and retrieve the @ID@ field value - this takes a lot of time. Regards, Mile Rosu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Best solution for the Date Range problem
Hello, You might consider using the suggestion at http://wiki.apache.org/jakarta-lucene/LargeScaleDateRangeProcessing We successfully used it to search for wide date ranges, on a relatively large number of date records. Using this approach simplifies a lot the query you are suggesting (3). Gluing and MM in a field like MM also would make your query look nicer. Greets, Mile Rosu -Original Message- From: Björn Ekengren [mailto:[EMAIL PROTECTED] Sent: Monday, June 12, 2006 11:51 AM To: java-user@lucene.apache.org Subject: Best solution for the Date Range problem Hi, I would like users to be able to search on both terms and within a date range. The solutions I have come across so far are: 1. Use the default QueryParser which will use RangeQuery which will expand into a number of Boolean clauses. It is quite likely that this will run into the TooManyClauses error. 2. Extend QueryParser and override getRangeQuery() and let it return a FilteredQuery containing a RangeFilter. 3. Split Dates during indexing into , MM, DD and create a custom RangeQuery that uses only the granularity needed: +date[20040830 TO 20060202] expands to (year:2004 AND month:08 AND day:30) OR (year:2004 AND month:08 AND day:31) OR (year:2004 AND month:09) OR (year:2004 AND month:10) OR (year:2004 AND month:11) OR (year:2004 AND month:12) OR (year:2005) OR (year:2006 AND month:01) OR (year:2006 AND month:02 AND day:01) OR (year:2006 AND month:02 AND day:02) Are there any other options, and which one is the best ? /B - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Using more than one index
Hi Hoss, Thanks for your quick answer. One of the problems left with the date is this: A document (in our case an xml that has many metadata) can have more than one date, each date with 2 attributes: Eg: 00-00-1886 In the date index I have for every in the input xml a document with fields: type (document |other), date, art (birthday | deportation | death...). For example if I merge all the dates that correspond to a document then the new type field will contain all the values. So if I want to search for a document that has type:document art:birthday and date between a and b then I won't get the correct results. Regards, Mile Rosu -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 13, 2006 9:55 AM To: java-user@lucene.apache.org Subject: Re: Using more than one index A couple of suggestions... 1) don't use multiple indexes. create one index, with one document per "thing" you want to return (in this case it sounds like books) and index all of the relevent data about each thing in that doc. If multiple people worked on a book, add all of their names to the same field. addd all of the dates to the book doc -- if you need to distibguish the differnet types of dates, make a seaprete field for each type. If you *must* cross refrence... 2) make sure you aren't useing the Hits API to iterate over all the results when gathering IDs -- use a lower level api (like a HitCollector) 3) use the FieldCache to get the IDs instead of he stored Document fields. 4) don't extract full ID lists from all of then indexes and then search on one of the indexes again with the ID list ... use the ID lists generated from the supporting indexes (people and dates) to build a Filter that you can use when searching the main index. : Date: Mon, 12 Jun 2006 12:22:30 +0300 : From: Mile Rosu <[EMAIL PROTECTED]> : Reply-To: java-user@lucene.apache.org : To: java-user@lucene.apache.org : Subject: Using more than one index : : Hello, : : We have an application dealing with historical books. The books have : metadata consisting of event dates, and person names among others. : The FullText, Person and Date indexes were split until we realized that : for a larger number of documents (400K) the combination of the : sequential search hits took a way too long time to complete (15 min). : The date index was built using the suggestion found at: : http://wiki.apache.org/jakarta-lucene/LargeScaleDateRangeProcessing (big : thanks for the hint) : : Is there a recommended approach to combining results from different : indexes (with different fields)? : : The indexes structure: : MainIndex: : Fields: : @ID@ - keyword (document id) : @FULLTEXT@ - tokenized (used for full text6 search) : Ptitle - tokenized (used for full text publication title : search) : Dtitle - tokenized (used for full text document title : search) : Type - keyword - (used for document type) : : PersonIndex: : @ID@ - keyword (document id == [EMAIL PROTECTED]@) : Person - tokenized (full text person name search) : DateIndex: : @ID@ - keyword (document id == [EMAIL PROTECTED]@) : Date - date as MMDD - keyword : Type - type of date (document date, birth day, etc...) : @@ - year of date : @MM@ - year and month of date : @DDD@ - decade : @CC@ - century of date : : : Eg: : If I want to search for documents that contain: person "John", full text : "book" and date: before 06/12/2005 : Step 1: search in personIndex for John - retrieve all @ID@ from the hit : list : Step 2: search in DateIndex for documents that have dates before : 06/12/2005 - retrieve id from the hit list : Step 3: search in mainIndex for "book" - retrieve all @ID@ : Step 4: combine all the lists : Step 5: search mainIndex for documents with the @ID@ from the combined : id list : : Each search takes less then 1 second, but retrieving @ID@ from the index : takes a lot more - the time increases by the number of hits. This is : because when retrieving a field value from a document hit, the Lucene : engine loads all the fields from the index (the entire document). So if : in one search I get 300.000 hits cont, I have to iterate through all and : retrieve the @ID@ field value - this takes a lot of time. : : Regards, : Mile Rosu : : - : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : -Hoss - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Using more than one index
Hello again, Here http://people.level7.ro/mile.rosu/small_index.zip are a couple of documents in our index which might provide you a better overview of our problem(both separated indexes and merged version). Our problem remains with the date index - a date record has additional fields used for date range searches which cannot be merged into one index unfortunately - or at least we do not see a solution. Thank you, Mile -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 13, 2006 8:45 PM To: java-user@lucene.apache.org Cc: Mircea Pop Subject: RE: Using more than one index : A document (in our case an xml that has many metadata) can have more : than one date, each date with 2 attributes: : 00-00-1886 : : In the date index I have for every in the input xml a document : with fields: type (document |other), date, art (birthday | deportation | : death...). For example if I merge all the dates that correspond to a : document then the new type field will contain all the values. So if I so make a seperate field for each type of date, lucene is very good at supporting documents with heterogenous sets of fields -- and it's even better if you use OMIT_NORMS (which makes perfect sense for a date field where norm values are meaningless anyway) : A couple of suggestions... : : 1) don't use multiple indexes. create one index, with one document per : "thing" you want to return (in this case it sounds like books) and index : all of the relevent data about each thing in that doc. If multiple : people : worked on a book, add all of their names to the same field. addd all of : the dates to the book doc -- if you need to distibguish the differnet : types of dates, make a seaprete field for each type. : : If you *must* cross refrence... : : 2) make sure you aren't useing the Hits API to iterate over all the : results when gathering IDs -- use a lower level api (like a : HitCollector) : : 3) use the FieldCache to get the IDs instead of he stored Document : fields. : : 4) don't extract full ID lists from all of then indexes and then search : on one of the indexes again with the ID list ... use the ID lists : generated from the supporting indexes (people and dates) to build a : Filter : that you can use when searching the main index. : : : : : Date: Mon, 12 Jun 2006 12:22:30 +0300 : : From: Mile Rosu <[EMAIL PROTECTED]> : : Reply-To: java-user@lucene.apache.org : : To: java-user@lucene.apache.org : : Subject: Using more than one index : : : : Hello, : : : : We have an application dealing with historical books. The books have : : metadata consisting of event dates, and person names among others. : : The FullText, Person and Date indexes were split until we realized : that : : for a larger number of documents (400K) the combination of the : : sequential search hits took a way too long time to complete (15 min). : : The date index was built using the suggestion found at: : : http://wiki.apache.org/jakarta-lucene/LargeScaleDateRangeProcessing : (big : : thanks for the hint) : : : : Is there a recommended approach to combining results from different : : indexes (with different fields)? : : : : The indexes structure: : : MainIndex: : : Fields: : : @ID@ - keyword (document id) : : @FULLTEXT@ - tokenized (used for full text6 search) : : Ptitle - tokenized (used for full text publication title : : search) : : Dtitle - tokenized (used for full text document title : : search) : : Type - keyword - (used for document type) : : : : PersonIndex: : : @ID@ - keyword (document id == [EMAIL PROTECTED]@) : : Person - tokenized (full text person name search) : : DateIndex: : : @ID@ - keyword (document id == [EMAIL PROTECTED]@) : : Date - date as MMDD - keyword : : Type - type of date (document date, birth day, etc...) : : @@ - year of date : : @MM@ - year and month of date : : @DDD@ - decade : : @CC@ - century of date : : : : : : Eg: : : If I want to search for documents that contain: person "John", full : text : : "book" and date: before 06/12/2005 : : Step 1: search in personIndex for John - retrieve all @ID@ from the : hit : : list : : Step 2: search in DateIndex for documents that have dates before : : 06/12/2005 - retrieve id from the hit list : : Step 3: search in mainIndex for "book" - retrieve all @ID@ : : Step 4: combine all the lists : : Step 5: search mainIndex for documents with the @ID@ from the combined : : id list : : : : Each search takes less then 1 second, but retrieving @ID@ from the : index : : takes a lot more - the time increases by the number of hits. This is : : because when retrieving a field value from a document hit, the Lucene : : engine loads all the fields from the index (the entire document). So : if : : in one s
RE: Questions on Query Scorer
Hello, The problem may be rather in the name of the field you are querying - "prohibited" in your case. You can check with Luke(http://www.getopt.org/luke/) the structure of the index on which you are performing your query. Mile -Original Message- From: Ferdinand Chan [mailto:[EMAIL PROTECTED] Sent: Thursday, June 15, 2006 12:26 PM To: java-user@lucene.apache.org Subject: Questions on Query Scorer How can I create a QueryScorer in Lucene 2.0??? When I create a QueryScorer using the following codes, BooleanQuery booleanQuery = new BooleanQuery(); booleanQuery.add(q1,BooleanClause.Occur.SHOULD); booleanQuery.add(q2,BooleanClause.Occur.SHOULD); QueryScorer scorer = new QueryScorer(booleanQuery); It compiles successfully but throws a runtime exception when I execute the code. java.lang.NoSuchFieldError: prohibited at org.apache.lucene.search.highlight.QueryTermExtractor.getTermsFromBoolea nQue ry(QueryTermExtractor.java:91) at org.apache.lucene.search.highlight.QueryTermExtractor.getTerms(QueryTerm Extr actor.java:66) at org.apache.lucene.search.highlight.QueryTermExtractor.getTerms(QueryTerm Extr actor.java:59) at org.apache.lucene.search.highlight.QueryTermExtractor.getTerms(QueryTerm Extr actor.java:45) at org.apache.lucene.search.highlight.QueryScorer.(QueryScorer.java:4 8) Can anyone suggest a solution to this problem? Thanks Ferdinand - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: How to search for europian word with and without special characters
Hello Supriya, One possibility would be to search for both müller and mueller from the interface. It means you should "normalize" in some way the search query you are doing. This solution would not affect the content of the existing index (no reindexing needed). Greets, Mile -Original Message- From: Supriya Kumar Shyamal [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 20, 2006 3:09 PM To: java-user@lucene.apache.org Subject: How to search for europian word with and without special characters Hi All, I have a question regarding the indexing and searching for german characters. For eg. when I search for the word "müller" also I want to search for the word "mueller". How to achieve this in lucene. Thanks, supriya -- Mit freundlichen Grüßen / Regards Supriya Kumar Shyamal Software Developer tel +49 (30) 443 50 99 -22 fax +49 (30) 443 50 99 -99 email [EMAIL PROTECTED] ___ artnology GmbH Milastr. 4 10437 Berlin ___ http://www.artnology.com __ News / Aktuelle Projekte: * artnology gewinnt Ausschreibung des Bundesministeriums des Innern: Softwarelösung für die Verwaltung der Sammlung zeitgenössischer Kunstwerke zur kulturellen Repräsentation des Bundes. Projektreferenzen: * Globaler eShop und Corporate-Site für Springer: www.springeronline.com * E-Detailing-Portal für Novartis: www.interaktiv.novartis.de * Service-Center-Plattform für Biogen: www.ms-life.de * eCRM-System für Grünenthal: www.gruenenthal.com ___ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: BooleanQuery
You should specify the field name for influenza as well. Like this: +doccontent:avian +doccontent:influenza +doctype:AM +docdate:[2005033122000 TO 2006062022000] Mile -Original Message- From: WATHELET Thomas [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 21, 2006 4:40 PM To: java-user@lucene.apache.org Subject: BooleanQuery Why I retrive hits with this query : +doccontent:avian +doctype:AM +docdate:[2005033122000 TO 2006062022000] and not with this one +doccontent:avian influenza +doctype:AM +docdate:[2005033122000 TO 2006062022000] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Searching for a phrase which spans on 2 pages
Hello, I am working on an application similar to google books which allows searching on documents which represent a scanned page. Of course, one might search for a phrase starting at the end of one page and ending at the beginning of the next one. In this case I do not know how I might treat this. Both pages should be returned as hit results. Do you have any idea on how this situation might be handled? Thank you, Mile Rosu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Searching for a phrase which spans on 2 pages
Hello Erick, I have been trying on Google Books some scenarios and apparently found a Google bug ... It looks like they use number 2 approach, as this query illustrates it. http://books.google.com/books?vid=ISBN1564968316&id=14Xx2T8tmMYC&pg=PA8&lpg=PA8&dq=%2B%22the+site+is+unburdened%22&sig=QRJSkKNLm0JlbkcWe2m1-y8YYz0 The phrase returns 2 hits, but if you look at the documents, only in the first one the phrase is visible. Anyway, it makes possible finding something like: http://books.google.com/books?q=%22sense+of+dissatisfaction+with+existing+elements%22&btnG=Search+Books The returned page is the first one on which the phrase spans (but no more highlighting). It seems we are really close to a good solution, now looking for a way to implementing it in terms of index structure. Thanks again, Mile Rosu Erick Erickson wrote: I can think of several approaches, but the experts will no doubt show me up .. 1> index the entire book as a single document. Also, index the beginning and ending offset of each page in separate "documents". Assuming you can find the offset in the big doc of each matching phrase, you can also find out what pages each match starts on and ends on, and if they are different you'd know to display two pages. Not sure what this does to relevancy... 2> Index, say, the 10 words on the previous page and 10 words on the next page with the current page. You'd have to make sure your match wasn't entirely within the 10 words you prepended or appended to the "match" page (again by match position) when you returned data. 3> Have a series of "joiner" "documents". One for the 9 words of page n, and 9 words of page n + 1 (along with the page number). Another set for 8 before and 8 after. etc. down to 1. If your phrase was 10 words, you'd search your normal pages, and the 9 word "joiner" pages. Any match in the joiners would be a page spanner. Again, what does that do to relevancy? Note that there is no requirement that every document have the same fields, so your searches can be disjoint. Also, I'm assuming that you can reasonably decide that, say, 10 word phrases are the max you'll respect, which may not be true. I have no idea whether these are reasonable approaches given your problem domain Best Erick - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
PhraseQuery - retrieving the fieldname
Hello, A small problem this time: I would like to retrieve the field name of a PhraseQuery. Could you tell me please which is the best way for this ? Thank you, Mile Rosu - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]