Re: Boost a field in fuzzy query
You could build the query up in your program, or that part of it anyway. BooleanQuery bq = new BooleanQuery(); FuzzyQuery fq = new FuzzyQuery(...); fq.setBoost(123f); bq.add(fq); ... This might be a bug in MultiFieldQueryParser - you could provide a test case or, better, a patch. See https://issues.apache.org/jira/browse/LUCENENET-147. Lucene.Net, but the comment there says "would probably mean Lucene Java also suffers from the same bug". Presumably you've read the "not very scalable" warning in the javadocs for FuzzyQuery. And you don't say what version of lucene you are using. If not the latest, try that. -- Ian. On Mon, Mar 14, 2011 at 5:33 AM, chhava40 wrote: > Hi, > I am using MultiFieldQueryParser to parse query for multiple fields with > custom boosts for each field. > The issue is when one of the terms in the query is fuzzy e.g abc~. > For such a term, the field boost is not applied. If the query is "abc~ xyz" > and fields are f1 & f2 with boosts 10, 5, the parsed query output is: > (f1:abc~0.5 f2:abc~0.5) (f1:xyz^10 f2:xyz^5). > Is there any way to apply the field boost factor to fuzzy terms as well? > Thanks. > > -- - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Indexing of multilingual labels
Hello Stephane, I think a better way is to have resource file with different language and store pointer in the index to get to correct resource file ( Something like I18N and L10N approach). Store the internationalised string in index and all related localised string in resource file . This way index size will be reduced (adding to payload would have impact on performance) and help performance too. Now your Total Search Time would be (searchtime+time to retrieve the language based data) Hope this helps. -Vinaya On Friday 11 March 2011 09:05 PM, Stephane Fellah wrote: Erick, I am trying to index multilingual taxonomies such as SKOS, Wordnet, Eurowordnet. Taxonomies are composed of concepts which have preferred and alternative labels in different languages. Some labels are the same lexical form in different languages. I want to be able to index these concepts in Lucene in order to be able to search concepts by their label in one or several languages. I want also be able to display concept definition with all the alternative labels in different languages. My question is: could we use the payload mechanism to store the language assigned to the word (i read somewhere Google was using payload to store information such as font for example, so why not language) ? Wouldn't be a better approach then using one field per language or one index per language ? REgards Stephane On Fri, Mar 11, 2011 at 7:52 AM, Erick Ericksonwrote: It's not so much a matter of problems with indexing/searching as it is with search behavior. The reason these strategies are implemented is that using English stemming, say, on other languages will produce "interesting" results. There's no a-priori reason you can't index multiple languages in the same field. So I don't see what you would accomplish by using payloads to indicate which language the term is in. Could you expand a bit on what you're trying to accomplish here? Maybe there are better solutions Best Erick On Thu, Mar 10, 2011 at 10:29 PM, Stephane Fellah wrote: I am trying to index in Lucene a field that could have label of concepts in different languages. Most of the approaches I have seen so far are: - Use a single index, where each document has a field per each language it uses, or - Use M indexes, M being the number of languages in the corpus. Lucene 2.9+ has a feature called Payload that allows to attach attributes to term. Is anyone use this mechanism to store language (or other attributes such as datatypes) information ? Does this approach if labels are the same in different languages (does it break inverted index) ? How is performance compared to the two other approaches ? Any pointer on source code showing how it is done would help. Thanks -- Stephane Fellah, M.Sc, B.Sc Principal Engineer/Product Manager smartRealm LLC 201 Loudoun St. SW Leesburg, VA 20175 Tel: 703 669 5514 Cell: 571 502 8478 Fax: 703 669 5515 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Analyzer enquiry
Thanks a lot for your help Erick! About the fields you mentioned: If I don't use stemmers, except for the constructor argument related to the stop words, is there anything else that I have to modify? Thanks, Vicky Quoting Erick Erickson : StandardAnalyzer works well for most European languages. The problem will be stemming. Applying stemming via English rules to non-English languages produces...er...interesting results. You can go ahead and create language-specific fields for each language and use StandardAnalyzer with the appropriate stopwords and stemming with each, this is a common approach.. The Snowball stemmer takes a language parameter... You need to use specific analyzers for Chinese Japanese Korean (CJK) documents though. Hope that helps Erick On Sun, Mar 13, 2011 at 7:23 PM, Vasiliki Gkouta wrote: Hello everybody, I have an enquiry about StandardAnalyzer. Can I use it for other languages except from English? I give the right list of stop words at initialization. Is there anything else inside the class that is by default set in English? I've found the Analyzers for other languages too but they where seem to be deprecated.. Moreover I use english and other languages, all together in my project so I would like to ask if there is a way to use either the same class analyzer for all of them, or analyzers of the same functionality for all the languages. Thanks in advance! Best regards, Vicky - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Analyzer enquiry
I don't understand what you're saying here. If you put a stemmer in the constructor, you *are* using it. If you don't specify any stemmer at all, you still have to define different analyzers to use different stop word lists. Can you restate your question? Best Erick On Mon, Mar 14, 2011 at 8:21 AM, Vasiliki Gkouta wrote: > Thanks a lot for your help Erick! About the fields you mentioned: If I don't > use stemmers, except for the constructor argument related to the stop words, > is there anything else that I have to modify? > > Thanks, > Vicky > > > Quoting Erick Erickson : > >> StandardAnalyzer works well for most European languages. The problem will >> be stemming. Applying stemming via English rules to non-English languages >> produces...er...interesting results. >> >> You can go ahead and create language-specific fields for each language and >> use StandardAnalyzer with the appropriate stopwords and stemming with >> each, >> this is a common approach.. The Snowball stemmer takes a language >> parameter... >> >> You need to use specific analyzers for Chinese Japanese Korean (CJK) >> documents >> though. >> >> Hope that helps >> Erick >> >> On Sun, Mar 13, 2011 at 7:23 PM, Vasiliki Gkouta >> wrote: >>> >>> Hello everybody, >>> >>> I have an enquiry about StandardAnalyzer. Can I use it for other >>> languages >>> except from English? I give the right list of stop words at >>> initialization. >>> Is there anything else inside the class that is by default set in >>> English? >>> I've found the Analyzers for other languages too but they where seem to >>> be >>> deprecated.. Moreover I use english and other languages, all together in >>> my >>> project so I would like to ask if there is a way to use either the same >>> class analyzer for all of them, or analyzers of the same functionality >>> for >>> all the languages. Thanks in advance! >>> >>> Best regards, >>> Vicky >>> >>> >>> >>> - >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Analyzer enquiry
Sorry for the confusion. I have two analyzers(of StandardAnalyzer) and use no stemmers. At the one analyzer I passed a german stop words set to the constructor and at the other one I passed an english stop words set. My question was if I have to call any other function of the german analyzer for it to be corrent. Thank you. Quoting Erick Erickson : I don't understand what you're saying here. If you put a stemmer in the constructor, you *are* using it. If you don't specify any stemmer at all, you still have to define different analyzers to use different stop word lists. Can you restate your question? Best Erick On Mon, Mar 14, 2011 at 8:21 AM, Vasiliki Gkouta wrote: Thanks a lot for your help Erick! About the fields you mentioned: If I don't use stemmers, except for the constructor argument related to the stop words, is there anything else that I have to modify? Thanks, Vicky Quoting Erick Erickson : StandardAnalyzer works well for most European languages. The problem will be stemming. Applying stemming via English rules to non-English languages produces...er...interesting results. You can go ahead and create language-specific fields for each language and use StandardAnalyzer with the appropriate stopwords and stemming with each, this is a common approach.. The Snowball stemmer takes a language parameter... You need to use specific analyzers for Chinese Japanese Korean (CJK) documents though. Hope that helps Erick On Sun, Mar 13, 2011 at 7:23 PM, Vasiliki Gkouta wrote: Hello everybody, I have an enquiry about StandardAnalyzer. Can I use it for other languages except from English? I give the right list of stop words at initialization. Is there anything else inside the class that is by default set in English? I've found the Analyzers for other languages too but they where seem to be deprecated.. Moreover I use english and other languages, all together in my project so I would like to ask if there is a way to use either the same class analyzer for all of them, or analyzers of the same functionality for all the languages. Thanks in advance! Best regards, Vicky - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Indexing of multilingual labels
Stephane, I think that you have the freedom to put what you want in the stored value of a field. The simplest would even be to make it that the fields that you want to use for display are stored, preformatted, xml-ished, owl-ified, or json-ized, to be separate from the indexed fields (where you are only interested to the plain text). Payloads seem to be doing a similar job as a separate stored, non-indexed field. The best approach I had thus far was to use a multiplexing analyzer (which is called for indexed fields only anyways) that recognizes the language by the suffix of the field name. As to the difference between one index and several fields or one field in many indices, I think it is just a programming difference. The tf and idf are always done at the term level so they make no difference. I tend to prefer multiple fields because it's easier to expand a query for, say, Fourrier sent by a browser that says English but also accepts french and German into: - a query for Fourrier in the whitespace-tokenized track (always prefer that one) - a query for fouri in the French field - a query for fourier in the English and German fields My current experience is that many users appear or claim to speak many languages (they do, a little bit). hope it helps. paul PS: not that my code is ideal but here are the ones I have: - i2geo, based on an ontology of concepts in OWL, http://i2geo.net/xwiki/bin/view/About/GeoSkills and http://svn.activemath.org/intergeo/Platform/SearchI2G/ - ActiveMath, fed by XML, http://www.activemath.org/javadoc/org/activemath/omdocjdom/index/package-summary.html and Le 11 mars 2011 à 16:35, Stephane Fellah a écrit : > Erick, > > I am trying to index multilingual taxonomies such as SKOS, Wordnet, > Eurowordnet. Taxonomies are composed of concepts which have preferred and > alternative labels in different languages. Some labels are the same lexical > form in different languages. I want to be able to index these concepts in > Lucene in order to be able to search concepts by their label in one or > several languages. I want also be able to display concept definition with > all the alternative labels in different languages. My question is: could we > use the payload mechanism to store the language assigned to the word (i read > somewhere Google was using payload to store information such as font for > example, so why not language) ? Wouldn't be a better approach then using one > field per language or one index per language ? > > REgards > Stephane > > On Fri, Mar 11, 2011 at 7:52 AM, Erick Erickson > wrote: > >> It's not so much a matter of problems with indexing/searching >> as it is with search behavior. The reason these strategies >> are implemented is that using English stemming, say, on >> other languages will produce "interesting" results. >> >> There's no a-priori reason you can't index multiple languages >> in the same field. >> >> So I don't see what you would accomplish by using payloads >> to indicate which language the term is in. Could you expand >> a bit on what you're trying to accomplish here? Maybe there >> are better solutions >> >> Best >> Erick >> >> >> On Thu, Mar 10, 2011 at 10:29 PM, Stephane Fellah >> wrote: >>> I am trying to index in Lucene a field that could have label of concepts >> in >>> different languages. Most of the approaches I have seen so far are: >>> >>> - >>> >>> Use a single index, where each document has a field per each language >> it >>> uses, or >>> - >>> >>> Use M indexes, M being the number of languages in the corpus. >>> >>> Lucene 2.9+ has a feature called Payload that allows to attach attributes >> to >>> term. Is anyone use this mechanism to store language (or other attributes >>> such as datatypes) information ? Does this approach if labels are the >> same >>> in different languages (does it break inverted index) ? How is >> performance >>> compared to the two other approaches ? Any pointer on source code showing >>> how it is done would help. >>> >>> Thanks >>> >>> -- >>> Stephane Fellah, M.Sc, B.Sc >>> Principal Engineer/Product Manager >>> smartRealm LLC >>> 201 Loudoun St. SW >>> Leesburg, VA 20175 >>> Tel: 703 669 5514 >>> Cell: 571 502 8478 >>> Fax: 703 669 5515 >>> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > > -- > Stephane Fellah, M.Sc, B.Sc > Principal Engineer/Product Manager > smartRealm LLC > 201 Loudoun St. SW > Leesburg, VA 20175 > Tel: 703 669 5514 > Cell: 571 502 8478 > Fax: 703 669 5515 - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
lucene3.0.3 | Special character indexing
Hi, I am creating index using Lucene 3.0.3 StandardAnalyzer. when searching is made on index using query like C, C# or C++ it gives same result for all these three term. As, I know while creating index analyzer ignore special character and do not create index for same. I have tried KeywordAnalyzer but it do not fulfill my requirement. Need to be able to differentiate between "C", "C#" and "C++" I have to create my own analyzer? Or I have to modify the JFlex grammar? http://osdir.com/ml/java-dev/2009-06/msg00208.html please suggest me that, Is any existing analyzer will resolve this issue? Any suggestion will be appreciated!!! Thanks & Regards, Ranjit Kumar Associate Software Engineer [cid:image002.jpg@01CB7089.C0069B40] US: +1 408.540.0001 UK: +44 208.099.1660 India: +91 124.474.8100 | +91 124.410.1350 FAX: +1 408.516.9050 http://www.otssolutions.com === Private, Confidential and Privileged. This e-mail and any files and attachments transmitted with it are confidential and/or privileged. They are intended solely for the use of the intended recipient. The content of this e-mail and any file or attachment transmitted with it may have been changed or altered without the consent of the author. If you are not the intended recipient, please note that any review, dissemination, disclosure, alteration, printing, circulation or Transmission of this e-mail and/or any file or attachment transmitted with it, is prohibited and may be unlawful. If you have received this e-mail or any file or attachment transmitted with it in error please notify OTS Solutions at i...@otssolutions.com ===
Re: lucene3.0.3 | Special character indexing
Google finds http://www.gossamer-threads.com/lists/lucene/java-user/91750which looks like a good starting point. -- Ian. P.S. Plain text emails are preferable. On Mon, Mar 14, 2011 at 1:48 PM, Ranjit Kumar wrote: > Hi, > > I am creating index using Lucene 3.0.3 *StandardAnalyzer*. > > when searching is made on index using query like *C*, *C#* or *C++* it > gives same result for all these three term. As, I know while creating index > analyzer ignore special character and do not create index for same. I have > tried *KeywordAnalyzer *but it do not fulfill my requirement. > > *Need to be able to differentiate between "C", "C#" and "C++"* > > I have to create my own analyzer? > > Or I have to modify the JFlex grammar? > http://osdir.com/ml/java-dev/2009-06/msg00208.html > > please suggest me that, Is any existing analyzer will resolve this issue? > > Any suggestion will be appreciated!!! > > > > > > Thanks & Regards, > > *Ranjit Kumar *** > > Associate Software Engineer > > > > [image: cid:image002.jpg@01CB7089.C0069B40] > > > > *US:* +1 408.540.0001 > > *UK:* +44 208.099.1660 > > *India:* +91 124.474.8100 | +91 124.410.1350 > > *FAX:* +1 408.516.9050 > > http://www.otssolutions.com > > > === > Private, Confidential and Privileged. This e-mail and any files and > attachments transmitted with it are confidential and/or privileged. They are > intended solely for the use of the intended recipient. The content of this > e-mail and any file or attachment transmitted with it may have been changed > or altered without the consent of the author. If you are not the intended > recipient, please note that any review, dissemination, disclosure, > alteration, printing, circulation or Transmission of this e-mail and/or any > file or attachment transmitted with it, is prohibited and may be unlawful. > If you have received this e-mail or any file or attachment transmitted with > it in error please notify OTS Solutions at > i...@otssolutions.com=== > >
Re: lucene3.0.3 | Special character indexing
Hello Ranjit, Can you use the latest luke tool ? It has analyzer section which helps in deciding which analyzer to use based on the input. Hope this helps -vinaya On Monday 14 March 2011 07:18 PM, Ranjit Kumar wrote: Hi, I am creating index using Lucene 3.0.3 *StandardAnalyzer*. when searching is made on index using query like *C*, *C#* or *C++* it gives same result for all these three term. As, I know while creating index analyzer ignore special character and do not create index for same. I have tried *KeywordAnalyzer *but it do not fulfill my requirement. *Need to be able to differentiate between "C", "C#" and "C++"* I have to create my own analyzer? Or I have to modify the JFlex grammar? http://osdir.com/ml/java-dev/2009-06/msg00208.html please suggest me that, Is any existing analyzer will resolve this issue? Any suggestion will be appreciated!!! Thanks & Regards, *Ranjit Kumar *** Associate Software Engineer cid:image002.jpg@01CB7089.C0069B40 *US:* +1 408.540.0001 *UK:* +44 208.099.1660 *India:* +91 124.474.8100 | +91 124.410.1350 *FAX:*+1 408.516.9050 http://www.otssolutions.com === Private, Confidential and Privileged. This e-mail and any files and attachments transmitted with it are confidential and/or privileged. They are intended solely for the use of the intended recipient. The content of this e-mail and any file or attachment transmitted with it may have been changed or altered without the consent of the author. If you are not the intended recipient, please note that any review, dissemination, disclosure, alteration, printing, circulation or Transmission of this e-mail and/or any file or attachment transmitted with it, is prohibited and may be unlawful. If you have received this e-mail or any file or attachment transmitted with it in error please notify OTS Solutions at i...@otssolutions.com ===
no. of documents with hits vs. no. of hits
Hi, Does Lucene always count the number of documents with hits matching a query or is it also possible to count the overall number of hits? There would be a difference between the two if within a document there is actually more than one hit. Thank you in advance! Best, Michael - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Issue with disk space on UNIX
Hello All: Background: I have a text based search engine implemented in Java using Lucene 3.0. Indexing and re-indexing happens every night at 1 am as a scheduled process. The index size is around 1 gig and is recreated every night. Issues 1. Now I have a peculiar problem that happens only on my UNIX server. Every night after deleting the existing indexes and recreating the new, the disk loses around 1 gig space. When I look into the directory, I see a new file created with same size as the previous one, still overall space is lost. 2. Also, there is an issue with RAM memory. During indexing the memory occupancy is high, which is understandable. However, the memory occupancy remains the same even after completing the indexing process and this keeps increasing day by day until the server runs out of memory in a few weeks. This happens both on my Windows and Unix servers. Any help or hint on possible solutions to fix the above issues is highly appreciated. Thanks. -- View this message in context: http://lucene.472066.n3.nabble.com/Issue-with-disk-space-on-UNIX-tp2676784p2676784.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Analyzer enquiry
Nope, that should do it. Best Erick On Mon, Mar 14, 2011 at 9:35 AM, Vasiliki Gkouta wrote: > Sorry for the confusion. I have two analyzers(of StandardAnalyzer) and use > no stemmers. At the one analyzer I passed a german stop words set to the > constructor and at the other one I passed an english stop words set. My > question was if I have to call any other function of the german analyzer for > it to be corrent. > > Thank you. > > > Quoting Erick Erickson : > >> I don't understand what you're saying here. If you put a stemmer in the >> constructor, you *are* using it. If you don't specify any stemmer at all, >> you >> still have to define different analyzers to use different stop word lists. >> >> Can you restate your question? >> >> Best >> Erick >> >> On Mon, Mar 14, 2011 at 8:21 AM, Vasiliki Gkouta >> wrote: >>> >>> Thanks a lot for your help Erick! About the fields you mentioned: If I >>> don't >>> use stemmers, except for the constructor argument related to the stop >>> words, >>> is there anything else that I have to modify? >>> >>> Thanks, >>> Vicky >>> >>> >>> Quoting Erick Erickson : >>> StandardAnalyzer works well for most European languages. The problem will be stemming. Applying stemming via English rules to non-English languages produces...er...interesting results. You can go ahead and create language-specific fields for each language and use StandardAnalyzer with the appropriate stopwords and stemming with each, this is a common approach.. The Snowball stemmer takes a language parameter... You need to use specific analyzers for Chinese Japanese Korean (CJK) documents though. Hope that helps Erick On Sun, Mar 13, 2011 at 7:23 PM, Vasiliki Gkouta wrote: > > Hello everybody, > > I have an enquiry about StandardAnalyzer. Can I use it for other > languages > except from English? I give the right list of stop words at > initialization. > Is there anything else inside the class that is by default set in > English? > I've found the Analyzers for other languages too but they where seem to > be > deprecated.. Moreover I use english and other languages, all together > in > my > project so I would like to ask if there is a way to use either the same > class analyzer for all of them, or analyzers of the same functionality > for > all the languages. Thanks in advance! > > Best regards, > Vicky > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >>> >>> >>> >>> - >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?
I had exactly the same requirement to parse and index offline html files. I had written my own HTML scanner using javax.swing.text.html.HTMLEditorKit.Parser. It sounds difficult, but pretty simple and straight forward to implement, a simple 40 line java class did the job for me. shrinath.m wrote: > > On Fri, Mar 11, 2011 at 5:06 PM, Li Li [via Lucene] < > ml-node+2664380-1940163870-376...@n3.nabble.com> wrote: > >> But I think the parser will most be used when crawling. So you can use >> these parsers when crawling and save parsed result only. >> > > Consider we've offline HTML pages, no parsing while crawling, now what ? > Any tokenizer someone has built for this ? > > > How does Solr do it ? > > > -- > Regards > Shrinath.M > -- View this message in context: http://lucene.472066.n3.nabble.com/Which-is-the-best-fast-HTML-parser-tokenizer-that-I-can-use-with-Lucene-for-indexing-HTML-content-to-tp2664316p2676832.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Issue with disk space on UNIX
This sounds like you're not closing your index searchers and the file system is keeping them around. On the Unix box, does hour index space reappear just by restarting the process? Not using reopen correctly is sometimes the culprit, you need something like this (taken from the javadocs). IndexReader reader = ... ... IndexReader new = r.reopen(); if (new != reader) { ... // reader was reopened reader.close(); } reader = new; ** the mistake is to write something like: reader = reader.reopen(); in which case the underlying reader is never closed. Best Erick On Mon, Mar 14, 2011 at 1:55 PM, Sirish Vadala wrote: > Hello All: > > Background: > I have a text based search engine implemented in Java using Lucene 3.0. > Indexing and re-indexing happens every night at 1 am as a scheduled process. > The index size is around 1 gig and is recreated every night. > > Issues > 1. Now I have a peculiar problem that happens only on my UNIX server. Every > night after deleting the existing indexes and recreating the new, the disk > loses around 1 gig space. When I look into the directory, I see a new file > created with same size as the previous one, still overall space is lost. > > 2. Also, there is an issue with RAM memory. During indexing the memory > occupancy is high, which is understandable. However, the memory occupancy > remains the same even after completing the indexing process and this keeps > increasing day by day until the server runs out of memory in a few weeks. > This happens both on my Windows and Unix servers. > > Any help or hint on possible solutions to fix the above issues is highly > appreciated. > > Thanks. > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Issue-with-disk-space-on-UNIX-tp2676784p2676784.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Issue with disk space on UNIX
Further to what Erick says, recent versions of lucene hang on to unclosed readers longer than old versions used to. lsof -p pid can be useful here: run it and grep for deleted files still being held open. -- Ian. On Mon, Mar 14, 2011 at 6:13 PM, Erick Erickson wrote: > This sounds like you're not closing your index searchers and the file system > is keeping them around. On the Unix box, does hour index space reappear > just by restarting the process? > > Not using reopen correctly is sometimes the culprit, you need something like > this (taken from the javadocs). > IndexReader reader = ... > ... > IndexReader new = r.reopen(); > if (new != reader) { > ... // reader was reopened > reader.close(); > } > reader = new; > > ** > the mistake is to write something like: > reader = reader.reopen(); > in which case the underlying reader is never closed. > > Best > Erick > > On Mon, Mar 14, 2011 at 1:55 PM, Sirish Vadala wrote: >> Hello All: >> >> Background: >> I have a text based search engine implemented in Java using Lucene 3.0. >> Indexing and re-indexing happens every night at 1 am as a scheduled process. >> The index size is around 1 gig and is recreated every night. >> >> Issues >> 1. Now I have a peculiar problem that happens only on my UNIX server. Every >> night after deleting the existing indexes and recreating the new, the disk >> loses around 1 gig space. When I look into the directory, I see a new file >> created with same size as the previous one, still overall space is lost. >> >> 2. Also, there is an issue with RAM memory. During indexing the memory >> occupancy is high, which is understandable. However, the memory occupancy >> remains the same even after completing the indexing process and this keeps >> increasing day by day until the server runs out of memory in a few weeks. >> This happens both on my Windows and Unix servers. >> >> Any help or hint on possible solutions to fix the above issues is highly >> appreciated. >> >> Thanks. >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/Issue-with-disk-space-on-UNIX-tp2676784p2676784.html >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
lucene double metaphone ranking.
Hi guys, Here is my noob question: I'm trying to do fuzzy search on first name, last name. I'm using double metaphone analyzer. and i encountered the following problem. for example, when i search for picasso, "paski" shows up with the same score as the spelling of "picasso". when i look at the analyzer result of "paksi, picasso" they are both analyzed as ''PKS". why isn't the exact spelling getting a higher score? thanks.
Re: lucene double metaphone ranking.
Merlin, the kind of magic such as "prefer an exact match" still has to be programmed. Searching in a field with double-metaphone analyzer will only compare tokens by their double-metaphone-results. You probably want query expansion: text:picasso to be expanded to: text:picasso^3.0 text.stemmed:picass^1.5 text.phonetic:PKS^1.2 paul Le 14 mars 2011 à 22:02, merlin.list a écrit : > Hi guys, >Here is my noob question: > > I'm trying to do fuzzy search on first name, last name. I'm using double > metaphone analyzer. and i encountered the following problem. > for example, when i search for picasso, "paski" shows up with the same score > as the spelling of "picasso". when i look at the analyzer result of "paksi, > picasso" they are both analyzed as ''PKS". why isn't the exact spelling > getting a higher score? > > thanks. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: lucene double metaphone ranking.
On Mar 14, 2011 5:10 PM, Paul Libbrecht wrote: Merlin, the kind of magic such as "prefer an exact match" still has to be programmed. Searching in a field with double-metaphone analyzer will only compare tokens by their double-metaphone-results. You probably want query expansion: text:picasso to be expanded to: text:picasso^3.0 text.stemmed:picass^1.5 text.phonetic:PKS^1.2 paul Le 14 mars 2011 à 22:02, merlin.list a écrit : Thank you Paul! i shall try your spell. Hi guys, Here is my noob question: I'm trying to do fuzzy search on first name, last name. I'm using double metaphone analyzer. and i encountered the following problem. for example, when i search for picasso, "paski" shows up with the same score as the spelling of "picasso". when i look at the analyzer result of "paksi, picasso" they are both analyzed as ''PKS". why isn't the exact spelling getting a higher score? thanks. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Analyzer enquiry
Thank you for your help! Best Regards, Vicky Quoting Erick Erickson : Nope, that should do it. Best Erick On Mon, Mar 14, 2011 at 9:35 AM, Vasiliki Gkouta wrote: Sorry for the confusion. I have two analyzers(of StandardAnalyzer) and use no stemmers. At the one analyzer I passed a german stop words set to the constructor and at the other one I passed an english stop words set. My question was if I have to call any other function of the german analyzer for it to be corrent. Thank you. Quoting Erick Erickson : I don't understand what you're saying here. If you put a stemmer in the constructor, you *are* using it. If you don't specify any stemmer at all, you still have to define different analyzers to use different stop word lists. Can you restate your question? Best Erick On Mon, Mar 14, 2011 at 8:21 AM, Vasiliki Gkouta wrote: Thanks a lot for your help Erick! About the fields you mentioned: If I don't use stemmers, except for the constructor argument related to the stop words, is there anything else that I have to modify? Thanks, Vicky Quoting Erick Erickson : StandardAnalyzer works well for most European languages. The problem will be stemming. Applying stemming via English rules to non-English languages produces...er...interesting results. You can go ahead and create language-specific fields for each language and use StandardAnalyzer with the appropriate stopwords and stemming with each, this is a common approach.. The Snowball stemmer takes a language parameter... You need to use specific analyzers for Chinese Japanese Korean (CJK) documents though. Hope that helps Erick On Sun, Mar 13, 2011 at 7:23 PM, Vasiliki Gkouta wrote: Hello everybody, I have an enquiry about StandardAnalyzer. Can I use it for other languages except from English? I give the right list of stop words at initialization. Is there anything else inside the class that is by default set in English? I've found the Analyzers for other languages too but they where seem to be deprecated.. Moreover I use english and other languages, all together in my project so I would like to ask if there is a way to use either the same class analyzer for all of them, or analyzers of the same functionality for all the languages. Thanks in advance! Best regards, Vicky - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?
I started trying out all your suggestions one by one, thanks to all who helped. I used Jericho and found it extremely simple to start with ... Just wanted to clarify one thing though. Is there some tool that does extract text from HTML without creating the DOM ? -- Regards Shrinath.M -- View this message in context: http://lucene.472066.n3.nabble.com/Which-is-the-best-fast-HTML-parser-tokenizer-that-I-can-use-with-Lucene-for-indexing-HTML-content-to-tp2664316p2680634.html Sent from the Lucene - Java Users mailing list archive at Nabble.com.
Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?
On Mon, Mar 14, 2011 at 11:46 PM, shrinath.m wrote: > I used Jericho and found it extremely simple to start with ... > > Just wanted to clarify one thing though. > Is there some tool that does extract text from HTML without creating the DOM Looks like Jericho does what you want already: http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/TextExtractor.html --ewh - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?
Earl Hood wrote: > > Looks like Jericho does what you want already: > http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/TextExtractor.html > > --ewh > I went through their feature list and found that out :) http://jericho.htmlparser.net/docs/index.html Thanks Earl :) This is cool :) -- View this message in context: http://lucene.472066.n3.nabble.com/Which-is-the-best-fast-HTML-parser-tokenizer-that-I-can-use-with-Lucene-for-indexing-HTML-content-to-tp2664316p2680665.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org