Re: Boost a field in fuzzy query

2011-03-14 Thread Ian Lea
You could build the query up in your program, or that part of it anyway.

BooleanQuery bq = new BooleanQuery();
FuzzyQuery fq = new FuzzyQuery(...);
fq.setBoost(123f);
bq.add(fq);
...

This might be a bug in MultiFieldQueryParser - you could provide a
test case or, better, a patch.
See https://issues.apache.org/jira/browse/LUCENENET-147.  Lucene.Net,
but the comment there says "would probably mean Lucene Java also
suffers from the same bug".

Presumably you've read the "not very scalable" warning in the javadocs
for FuzzyQuery.

And you don't say what version of lucene you are using.  If not the
latest, try that.


--
Ian.


On Mon, Mar 14, 2011 at 5:33 AM, chhava40  wrote:
> Hi,
> I am using MultiFieldQueryParser to parse query for multiple fields with
> custom boosts for each field.
> The issue is when one of the terms in the query is fuzzy e.g abc~.
> For such a term, the field boost is not applied. If the query is "abc~ xyz"
> and fields are f1 & f2 with boosts 10, 5, the parsed query output is:
> (f1:abc~0.5 f2:abc~0.5) (f1:xyz^10 f2:xyz^5).
> Is there any way to apply the field boost factor to fuzzy terms as well?
> Thanks.
>
> --

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Indexing of multilingual labels

2011-03-14 Thread Vinaya Kumar Thimmappa

Hello Stephane,

I think a better way is to have resource file with different language 
and store pointer in the index to get to correct resource file ( 
Something like  I18N and L10N approach). Store the internationalised 
string in index  and all related localised string in resource file .


This way index size will be reduced (adding to payload would have impact 
on performance)

and help performance too.

Now your Total Search Time would be (searchtime+time to retrieve the 
language based data)


Hope this helps.
-Vinaya

On Friday 11 March 2011 09:05 PM, Stephane Fellah wrote:

Erick,

I am trying to index multilingual taxonomies such as SKOS, Wordnet,
Eurowordnet. Taxonomies are composed of concepts which have preferred and
alternative labels in different languages. Some labels are the same lexical
form in different languages. I want to be able to index these concepts in
Lucene in order to be able to search concepts by their label in one or
several languages. I want also be able to display concept definition with
all the alternative labels in different languages. My question is: could we
use the payload mechanism to store the language assigned to the word (i read
somewhere Google was using payload to store information such as font for
example, so why not language) ? Wouldn't be a better approach then using one
field per language or one index per language ?

REgards
Stephane

On Fri, Mar 11, 2011 at 7:52 AM, Erick Ericksonwrote:


It's not so much a matter of problems with indexing/searching
as it is with search behavior. The reason these strategies
are implemented is that using English stemming, say, on
other languages will produce "interesting" results.

There's no a-priori reason you can't index multiple languages
in the same field.

So I don't see what you would accomplish by using payloads
to indicate which language the term is in. Could you expand
a bit on what you're trying to accomplish here? Maybe there
are better solutions

Best
Erick


On Thu, Mar 10, 2011 at 10:29 PM, Stephane Fellah
  wrote:

I  am trying to index in Lucene a field that could have label of concepts

in

different languages. Most of the approaches I have seen so far are:

   -

   Use a single index, where each document has a field per each language

it

   uses, or
   -

   Use M indexes, M being the number of languages in the corpus.

Lucene 2.9+ has a feature called Payload that allows to attach attributes

to

term. Is anyone use this mechanism to store language (or other attributes
such as datatypes) information ? Does this approach if labels are the

same

in different languages (does it break inverted index) ? How is

performance

compared to the two other approaches ? Any pointer on source code showing
how it is done would help.

Thanks

--
Stephane Fellah, M.Sc, B.Sc
Principal Engineer/Product Manager
smartRealm LLC
201 Loudoun St. SW
Leesburg, VA 20175
Tel: 703 669 5514
Cell: 571 502 8478
Fax: 703 669 5515


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Analyzer enquiry

2011-03-14 Thread Vasiliki Gkouta
Thanks a lot for your help Erick! About the fields you mentioned: If I  
don't use stemmers, except for the constructor argument related to the  
stop words, is there anything else that I have to modify?


Thanks,
Vicky


Quoting Erick Erickson :


StandardAnalyzer works well for most European languages. The problem will
be stemming. Applying stemming via English rules to non-English languages
produces...er...interesting results.

You can go ahead and create language-specific fields for each language and
use StandardAnalyzer with the appropriate stopwords and stemming with each,
this is a common approach.. The Snowball stemmer takes a language  
parameter...


You need to use specific analyzers for Chinese Japanese Korean (CJK)  
documents

though.

Hope that helps
Erick

On Sun, Mar 13, 2011 at 7:23 PM, Vasiliki Gkouta  wrote:

Hello everybody,

I have an enquiry about StandardAnalyzer. Can I use it for other languages
except from English? I give the right list of stop words at initialization.
Is there anything else inside the class that is by default set in English?
I've found the Analyzers for other languages too but they where seem to be
deprecated.. Moreover I use english and other languages, all together in my
project so I would like to ask if there is a way to use either the same
class analyzer for all of them, or analyzers of the same functionality for
all the languages. Thanks in advance!

Best regards,
Vicky



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org








-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Analyzer enquiry

2011-03-14 Thread Erick Erickson
I don't understand what you're saying here. If you put a stemmer in the
constructor, you *are* using it. If you don't specify any stemmer at all, you
still have to define different analyzers to use different stop word lists.

Can you restate your question?

Best
Erick

On Mon, Mar 14, 2011 at 8:21 AM, Vasiliki Gkouta  wrote:
> Thanks a lot for your help Erick! About the fields you mentioned: If I don't
> use stemmers, except for the constructor argument related to the stop words,
> is there anything else that I have to modify?
>
> Thanks,
> Vicky
>
>
> Quoting Erick Erickson :
>
>> StandardAnalyzer works well for most European languages. The problem will
>> be stemming. Applying stemming via English rules to non-English languages
>> produces...er...interesting results.
>>
>> You can go ahead and create language-specific fields for each language and
>> use StandardAnalyzer with the appropriate stopwords and stemming with
>> each,
>> this is a common approach.. The Snowball stemmer takes a language
>> parameter...
>>
>> You need to use specific analyzers for Chinese Japanese Korean (CJK)
>> documents
>> though.
>>
>> Hope that helps
>> Erick
>>
>> On Sun, Mar 13, 2011 at 7:23 PM, Vasiliki Gkouta 
>> wrote:
>>>
>>> Hello everybody,
>>>
>>> I have an enquiry about StandardAnalyzer. Can I use it for other
>>> languages
>>> except from English? I give the right list of stop words at
>>> initialization.
>>> Is there anything else inside the class that is by default set in
>>> English?
>>> I've found the Analyzers for other languages too but they where seem to
>>> be
>>> deprecated.. Moreover I use english and other languages, all together in
>>> my
>>> project so I would like to ask if there is a way to use either the same
>>> class analyzer for all of them, or analyzers of the same functionality
>>> for
>>> all the languages. Thanks in advance!
>>>
>>> Best regards,
>>> Vicky
>>>
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
>
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Analyzer enquiry

2011-03-14 Thread Vasiliki Gkouta
Sorry for the confusion. I have two analyzers(of StandardAnalyzer) and  
use no stemmers. At the one analyzer I passed a german stop words set  
to the constructor and at the other one I passed an english stop words  
set. My question was if I have to call any other function of the  
german analyzer for it to be corrent.


Thank you.


Quoting Erick Erickson :


I don't understand what you're saying here. If you put a stemmer in the
constructor, you *are* using it. If you don't specify any stemmer at all, you
still have to define different analyzers to use different stop word lists.

Can you restate your question?

Best
Erick

On Mon, Mar 14, 2011 at 8:21 AM, Vasiliki Gkouta  wrote:

Thanks a lot for your help Erick! About the fields you mentioned: If I don't
use stemmers, except for the constructor argument related to the stop words,
is there anything else that I have to modify?

Thanks,
Vicky


Quoting Erick Erickson :


StandardAnalyzer works well for most European languages. The problem will
be stemming. Applying stemming via English rules to non-English languages
produces...er...interesting results.

You can go ahead and create language-specific fields for each language and
use StandardAnalyzer with the appropriate stopwords and stemming with
each,
this is a common approach.. The Snowball stemmer takes a language
parameter...

You need to use specific analyzers for Chinese Japanese Korean (CJK)
documents
though.

Hope that helps
Erick

On Sun, Mar 13, 2011 at 7:23 PM, Vasiliki Gkouta 
wrote:


Hello everybody,

I have an enquiry about StandardAnalyzer. Can I use it for other
languages
except from English? I give the right list of stop words at
initialization.
Is there anything else inside the class that is by default set in
English?
I've found the Analyzers for other languages too but they where seem to
be
deprecated.. Moreover I use english and other languages, all together in
my
project so I would like to ask if there is a way to use either the same
class analyzer for all of them, or analyzers of the same functionality
for
all the languages. Thanks in advance!

Best regards,
Vicky



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org








-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org








-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Indexing of multilingual labels

2011-03-14 Thread Paul Libbrecht
Stephane,

I think that you have the freedom to put what you want in the stored value of a 
field.

The simplest would even be to make it that the fields that you want to use for 
display are stored, preformatted, xml-ished, owl-ified, or json-ized, to be 
separate from the indexed fields (where you are only interested to the plain 
text). 
Payloads seem to be doing a similar job as a separate stored, non-indexed field.

The best approach I had thus far was to use a multiplexing analyzer (which is 
called for indexed fields only anyways) that recognizes the language by the 
suffix of the field name.

As to the difference between one index and several fields or one field in many 
indices, I think it is just a programming difference. The tf and idf are always 
done at the term level so they make no difference. 

I tend to prefer multiple fields because it's easier to expand a query for, 
say, Fourrier sent by a browser that says English but also accepts french and 
German into:
- a query for Fourrier in the whitespace-tokenized track (always prefer that 
one)
- a query for fouri in the French field
- a query for fourier in the English and German fields
My current experience is that many users appear or claim to speak many 
languages (they do, a little bit).

hope it helps.

paul

PS: not that my code is ideal but here are the ones I have:
 - i2geo, based on an ontology of concepts in OWL, 
http://i2geo.net/xwiki/bin/view/About/GeoSkills
   and http://svn.activemath.org/intergeo/Platform/SearchI2G/
 - ActiveMath, fed by XML, 
http://www.activemath.org/javadoc/org/activemath/omdocjdom/index/package-summary.html
 and 


Le 11 mars 2011 à 16:35, Stephane Fellah a écrit :

> Erick,
> 
> I am trying to index multilingual taxonomies such as SKOS, Wordnet,
> Eurowordnet. Taxonomies are composed of concepts which have preferred and
> alternative labels in different languages. Some labels are the same lexical
> form in different languages. I want to be able to index these concepts in
> Lucene in order to be able to search concepts by their label in one or
> several languages. I want also be able to display concept definition with
> all the alternative labels in different languages. My question is: could we
> use the payload mechanism to store the language assigned to the word (i read
> somewhere Google was using payload to store information such as font for
> example, so why not language) ? Wouldn't be a better approach then using one
> field per language or one index per language ?
> 
> REgards
> Stephane
> 
> On Fri, Mar 11, 2011 at 7:52 AM, Erick Erickson 
> wrote:
> 
>> It's not so much a matter of problems with indexing/searching
>> as it is with search behavior. The reason these strategies
>> are implemented is that using English stemming, say, on
>> other languages will produce "interesting" results.
>> 
>> There's no a-priori reason you can't index multiple languages
>> in the same field.
>> 
>> So I don't see what you would accomplish by using payloads
>> to indicate which language the term is in. Could you expand
>> a bit on what you're trying to accomplish here? Maybe there
>> are better solutions
>> 
>> Best
>> Erick
>> 
>> 
>> On Thu, Mar 10, 2011 at 10:29 PM, Stephane Fellah
>>  wrote:
>>> I  am trying to index in Lucene a field that could have label of concepts
>> in
>>> different languages. Most of the approaches I have seen so far are:
>>> 
>>> -
>>> 
>>> Use a single index, where each document has a field per each language
>> it
>>> uses, or
>>> -
>>> 
>>> Use M indexes, M being the number of languages in the corpus.
>>> 
>>> Lucene 2.9+ has a feature called Payload that allows to attach attributes
>> to
>>> term. Is anyone use this mechanism to store language (or other attributes
>>> such as datatypes) information ? Does this approach if labels are the
>> same
>>> in different languages (does it break inverted index) ? How is
>> performance
>>> compared to the two other approaches ? Any pointer on source code showing
>>> how it is done would help.
>>> 
>>> Thanks
>>> 
>>> --
>>> Stephane Fellah, M.Sc, B.Sc
>>> Principal Engineer/Product Manager
>>> smartRealm LLC
>>> 201 Loudoun St. SW
>>> Leesburg, VA 20175
>>> Tel: 703 669 5514
>>> Cell: 571 502 8478
>>> Fax: 703 669 5515
>>> 
>> 
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
>> 
> 
> 
> -- 
> Stephane Fellah, M.Sc, B.Sc
> Principal Engineer/Product Manager
> smartRealm LLC
> 201 Loudoun St. SW
> Leesburg, VA 20175
> Tel: 703 669 5514
> Cell: 571 502 8478
> Fax: 703 669 5515


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



lucene3.0.3 | Special character indexing

2011-03-14 Thread Ranjit Kumar
Hi,

I am creating index using Lucene 3.0.3 StandardAnalyzer.

when searching is made on index using query like C, C# or C++  it gives same 
result for all these three term. As, I know while creating index analyzer 
ignore special character and do not create index for same. I have tried 
KeywordAnalyzer but it do not fulfill my requirement.

Need to be able to differentiate between "C", "C#" and "C++"

I have to create my own analyzer?

Or I have to modify the JFlex grammar? 
http://osdir.com/ml/java-dev/2009-06/msg00208.html

please suggest me that, Is any existing analyzer will resolve this issue?

Any suggestion will be appreciated!!!


Thanks & Regards,
Ranjit Kumar
Associate Software Engineer

[cid:image002.jpg@01CB7089.C0069B40]

US:   +1 408.540.0001
UK:   +44 208.099.1660
India:   +91 124.474.8100 | +91 124.410.1350
FAX: +1 408.516.9050
http://www.otssolutions.com

===
 Private, Confidential and Privileged. This e-mail and any files and 
attachments transmitted with it are confidential and/or privileged. They are 
intended solely for the use of the intended recipient. The content of this 
e-mail and any file or attachment transmitted with it may have been changed or 
altered without the consent of the author. If you are not the intended 
recipient, please note that any review, dissemination, disclosure, alteration, 
printing, circulation or Transmission of this e-mail and/or any file or 
attachment transmitted with it, is prohibited and may be unlawful. If you have 
received this e-mail or any file or attachment transmitted with it in error 
please notify OTS Solutions at i...@otssolutions.com 
===


Re: lucene3.0.3 | Special character indexing

2011-03-14 Thread Ian Lea
Google finds http://www.gossamer-threads.com/lists/lucene/java-user/91750which
looks like a good starting point.


--
Ian.

P.S.  Plain text emails are preferable.


On Mon, Mar 14, 2011 at 1:48 PM, Ranjit Kumar  wrote:

>  Hi,
>
> I am creating index using Lucene 3.0.3 *StandardAnalyzer*.
>
> when searching is made on index using query like *C*, *C#* or *C++*  it
> gives same result for all these three term. As, I know while creating index
> analyzer ignore special character and do not create index for same. I have
> tried *KeywordAnalyzer *but it do not fulfill my requirement.
>
> *Need to be able to differentiate between "C", "C#" and "C++"*
>
> I have to create my own analyzer?
>
> Or I have to modify the JFlex grammar?
> http://osdir.com/ml/java-dev/2009-06/msg00208.html
>
> please suggest me that, Is any existing analyzer will resolve this issue?
>
> Any suggestion will be appreciated!!!
>
>
>
>
>
> Thanks & Regards,
>
> *Ranjit Kumar   ***
>
> Associate Software Engineer
>
>
>
> [image: cid:image002.jpg@01CB7089.C0069B40]
>
>
>
> *US:*   +1 408.540.0001
>
> *UK:*   +44 208.099.1660
>
> *India:*   +91 124.474.8100 | +91 124.410.1350
>
> *FAX:* +1 408.516.9050
>
> http://www.otssolutions.com
>
>
> ===
> Private, Confidential and Privileged. This e-mail and any files and
> attachments transmitted with it are confidential and/or privileged. They are
> intended solely for the use of the intended recipient. The content of this
> e-mail and any file or attachment transmitted with it may have been changed
> or altered without the consent of the author. If you are not the intended
> recipient, please note that any review, dissemination, disclosure,
> alteration, printing, circulation or Transmission of this e-mail and/or any
> file or attachment transmitted with it, is prohibited and may be unlawful.
> If you have received this e-mail or any file or attachment transmitted with
> it in error please notify OTS Solutions at 
> i...@otssolutions.com===
>
>


Re: lucene3.0.3 | Special character indexing

2011-03-14 Thread Vinaya Kumar Thimmappa

Hello Ranjit,

Can you use the latest luke tool ? It has analyzer section which helps 
in deciding which analyzer to use based on the input.


Hope this helps
-vinaya

On Monday 14 March 2011 07:18 PM, Ranjit Kumar wrote:


Hi,

I am creating index using Lucene 3.0.3 *StandardAnalyzer*.

when searching is made on index using query like *C*, *C#* or *C++* 
 it gives same result for all these three term. As, I know while 
creating index analyzer ignore special character and do not create 
index for same. I have tried *KeywordAnalyzer *but it do not fulfill 
my requirement.


*Need to be able to differentiate between "C", "C#" and "C++"*

I have to create my own analyzer?

Or I have to modify the JFlex grammar? 
http://osdir.com/ml/java-dev/2009-06/msg00208.html


please suggest me that, Is any existing analyzer will resolve this issue?

Any suggestion will be appreciated!!!

Thanks & Regards,

*Ranjit Kumar ***

Associate Software Engineer

cid:image002.jpg@01CB7089.C0069B40

*US:*  +1 408.540.0001

*UK:*  +44 208.099.1660

*India:*  +91 124.474.8100 | +91 124.410.1350

*FAX:*+1 408.516.9050

http://www.otssolutions.com

=== 
Private, Confidential and Privileged. This e-mail and any files and 
attachments transmitted with it are confidential and/or privileged. 
They are intended solely for the use of the intended recipient. The 
content of this e-mail and any file or attachment transmitted with it 
may have been changed or altered without the consent of the author. If 
you are not the intended recipient, please note that any review, 
dissemination, disclosure, alteration, printing, circulation or 
Transmission of this e-mail and/or any file or attachment transmitted 
with it, is prohibited and may be unlawful. If you have received this 
e-mail or any file or attachment transmitted with it in error please 
notify OTS Solutions at i...@otssolutions.com 
===


no. of documents with hits vs. no. of hits

2011-03-14 Thread Michael Wiegand

Hi,

Does Lucene always count the number of documents with hits matching a 
query or is it also possible to count the overall number of hits?
There would be a difference between the two if within a document there 
is actually more than one hit.


Thank you in advance!

Best,
Michael


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Issue with disk space on UNIX

2011-03-14 Thread Sirish Vadala
Hello All:

Background:
I have a text based search engine implemented in Java using Lucene 3.0.
Indexing and re-indexing happens every night at 1 am as a scheduled process.
The index size is around 1 gig and is recreated every night.

Issues
1. Now I have a peculiar problem that happens only on my UNIX server. Every
night after deleting the existing indexes and recreating the new, the disk
loses around 1 gig space. When I look into the directory, I see a new file
created with same size as the previous one, still overall space is lost.

2. Also, there is an issue with RAM memory. During indexing the memory
occupancy is high, which is understandable. However, the memory occupancy
remains the same even after completing the indexing process and this keeps
increasing day by day until the server runs out of memory in a few weeks.
This happens both on my Windows and Unix servers.

Any help or hint on possible solutions to fix the above issues is highly
appreciated.

Thanks.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Issue-with-disk-space-on-UNIX-tp2676784p2676784.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Analyzer enquiry

2011-03-14 Thread Erick Erickson
Nope, that should do it.

Best
Erick

On Mon, Mar 14, 2011 at 9:35 AM, Vasiliki Gkouta  wrote:
> Sorry for the confusion. I have two analyzers(of StandardAnalyzer) and use
> no stemmers. At the one analyzer I passed a german stop words set to the
> constructor and at the other one I passed an english stop words set. My
> question was if I have to call any other function of the german analyzer for
> it to be corrent.
>
> Thank you.
>
>
> Quoting Erick Erickson :
>
>> I don't understand what you're saying here. If you put a stemmer in the
>> constructor, you *are* using it. If you don't specify any stemmer at all,
>> you
>> still have to define different analyzers to use different stop word lists.
>>
>> Can you restate your question?
>>
>> Best
>> Erick
>>
>> On Mon, Mar 14, 2011 at 8:21 AM, Vasiliki Gkouta 
>> wrote:
>>>
>>> Thanks a lot for your help Erick! About the fields you mentioned: If I
>>> don't
>>> use stemmers, except for the constructor argument related to the stop
>>> words,
>>> is there anything else that I have to modify?
>>>
>>> Thanks,
>>> Vicky
>>>
>>>
>>> Quoting Erick Erickson :
>>>
 StandardAnalyzer works well for most European languages. The problem
 will
 be stemming. Applying stemming via English rules to non-English
 languages
 produces...er...interesting results.

 You can go ahead and create language-specific fields for each language
 and
 use StandardAnalyzer with the appropriate stopwords and stemming with
 each,
 this is a common approach.. The Snowball stemmer takes a language
 parameter...

 You need to use specific analyzers for Chinese Japanese Korean (CJK)
 documents
 though.

 Hope that helps
 Erick

 On Sun, Mar 13, 2011 at 7:23 PM, Vasiliki Gkouta 
 wrote:
>
> Hello everybody,
>
> I have an enquiry about StandardAnalyzer. Can I use it for other
> languages
> except from English? I give the right list of stop words at
> initialization.
> Is there anything else inside the class that is by default set in
> English?
> I've found the Analyzers for other languages too but they where seem to
> be
> deprecated.. Moreover I use english and other languages, all together
> in
> my
> project so I would like to ask if there is a way to use either the same
> class analyzer for all of them, or analyzers of the same functionality
> for
> all the languages. Thanks in advance!
>
> Best regards,
> Vicky
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


>>>
>>>
>>>
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
>
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-14 Thread Sirish Vadala
I had exactly the same requirement to parse and index offline html files. I
had written my own HTML scanner using
javax.swing.text.html.HTMLEditorKit.Parser. It sounds difficult, but pretty
simple and straight forward to implement, a simple 40 line java class did
the job for me.


shrinath.m wrote:
> 
> On Fri, Mar 11, 2011 at 5:06 PM, Li Li [via Lucene] <
> ml-node+2664380-1940163870-376...@n3.nabble.com> wrote:
> 
>>   But I think the parser will most be used when crawling. So you can use
>> these parsers when crawling and save parsed result only.
>>
> 
> Consider we've offline HTML pages, no parsing while crawling, now what ?
> Any tokenizer someone has built for this ?
> 
> 
> How does Solr do it ?
> 
> 
> -- 
> Regards
> Shrinath.M
> 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Which-is-the-best-fast-HTML-parser-tokenizer-that-I-can-use-with-Lucene-for-indexing-HTML-content-to-tp2664316p2676832.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Issue with disk space on UNIX

2011-03-14 Thread Erick Erickson
This sounds like you're not closing your index searchers and the file system
is keeping them around. On the Unix box, does hour index space reappear
just by restarting the process?

Not using reopen correctly is sometimes the culprit, you need something like
this (taken from the javadocs).
 IndexReader reader = ...
 ...
 IndexReader new = r.reopen();
 if (new != reader) {
   ... // reader was reopened
   reader.close();
 }
 reader = new;

**
the mistake is to write something like:
reader = reader.reopen();
in which case the underlying reader is never closed.

Best
Erick

On Mon, Mar 14, 2011 at 1:55 PM, Sirish Vadala  wrote:
> Hello All:
>
> Background:
> I have a text based search engine implemented in Java using Lucene 3.0.
> Indexing and re-indexing happens every night at 1 am as a scheduled process.
> The index size is around 1 gig and is recreated every night.
>
> Issues
> 1. Now I have a peculiar problem that happens only on my UNIX server. Every
> night after deleting the existing indexes and recreating the new, the disk
> loses around 1 gig space. When I look into the directory, I see a new file
> created with same size as the previous one, still overall space is lost.
>
> 2. Also, there is an issue with RAM memory. During indexing the memory
> occupancy is high, which is understandable. However, the memory occupancy
> remains the same even after completing the indexing process and this keeps
> increasing day by day until the server runs out of memory in a few weeks.
> This happens both on my Windows and Unix servers.
>
> Any help or hint on possible solutions to fix the above issues is highly
> appreciated.
>
> Thanks.
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Issue-with-disk-space-on-UNIX-tp2676784p2676784.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Issue with disk space on UNIX

2011-03-14 Thread Ian Lea
Further to what Erick says, recent versions of lucene hang on to
unclosed readers longer than old versions used to.

lsof -p pid can be useful here: run it and grep for deleted files
still being held open.


--
Ian.


On Mon, Mar 14, 2011 at 6:13 PM, Erick Erickson  wrote:
> This sounds like you're not closing your index searchers and the file system
> is keeping them around. On the Unix box, does hour index space reappear
> just by restarting the process?
>
> Not using reopen correctly is sometimes the culprit, you need something like
> this (taken from the javadocs).
>  IndexReader reader = ...
>  ...
>  IndexReader new = r.reopen();
>  if (new != reader) {
>   ...     // reader was reopened
>   reader.close();
>  }
>  reader = new;
>
> **
> the mistake is to write something like:
> reader = reader.reopen();
> in which case the underlying reader is never closed.
>
> Best
> Erick
>
> On Mon, Mar 14, 2011 at 1:55 PM, Sirish Vadala  wrote:
>> Hello All:
>>
>> Background:
>> I have a text based search engine implemented in Java using Lucene 3.0.
>> Indexing and re-indexing happens every night at 1 am as a scheduled process.
>> The index size is around 1 gig and is recreated every night.
>>
>> Issues
>> 1. Now I have a peculiar problem that happens only on my UNIX server. Every
>> night after deleting the existing indexes and recreating the new, the disk
>> loses around 1 gig space. When I look into the directory, I see a new file
>> created with same size as the previous one, still overall space is lost.
>>
>> 2. Also, there is an issue with RAM memory. During indexing the memory
>> occupancy is high, which is understandable. However, the memory occupancy
>> remains the same even after completing the indexing process and this keeps
>> increasing day by day until the server runs out of memory in a few weeks.
>> This happens both on my Windows and Unix servers.
>>
>> Any help or hint on possible solutions to fix the above issues is highly
>> appreciated.
>>
>> Thanks.
>>
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/Issue-with-disk-space-on-UNIX-tp2676784p2676784.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



lucene double metaphone ranking.

2011-03-14 Thread merlin.list

Hi guys,
Here is my noob question:

 I'm trying to do fuzzy search on first name, last name. I'm using 
double metaphone analyzer. and i encountered the following problem.
for example, when i search for picasso, "paski" shows up with the same 
score as the spelling of "picasso". when i look at the analyzer result 
of "paksi, picasso" they are both analyzed as ''PKS". why isn't the 
exact spelling getting a higher score?


thanks.


Re: lucene double metaphone ranking.

2011-03-14 Thread Paul Libbrecht
Merlin,

the kind of magic such as "prefer an exact match" still has to be programmed.
Searching in a field with double-metaphone analyzer will only compare tokens by 
their double-metaphone-results.
You probably want query expansion:

text:picasso

to be expanded to:

  text:picasso^3.0 text.stemmed:picass^1.5 text.phonetic:PKS^1.2

paul


Le 14 mars 2011 à 22:02, merlin.list a écrit :

> Hi guys,
>Here is my noob question:
> 
> I'm trying to do fuzzy search on first name, last name. I'm using double 
> metaphone analyzer. and i encountered the following problem.
> for example, when i search for picasso, "paski" shows up with the same score 
> as the spelling of "picasso". when i look at the analyzer result of "paksi, 
> picasso" they are both analyzed as ''PKS". why isn't the exact spelling 
> getting a higher score?
> 
> thanks.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: lucene double metaphone ranking.

2011-03-14 Thread merlin.list

On Mar 14, 2011 5:10 PM, Paul Libbrecht wrote:

Merlin,

the kind of magic such as "prefer an exact match" still has to be programmed.
Searching in a field with double-metaphone analyzer will only compare tokens by 
their double-metaphone-results.
You probably want query expansion:

text:picasso

to be expanded to:

   text:picasso^3.0 text.stemmed:picass^1.5 text.phonetic:PKS^1.2

paul


Le 14 mars 2011 à 22:02, merlin.list a écrit :

Thank you Paul! i shall try your spell.



Hi guys,
Here is my noob question:

I'm trying to do fuzzy search on first name, last name. I'm using double 
metaphone analyzer. and i encountered the following problem.
for example, when i search for picasso, "paski" shows up with the same score as the spelling of 
"picasso". when i look at the analyzer result of "paksi, picasso" they are both analyzed as 
''PKS". why isn't the exact spelling getting a higher score?

thanks.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Analyzer enquiry

2011-03-14 Thread Vasiliki Gkouta

Thank you for your help!

Best Regards,
Vicky

Quoting Erick Erickson :


Nope, that should do it.

Best
Erick

On Mon, Mar 14, 2011 at 9:35 AM, Vasiliki Gkouta  wrote:

Sorry for the confusion. I have two analyzers(of StandardAnalyzer) and use
no stemmers. At the one analyzer I passed a german stop words set to the
constructor and at the other one I passed an english stop words set. My
question was if I have to call any other function of the german analyzer for
it to be corrent.

Thank you.


Quoting Erick Erickson :


I don't understand what you're saying here. If you put a stemmer in the
constructor, you *are* using it. If you don't specify any stemmer at all,
you
still have to define different analyzers to use different stop word lists.

Can you restate your question?

Best
Erick

On Mon, Mar 14, 2011 at 8:21 AM, Vasiliki Gkouta 
wrote:


Thanks a lot for your help Erick! About the fields you mentioned: If I
don't
use stemmers, except for the constructor argument related to the stop
words,
is there anything else that I have to modify?

Thanks,
Vicky


Quoting Erick Erickson :


StandardAnalyzer works well for most European languages. The problem
will
be stemming. Applying stemming via English rules to non-English
languages
produces...er...interesting results.

You can go ahead and create language-specific fields for each language
and
use StandardAnalyzer with the appropriate stopwords and stemming with
each,
this is a common approach.. The Snowball stemmer takes a language
parameter...

You need to use specific analyzers for Chinese Japanese Korean (CJK)
documents
though.

Hope that helps
Erick

On Sun, Mar 13, 2011 at 7:23 PM, Vasiliki Gkouta 
wrote:


Hello everybody,

I have an enquiry about StandardAnalyzer. Can I use it for other
languages
except from English? I give the right list of stop words at
initialization.
Is there anything else inside the class that is by default set in
English?
I've found the Analyzers for other languages too but they where seem to
be
deprecated.. Moreover I use english and other languages, all together
in
my
project so I would like to ask if there is a way to use either the same
class analyzer for all of them, or analyzers of the same functionality
for
all the languages. Thanks in advance!

Best regards,
Vicky



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org








-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org








-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org








-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-14 Thread shrinath.m
I started trying out all your suggestions one by one, thanks to all who
helped.

I used Jericho and found it extremely simple to start with ...

Just wanted to clarify one thing though.
Is there some tool that does extract text from HTML without creating the DOM
?


-- 
Regards
Shrinath.M


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Which-is-the-best-fast-HTML-parser-tokenizer-that-I-can-use-with-Lucene-for-indexing-HTML-content-to-tp2664316p2680634.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-14 Thread Earl Hood
On Mon, Mar 14, 2011 at 11:46 PM, shrinath.m  wrote:
> I used Jericho and found it extremely simple to start with ...
>
> Just wanted to clarify one thing though.
> Is there some tool that does extract text from HTML without creating the DOM

Looks like Jericho does what you want already:
http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/TextExtractor.html

--ewh

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Which is the +best +fast HTML parser/tokenizer that I can use with Lucene for indexing HTML content today ?

2011-03-14 Thread shrinath.m

Earl Hood wrote:
> 
> Looks like Jericho does what you want already:
> http://jericho.htmlparser.net/docs/javadoc/net/htmlparser/jericho/TextExtractor.html
> 
> --ewh
> 


I went through their feature list and found that out :) 
http://jericho.htmlparser.net/docs/index.html


Thanks Earl :)
This is cool :)

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Which-is-the-best-fast-HTML-parser-tokenizer-that-I-can-use-with-Lucene-for-indexing-HTML-content-to-tp2664316p2680665.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org