Tansley, Robert wrote:
Hi all,
The DSpace (www.dspace.org) currently uses Lucene to index metadata
(Dublin Core standard) and extracted full-text content of documents
stored in it. Now the system is being used globally, it needs to
support multi-language indexing.
I've looked through the mail
> Tansley, Robert wrote:
> > What if we're trying to index multiple languages in the
> same site? Is
> > it best to have:
> >
> > 1/ one index for all languages
> > 2/ one index for all languages, with an extra language field so
> > searches can be constrained to a particular language 3/ separ
Tansley, Robert wrote:
What if we're trying to index multiple languages in the same site? Is
it best to have:
1/ one index for all languages
2/ one index for all languages, with an extra language field so searches
can be constrained to a particular language
3/ separate indices for each language
]
Sent: Friday, June 03, 2005 14:23
To: java-user@lucene.apache.org
Subject: Re: Indexing multiple languages
Robert,
Le 2 juin 05, à 21:42, Tansley, Robert a écrit :
> It seems that there are even more options --
> 4/ One index, with a separate Lucene document for each (item,language)
> co
Robert,
Le 2 juin 05, à 21:42, Tansley, Robert a écrit :
It seems that there are even more options --
4/ One index, with a separate Lucene document for each (item,language)
combination, with one field that specifies the language
5/ One index, one Lucene document per item, with field names that
http://wiki.apache.org/jakarta-lucene/IndexingOtherLanguages
>>> [EMAIL PROTECTED] 6/3/2005 6:03:31 AM >>>
On Jun 2, 2005, at 9:06 PM, Bob Cheung wrote:
> Btw, I did try running the lucene demo (web template) to index the
> HTML
> files after I added one including English and Chinese characters
On Jun 2, 2005, at 9:06 PM, Bob Cheung wrote:
Btw, I did try running the lucene demo (web template) to index the
HTML
files after I added one including English and Chinese characters.
I was
not able to search for any Chinese in that HTML file (returned no
hits).
I wonder whether I need to
On Friday 03 Jun 2005 01:06, Bob Cheung wrote:
> For the StandardAnalyzer, will it have to be modified to accept
> different character encodings.
>
> We have customers in China, Taiwan and Hong Kong. Chinese data may come
> in 3 different encoding: Big5, GB and UTF8.
>
> What is the default encod
Hi Erik,
I am a new comer to this list and please allow me to ask a dumb
question.
For the StandardAnalyzer, will it have to be modified to accept
different character encodings.
We have customers in China, Taiwan and Hong Kong. Chinese data may come
in 3 different encoding: Big5, GB and UTF8.
-
> From: Paul Libbrecht [mailto:[EMAIL PROTECTED]
> Sent: 01 June 2005 04:10
> To: java-user@lucene.apache.org
> Subject: Re: Indexing multiple languages
>
> Le 1 juin 05, à 01:12, Erik Hatcher a écrit :
> >> 1/ one index for all languages
> >> 2/ one in
Le 1 juin 05, à 01:12, Erik Hatcher a écrit :
1/ one index for all languages
2/ one index for all languages, with an extra language field so
searches
can be constrained to a particular language
3/ separate indices for each language?
I would vote for option #2 as it gives the most flexibilty - y
Hi, Erik,
Thanks for your info.
No, I haven't tried it yet. I will give it a try and maybe produce
some Chinese/English text search demo online.
Currently I used Lucene as the indexing engine for Velocity mailing
list search. I have a demo at www.jhsystems.net.
It is yet another mailing list s
Robert,
I'm very likely going to be using DSpace and some related
technologies from the SIMILE project very soon :)
On May 31, 2005, at 5:08 PM, Tansley, Robert wrote:
Hi all,
The DSpace (www.dspace.org) currently uses Lucene to index metadata
(Dublin Core standard) and extracted full-text
Jian - have you tried Lucene's StandardAnalyzer with Chinese? It
will keep English as-is (removing stop words, lowercasing, and such)
and separate CJK characters into separate tokens also.
Erik
On May 31, 2005, at 5:49 PM, jian chen wrote:
Hi,
Interesting topic. I thought about this
Hi,
Interesting topic. I thought about this as well. I wanted to index
Chinese text with English, i.e., I want to treat the English text
inside Chinese text as English tokens rather than Chinese text tokens.
Right now I think maybe I have to write a special analyzer that takes
the text input, and
15 matches
Mail list logo