Some questions...

2007-10-01 Thread sandeep chawla
Hi, I want to ask two question here- 1- Does lucene provide a tokenizer which can use string as a delimiter if not , someone please give me some gyan :) about how to do it. 2- Is there a way I can get the term.docFrq() for a particular set of documents.. i mean if i have a 100 documents and

mixing analyzer

2007-10-01 Thread Dino Korah
Hi, I am working on a lucene email indexing system which potentially can get documents in various languages. Currently I am using StandardAnalyzer, which works for English but not for many of the other languages. One of the requirements for the search interface is that they have to search witho

Re: mixing analyzer

2007-10-01 Thread Erick Erickson
Sure, but there's a time/space tradeoff. Isn't there always PerFieldAnalyzerWrapper is your friend. It would require that your index be built on a per-language basis. Say indexing text from French documents in a field "french_text", Chinese documents in a field chinese_text. You'd construct y

Indexing puncuation and symbols

2007-10-01 Thread John Byrne
Hi, Has anyone written an analyzer that preserves puncuation and synmbols ("£", "$", "%" etc.) as tokens? That way we could distinguish between searching for "100" and "100%" or "$100". Does anyone know of a reason why that wouldn't work? I notice that even Google doesn't support that. But

Re: Some questions...

2007-10-01 Thread Karl Wettin
1 okt 2007 kl. 14.41 skrev sandeep chawla: 2- Is there a way I can get the term.docFrq() for a particular set of documents.. Using TermDocs or the TermFreqVector. -- karl - To unsubscribe, e-mail: [EMAIL PROTECTED] For a

Re: Indexing puncuation and symbols

2007-10-01 Thread Karl Wettin
1 okt 2007 kl. 15.33 skrev John Byrne: Has anyone written an analyzer that preserves puncuation and synmbols ("£", "$", "%" etc.) as tokens? WhitespaceAnalyzer? You could also extend the lexical rules of StandardAnalyzer. -- karl

Re: Indexing puncuation and symbols

2007-10-01 Thread John Byrne
Whitespace analyzer does preserve those symbols, but not as tokens. It simply leaves them attached to the original term. As an example of what I'm talking about, consider a document that contains (without the quotes) "foo, ". Now, using WhitespaceAnalyzer, I could only get that document by s

Re: Indexing puncuation and symbols

2007-10-01 Thread Patrick Turcotte
Hi, Don't know the size of your dataset. But, couldn't you index in 2 fields, with PerFieldAnalyzer, tokenizing with Standard for 1 field, and WhiteSpace for the other. Then use multiple field query (there is a query parser for that, just don't remember the name right now). Patrick On 10/1/07,

Re: Indexing puncuation and symbols

2007-10-01 Thread John Byrne
Well, the size wouldn't be a problem, we could afford the extra field. But it would seem to complicate the search quite a lot. I'd have to run the search terms through both analyzers. It would be much simpler if the characters were indexed as separate tokens. Patrick Turcotte wrote: Hi, Don'

Re: Indexing puncuation and symbols

2007-10-01 Thread Patrick Turcotte
Of course, it depends on the kind of query you are doing, but (I did find the query parser in the mean time) MultiFieldQueryParser mfqp = new MultiFieldQueryParser(useFields, analyzer, boosts); where analyzer can be a PerFieldAnalyzer followed by Query query = mfqp.parse(queryString); would do the

Re: a query for a special AND?

2007-10-01 Thread Paul Elschot
As for suggestions on how to do this, I have no other than to make sure that you can create the queries necessary to obtain the required output. Regards, Paul Elschot On Sunday 30 September 2007 09:20, Mohammad Norouzi wrote: > Hi Paul, > thanks, I dot your idea, now I am planing to implement th

RE: mixing analyzer

2007-10-01 Thread Dino Korah
Thanks Erick. The PerFieldAnalyzerWrapper could fit in but in the current world of multilingual anywhere, (even in programming languages.. %$£%#@), almost any field in an email (addresses, subject, body, attachment filenames, ... ) document could be multilingual. I will have a go anyway. -O

GOMStaxWriter compile error

2007-10-01 Thread Peter Keegan
I've been getting the following compiler error when building the javadocs from the trunk sources: Ant build error: [javac] D:\lucene- 2.2.0\contrib\gdata-server\src\gom\src\java\org\apache\lucene\gdata\gom\writer\GOMStaxWriter.java:102:cannot find symbol [javac] symbol : method createXML

Re: Indexing puncuation and symbols

2007-10-01 Thread Erick Erickson
You might be able to create an analyzer that breaks your stream up (from the example) into tokens "foo" and "," and then (using the same analyzer) search on phrases with a slop of 0. That seems like it'd do what you want. Best Erick On 10/1/07, Patrick Turcotte <[EMAIL PROTECTED]> wrote: > >

Re: mixing analyzer

2007-10-01 Thread Erick Erickson
The whole question of multilingual indexing has been discusses at length, you might find some ideas if you search the archive... Erick On 10/1/07, Dino Korah <[EMAIL PROTECTED]> wrote: > > Thanks Erick. > > The PerFieldAnalyzerWrapper could fit in but in the current world of > multilingual anywhe

Index Dedupe

2007-10-01 Thread Johnny R. Ruiz III
Hi, I can't seem to find a way to delete duplicate in lucene index. I hve a unique key so it seems to be straight forward. But I can't find a simple way to do it except for putting each record in the index into HashMap. Are there any method in lucene package that I could use? Tnx, Johnny

Re: Index Dedupe

2007-10-01 Thread Daniel Noll
On Tuesday 02 October 2007 12:25:47 Johnny R. Ruiz III wrote: > Hi, > > I can't seem to find a way to delete duplicate in lucene index. I hve a > unique key so it seems to be straight forward. But I can't find a simple > way to do it except for putting each record in the index into HashMap. >