Re: Multi-analyzer ?

2005-04-11 Thread Andy Roberts
On Tuesday 12 Apr 2005 00:53, Eric Chow wrote: > But how about one document contains more than two different languages ?? > > > Eric If you're indexing many documents which contain multiple languages then it's probably just better to use a SimpleAnalyser, rather than one that does any language s

Re: How to include a multi-word synonym to a word when indexing?

2005-04-11 Thread Chris Hostetter
: You'll need some kind of lookup to know how to split a token like : "cybercafe" into two words - once you've done that it will be easy to : set the position increment of them to zero so that they overlay the : original term. but how would you set the position increment of a multi-word synonym s

RE: How to include a multi-word synonym to a word when indexing?

2005-04-11 Thread Pasha Bizhan
Hi, > From: Erik Hatcher [mailto:[EMAIL PROTECTED] > > My problem is, however, that some words needs to have alternatives > > where the word is decomposed / decompounded into two or more words: > > > > "FooBar Corp" or "cybercafe" > > > > should be found when searching for > > > > "Foo Ba*" or

Re: Multi-analyzer ?

2005-04-11 Thread Eric Chow
But how about one document contains more than two different languages ?? Eric On Apr 12, 2005 12:13 AM, Andy Roberts <[EMAIL PROTECTED]> wrote: > On Monday 11 Apr 2005 14:55, Mike Baranczak wrote: > > Your example with Arabic wouldn't work reliably either - there are > > several other languages

Strange sort error

2005-04-11 Thread Bill Tschumy
In my application, by default I display all documents that are in the index. I sort them either using a "time modified" or "time created". If I have a newly created empty index, I find I get an error if I sort by "time modified" but not "time created". In either case there are actually n

Re: Corrupted index

2005-04-11 Thread Doug Cutting
Bill Tschumy wrote: So, did this happen because he copied the data while in an inconsistent state? I'm a bit surprised that an inconsistent index is ever left on disk (except for temporarily while something is being written). Would this happen if there was a Writer that was not closed? An inde

Re: Corrupted index

2005-04-11 Thread Doug Cutting
Daniel Naber wrote: Yes, the *.cfs shows that this is a compound index which has *.fnm files only when it's being modified. When creating a compound segment, a "segments" file is never written that refers to the segment until the .cfs file is created and the .fnm files are removed. The real pro

Re: Lucene Search Result with Line Numbers?

2005-04-11 Thread Doug Cutting
cerberus yao wrote: Does anyone knows how to add the Lucene search results with Line number in original source content? When you display each hit, first scan the text and build an array containing the positions of each newline. Then use the highlighter (in contrib/highlighter) to find fragment

Lucene for Geo

2005-04-11 Thread Martin May
Hi everybody, I have some questions concerning using Lucene for Geo-searching. I have a bunch of documents (> 100,000) in the index that all have a latitude and longitude associated with them. I wanted to be able to search within a certain radius of a point of origin, which I accomplished by app

Re: How to include a multi-word synonym to a word when indexing?

2005-04-11 Thread Erik Hatcher
On Apr 11, 2005, at 9:36 AM, Peter Hotm. Nørregaard wrote: According to "Lucene in Action" it is possible to get synonyms indexed together with a word by putting multiple words with the same position-id in the term vector. My problem is, however, that some words needs to have alternatives where

Re: Multi-analyzer ?

2005-04-11 Thread Andy Roberts
On Monday 11 Apr 2005 14:55, Mike Baranczak wrote: > Your example with Arabic wouldn't work reliably either - there are > several other languages that use the Arabic script (Persian for > example). Good point. Although you could try a simple approach to test for the additional characters that exi

Re: Lucene Search Result with Line Numbers?

2005-04-11 Thread Karl Øie
Oh, forgot your last question, thats why the field "line" has to be stored, upon query you have to get the "line" number from the document that represents the line and in "forward" / "back" actions you will have sort the resultset by line value and print only chunks of that result. Mvh Karl Øi

Re: Multi-analyzer ?

2005-04-11 Thread Mike Baranczak
Your example with Arabic wouldn't work reliably either - there are several other languages that use the Arabic script (Persian for example). You could also try to pick out characters that are unique to a particular language - for example, Ä or Å only occur in Polish (as far as I know...). Of c

Re: Lucene Search Result with Line Numbers?

2005-04-11 Thread Karl Øie
Yes, the biggest drawback is text spanning lines: L1 - it was the best of times, L2 - it was the worst of times will return no hits for the search "it was the best of times, it was the worst of times" (with quotes). because no single lucene document contains the whole text alone. I would be inte

Re: Lucene Search Result with Line Numbers?

2005-04-11 Thread cerberus yao
But the "crash.java" is a just single document physically. Do we have any drawback if we treat each line in "crash.java" as a doucment? Another question: If we need to present the search result with the hit lines plus n lines forward and backword, how can I do this if each lines are seperated in

How to include a multi-word synonym to a word when indexing?

2005-04-11 Thread Peter Hotm. N�rregaard
According to "Lucene in Action" it is possible to get synonyms indexed together with a word by putting multiple words with the same position-id in the term vector. My problem is, however, that some words needs to have alternatives where the word is decomposed / decompounded into two or more wor

Re: Corrupted index

2005-04-11 Thread Bill Tschumy
Daniel, Thanks for responding on this thread. I doubt the copy was made while the index was being updated and I don't see any indication of a crash. Just for my clarification, if I update the index, but don't close the IndexWriter (because I may need it again soon), can the index on disk be le

Re: Terms & Postion from Hits ...

2005-04-11 Thread Maik Schreiber
> Now, I would like to obtain the List of all Terms (and their corresponding > position) from each document (hits.doc(i)). Try IndexReader.getTermFreqVector(), which will return an instance of TermPositionVector when the corresponding field has been indexed with storeTermVector==true. -- Maik Sc

Re: Terms & Postion from Hits ...

2005-04-11 Thread Andy Roberts
I've managed something like this from a slightly different perspective. IndexReader ir = new IndexReader(yourIndex); String searchTerm = "word"; TermPositions tp = ir.termPositions(new Term("contents", searchTerm); tp.next(); int termFreq = tp.freq(); System.out.print(currentTerm.text());

Re: Terms & Postion from Hits ...

2005-04-11 Thread Erik Hatcher
On Apr 10, 2005, at 11:52 AM, Patricio Galeas wrote: Hello, I am new with Lucene. I have following problem. When I execute a search I receive the list of document Hits. I get without problem the content of the documents too: for (int i = 0; i < hits.length(); i++) { Document doc = hits.doc(i)

Re: RangeQuery doesn't override equals() or hashCode() - intentional?

2005-04-11 Thread Erik Hatcher
On Apr 11, 2005, at 4:48 AM, Chris Lamprecht wrote: I was attempting to cache QueryFilters in a Map using the Query as the key (a BooleanQuery instance containing two RangeQueries), and I discovered that my BooleanQueries' equals() methods would always return false, even when the queries were equiv

Re: Lucene Search Result with Line Numbers?

2005-04-11 Thread Karl Øie
Most indexing creates a Lucene document for each Source document. What would need is to create a Lucene document for each line. String src_doc = "crash.java"; int line_number = 0; while(reader!=EOF) { String line = reader.readLine(); Document ld = new Document(); ld.add(ne

Re: Multi-analyzer ?

2005-04-11 Thread Andy Roberts
Can you not provide the user with a option list to specify their input language? Language identification can be a pretty tricky field. There are some tricks you can do with unicode to identify language, e.g., \u0600 - \u06FF contains the Arabic characters, so if you're input contains lots of ch

Terms & Postion from Hits ...

2005-04-11 Thread Patricio Galeas
Hello, I am new with Lucene. I have following problem. When I execute a search I receive the list of document Hits. I get without problem the content of the documents too: for (int i = 0; i < hits.length(); i++) { Document doc = hits.doc(i); System.out.println(doc.get("content")); } N

Re: Multi-analyzer ?

2005-04-11 Thread Karl Øie
I don't think you can figure out the language from the input box value alone, i can't see any way to select the correct language analyzer at this point. What you can do is to put Chinese, Japanese, English and Dutch content in separate indexes and use multisearcher to search in all of them, and

Re: Urgent, please help Index/Search in UTF-8 ???

2005-04-11 Thread Zilverline info
For instance look at http://www.zilverline.org/zilverlineweb/space/faq Michael Karl Øie wrote: If you use a servlet and a HTML Form to feed queries to the QueryParser take good care of all configurations around the servlet container. If you, like me, use tomcat you might have to recode the query

Re: Urgent, please help Index/Search in UTF-8 ???

2005-04-11 Thread Karl Øie
If you use a servlet and a HTML Form to feed queries to the QueryParser take good care of all configurations around the servlet container. If you, like me, use tomcat you might have to recode the query into internal java form (utf-8) before you pass it to lucene. read this: http://www.crazysqui

Urgent, please help, index/search in UTF-8 ???

2005-04-11 Thread Eric Chow
Hello, I am a beginner in using Lucene. My files are contains different language (English, Chinese, Portuguese, Japanese and some Asian languages, non-latin languages). They always contain in one file. Therefore, I have to use UTF-8 to save the contents. I am now developing a web-based search en

Urgent, please help Index/Search in UTF-8 ???

2005-04-11 Thread Eric Chow
Hello, I am a beginner in using Lucene. My files are contains different language (English, Chinese, Portuguese, Japanese and some Asian languages, non-latin languages). They always contain in one file. Therefore, I have to use UTF-8 to save the contents. I am now developing a web-based search

Lucene Search Result with Line Numbers?

2005-04-11 Thread cerberus yao
Hi, Lucene users: Does anyone knows how to add the Lucene search results with Line number in original source content? for example: I have a file "Test.java" which is indexed by lucene. When I search inside the index, how to enhance the search result with line number in Test.java?

RangeQuery doesn't override equals() or hashCode() - intentional?

2005-04-11 Thread Chris Lamprecht
I was attempting to cache QueryFilters in a Map using the Query as the key (a BooleanQuery instance containing two RangeQueries), and I discovered that my BooleanQueries' equals() methods would always return false, even when the queries were equivalent. The culprit was RangeQuery - it doesn't impl