Re: n-gram indexing

2005-07-18 Thread Andy Roberts
On Monday 18 Jul 2005 22:06, Rajesh Munavalli wrote: > Intution behind adding n-grams is to boost naturally occurring larger > phrases versus using phrase queries. For example, if I am searching for > "united states of america", I want the search results to return the > documents ordered as follows

Re: n-gram indexing

2005-07-18 Thread Andy Roberts
On Monday 18 Jul 2005 21:27, Rajesh Munavalli wrote: > At what point do I add n-grams? Does the order in which I add n-grams > affect exact phrase queries later? My questions are > > (1) Should I add all the 1-grams followed by 2-grams followed by > 3-grams..etc sentence by sentence OR > > (2) Add

Re: Hypenated word

2005-06-13 Thread Andy Roberts
On Monday 13 Jun 2005 14:52, Markus Wiederkehr wrote: > On 6/13/05, Andy Roberts <[EMAIL PROTECTED]> wrote: > > On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote: > > > I see, the list of exceptions makes this a lot more complicated than I > > > thought... Tha

Re: Hypenated word

2005-06-13 Thread Andy Roberts
On Monday 13 Jun 2005 13:18, Markus Wiederkehr wrote: > I see, the list of exceptions makes this a lot more complicated than I > thought... Thanks a lot, Erik! > I expect you'll need to do some pre-processing. Read in your text into a buffer, line-by-line. If a given line ends with a hyphen, you

Re: Indexing multiple languages

2005-06-03 Thread Andy Roberts
On Friday 03 Jun 2005 01:06, Bob Cheung wrote: > For the StandardAnalyzer, will it have to be modified to accept > different character encodings. > > We have customers in China, Taiwan and Hong Kong. Chinese data may come > in 3 different encoding: Big5, GB and UTF8. > > What is the default encod

Re: Retrieve all terms

2005-05-19 Thread Andy Roberts
On Thursday 19 May 2005 06:53, Morus Walter wrote: > I think he doesn't want the contents but a term list for these contents. > Something like > 1 1 > 4 1 > content 2 > document 2 > for his sample, where the number is the fequency of the term. > > I don't think that you can ea

Digester and simple XML files

2005-04-22 Thread Andy Roberts
Hi all, Just been playing with Digester after reading chapter 7 in LIA. Seems to fit my needs as I have a relatively simple XML structure. some sente

Re: Best way to purposely corrupt an index?

2005-04-21 Thread Andy Roberts
On Wednesday 20 Apr 2005 12:52, Kevin L. Cobb wrote: > My policy on this type of exception handling is to only byte off what > you can chew. If you catch an IOException, then you simply report to the > user that an unexpected error has occurred and the search engine is > unobtainable at the moment.

Re: Best way to purposely corrupt an index?

2005-04-20 Thread Andy Roberts
On Wednesday 20 Apr 2005 08:27, Maik Schreiber wrote: > > As the index is rather critical to my program, I just wanted to make it > > really robust, and able to cope should a problem occur with the index > > itself. Otherwise, the user will be left with a non-functioning program > > with no explana

Re: Best way to purposely corrupt an index?

2005-04-20 Thread Andy Roberts
exReader. > > This relies on UNIX file handling semantics. (Can't say a word about > Windows). Don't know if this applies at all to our situation, but it > works for us. > > /D > > Andy Roberts wrote: > >Hi, > > > >Seems like an odd request I'm s

Best way to purposely corrupt an index?

2005-04-19 Thread Andy Roberts
Hi, Seems like an odd request I'm sure. However, my application relies an index, and should the index become unusable for some unfortunate reason, I'd like my app to gracefully cope with this situation. Firstly, I need to know how to detect a broken index. Opening an IndexReader can potentiall

Re: getting the number of occurrences within a document

2005-04-14 Thread Andy Roberts
On Thursday 14 Apr 2005 15:15, Pablo Gomes Ludermir wrote: > Hello all, > > I would like to get the following information from the index: > > 1. Given a term, how many times the term occurs in each document. > Something like a triple: > < Term, Doc1, Freq> , , , ... > > Is possible to do that? > >

Re: Multi-analyzer ?

2005-04-11 Thread Andy Roberts
On Tuesday 12 Apr 2005 00:53, Eric Chow wrote: > But how about one document contains more than two different languages ?? > > > Eric If you're indexing many documents which contain multiple languages then it's probably just better to use a SimpleAnalyser, rather than one that does any language s

Re: Multi-analyzer ?

2005-04-11 Thread Andy Roberts
ser to specify their input language because otherwise, results will be poor. Andy Roberts > -MB > > On Apr 11, 2005, at 6:02 AM, Andy Roberts wrote: > > Can you not provide the user with a option list to specify their input > > language? > > > > Language identificat

Re: Terms & Postion from Hits ...

2005-04-11 Thread Andy Roberts
5 in the field "contents" of the index ir. HTH, Andy Roberts On Sunday 10 Apr 2005 15:52, Patricio Galeas wrote: > Hello, > I am new with Lucene. I have following problem. > When I execute a search I receive the list of document Hits. > I get without problem the

Re: Multi-analyzer ?

2005-04-11 Thread Andy Roberts
Can you not provide the user with a option list to specify their input language? Language identification can be a pretty tricky field. There are some tricks you can do with unicode to identify language, e.g., \u0600 - \u06FF contains the Arabic characters, so if you're input contains lots of ch

Re: Escaping special characters

2005-04-07 Thread Andy Roberts
On Thursday 07 Apr 2005 06:38, Chuck Williams wrote: > Mufaddal Khumri writes (4/6/2005 11:21 PM): > >Hi, > > > >Am new to Lucene. I found the following page: > >http://lucene.apache.org/java/docs/queryparsersyntax.html. At the bottom > >of the page there is a section that in order to escape specia

Highlighter compile error

2005-03-10 Thread Andy Roberts
successfully built the code in the lucene-1.4.2-dev branch, but that doesn't contain that class either! Any hints? Google didn't shed any light, btw. Cheers, Andy Roberts - To unsubscribe, e-mail: [EMAIL PROTECTED]

Obtaining the contexts of hits

2005-03-09 Thread Andy Roberts
Hi, I've been using Lucene for a few months now, although not in a typical "building a search engine" kind of way*. Basically, I have some large documents. I would like a system whereby I search for a term, and then I receive a hit for each match, with its context, e.g., ten words either side