Re: TokenStream per Field instance?

2006-05-26 Thread karl wettin
On Sun, 2006-05-21 at 04:46 +0200, karl wettin wrote: > Do I have any alternatives? > > > What I really want is: > { > for (Classification c : myDocument) { > doc.add(new Field(c.getFieldName(), c.tokenStreamFactory()... > } > indexWriter.add(doc, perFieldsAnalyzer); > } Patch now in J

Re: sitegeist

2006-05-26 Thread karl wettin
I did it like below, but used the lucene score instead. Will report back with results in a month or so. On Thu, 2006-05-25 at 11:51 +0200, karl wettin wrote: > Did anyone write some neat tool for statistical analysis of hits over > time? I need one. And it must be fast. Was thinking something lik

Re: Joining searches on multiple indexes

2006-05-26 Thread Chris Hostetter
There is nothing special in lucene to help you do this ... it would have to be done in your own code. : If I have user info in 2 different sources (index)and want to search for : fields on both, but the search should : join the resulting records using a common field (user id for example). Is : th

Re: Index evolution

2006-05-26 Thread Chris Hostetter
: How easy is to add new fields to the documents in the index? : Suppose that today I can search for book title and decide that including the : author in the search would be a good idea. How easy is to do that with : lucene? very. whenevery you add a document, you specify what fields that docume

Re: Index evolution

2006-05-26 Thread karl wettin
On Fri, 2006-05-26 at 17:50 -0300, Leandro Saad wrote: > Hi all. I'm very new to lucene. All I have done is read some docs about how > it works, which brings to the question: > > How easy is to add new fields to the documents in the index? > Suppose that today I can search for book title and decid

Re: Question about special characters

2006-05-26 Thread Chris Hostetter
: Thks for the reply, ut I don't know how to do this change in : SOLatin1AccentFilter. : Can you give me some advice in this action? I've never really looked at the internals of ISOLatin1AccentFilter, but the basic idea is to subclass it with a new TokenFilter that maintains a one token "buffer"

RE: Seeing what's occupying all the space in the index

2006-05-26 Thread Rob Staveley (Tom)
Luke shows the total index size the same, and yes, it appears to list all the files. There are 997 of them which are tough to count using that interface with Cygwin/X. > Also, you may want to see if you have any stale locks or the like that is preventing you from doing an optimize. No lock files,

Joining searches on multiple indexes

2006-05-26 Thread Leandro Saad
My second question is: can I join the results os multiple indexes using a common field? If I have user info in 2 different sources (index)and want to search for fields on both, but the search should join the resulting records using a common field (user id for example). Is this possible? -- Leandr

Index evolution

2006-05-26 Thread Leandro Saad
Hi all. I'm very new to lucene. All I have done is read some docs about how it works, which brings to the question: How easy is to add new fields to the documents in the index? Suppose that today I can search for book title and decide that including the author in the search would be a good idea.

Re: sorting issues

2006-05-26 Thread Daniel Naber
On Freitag 26 Mai 2006 17:46, Mike Richmond wrote: > I am then storing this in a stored, untokenized field named "date". From the API docs: The field must be indexed, but should not be tokenized, and does not need to be stored (unless you happen to want it back with the rest of your document d

Re: Seeing what's occupying all the space in the index

2006-05-26 Thread Grant Ingersoll
It kind of sounds like those files are corrupted, but I can't say for sure. When you look in Luke at your index (the one with all the files, not the new one) do you see all the documents you would expect to see with values that seem reasonable? Also, in Luke, you can see a listing of all the

RE: Seeing what's occupying all the space in the index

2006-05-26 Thread Rob Staveley (Tom)
Indexing 55648 documents in a new clean directory, I see only .cfs files (+ deletable + segments). Disk usage is 65K for all of these, which means that each message takes ~1K of index space rather than > 10K as it does in my 99GB index. Bearing in mind that the large index has > 5 million Lucene

Re: Get index name from a hit

2006-05-26 Thread Erik Hatcher
I'm running out the door, so only a quick reply... yes you can. Look at the subSearcher(?) method - that'll give you the index. Your application will need to keep track of which indexes correspond to which indices. Check the archives for the answer too, sorry for the short reply.

RE: Seeing what's occupying all the space in the index

2006-05-26 Thread Rob Staveley (Tom)
> Note that IndexReader has a main() that will list the contents of compound index files. It looks like some of my index is compound and some isn't. My not very well informed guess is that an optimize() got interrupted somewhere along the line. If I try to optimize the index now, it throws except

RE: Seeing what's occupying all the space in the index

2006-05-26 Thread Rob Staveley (Tom)
I just tried to optimise my index, using the lucli command line client, and got: 8< lucli> optimize Starting to optimize index. java.io.IOException: Cannot overwrite: /mnt/sdb1/lucene-index/index-1/_2lhqi.fnm at org.apache.lucene.store.FSDirectory.createOutput(FSDirectory.j

Re: BufferedIndexInput.readByte performance; skipping

2006-05-26 Thread Paul Elschot
On Friday 26 May 2006 19:13, Ken Krugler wrote: > >On Friday 26 May 2006 16:14, Michael Chan wrote: > >> Hi, > >> > > > I have a 5gb index containing 2mil documents and am trying to run > >> 1mil+ queries against it. Most of the queries are SpanQueries and it > >> occurs to me that the search p

Get index name from a hit

2006-05-26 Thread Mike Richmond
When using a MultiSearcher Is there anyway to get the name of the index that a hit came from? One way would be to add the index name as a field to each document, but I am hoping to avoid this. Thanks, Mike - To unsubscribe, e

Re: Seeing what's occupying all the space in the index

2006-05-26 Thread Doug Cutting
Rob Staveley (Tom) wrote: Is there a tool I can use to see how much of the index is occupied by the different fields I am indexing? Note that IndexReader has a main() that will list the contents of compound index files. Doug --

RE: Seeing what's occupying all the space in the index

2006-05-26 Thread Rob Staveley (Tom)
Interesting. I am explicitly turning on the compound file format when I start my application, but I am suspicious about my optimizing thread. It *ought* to be optimising every 30 minutes, using thread synchronisation to prevent the writer from trying to write while optimisation takes place, but it

Re: Seeing what's occupying all the space in the index

2006-05-26 Thread Grant Ingersoll
It seems odd to me that if you are using the CFS format, why you would have the .fdt, .frq and .prx files in addition to the .cfs files. My understanding is all files (except deletable and segment) get put inside of the CFS file. Looking at my indices, I only have the CFS file. Are you optim

RE: Seeing what's occupying all the space in the index

2006-05-26 Thread Rob Staveley (Tom)
That's a really good idea, but I've got a total of 38 fields only. It is true that some of them are empty, but that can't account for the bulk. -Original Message- From: Chris Hostetter [mailto:[EMAIL PROTECTED] Sent: 26 May 2006 17:50 To: java-user@lucene.apache.org Subject: RE: Seeing wh

Re: BufferedIndexInput.readByte performance

2006-05-26 Thread Ken Krugler
On Friday 26 May 2006 16:14, Michael Chan wrote: Hi, > I have a 5gb index containing 2mil documents and am trying to run 1mil+ queries against it. Most of the queries are SpanQueries and it occurs to me that the search performance is quite slow when using 2, 3 SpanOrQueries nested inside

RE: Seeing what's occupying all the space in the index

2006-05-26 Thread Rob Staveley (Tom)
> Is there anything I can learn from the index directory's file listing? Running this nasty little BASH one-liner... $ for i in `ls * | perl -nle 'if (/^.+(\..+)/) {print $1;}' | sort | uniq`;do ls -l *$i | awk '{SUM = SUM + $5} END {if (SUM > 1e10) {print "'$i': ", SUM}}'; done ... I see

Re: Seeing what's occupying all the space in the index

2006-05-26 Thread Chris Hostetter
: PS: I am a newbie to the mailing list - I hope I've got the etiquette right you may have figured this out already, but please CC email to multiple lucene mailing lists -- in this particular case, [EMAIL PROTECTED] is just a legacy alias that points at [EMAIL PROTECTED] -- so there's *really* no

RE: Seeing what's occupying all the space in the index

2006-05-26 Thread Chris Hostetter
are you by any chance using different field names for each document -- or do you have a wide range of field names that aren't the same for each document? ... you mentioned indexing emails, email has a very loose header structure that allows MTAs to add arbitrary "X" headers, are you converting eve

Re: BufferedIndexInput.readByte performance

2006-05-26 Thread Paul Elschot
On Friday 26 May 2006 16:14, Michael Chan wrote: > Hi, > > I have a 5gb index containing 2mil documents and am trying to run > 1mil+ queries against it. Most of the queries are SpanQueries and it > occurs to me that the search performance is quite slow when using 2, 3 > SpanOrQueries nested inside

Re: sorting issues

2006-05-26 Thread Mike Richmond
I'm running into similar sort issues when I try to sort my results on a date field that was created using the DateTools class as follows: DateTools.dateToString(dateObj, DateTools.Resolution.SECOND); I am then storing this in a stored, untokenized field named "date". When I sort the results by d

Re[2]: Implemented subclasses of Similarity class in Lucene

2006-05-26 Thread Charlie
Hi Edgar, Are there any technical reports explaining your design and implementation of LM on Lucene? Or what source files are exactly "LM extension"? -- Best regards, Charlie --- Friday, May 26, 2006, 7:36:14 AM, you wrote: > Hi Edgar, > While doing the integration/updating for Lucene 1.9, c

Re: Run-Time Error

2006-05-26 Thread Andrzej Bialecki
Dennis Kubes wrote: The server is headless (i.e. no X-Windows). I've tried lucli, but that doesn't have Luke's whistles and bells. Does Luke have a non-GUI equivalent, Grant? You can tunnel your X session through ssh. If that's not possible, AND you are familiar with Lucene API, then you ca

RE: Seeing what's occupying all the space in the index

2006-05-26 Thread Rob Staveley (Tom)
> I can't see how Luke is going to show me what's occupying most of my index. I do however notice that none of my stored fields are stored compressed. Presumably Field.Store COMPRESS is something that is new in Lucene 1.9 and wasn't available in 1.4.3?? However, it is still hard to see what's c

BufferedIndexInput.readByte performance

2006-05-26 Thread Michael Chan
Hi, I have a 5gb index containing 2mil documents and am trying to run 1mil+ queries against it. Most of the queries are SpanQueries and it occurs to me that the search performance is quite slow when using 2, 3 SpanOrQueries nested inside a SpanNearQuery, which in turn is nested inside another Spa

RE: Seeing what's occupying all the space in the index

2006-05-26 Thread Rob Staveley (Tom)
Luke is working nicely with a XWin32 demo server, I just downloaded from StarNet, with a bit of SSH tunnelling :-) [I couldn't immediately figure out how to do it with Cygwin/X.] However, I can't see how Luke is going to show me what's occupying most of my index. -Original Message- From

Re: Seeing what's occupying all the space in the index

2006-05-26 Thread Karel Tejnora
Or you can use ssh -X for X11 forwarding. I don't know how it's working in windows (some x client app) but great on linux(es) with huge bandwidth. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EM

Re: Seeing what's occupying all the space in the index

2006-05-26 Thread Grant Ingersoll
I don't believe it does. Is there anyway you can mount the drive where the index lives? Can you copy the index to someplace that allows you to run Luke? Otherwise, you could write a simple standalone program that dumps the terms and their freqs from the command line. I don't think it would

Re: Implemented subclasses of Similarity class in Lucene

2006-05-26 Thread Murat Yakici
Hi Edgar, While doing the integration/updating for Lucene 1.9, could you be more open and clear about the design so that people can 1)Understand it, 2)Extend it, Just an recommendation. Cheers, Murat Edgar Meij wrote: Hi Ganesh, We have developed a Language Modeling extension to Lucene at

RE: Seeing what's occupying all the space in the index

2006-05-26 Thread Rob Staveley (Tom)
The server is headless (i.e. no X-Windows). I've tried lucli, but that doesn't have Luke's whistles and bells. Does Luke have a non-GUI equivalent, Grant? -Original Message- From: Grant Ingersoll [mailto:[EMAIL PROTECTED] Sent: 26 May 2006 12:41 To: java-user@lucene.apache.org Subject: Re

Re: Seeing what's occupying all the space in the index

2006-05-26 Thread Grant Ingersoll
Give Luke a try. Google for "Luke Lucene" and you should find it. Otherwise check the Lucene website for a reference. Rob Staveley (Tom) wrote: In my index of e-mail message parts, it looks like 23K is being used up for each indexed message part, which is way more than I'd expect. I have a

RE: Seeing what's occupying all the space in the index

2006-05-26 Thread Rob Staveley (Tom)
In my index of e-mail message parts, it looks like 23K is being used up for each indexed message part, which is way more than I'd expect. I have a total of 37 fields per message part. I tokenize, index and do not store message part bodies. I store a <= 300 character synopsis of each message part.

Re: Implemented subclasses of Similarity class in Lucene

2006-05-26 Thread Edgar Meij
Hi Ganesh, We have developed a Language Modeling extension to Lucene at the University of Amsterdam. It can be found here: http://ilps.science.uva.nl/Resources/#lm-lucen It was build around Lucene 1.4.3, so it isn't source compatible with the latest Lucene version. We are currently working on i

Seeing what's occupying all the space in the index

2006-05-26 Thread Rob Staveley (Tom)
I am indexing e-mail in a compound index and for e-mail which is stored in ~60G (in Bzip2 compressed form), I have an index which is now 80G. Is there a tool I can use to see how much of the index is occupied by the different fields I am indexing? PS: I am a newbie to the mailing list - I hope I'

Re: Question about special characters

2006-05-26 Thread Dan Wiggin
Thks for the reply, ut I don't know how to do this change in SOLatin1AccentFilter. Can you give me some advice in this action? 2006/5/25, Chris Hostetter <[EMAIL PROTECTED]>: I think I'm missing something here. the whole point of the ISOLatin1AccentFilter is to replace accented characters wit