API that return the amount of terms indexed

2010-10-15 Thread APOLO_11
hey - is there an API that return the number of term indexed? I found the API return the amount of document indexed (IndexWriter.docCount) but cant find an API for the amount of terms in the index. any idea ? thanks,d. -- View this message in context: http://lucene.472066.n3.nabble.com/API-

Re: Writing an Analyzer for storing and retrieving a payload (was: Storing additional Metadata with Fields)

2010-10-15 Thread Christoph Hermann
Am Freitag, 15. Oktober 2010, 20:13:17 schrieb Erick Erickson: Hello, > Have you seen: > http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloa > ds/ Sure. There is also http://www.lucidimagination.com/blog/2010/04/18/refresh-getting-started-with- payloads/ http://sujitpal.b

Re: Tokenizing XML

2010-10-15 Thread Erick Erickson
Well, it's hard to say what "correctly" would be. Remove all XML? Preserve attributes? Preserve tags? Put the attributes and values into fields in the document? My point is that there's no obviously "correct" parsing. But if you just want to strip out all the <>, it seems like PatternTokenizer

Re: Writing an Analyzer for storing and retrieving a payload (was: Storing additional Metadata with Fields)

2010-10-15 Thread Erick Erickson
Have you seen: http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/ ? And I don't think payloads are added unless they're specified in the term. And even if they are, is your index big enough

Re: Use of Lucene to store data from RSS feeds

2010-10-15 Thread Erick Erickson
How many documents Lucene creates and when is entirely up to you. Your code calls IndexWriter.addDocument after all. You can add multiple values to a #field# in a document if you want, just call Document.add() repeatedly... HTH Erick On Fri, Oct 15, 2010 at 10:35 AM, Martin O'Shea wrote: > @Pul

Tokenizing XML

2010-10-15 Thread Christoph Hermann
Hi, is there a Tokenizer in Lucene, that tokenizes XML correctly? I.e. that one gets from the following XML: this is exampletext. Tokens (or similar): | this | is | | example | | text. | Or would i need to write such a Tokenizer myself? regards Christoph Hermann -- Christoph Hermann Inst

RE: Use of Lucene to store data from RSS feeds

2010-10-15 Thread Martin O'Shea
@Pulkit Singhal: Thanks for the reply. Just to clarify my post yesterday, I'm not sure if each row in the database table would form a document or not because I do not know if Lucene works in this manner. In my case, each row of the table represents a single polling of an RSS feed to retrieve any

RE: determining the type of a term - retrieving a payload

2010-10-15 Thread Sykes, Derek
Hi David, nextPosition() was indeed the missing link. Thanks very much! Cheers, Derek -Original Message- From: David Causse [mailto:dcau...@spotter.com] Sent: 15 October 2010 09:34 To: java-user@lucene.apache.org Subject: Re: determining the type of a term - retrieving a payload On We

Writing an Analyzer for storing and retrieving a payload (was: Storing additional Metadata with Fields)

2010-10-15 Thread Christoph Hermann
Am Donnerstag, 14. Oktober 2010, 14:43:41 schrieb Christoph Hermann: Hello, > It seems Playload gets added to > every term in the index, so in my case i would store the x,y and page > values for every word and increase the index much more than i'd need. > Any approach for preventing this? > > An

Re: Overriding DefaultScore

2010-10-15 Thread Ian Lea
http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/ sounds a good place to start. A much simpler alternative, although without exact control, would be to use query boosting. There is also CustomScoreQuery - complex but powerful. -- Ian. On Fri, Oct 15, 2010 at 2:19

Re: Overriding DefaultScore

2010-10-15 Thread Zaharije Pasalic
Can anybody explain or point me to couple of links where i can find more info about payloads? Thx On Fri, Oct 15, 2010 at 11:09 AM, Danil ŢORIN wrote: > You could encode term score as payload while indexing, and use those > payloads on search time. > > On Fri, Oct 15, 2010 at 11:30, Zaharije Pas

Re: Use of Lucene to store data from RSS feeds

2010-10-15 Thread Pulkit Singhal
When you ask: a) will each feed would form a Lucene document, or b) will each database row would form a lucene document I'm inclined to say that really depends on what type of aggregation tool or logic you are using. I don't know if "Tika" does it but if there is a tool out there that can be point

Re: Overriding DefaultScore

2010-10-15 Thread Danil ŢORIN
You could encode term score as payload while indexing, and use those payloads on search time. On Fri, Oct 15, 2010 at 11:30, Zaharije Pasalic wrote: > Hi > > my original problem is to index large number of documents which > contains 360 integers in rage from 0-90K. Searching it's a little bit > c

Re: IndexSearch very slow after reopening the index

2010-10-15 Thread Ian Lea
I'm a bit confused about what exactly you are timing. Is the 46 ms for one search on one term with one hit, or for 100 similar searches or what? Perhaps a minimal self-contained search program demonstrating exactly what you are doing would help, with evidence of where it is spending time. -- Ia

Re: determining the type of a term - retrieving a payload

2010-10-15 Thread David Causse
On Wed, Oct 13, 2010 at 04:37:37PM +0100, Sykes, Derek wrote: > Hi there, > > I'm currently trying to work out how I can determine the type > (string/number/date/etc)of a term. I've not seen any off the shelf way to do > it so am trying to store a payload against each term that records the type

Overriding DefaultScore

2010-10-15 Thread Zaharije Pasalic
Hi my original problem is to index large number of documents which contains 360 integers in rage from 0-90K. Searching it's a little bit complicated - I need to find most similar documents where query data is also 360 numbers in range 0-90K. But (there is always 'but') i need to create score with