RE: indexing rss feeds in multiple languages

2007-03-21 Thread Melanie Langlois
Well, thanks, sounds like the best option to me. Does anybody use the PerFieldAnalyzerWrapper? I'm just curious to know if there is any impact on the performances when using different analyzers. Mélanie -Original Message- From: Doron Cohen [mailto:[EMAIL PROTECTED] Sent: Thursday, Ma

Re: indexing rss feeds in multiple languages

2007-03-21 Thread Doron Cohen
If language is known also at search time, PerFieldAnalyzerWrapper seems a nice third option: single document per feed, with a separate field for each language, additional field(s) for the common data; using PerFieldAnalyzerWrapper at both indexing and search; using FieldSelector at search to retr

Re: indexing rss feeds in multiple languages

2007-03-21 Thread aslam bari
OOPs!!! Sorry, My last message has come here by mistake. It was for someone else, It is just a silly mistake. sorry People. - Original Message From: aslam bari <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Thursday, 22 March, 2007 12:12:57 PM Subject: Re: indexing rss feed

Re: indexing rss feeds in multiple languages

2007-03-21 Thread aslam bari
Hi, Have a look to my resume attached with the mail. if it suits you, let me know. Thanks... - Original Message From: Melanie Langlois <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Thursday, 22 March, 2007 11:33:03 AM Subject: indexing rss feeds in multiple languages Hi,

Re: Spelt, for better spelling correction

2007-03-21 Thread Otis Gospodnetic
Martin, This sounds like the spellchecker dictionary needs to be built in parallel with the main Lucene index. Is it possible to create a dictionary out of an existing (and no longer modified) Lucene index? Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.si

Re: Issue while parsing XML files due to control characters, help appreciated.

2007-03-21 Thread Doron Cohen
Lokeya <[EMAIL PROTECTED]> wrote on 21/03/2007 22:09:06: > > Initially I was writing into the Index 7,00,000 times. I chaged the code to > now write only 70 times which means I am putting lot of data in an array > list and add to doc and index at one shot. This is where the improvement > came from

indexing rss feeds in multiple languages

2007-03-21 Thread Melanie Langlois
Hi, I saw that there are many post on the mailing list about indexing in multiple language, so I will try to not post duplicate question. In my case, I want to index rss feeds, so one feed contains several items in different languages, and some common data for all the items (date, source..).

Combining score from two or more hits

2007-03-21 Thread Antony Bowesman
I have indexed objects that contain one or more attachments. Each attachment is indexed as a separate Document along with the object metadata. When I make a search, I may get hits in more than one Document that refer to the same object. I have a HitCollector which knows if the object has alre

Re: Issue while parsing XML files due to control characters, help appreciated.

2007-03-21 Thread Lokeya
Initially I was writing into the Index 7,00,000 times. I chaged the code to now write only 70 times which means I am putting lot of data in an array list and add to doc and index at one shot. This is where the improvement came from. To be precise IndexWriter is now adding document 70 times Vs. 7,0

Re: retrieving matched slop

2007-03-21 Thread Chris Hostetter
there's a few options... you can define a custom SImilarity that makes the score based entirely on the sloppyFreq ... it's not trivial, but it's certainly possible. the other option is to call SpanQuery.getSpans directly, and then iterate over it and compare end() - start() for each span. : Dat

Re: Lazy field loading in

2007-03-21 Thread Chris Hostetter
: IndexReader class for lazy field loading, the search API in IndexSearcher : does not contain such facilities. Hence, the Documents I get from the : Hits.doc() would not benefit from the mentioned feature. Lazy loading stored fields is really about perfermance tweaking ... if yoiu are that conce

Re: Thank you...

2007-03-21 Thread Cass Costello
You all rock. I'm clearing the semi-official legal hurdle with my CTO and our head counsel to full (or something close to full) disclosure of some of the architectural details, so stay tuned for as much as I'm allowed to share (and btw, for any of you that live/work/vacation in the SF Bay area, I

Lazy field loading in

2007-03-21 Thread jafarim
Hi I am seeking for making use of the latest lazy field loading in lucene 2.1. I store the orignal bytes of a document, say a PDF file for example, in a special untokenized field in the index. Though there is enough facilities in IndexReader class for lazy field loading, the search API in IndexSea

Re: Thank you...

2007-03-21 Thread Chris Hostetter
: Care to write up a Use Case when you have a few spare cycles? http:// : wiki.apache.org/lucene-java/UseCases Oh, Oh OH! ... competing requests for wiki submisions: Can you add some of the info about the performance numbers you are seeing to... http://wiki.apache.org/solr/SolrPerforman

Re: Thank you...

2007-03-21 Thread Grant Ingersoll
Hi Cass, Care to write up a Use Case when you have a few spare cycles? http:// wiki.apache.org/lucene-java/UseCases -Grant On Mar 20, 2007, at 4:49 PM, Cass Costello wrote: Heh - it used to be in my sig ... my bad. Thanks, all. :) http://www.stubhub.com On 3/20/07, bruce <[EMAIL PROTEC

Re: Spelt, for better spelling correction

2007-03-21 Thread Martin Haye
The dictionary is generated from the corpus, with the result that a larger corpus gives better results. Words are queued up during an index run, and at the end are munged to create an optimized dictionary. It also supports incremental building, though the overhead would be too much for those appl

Re: Lucene search performance: linear?

2007-03-21 Thread Yonik Seeley
On 3/21/07, Peter Keegan <[EMAIL PROTECTED]> wrote: On a similar topic, has anybody measured query performance as a function of index size? Well, I did and the results surprised me. I measured query throughput on 8 indexes that varied in size from 55,000 to 4.4 million documents. When plotted on

Re: Lucene search performance: linear?

2007-03-21 Thread Peter Keegan
On a similar topic, has anybody measured query performance as a function of index size? Well, I did and the results surprised me. I measured query throughput on 8 indexes that varied in size from 55,000 to 4.4 million documents. When plotted on a graph, there is a distinct hyperbolic curve (1/x).

Re: TextMining.org Word extractor

2007-03-21 Thread Ryan Ackley
Sorry, I don't think there is any POI in my future :-) Long story. Maybe I'll blog about it or something. Stay tuned. I have another project that I'm interested in spending time on. Not sure if it's going to be open source at this point but it will utilize the textmining.org library so I plan on

Re: TextMining.org Word extractor

2007-03-21 Thread Grant Ingersoll
Last I remember, it was being voted on by the Incubator committee. Good to hear TextMining is back in action! Does that mean you are back on POI Word again too? -Grant On Mar 20, 2007, at 10:35 PM, Ryan Ackley wrote: Someone pointed me there already. Looks interesting. Is there a mailing

Re: Querying fragments of a tree structure

2007-03-21 Thread Erick Erickson
Is it a fair restatement of your problem that you want to generate a list of all children of a node? That's what I'm reading. Would it work for you to store the complete ancestry in each node? By that I mean (from your example), NOTE: it's no problem in Lucene to store different values for t

Querying fragments of a tree structure

2007-03-21 Thread Emanuel Schleussinger
Hi, first, thanks for this great a resource, and sorry if i am oversimplfying a few things, i am still rather new to Lucene. I have been thinking how to integrate my app with Lucene - it is a CMS type system that has documents organized in a tree-style layout. A few facts about the system: -