How to get the intersection of two Hits?

2007-03-28 Thread is_maximum
Hi suppose we have two Hits, now we need the documents which exists in both of them and ignore the others. is there any workaround? thanks Regards Mohammad -- View this message in context: http://www.nabble.com/How-to-get-the-intersection-of-two-Hits--tf3484844.html#a9728397 Sent from the Lu

Re: why Apache doesnt create a nice forum like the others???

2007-03-28 Thread Mohammad Norouzi
I registered in Nabble, but to post message you should subscribe to lucene mailing list and if you subscribe to mailing list your inbox will become full of messages. this is very bad!!! On 3/28/07, John Haxby <[EMAIL PROTECTED]> wrote: Grant Ingersoll wrote: > I like the mailing list approach m

question about HitCollector

2007-03-28 Thread is_maximum
Hello The collect(int doc, int score) method. in this method, which id the argument doc refers to? the original id in the index or the id of search result (the position of document in the search result) I ask this, because I implement a HitCollector and collect the IDs in a BitSet and it was ver

Re: TF-IDF API

2007-03-28 Thread Sengly Heng
Thank you very much for your time. Here is a sample of a vector of terms : v1 = {"sad", "john", "intelligent", "news", "USA", "disneyland", "MIT", "cambridge", "marry", ...} I'll try out your method. Best regards, Sengly On 3/28/07, karl wettin <[EMAIL PROTECTED]> wrote: 28 mar 2007 kl.

Re: index file size threshold affecting search performance?

2007-03-28 Thread Mike Klaas
On 3/28/07, Scott Oshima <[EMAIL PROTECTED]> wrote: So I assumed a linear decay of performance as an index got bigger. For some reason when going from an index size of 1.89 to 1.95 gigs dramatically increased cpu across all of our servers. I was thinking of splitting the 1.95 index into 2 separ

Re: index file size threshold affecting search performance?

2007-03-28 Thread Erick Erickson
Well, if you're adding data, you must be doing something with it later. Are you sure the problem is in the index growing and not how you use the data afterwards? The reason I ask is that I found almost a 10-fold increase in my apps performance when I used FieldSelector (Lucene 2.1) to only load t

Re: Start/end offsets in analyzers

2007-03-28 Thread Antony Bowesman
Thanks Erik. For our purposes it seems more generally useful to use the original start/end offsets. Antony Erik Hatcher wrote: They aren't used implicitly by anything in Lucene, but can be very handy for efficient highlighting. Where you set the offsets really all depends on how you plan

RE: index file size threshold affecting search performance?

2007-03-28 Thread Oshima, Scott
Yeah it might be an hardware issue, with a slightly smaller index with less stored data, the performance is what we want it to be. Just adding 5% more stored data(unidexed of course) pushes us over some sort of threshold causing performance to tank. -Original Message- From: Erik Hatc

RE: Using Lucene to apply LSI

2007-03-28 Thread José Ramón Pérez Agüera
you need to use JAMA combined with Lucene, using the vectors that are builded by lucene to compute SVD with JAMA http://math.nist.gov/javanumerics/jama/ Best jose José Ramón Pérez Agüera Dept. de Ingeniería del Software e Inteligencia Artificial Despacho 411 tlf. 913947599 Facultad de Inform

Re: Using Lucene to apply LSI

2007-03-28 Thread José Ramón Pérez Agüera
you need to use JAMA combined with Lucene, using the vectors that are builded by lucene to compute SVD with JAMA http://math.nist.gov/javanumerics/jama/ Best jose On 3/28/07, Mark Stiner <[EMAIL PROTECTED]> wrote: Hi, I have a research project where I want to implement

Re: index file size threshold affecting search performance?

2007-03-28 Thread Erik Hatcher
I've just built a 9.3G index (admittedly tons of stored data in there, 3.3M documents) and performance is amazing (through Solr). Erik On Mar 28, 2007, at 3:11 PM, Erick Erickson wrote: This surprises me, I'm currently working with a 4G index, and the improvement from when it was a

Using Lucene to apply LSI

2007-03-28 Thread Mark Stiner
Hi, I have a research project where I want to implement LSI technique. The scenario is something as follows. Search the news sites for the locally event based news. Cluster the similar news items together. For example hurricane in New York city. We want to apply basic LSI a

Re: index file size threshold affecting search performance?

2007-03-28 Thread Erick Erickson
This surprises me, I'm currently working with a 4G index, and the improvement from when it was an 8G index was only 10% or so. And it's plenty speedy. Are you hitting hardware limitations and perhaps swapping like crazy? In which case, unless you split things across several machines, I doubt it w

index file size threshold affecting search performance?

2007-03-28 Thread Scott Oshima
So I assumed a linear decay of performance as an index got bigger. For some reason when going from an index size of 1.89 to 1.95 gigs dramatically increased cpu across all of our servers. I was thinking of splitting the 1.95 index into 2 separate indexes and using a multisearcher on those parts

integration of lucene into opencms

2007-03-28 Thread mohamed hadj taieb
I have integrated lucene to search into files i tried that in the jsp-examples like that 1st step Indexation des fichiers de JSP-examples C:\Program Files\Apache Software Foundation\Tomcat 5.5\webapps\luceneweb\WEB-INF \lib>java org.apache.lucene.demo.IndexFiles "C:\Program Files\Apache Software

Re: failure while indexing files

2007-03-28 Thread mohamed hadj taieb
thx a 2007/3/28, Arun M.A. <[EMAIL PROTECTED]>: I used to have this classpath issue when i set classpath for Java. Try using DOS based directory path instead of the windows name. eg. instead of Documents and Settings use DOCUME~1 use dir/x to know your DOS name for the directory On 3/28/07,

Re: Custom Analyzer Help please

2007-03-28 Thread Grant Ingersoll
OK, gotcha. I now see what you mean. StandardAnalyzer uses the StandardTokenizer, whereas StopAnalyzer uses the LowerCaseTokenizer, which divides text at non-letters. What you most likely will need to do is create a Tokenizer that outputs the original token, and outputs the parts of it

Re: Custom Analyzer Help please

2007-03-28 Thread TimF
Grant, Thanks for your reply and the pointer to the custom code sample. I will be checking into that today. I did delve into the src for the OOTB analyzers and was aware of what they did. Still, the StandardAnalyzer does not do what I want. The real difference between my needs and the results of t

Re: TF-IDF API

2007-03-28 Thread karl wettin
28 mar 2007 kl. 15.24 skrev Sengly Heng: Thank you but I still have have no clue of how to do that by using Weka after taking a look at its API. Let me reformulate my problem : I have a collection of vector of terms (actually each vector of terms represents the list of tokens extracted from

Unison index handling (was: Reverse search)

2007-03-28 Thread karl wettin
28 mar 2007 kl. 13.22 skrev mark harwood: Odd. I'm sure it used to have a getReader method somewhere. Still, you can use MemoryIndex.createSearcher().getIndexReader() I've wrapped MemoryIndex in the unison index facade of LUCENE-550, just as I did with all the other index implemementations

Re: why Apache doesnt create a nice forum like the others???

2007-03-28 Thread John Haxby
Grant Ingersoll wrote: I like the mailing list approach much better. With a good set of rules and folders in place (which takes about 15 minutes to setup), one can easily manage large volumes of mail w/o batting an eye, whereas forums require large amounts of navigation, IMO. Glad I'm not th

Re: TF-IDF API

2007-03-28 Thread Sengly Heng
Thank you but I still have have no clue of how to do that by using Weka after taking a look at its API. Let me reformulate my problem : I have a collection of vector of terms (actually each vector of terms represents the list of tokens extracted from a file) and I do not have the original files.

Re: TF-IDF API

2007-03-28 Thread karl wettin
28 mar 2007 kl. 10.36 skrev Sengly Heng: Does anyone of you know any Java API that directly handle this problem? or I have to implement from scratch. You can also try weka.filters.unsupervised.attribute.StringToWordVector, it has many neat features you might be interested in. And if app

Re: Start/end offsets in analyzers

2007-03-28 Thread Erik Hatcher
On Mar 28, 2007, at 3:51 AM, Antony Bowesman wrote: I'm fiddling with custom anaylyzers to analyze email addresses to store the full email address and the component parts. It's based on Solr's analyzer framework, so I have a StandardTokenizerFactory followed by a EmailFilterFactory. It p

Re: why Apache doesnt create a nice forum like the others???

2007-03-28 Thread Grant Ingersoll
I like the mailing list approach much better. With a good set of rules and folders in place (which takes about 15 minutes to setup), one can easily manage large volumes of mail w/o batting an eye, whereas forums require large amounts of navigation, IMO. On Mar 28, 2007, at 7:53 AM, Ted Hu

Re: TF-IDF API

2007-03-28 Thread Grant Ingersoll
You can pass in a String or a Reader to Field when indexing. There is nothing file specific about Lucene when it comes to indexing. Take a look at the Field class for the various constructors. On Mar 28, 2007, at 8:20 AM, Sengly Heng wrote: Thanks but in my case I do not have the files. W

Re: why Apache doesnt create a nice forum like the others???

2007-03-28 Thread Ted Husted
Since there are so many ASF projects, launching new infrastructure initiatives can be difficult. Infrastructure wants everything to scale, so that everyone can use it. We also want to be sure that there are volunteers with the interest and ability to keep the product up and running. To get produc

Re: TF-IDF API

2007-03-28 Thread Sengly Heng
Thanks but in my case I do not have the files. What I have is just a collection of vectors of terms. Does lucene provide any mean to index each vector of terms as a file? Or there is a better way to do? Thank everyone once again. Regards, Sengly On 3/28/07, thomas arni <[EMAIL PROTECTED]> wr

Re: Reverse search

2007-03-28 Thread mark harwood
Odd. I'm sure it used to have a getReader method somewhere. Still, you can use MemoryIndex.createSearcher().getIndexReader() - Original Message From: Melanie Langlois <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, 28 March, 2007 8:38:24 AM Subject: RE: Reverse searc

Re: failure while indexing files

2007-03-28 Thread Arun M.A.
I used to have this classpath issue when i set classpath for Java. Try using DOS based directory path instead of the windows name. eg. instead of Documents and Settings use DOCUME~1 use dir/x to know your DOS name for the directory On 3/28/07, mohamed hadj taieb <[EMAIL PROTECTED]> wrote: Hi

failure while indexing files

2007-03-28 Thread mohamed hadj taieb
Hi i have downloaded the lucene-2.1.0.zip file and i tried to integrate into my application web i have added the 2 jar files to the classpath into the Environment variables like this classpath: .;C:\Program Files\Apache Software Foundation\Tomcat 5.5\webapps\luceneweb\WEB-INF\lib\lucene-demos-2.1.

Re: why Apache doesnt create a nice forum like the others???

2007-03-28 Thread John Haxby
karl wettin wrote: The way I see it (and probably many other) mailing lists are suprior in many ways, especially when following multiple forums. It's true. Any forum that I need to subscribe to I find an RSS feed for so that I can get mail messages. Forums are a pain in the neck once you'

Re: index word files ( doc )

2007-03-28 Thread John Haxby
Daniel Noll wrote: The only screenshots I can see look like plain text to me, and I'm currently working on something which needs to convert Word to HTML, which is why I ask. wvWare, which I mentioned earlier, can convert word to HTML and does a pretty good job of maintaining formatting. abiwor

Re: TF-IDF API

2007-03-28 Thread thomas arni
Hava a look at the "TermDocs" Interface in the API. You can get term frequency with a open IndexReader TermDocs termDocs = reader.termDocs(term); where "term" represents the current Term. now you can call: termDocs.freq() to get the frequency of the term within the current document. For th

TF-IDF API

2007-03-28 Thread Sengly Heng
Hello Luceners, I have a collections of vector of terms (token) that I extracted from files. I am looking for ways to calculate TF/IDF of each term. I wanted to use Lucene to do this but Lucene is made for collections of files and in my case I have already extracted those files into vector of te