Re: a faster way to addDocument and get the ID just added?

2011-03-29 Thread Li Li
merge will also change docid all segments' docId begin with 0 2011/3/30 Trejkaz : > On Tue, Mar 29, 2011 at 11:21 PM, Erick Erickson > wrote: >> I'm always skeptical of storing the doc IDs since they can >> change out from underneath you (just delete even a single >> document and optimize). > > W

Re: a faster way to addDocument and get the ID just added?

2011-03-29 Thread Trejkaz
On Tue, Mar 29, 2011 at 11:21 PM, Erick Erickson wrote: > I'm always skeptical of storing the doc IDs since they can > change out from underneath you (just delete even a single > document and optimize). We never delete documents. Even when a feature request came in to update documents (i.e. dele

Re: Best practice for stemming and exact matching

2011-03-29 Thread Robert Muir
On Tue, Mar 29, 2011 at 6:56 PM, Christopher Condit wrote: > Ideally I'd like to have the parser use the > custom analyzer for everything unless it's going to parse a clause into > a PhraseQuery or a MultiPhraseQuery, in which case it uses the > SimpleAnalyzer and looks in the _exact field - but

Best practice for stemming and exact matching

2011-03-29 Thread Christopher Condit
I have Lucene indexes build using a shingled, stemmed custom analyzer. I have a new requirement that exact searches match correctly. ie: bar AND "nachos" will only fetch results with plural nachos. Right now, with the stemming, singular nacho results are returned as well. I realize that I'm going t

Re: Filter to retrieve random documents without specific terms ?

2011-03-29 Thread Patrick Diviacco
One last thing, how do I check if the random document does not contain the term ? In other words, I cannot just pass the TermsFilter but I need to check if the retrieved random document is valid or not to know if I have enough. Any code example is appreciated.. so far I have this one, to retrieve

Re: Filter to retrieve random documents without specific terms ?

2011-03-29 Thread Ian Lea
> Plan A sounds better because I don't want to consider the entire collection > and then remove results from it. Fine, your choice. > However, the same code has to work with 2 different collections. The first > one has 30.000 docs the other one 90.000. No problem. The number of docs is irreleva

Re: Filter to retrieve random documents without specific terms ?

2011-03-29 Thread Patrick Diviacco
Plan A sounds better because I don't want to consider the entire collection and then remove results from it. However, the same code has to work with 2 different collections. The first one has 30.000 docs the other one 90.000. How can I get the total amount of docs from a collection ? thanks O

Re: Filter to retrieve random documents without specific terms ?

2011-03-29 Thread Ian Lea
Here are a couple of ideas. Plan A. Think of a number, say 10, retrieve n * 10 docids in your search and then loop round java.util.Random.nextInt(n * 10) until you've got enough. Plan B. Reverse your MUST NOT search to get a list of docids that you don't want, then loop round Random.nextInt(ind

RE: cannot find org.apache.lucene.search.TermsFilter

2011-03-29 Thread Uwe Schindler
It is in the zip/tar.gz file from Hudson under contrib! Alternatively load the maven artifacts from Lucene's Maven Build in Hudson: https://builds.apache.org/hudson/job/Lucene-Solr-Maven-trunk/lastSuccessfulB uild/artifact/maven_artifacts/org/apache/lucene/ Uwe - Uwe Schindler H.-H.-Meier-All

Re: Filter to retrieve random documents without specific terms ?

2011-03-29 Thread Patrick Diviacco
Ok I've solved the first part of the problem. I'm now selecting all documents that do not contain a given term with a BooleanFilter and FilterClause, MUST NOT. I still have to understand how to retrieve random documents and limit the number of retrieved docs to N. thanks On 29 March 2011 20:40,

Filter to retrieve random documents without specific terms ?

2011-03-29 Thread Patrick Diviacco
Is there a Filter to get a limited number of random collection docs from the index which DO NOT contain a specific term ? i.e. term="pizza" I want to run the query against 10 random documents of the collection that do not contain the term "pizza". thanks

Re: cannot find org.apache.lucene.search.TermsFilter

2011-03-29 Thread Patrick Diviacco
Nevermind, I've compiled it using ant. solved thanks On 29 March 2011 17:41, Patrick Diviacco wrote: > Ok, the svn repository I can only find the source files. Should I build the > jar by myself or is there a packaged jar to download ? > > thanks > > > On 29 March 2011 16:00, Uwe Schindler wrot

Re: cannot find org.apache.lucene.search.TermsFilter

2011-03-29 Thread Patrick Diviacco
Ok, the svn repository I can only find the source files. Should I build the jar by myself or is there a packaged jar to download ? thanks On 29 March 2011 16:00, Uwe Schindler wrote: > Hi, > > The TermsFilter is not in Lucene Core. You have to use one of the contrib > JARS (I think it is contri

[infomercial] Lucene Refcard at DZone

2011-03-29 Thread Erik Hatcher
I've written an "Understanding Lucene" refcard that has just been published at DZone. See here for details: http://www.lucidimagination.com/blog/2011/03/28/understanding-lucene-by-erik-hatcher-free-dzone-refcard-now-available/ If you're new to Lucene or Solr, this refcard will be a nice gro

RE: cannot find org.apache.lucene.search.TermsFilter

2011-03-29 Thread Uwe Schindler
Hi, The TermsFilter is not in Lucene Core. You have to use one of the contrib JARS (I think it is contrib-queries, so should be lucene-queries.jar). Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: P

Re: Definition Extraction

2011-03-29 Thread vineet yadav
If you are using Machine learning techniques for concept learning then you can use mahout library. Mahout has plenty of clustering algorithms(https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms) which are useful for concept learning. Thanks Vineet Yadav On Tue, Mar 29, 2011 at 12:42 PM, h

Re: Definition Extraction

2011-03-29 Thread Grant Ingersoll
Java codes to do what? You might ask machine learning questions over on u...@mahout.apache.org, but please be provide more details on what you are doing. On Mar 29, 2011, at 3:12 AM, henok sahilu wrote: > Hello All > Recently, I am trying to develop an automatic definition extraction system >

Re: cannot find org.apache.lucene.search.TermsFilter

2011-03-29 Thread Patrick Diviacco
I get that in response to this: import org.apache.lucene.search.TermsFilter; well I'm only using this jar: lucene-core-4.0-20110304.141738-1.jar and for example this line of my code compiles correctly: booleanQuery.add(new QueryParser(org.apache.lucene.util.Version.LUCENE_40, "tags", new Whitesp

Re: cannot find org.apache.lucene.search.TermsFilter

2011-03-29 Thread Erick Erickson
You get this in response to doing what? Are you sure you've unpackaged the nightly build and aren't inadvertently getting older jars? Best Erick On Tue, Mar 29, 2011 at 7:21 AM, Patrick Diviacco wrote: > I've downloaded the nightly build of Lucene (TRUNK) and I'm referring to the > following doc

Re: a faster way to addDocument and get the ID just added?

2011-03-29 Thread Erick Erickson
I'm always skeptical of storing the doc IDs since they can change out from underneath you (just delete even a single document and optimize). What is it you're doing with the doc ID that you couldn't do with the guid? If your "guid list" were ordered, I can imagine building filters quite quickly fro

cannot find org.apache.lucene.search.TermsFilter

2011-03-29 Thread Patrick Diviacco
I've downloaded the nightly build of Lucene (TRUNK) and I'm referring to the following documentation: https://hudson.apache.org/hudson/view/G-L/view/Lucene/job/Lucene-trunk/javadoc/all/index.html But I get: cannot find symbol symbol : class TermsFilter location: package org.apache.lucene.search

Re: Can I run a query against few specific docs of the collection only ?

2011-03-29 Thread Ian Lea
http://wiki.apache.org/lucene-java/LuceneFAQ#How_do_I_restrict_searches_to_only_return_results_from_a_limited_subset_of_documents_in_the_index_.28e.g._for_privacy_reasons.29.3F_What_is_the_best_way_to_approach_this.3F -- Ian. On Tue, Mar 29, 2011 at 11:54 AM, Patrick Diviacco wrote: > hi, > >

Can I run a query against few specific docs of the collection only ?

2011-03-29 Thread Patrick Diviacco
hi, Can I run a query against few specific docs of the collection only ? Can I filter the built collection according to documents fields content ? For example I would like to query over documents having field2 = "abc". thanks

Re: should I import the XML file into a mysql dataset ?

2011-03-29 Thread Ian Lea
> 1 - I'm using commons Digester as xml parser, how can I find the bottleneck > ? Should I run the code and comment out the Lucene queries part and just > leave the xml parsing ? That is what I was suggesting. > 2 - I actually also wanted to know the following: how much does it take to > run a 10

Re: should I import the XML file into a mysql dataset ?

2011-03-29 Thread Patrick Diviacco
1 - I'm using commons Digester as xml parser, how can I find the bottleneck ? Should I run the code and comment out the Lucene queries part and just leave the xml parsing ? 2 - I actually also wanted to know the following: how much does it take to run a 100MB queries text file against each single

Re: should I import the XML file into a mysql dataset ?

2011-03-29 Thread Patrick Diviacco
My machine is Intel Dual Duo Core with 4GB ram.. is there something wrong here ? On 29 March 2011 11:22, Patrick Diviacco wrote: > hi, > > I performing multiple queries (stored in a 100MB XML file) against a > collection (indexed with lucene, and it was stored before in a 100MB XML > file). > >

Re: should I import the XML file into a mysql dataset ?

2011-03-29 Thread Ian Lea
You need to figure out what is taking the time, for example by reading the XML file without making any lucene queries. What XML parsing process are you using? Some are faster than others. A google search should find loads of info. If it turns out that it is lucene searching taking most of the t

should I import the XML file into a mysql dataset ?

2011-03-29 Thread Patrick Diviacco
hi, I performing multiple queries (stored in a 100MB XML file) against a collection (indexed with lucene, and it was stored before in a 100MB XML file). The process seems pretty long on my machine (more than 2 hours), so I was wondering if importing the 100MB queries XML file into a mysql dataset

Re: comparing lucene scores across queries

2011-03-29 Thread Patrick Diviacco
hey Uwe, so from your last answer, I understand I'm done.. no need to do anything, I can already compare the queries. However there is actually a misunderstanding: my booleanqueries have variable number of boolean clauses because the fields are fixed but the terms per field are not. So, for exampl

RE: comparing lucene scores across queries

2011-03-29 Thread Uwe Schindler
> thanks for your reply. I thought I've solved the issue according to Uwe, the > queries without coord function were reasonably comparable, but now you > actually reopened it. > > So, I need to be sure I'm making them comparable and I would like to ask the > following. > > My BooleanQueries have

Re: comparing lucene scores across queries

2011-03-29 Thread Patrick Diviacco
hey Hoss, thanks for your reply. I thought I've solved the issue according to Uwe, the queries without coord function were reasonably comparable, but now you actually reopened it. So, I need to be sure I'm making them comparable and I would like to ask the following. My BooleanQueries have simil

Definition Extraction

2011-03-29 Thread henok sahilu
Hello All Recently, I am trying to develop an automatic definition extraction system for Amharic Language - using machine learning technique (Version Space learning). Can anyone suggest me some java codes to start with? Thank You Henok