Re: best practice: 1.4 billions documents

2010-11-25 Thread Robert Muir
On Thu, Nov 25, 2010 at 2:58 AM, Uwe Schindler wrote: > ParallelMultiSearcher as subclass of MultiSearcher has the same problems. > These are not crashes, but more that some queries do not return correct > scored results for some queries. This effects especially all MultiTermQueries > (TermRang

Re: best practice: 1.4 billions documents

2010-11-25 Thread Ganesh
Thanks for the input. My results are sorted by date and i am not much bothered about score. Will i still be in trouble? Regards Ganesh - Original Message - From: "Robert Muir" To: Sent: Thursday, November 25, 2010 1:45 PM Subject: Re: best practice: 1.4 billions documents On Thu,

RE: best practice: 1.4 billions documents

2010-11-25 Thread Uwe Schindler
You are in trouble if you use MultiTermQuery subclasses as negative clause in a BooleanQuery, e.g a range like "-[A TO B]" or even NumericRanges or Wildcards. The query will then incorrect results. - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetap

Re: custom attributs in tokens

2010-11-25 Thread Simon Willnauer
Hi Jan, On Wed, Nov 24, 2010 at 9:12 AM, wrote: > Of course: > > We are trying to search in documents that contain text in several languages. > We are also investigating other approaches*, so this is not about finding > other variants. > the goal is to only match tokens from 1 or more given la

Retrieve found keywords from document

2010-11-25 Thread Claudia Grieco
Hi guys, I have this problem: I'm using Lucene to create a search engine on people profiles. I have a set of hobbies (let's say {"reading" , "singing"} for example) and I want to find people who have at least one of these hobbies AND which of these hobbies they have. Currently I search for eac

Re: Retrieve found keywords from document

2010-11-25 Thread Ian Lea
Can't you just store the hobbies as standard stored fields (Field.Store.YES), or as a single field, call doc.get("hobbies") and do what you want with them? This sounds rather like faceting - if so you might want to consider using Solr. http://wiki.apache.org/solr/SolrFacetingOverview -- Ian. O

IndexWriter Class

2010-11-25 Thread McGibbney, Lewis John
Hello List, Lucene 3.0.1 Windows Vista Premium Home Edition I am currently attempting to configure my IndexFiles.java file. My intention is to add the following functionality to the code as I require input text to be further analyzed than what the default analyzer does. IndexWriter writer = n

not indexing analyzed field

2010-11-25 Thread Bernd Fehling
I used KeywordAnalyzer and KeywordTokenizer as templates for a new analyzer. The analyzer works fine but the result never reaches the index. My analyzer is called in "DocInverterPerField.processFields" with "stream.incrementToken()". ... try { boolean hasMoreTokens = stream.incrementToken();

R: Retrieve found keywords from document

2010-11-25 Thread Claudia Grieco
What I call "profile" is free text (extracted from a pdf) and not the result of the user listing hobbies in a form So to store hobbies in a field called "hobbies" I have to extract hobbies from text first...is it possible to do it using Lucene? -Messaggio originale- Da: Ian Lea [mailto:ian

Re: IndexWriter Class

2010-11-25 Thread Ian Lea
The normal technique is to write your own analyzer. See http://wiki.apache.org/lucene-java/LuceneFAQ#How_do_I_write_my_own_Analyzer.3F. Then pass that to IndexWriter - and be sure to use the same analyzer when you are searching, unless you're doing clever things. -- Ian. On Thu, Nov 25, 2010 a

Re: Retrieve found keywords from document

2010-11-25 Thread Ian Lea
You could parse the output from the lucene analyzer that you are using to get hold of a list of terms and pick the ones that are hobbies. Or do it outside lucene using whatever string parsing technique you like. Or take a look at the recent thread on this list on a similar topic: "High frequency

Re: custom attributs in tokens

2010-11-25 Thread Jan Kurella
Hi Simon, On 25.11.2010 10:40, ext Simon Willnauer wrote: Hi Jan, On Wed, Nov 24, 2010 at 9:12 AM, wrote: Of course: We are trying to search in documents that contain text in several languages. We are also investigating other approaches*, so this is not about finding other variants. the go

Re: custom attributs in tokens

2010-11-25 Thread Simon Willnauer
On Thu, Nov 25, 2010 at 3:25 PM, Jan Kurella wrote: > Hi Simon, > > On 25.11.2010 10:40, ext Simon Willnauer wrote: >> >> Hi Jan, >> >> On Wed, Nov 24, 2010 at 9:12 AM,  wrote: >>> >>> Of course: >>> >>> We are trying to search in documents that contain text in several >>> languages. We are also i

R: Retrieve found keywords from document

2010-11-25 Thread Claudia Grieco
Thanks a lot. I used the lucene analyzer to parse the profile and everything works :) -Messaggio originale- Da: Ian Lea [mailto:ian@gmail.com] Inviato: giovedì 25 novembre 2010 14.52 A: java-user@lucene.apache.org Oggetto: Re: Retrieve found keywords from document You could parse the

Re: not indexing analyzed field

2010-11-25 Thread Erick Erickson
What is your evidence that "the result never reaches the index?" Are you sure: 1> you commit afterwards 2> you reopen the underlying reader to see 3> if you don't store the value for the field, how are you sure? 4> If you search and don't find it, did you index it? First, I'd be sure the value in

RE: not indexing analyzed field

2010-11-25 Thread Uwe Schindler
field.fieldsData is used for the stored field contents and so only *stored* in index, of course not analyzed (why should I analyze a stored field). The indexed tokens go of course through your analyzer and the returned tokens are indexed as terms. Where is the problem? - Uwe Schindler H.-H.-Me

Re: not indexing analyzed field

2010-11-25 Thread Bernd Fehling
Hi Erik, my evidence is that I load a single document into an empty index with a field "id" and a second field "dcdocid". The field "dcdocid" has the word "foo". This goes through my analyzer and changes to MD5 string which is then "acbd18db4cc2f85cedef654fccc4a4d8". After indexing and commit a se

Re: not indexing analyzed field

2010-11-25 Thread Bernd Fehling
Hi Uwe, my fieldType and fields are as follows: So the field dcdocid has the attribute *stored* which I can also see in the debugger. Why should I analyze a stored field? I don't know if I need to analyze it, I also tried a filter but also no success. My understanding is to send somet