Re: Is there a way to check for field "uniqueness" when indexing?

2009-08-20 Thread Yonik Seeley
On Fri, Aug 21, 2009 at 12:49 AM, Chris Hostetter wrote: > > : But in that case, I assume Solr does a commit per document added. > > not at all ... it computes a signature and then uses that as a unique key. > IndexWriter.updateDocument does all the hard work. Right - Solr used to do that hard wor

Re: Is there a way to check for field "uniqueness" when indexing?

2009-08-20 Thread Chris Hostetter
: But in that case, I assume Solr does a commit per document added. not at all ... it computes a signature and then uses that as a unique key. IndexWriter.updateDocument does all the hard work. -Hoss - To unsubscribe, e-mai

Re: Possible to invoke same Lucene query on a String?

2009-08-20 Thread ohaya
Paul Cowan wrote: > oh...@cox.net wrote: > > - I'd have to create a (very small) index, for each sub-document, where I > > do the Document.add() with just the (for example) two terms, then > > - Run a query against the 1-entry index, which > > - Would either give me a "yes" or "no" (for th

Re: Possible to invoke same Lucene query on a String?

2009-08-20 Thread Paul Cowan
oh...@cox.net wrote: - I'd have to create a (very small) index, for each sub-document, where I do the Document.add() with just the (for example) two terms, then - Run a query against the 1-entry index, which - Would either give me a "yes" or "no" (for that sub-document) As I said, I'm concerned

Re: Possible to invoke same Lucene query on a String?

2009-08-20 Thread ohaya
Paul Cowan wrote: > oh...@cox.net wrote: > > Document1 subdoc1 term1 term2 > > subdoc2 term1a term2a > > subdoc3 term1b term2b > > > > However, I've now been asked to implement the ability to query t

Re: Possible to invoke same Lucene query on a String?

2009-08-20 Thread Paul Cowan
oh...@cox.net wrote: Document1 subdoc1 term1 term2 subdoc2 term1a term2a subdoc3 term1b term2b However, I've now been asked to implement the ability to query the sub-documents. In other words, rather t

Re: Possible to invoke same Lucene query on a String?

2009-08-20 Thread ohaya
Hi, I guess, that, in short, what I'm really trying to find out is: If I construct a Lucene query, can I (somehow) use that to query a String object that I have, rather than querying against a Lucene index? Thanks, Jim oh...@cox.net wrote: > Hi, > > This question is going to be a littl

Possible to invoke same Lucene query on a String?

2009-08-20 Thread ohaya
Hi, This question is going to be a little complicated to explain, but let me try. I have implemented an indexer app based on the demo IndexFiles app, and a web app based on the luceneweb web app for the searching. In my case, the "Documents" that I'm indexing are a proprietary file type, and e

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

2009-08-20 Thread Simon Willnauer
Valery, have you tried to use whitespaceTokenizer / CharTokenizer and do any further processing in a custom TokenFilter?! simon On Thu, Aug 20, 2009 at 8:48 PM, Robert Muir wrote: > Valery, I think it all depends on how you want your search to work. > > when I say this, I mean for example: if a

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

2009-08-20 Thread Robert Muir
Valery, I think it all depends on how you want your search to work. when I say this, I mean for example: if a document only contains "C++" do you want searches on just "C" to match or not? another thing I would suggest is to take a look at the capabilities of Solr: it has some analysis stuff that

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

2009-08-20 Thread Valery
Hi Robert, so, would you expect a Tokenizer to consider '/' in both cases as a separate Token? Personally, I see no problem if Tokenzer would do the following job: "C/C++" ==> TokenStream of { "C", "/", "C", "+", "+"} and come up with "C" and "C++" tokens after processing through the downstre

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

2009-08-20 Thread Valery
Hi Ken, thanks for the comments. Well, Terrence's ANTLR was and is a good piece of work. Do you mean that you use ANTLR to generate a Tokenzer (lexem parser) or did you even proceed further and used ANTLR to generate higher level parsers to overrule Lucene's TokenFilters? or maybe even bo

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

2009-08-20 Thread Ken Krugler
Hi Valery, From our experience at Krugle, we wound up having to create our own tokenizers (actually kind of specialized parser) for the different languages. It didn't seem like a good option to try to twist one of the existing tokenizers into something that would work well enough. We wou

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

2009-08-20 Thread Robert Muir
Valery, oh I think there might be other ways to solve this. But you provided some examples such as C/C++ and SAP R/3. In these two examples you want the "/" to behave differently depending upon context, so my first thought was that a grammar might be a good way to ensure it does what you want. On

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

2009-08-20 Thread Valery
Hi Robert, thanks for the hint. Indeed, a natural way to go. Especially if one builds a Tokenizer of the level of quality like StandardTokenizer's. OTOH, you mean that the out-of-the-box stuff is indeed not customizable for this task?.. regards Valery Robert Muir wrote: > > Valery, > >

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

2009-08-20 Thread Robert Muir
Valery, One thing you could try would be to create a JFlex-based tokenizer, specifying a grammar with the rules you want. You could use the source code & grammar of StandardTokenizer as a starting point. On Thu, Aug 20, 2009 at 10:28 AM, Valery wrote: > > Hi all, > > I am trying to tune Lucene t

Any Tokenizator friendly to C++, C#, .NET, etc ?

2009-08-20 Thread Valery
Hi all, I am trying to tune Lucene to respect such tokens like C++, C#, .NET The task is known for Lucene community, but surprisingly I can't google out somewhat good info on it. Of course, I tried to re-use Lucene's building blocks for Tokenizer. Here we go: 1) StandardTokenizer -- oh, th

Re: custom scorer

2009-08-20 Thread Simon Willnauer
You could simply set Similarity.setDefault(yourSimilarity) to make sure it is used all over the place. Simon On Thu, Aug 20, 2009 at 3:25 PM, Chris Salem wrote: > No, I take it I have to use it for both?  Is there anything else I should > have to do? > Sincerely, > Chris Salem > > > - Origin

Re: custom scorer

2009-08-20 Thread Chris Salem
No, I take it I have to use it for both? Is there anything else I should have to do? Sincerely, Chris Salem - Original Message - To: java-user@lucene.apache.org From: Grant Ingersoll Sent: 8/19/2009 7:17:45 PM Subject: Re: custom scorer Are you setting the Similarity before indexin

Extending Sort/FieldCache

2009-08-20 Thread Shai Erera
Hi I'd like to extend Lucene's FieldCache such that it will read native values from a different place (in my case, payloads). That is, instead of iterating on a field's terms and parsing each String to long (for example), I'd like to iterate over one term (sort:long, again - an example) and decode

RE: Merge Exception in Lucene 2.4

2009-08-20 Thread Sumanta Bhowmik
------ > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-use

Re: Merge Exception in Lucene 2.4

2009-08-20 Thread Michael McCandless
You should definitely upgrade to the latest JDK 1.6 to get the fix for the JRE bug in LUCENE-1282, but, I don't think you are hitting that bug (read past EOF during merge is a different exception). Can you describe more detail on how you merge 6 IndexWriters? Mike On Thu, Aug 20, 2009 at 5:21 AM

RE: Merge Exception in Lucene 2.4

2009-08-20 Thread Sumanta Bhowmik
I checked at http://issues.apache.org/jira/browse/LUCENE-1282 SegmentMerger.java has this code TermFreqVector[] vectors = reader.getTermFreqVectors(docNum); termVectorsWriter.addAllDocVectors(vectors); so this issue appears inspite of this fix. I am using java version "1.6.0_07". Is it fixed in

Merge Exception in Lucene 2.4

2009-08-20 Thread Sumanta Bhowmik
Hi I am getting this issue in Lucene2.4 when I try to merge multiple IndexWriters(generally 6) sh-3.2# Exception in thread "Lucene Merge Thread #5" org.apache.lucene.index.MergePolicy$MergeException: java.io.IOException: read past EOF at org.apache.lucene.index.ConcurrentMergeSche