Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

2009-08-21 Thread John Byrne
Valery wrote: Hi John, (aren't you the same John Byrne who is a key contributor to the great OpenSSI project?) Nope, never heard of him! But with a great name like that I'm sure he'll go a long way :) John Byrne-3 wrote: I'm inclined to disagree with the idea tha

Re: Any Tokenizator friendly to C++, C#, .NET, etc ?

2009-08-21 Thread John Byrne
Hi Valery, I'm inclined to disagree with the idea that a token should not be split again downstream. I think that is actually a much easier way to handle it. I would have the tokenizer return the longest match, and then split it in a token filter. In fact I have dones this before and it has wo

Re: Tokenizer queston: how can I force ? and ! to be separate tokens?

2009-07-17 Thread John Byrne
Yes, you could even use the WhitespaceTokenizer and then look for the symbols in a token filter. You would get [you?] as a single token; your job in the token filter is then to store the [?] and return the [you]. The next time the token filter is called for the next token, you return the [?] th

Re: strange issues with IRISH

2009-07-13 Thread John Byrne
Hi, "suspect that [an] is still ignored as a stop word for some reason" Yes, "an" is still a stop word in English of course! (eg. 'an apple') Your custom analyzer should work; are you making sure to do both your indexing *and* your searching with the new analyzer? I think making a list of Ir

Re: How to create a new index

2009-05-20 Thread John Byrne
nd gives you the full control of whatever you are doing. I've been trying to automate the creation of new solr cores for last two days without any luck. Finally today moved to Lucene and it fixed my problem very soon. Thank you all and special thanks to Lucene guys. Thanks, KK. On Wed, May 20,

Re: How to create a new index

2009-05-20 Thread John Byrne
index name, as mentioned by John you could simply use the timestamp as the index name. -- Anshum Gupta Naukri Labs! http://ai-cafe.blogspot.com The facts expressed here belong to everybody, the opinions to me. The distinction is yours to draw On Wed, May 20, 2009 at 3:23

Re: How to create a new index

2009-05-20 Thread John Byrne
You can do this with pure Java. Create a file object with the path you want, check if it exists, and it not, create it: File newIndexDir = new File("/foo/bar") if(!newFileDir.exists()) { newDirFile.mkdirs(); } The 'mkdirs()' method creates any necessary parent directories. If you want t

[Fwd: Vector space implemantion]

2009-04-09 Thread John Byrne
with documents presented as vectors, in which the elements of each vector is the TF weight ...) Please Please help me on this contact me if you need any further info via andykan1...@yahoo.com Many Many thanks --- On Thu, 4/9/09, John Byrne wrote: From: John Byrne Subject: Re: query c++ To: java

Re: query c++

2009-04-09 Thread John Byrne
Hi, This came up before, a while ago: http://www.nabble.com/searching-for-C%2B%2B-to18093942.html#a18093942 I don't think there is an easier way than modifying the standard analyzer. As I suggested in that earlier thread, I would make the analyzer recognize token patterns that consist of wor

Re: HeapedScorerDoc using all my memory

2009-04-03 Thread John Byrne
I've put a new in the wrong place by mistake a time or two) But I confess that I have no idea what happens under the covers when indexing, so I'll have to leave any real insights to the folks who know. Erick On Fri, Apr 3, 2009 at 8:41 AM, John Byrne wrote: The maximum J

Re: HeapedScorerDoc using all my memory

2009-04-03 Thread John Byrne
eMB? Best Erick On Fri, Apr 3, 2009 at 7:13 AM, John Byrne wrote: Hi, I'm having a problem where the JVM runs out of memory while indexing a large number of files. An analysis of the heapdump shows that most of the memory was taken up with "org/apache/lucene/util/ScorerDocQueue$Heaped

HeapedScorerDoc using all my memory

2009-04-03 Thread John Byrne
Hi, I'm having a problem where the JVM runs out of memory while indexing a large number of files. An analysis of the heapdump shows that most of the memory was taken up with "org/apache/lucene/util/ScorerDocQueue$HeapedScorerDoc". I can't find any leaks in my code so far, and I was wondering,

Re: waaaay too many files in the index!

2009-02-04 Thread John Byrne
ewhere? Lucene runs merges with background threads (by default), and if those threads hit unhandled exceptions it's possible they are logged somewhere you wouldn't normally look? Mike John Byrne wrote: MergeFactor and MergeDocs are left at default values. The indexing is increment

Re: waaaay too many files in the index!

2009-02-04 Thread John Byrne
. Thanks, John Erick Erickson wrote: What are your IndexWriter MergFactor and MergeDocs set to? Also, are the dates on all these files indicative of all being create during the same indexing run? Finally, how many documents are you indexing? Best Erick On Tue, Feb 3, 2009 at 10:26 AM, John By

waaaay too many files in the index!

2009-02-03 Thread John Byrne
Hi, I've got a weird problem with a lucene index, using 2.3.1. The index contains 6660 files. I don't know how this happened.Maybe somone can tell me something about the files themselves? (examples below) On one day, between 10 and 40 of these files were being created every minute. The index

Re: How to search documents taking in account the dates ???

2008-12-18 Thread John Byrne
Hi, I think this should do it... SortField dateSortField = new SortField("year", false);//the second argument reverses the sort direction if set to true SortField scoreSortField= new SortField(null, SortField.SCORE, false); // value of null for field, since 'score' is not reall

Re: Nested Proximity searches

2008-07-01 Thread John Byrne
The QueryParser syntax does not support what you're trying to do. However, it is possible to do with the API. If you can construct your Query programatically, you can use the SpanQuery API to do what you need. Proximity searching is achieved using the SpanNearQuery; the constructor of this obj

Re: Preventing index corruption

2008-06-27 Thread John Byrne
Hi, Rather than disabling the merging, have you considered putting the documents in a separate index, possibly in memory, and then deciding when to merge them with the main index yourself? That way, you can change you mind and simply not merge the new documents if you want. To do this, yo

Re: case insensitivity

2008-06-26 Thread John Byrne
Chris Hostetter wrote: the enumeration is in lexigraphical order, so "Dell" is no where near "dell" in the enumeration. even if we added a boolean property to Terms indicating that it's case insensitive Term the "seeking" along that enumeration would be ... lss optimal ... then it can be now.

Re: case insensitivity

2008-06-25 Thread John Byrne
h to say "let's assume", but I suspect that whatever solution satisfied your example will have its own problems that are far worse than just lower-casing things. Best Erick On Wed, Jun 25, 2008 at 5:37 AM, John Byrne <[EMAIL PROTECTED]> wrote: Hi, I know that case-insensi

case insensitivity

2008-06-25 Thread John Byrne
Hi, I know that case-insensitive searching is normally done by creating an all-lower-case version of the documents, and turning the search terms into lower case whenever this field is searched, but this approach has it's disadvantages. Let's say, for example, you want to find "Dell" (with a

Re: searching for C++

2008-06-24 Thread John Byrne
I don't think there is a simpler way. I think you will have to modify the tokenizer. Once you go beyond basic human-readable text, you always end up having to do that. I have modified the JavaCC version of StandardTokenizer for allowing symbols to pass through, but I've never used the JFlex ve

Re: number of hits per document

2008-06-10 Thread John Byrne
ust a Query, so the traditional way of Querying still applies, i.e. you get back a list of matching documents. Beyond that, if you just want to operate on the spans, just keep track of how often the doc() method changes. HTH, Grant On Jun 9, 2008, at 11:21 AM, John Byrne wrote: Hi, Is there

number of hits per document

2008-06-09 Thread John Byrne
Hi, Is there an easy way to find out the number of hits per document for a Query, rather than just for a Term? Let's say, for example, I have a document like this: "here is cats near dogs and here is cats a long long way from dogs" and I use a SpanNearQuery to find "cats" near "dogs" with a

Re: Wild carded phrases

2008-05-09 Thread John Byrne
Hi, Here's a searchable mailing list archive: http://www.gossamer-threads.com/lists/lucene/java-user/ As regards the wildcard phrase queries, here's one way I think you could do it, but it's a bit of extra work. If you're using QueryParser, you'd have to override the "getFieldQuery" method t

Re: storing position - keyword

2008-03-06 Thread John Byrne
"To confuse matters more, it is not really a matter of synonyms, as the orginal term is discarded from the index and there is only one mapped term" I'm not sure I fully understand this: am I right in thinking that you will be searching using these controlled volcabulary words, and that the sea

Re: bigram analysis

2008-03-03 Thread John Byrne
Yes, this makes sense to me. I think I'll just keep all words, including stop words, and if performance ever becomes an issue, I'll look at bigrams again. But I think there's a good chance that I'll never see significant impact either way. Thanks guys! Grant Ingersoll wrote: Yep, still good r

bigram analysis

2008-03-03 Thread John Byrne
Hi, I need to use stop-word bigrams, liike the Nutch analyzer, as described in LIA 4.8 (Nutch Analysis). What I don't understand is, why does it keep the original stop word intact? I can see great advantage to being able to search for a combination of stop word + real word, but I don't see th

Re: design: merging resultset from RDBMS with lucene search results

2008-02-13 Thread John Byrne
Hi, You might consider avoiding this problem altogether, by simply adding the meta data to your Lucene index. Lucene can handle untokenized fields, which is ideal for meta data. It might not be as quick as the RDB, but you could perhaps optimize by only searching in the RDB when you only need

Re: Highlighting with wildcards?

2008-01-18 Thread John Byrne
I think the way to do this is to run the 'rewrite()' method on the wilcard query; this turns it into a boolean collection of term queries, with a term for each match for the wildcard. That way, you're just highlighting a normal term query. I think that would also work for fuzzy queries. Hope th

Highlighting marked up documents

2008-01-17 Thread John Byrne
Hi, Has anyone found a way to use search term highlighting in a marked up document, such as HTML or .DOC? My problem is, the lucene highlighter works on plain text, the limitation being that you have to use the text you indexed for highlighitng, so your tags are gone by then. Although it's po

Re: BooleanQuery TooManyClauses in wildcard search

2007-11-30 Thread John Byrne
Hi, Your problem is that when you do a wildacrd search, Lucene expands the wildacrd term into all possible terms. So, searching for "stat*" produces a list of terms like "state", "states", "stating" etc. (It only uses terms that actually occur in your index, however). These terms are all adde

Re: Looking for "Exact match but no other terms"... how to express it?

2007-10-30 Thread John Byrne
Tobias Hill wrote: I want to match on the exact phrase "foo bar dot" on a specific field on my set of documents. I only want results where that field has exactly "foo bar dot" and no more terms. I.e. A document with "foo bar dot alu" should not match. A phrase query with slop 0 seems resonable

Re: Queries spanning paragraphs

2007-10-22 Thread John Byrne
Thanks for that, that's exactly what I needed. Actually, I hadn't heard of qsol, but it seems to solve a few other problems I have as well - correct highlighting, configurable operators, sentence recogition. Is it distributed under the Apache license? and is it currently stable enough to use o

Queries spanning paragraphs

2007-10-22 Thread John Byrne
Hi all, I need the ability to match documents that have two terms that occur within n paragraphs of each other. I had a look through the archives, and although many people have explained ways to implement per-sentence or per-paragraph indexing & searching, no seems to have tackeled this one y

Re: Querying the Query object

2007-10-08 Thread John Byrne
Yes, that sounds like what I was looking for! Thanks. Chris Hostetter wrote: : Is there any way to find out if an instance of Query has any terms within it? : I have a custom parser (QueryParser does not do everything I need) and it : somtimes creates empty BooleanQuerys. (This happens as a side

Querying the Query object

2007-10-05 Thread John Byrne
Hi, Is there any way to find out if an instance of Query has any terms within it? I have a custom parser (QueryParser does not do everything I need) and it somtimes creates empty BooleanQuerys. (This happens as a side effect of recursive parsing - even if there are no terms for a query, I sti

Re: Indexing puncuation and symbols

2007-10-01 Thread John Byrne
ne it before- it makes me suspect that there's some reason I haven't seen yet that makes it impossible ot impractical. Karl Wettin wrote: 1 okt 2007 kl. 15.33 skrev John Byrne: Has anyone written an analyzer that preserves puncuation and synmbols ("£", "

Re: Indexing puncuation and symbols

2007-10-01 Thread John Byrne
eason I haven't seen yet that makes it impossible ot impractical. Karl Wettin wrote: 1 okt 2007 kl. 15.33 skrev John Byrne: Has anyone written an analyzer that preserves puncuation and synmbols ("£", "$", "%"

Indexing puncuation and symbols

2007-10-01 Thread John Byrne
Hi, Has anyone written an analyzer that preserves puncuation and synmbols ("£", "$", "%" etc.) as tokens? That way we could distinguish between searching for "100" and "100%" or "$100". Does anyone know of a reason why that wouldn't work? I notice that even Google doesn't support that. But

Re: Wildcard vs Term query

2007-09-26 Thread John Byrne
ntrol the upper limit on terms produced by Wildcard/Fuzzy Queries. If this limit is exceeded (e.g when searching for something like "a*" ) then an exception is thrown. Cheers Mark - Original Message From: John Byrne <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: We

Wildcard vs Term query

2007-09-26 Thread John Byrne
Hi, I'm working my way through the Lucene In Action book, and there is one thing I need explained that I didn't find there; While wildcard queries are potentially slower than ordinary term queries, are they slower even if theyt don't contain a wildcard? Significantly slower? The reason I a