Valery wrote:
Hi John,
(aren't you the same John Byrne who is a key contributor to the great
OpenSSI project?)
Nope, never heard of him! But with a great name like that I'm sure he'll
go a long way :)
John Byrne-3 wrote:
I'm inclined to disagree with the idea tha
Hi Valery,
I'm inclined to disagree with the idea that a token should not be split
again downstream. I think that is actually a much easier way to handle
it. I would have the tokenizer return the longest match, and then split
it in a token filter. In fact I have dones this before and it has wo
Yes, you could even use the WhitespaceTokenizer and then look for the
symbols in a token filter. You would get [you?] as a single token; your
job in the token filter is then to store the [?] and return the [you].
The next time the token filter is called for the next token, you return
the [?] th
Hi,
"suspect that [an] is still ignored as a stop word for some reason"
Yes, "an" is still a stop word in English of course! (eg. 'an apple')
Your custom analyzer should work; are you making sure to do both your
indexing *and* your searching with the new analyzer?
I think making a list of Ir
nd gives you the full control of whatever you are
doing. I've been trying to automate the creation of new solr cores for last
two days without any luck. Finally today moved to Lucene and it fixed my
problem very soon. Thank you all and special thanks to Lucene guys.
Thanks,
KK.
On Wed, May 20,
index name, as
mentioned by John you could simply use the timestamp as the index name.
--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com
The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw
On Wed, May 20, 2009 at 3:23
You can do this with pure Java. Create a file object with the path you
want, check if it exists, and it not, create it:
File newIndexDir = new File("/foo/bar")
if(!newFileDir.exists()) {
newDirFile.mkdirs();
}
The 'mkdirs()' method creates any necessary parent directories.
If you want t
with documents presented as vectors, in
which the elements of each vector is the TF weight ...)
Please Please help me on this
contact me if you need any further info via andykan1...@yahoo.com
Many Many thanks
--- On Thu, 4/9/09, John Byrne wrote:
From: John Byrne
Subject: Re: query c++
To: java
Hi,
This came up before, a while ago:
http://www.nabble.com/searching-for-C%2B%2B-to18093942.html#a18093942
I don't think there is an easier way than modifying the standard
analyzer. As I suggested in that earlier thread, I would make the
analyzer recognize token patterns that consist of wor
I've put a new in the wrong place by mistake a time or two)
But I confess that I have no idea what happens under the covers when
indexing,
so I'll have to leave any real insights to the folks who know.
Erick
On Fri, Apr 3, 2009 at 8:41 AM, John Byrne wrote:
The maximum J
eMB?
Best
Erick
On Fri, Apr 3, 2009 at 7:13 AM, John Byrne wrote:
Hi, I'm having a problem where the JVM runs out of memory while indexing a
large number of files. An analysis of the heapdump shows that most of the
memory was taken up with
"org/apache/lucene/util/ScorerDocQueue$Heaped
Hi, I'm having a problem where the JVM runs out of memory while indexing
a large number of files. An analysis of the heapdump shows that most of
the memory was taken up with
"org/apache/lucene/util/ScorerDocQueue$HeapedScorerDoc".
I can't find any leaks in my code so far, and I was wondering,
ewhere? Lucene
runs merges with background threads (by default), and if those threads
hit unhandled exceptions it's possible they are logged somewhere you
wouldn't normally look?
Mike
John Byrne wrote:
MergeFactor and MergeDocs are left at default values. The indexing is
increment
.
Thanks,
John
Erick Erickson wrote:
What are your IndexWriter MergFactor and MergeDocs set to? Also, are
the dates on all these files indicative of all being create during the same
indexing run?
Finally, how many documents are you indexing?
Best
Erick
On Tue, Feb 3, 2009 at 10:26 AM, John By
Hi,
I've got a weird problem with a lucene index, using 2.3.1. The index
contains 6660 files. I don't know how this happened.Maybe somone can
tell me something about the files themselves? (examples below)
On one day, between 10 and 40 of these files were being created every
minute. The index
Hi,
I think this should do it...
SortField dateSortField = new SortField("year", false);//the
second argument reverses the sort direction if set to true
SortField scoreSortField= new SortField(null, SortField.SCORE,
false); // value of null for field, since 'score' is not reall
The QueryParser syntax does not support what you're trying to do.
However, it is possible to do with the API. If you can construct your
Query programatically, you can use the SpanQuery API to do what you need.
Proximity searching is achieved using the SpanNearQuery; the constructor
of this obj
Hi,
Rather than disabling the merging, have you considered putting the
documents in a separate index, possibly in memory, and then deciding
when to merge them with the main index yourself?
That way, you can change you mind and simply not merge the new
documents if you want.
To do this, yo
Chris Hostetter wrote:
the enumeration is in lexigraphical order, so "Dell" is no where near
"dell" in the enumeration. even if we added a boolean property to Terms
indicating that it's case insensitive Term the "seeking" along that
enumeration would be ... lss optimal ... then it can be now.
h to say "let's assume", but I suspect that
whatever solution satisfied your example will have its own problems
that are far worse than just lower-casing things.
Best
Erick
On Wed, Jun 25, 2008 at 5:37 AM, John Byrne <[EMAIL PROTECTED]> wrote:
Hi,
I know that case-insensi
Hi,
I know that case-insensitive searching is normally done by creating an
all-lower-case version of the documents, and turning the search terms
into lower case whenever this field is searched, but this approach has
it's disadvantages.
Let's say, for example, you want to find "Dell" (with a
I don't think there is a simpler way. I think you will have to modify
the tokenizer. Once you go beyond basic human-readable text, you always
end up having to do that. I have modified the JavaCC version of
StandardTokenizer for allowing symbols to pass through, but I've never
used the JFlex ve
ust a Query, so the traditional way of Querying still
applies, i.e. you get back a list of matching documents. Beyond that,
if you just want to operate on the spans, just keep track of how often
the doc() method changes.
HTH,
Grant
On Jun 9, 2008, at 11:21 AM, John Byrne wrote:
Hi,
Is there
Hi,
Is there an easy way to find out the number of hits per document for a
Query, rather than just for a Term?
Let's say, for example, I have a document like this:
"here is cats near dogs and here is cats a long long way from dogs"
and I use a SpanNearQuery to find "cats" near "dogs" with a
Hi,
Here's a searchable mailing list archive:
http://www.gossamer-threads.com/lists/lucene/java-user/
As regards the wildcard phrase queries, here's one way I think you could
do it, but it's a bit of extra work. If you're using QueryParser, you'd
have to override the "getFieldQuery" method t
"To confuse matters more, it is not really a matter of synonyms, as the
orginal term is discarded from the index and there is only one mapped term"
I'm not sure I fully understand this: am I right in thinking that you
will be searching using these controlled volcabulary words, and that the
sea
Yes, this makes sense to me. I think I'll just keep all words, including
stop words, and if performance ever becomes an issue, I'll look at
bigrams again. But I think there's a good chance that I'll never see
significant impact either way.
Thanks guys!
Grant Ingersoll wrote:
Yep, still good r
Hi,
I need to use stop-word bigrams, liike the Nutch analyzer, as described
in LIA 4.8 (Nutch Analysis). What I don't understand is, why does it
keep the original stop word intact? I can see great advantage to being
able to search for a combination of stop word + real word, but I don't
see th
Hi,
You might consider avoiding this problem altogether, by simply adding
the meta data to your Lucene index. Lucene can handle untokenized
fields, which is ideal for meta data. It might not be as quick as the
RDB, but you could perhaps optimize by only searching in the RDB when
you only need
I think the way to do this is to run the 'rewrite()' method on the
wilcard query; this turns it into a boolean collection of term queries,
with a term for each match for the wildcard. That way, you're just
highlighting a normal term query. I think that would also work for fuzzy
queries. Hope th
Hi,
Has anyone found a way to use search term highlighting in a marked up
document, such as HTML or .DOC? My problem is, the lucene highlighter
works on plain text, the limitation being that you have to use the text
you indexed for highlighitng, so your tags are gone by then. Although
it's po
Hi,
Your problem is that when you do a wildacrd search, Lucene expands the
wildacrd term into all possible terms. So, searching for "stat*"
produces a list of terms like "state", "states", "stating" etc. (It only
uses terms that actually occur in your index, however). These terms are
all adde
Tobias Hill wrote:
I want to match on the exact phrase "foo bar dot" on a
specific field on my set of documents.
I only want results where that field has exactly "foo bar dot"
and no more terms. I.e. A document with "foo bar dot alu"
should not match.
A phrase query with slop 0 seems resonable
Thanks for that, that's exactly what I needed.
Actually, I hadn't heard of qsol, but it seems to solve a few other
problems I have as well - correct highlighting, configurable operators,
sentence recogition. Is it distributed under the Apache license? and is
it currently stable enough to use o
Hi all,
I need the ability to match documents that have two terms that occur
within n paragraphs of each other. I had a look through the archives,
and although many people have explained ways to implement per-sentence
or per-paragraph indexing & searching, no seems to have tackeled this
one y
Yes, that sounds like what I was looking for! Thanks.
Chris Hostetter wrote:
: Is there any way to find out if an instance of Query has any terms within it?
: I have a custom parser (QueryParser does not do everything I need) and it
: somtimes creates empty BooleanQuerys. (This happens as a side
Hi,
Is there any way to find out if an instance of Query has any terms
within it? I have a custom parser (QueryParser does not do everything I
need) and it somtimes creates empty BooleanQuerys. (This happens as a
side effect of recursive parsing - even if there are no terms for a
query, I sti
ne it before- it
makes me suspect that there's some reason I haven't seen yet that makes
it impossible ot impractical.
Karl Wettin wrote:
1 okt 2007 kl. 15.33 skrev John Byrne:
Has anyone written an analyzer that preserves puncuation and
synmbols ("£", "
eason I haven't seen yet that makes
it impossible ot impractical.
Karl Wettin wrote:
1 okt 2007 kl. 15.33 skrev John Byrne:
Has anyone written an analyzer that preserves puncuation and
synmbols ("£", "$", "%"
Hi,
Has anyone written an analyzer that preserves puncuation and synmbols
("£", "$", "%" etc.) as tokens?
That way we could distinguish between searching for "100" and "100%" or
"$100".
Does anyone know of a reason why that wouldn't work? I notice that even
Google doesn't support that. But
ntrol the upper limit on terms
produced by Wildcard/Fuzzy Queries.
If this limit is exceeded (e.g when searching for something like "a*" ) then an
exception is thrown.
Cheers
Mark
- Original Message
From: John Byrne <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: We
Hi,
I'm working my way through the Lucene In Action book, and there is one
thing I need explained that I didn't find there;
While wildcard queries are potentially slower than ordinary term
queries, are they slower even if theyt don't contain a wildcard?
Significantly slower?
The reason I a
42 matches
Mail list logo