Doron Cohen wrote:
From the StandardAnalyzer javacc grammar :
// floating point, serial, model numbers, ip addresses, etc.
// every other segment must have at least one digit
etc.
<#P: ("_"|"-"|"/"|"."|",") >
My understanding of this: a non-whitespace sequence is broken
at eithe
Doron Cohen wrote:
I think it splits by hyphens unless the no-hyphen
part has digits, so:
np-pandock-a7
becomes
np
pandock-a7
This is for the indexing part.
Wow! Do you know the thinking behind that, i.e. why a number in a
hyphenated expression prevents the split?
--MDC
John Powers wrote:
Np-pandock
Np-pandock-1
Np-pandock-2
Np-pandock-L
Np-pandock-L1
Np-pandock-L2
I'm not positive, but I think StandardAnalyzer splits this input at the
hyphens. That is, it gives the terms "Np", "pandock", "1", "2", "L",
"L1", and "L2", but NOT "Np-pandoc", etc.
--MD
david m wrote:
A couple of reasons that lead to the merge approach:
- Source documents are written to archive media and retrieval is
relatively slow. Add to that our processing pipeline (including
text extraction)... Retrieving and merging minis is faster than
re-processing and re-indexing f
d m wrote:
I'd like to share index merge performance data and have a couple
of questions about it...
We (AXS-One, www.axsone.com) build one "master" index per day.
For backup and recovery purposes, we also build many individual
"mini" indexes from the docs added to the master index.
Should one
Jonathan O'Connor wrote:
I have a database of a million documents and about 100 users. The documents
can have an access control list, and there is a complex, recursive
algorithm to say if a particular user can see a particular document.
My problem is that my search algorithm is to first do a st
<[EMAIL PROTECTED]> wrote on 08/03/2007 12:56:33:
I have to index many documents with the same fields (only one or two
fields are different). Can I add a field (Field instance) to many
documents? It seams to work but I'm not sure if this is the right way...
What does "many" mean in this context?
Is your disk almost full? Under Linux, when you reach about 90% used on
a file system, only the superuser can allocate more space (e.g. create
files, add data to files, etc.).
--MDC
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
Felix Litman wrote:
We want to be able to return a result regardless if users use a colon or not in
the query. So 'work:' and 'work' query should still return same result.
With the current parser if a user enters 'work:' with a ":" , Lucene does not return
anything :-(. It seems to me the
Somnath Banerjee wrote:
Thanks for the reply. Good guess I think.
DB (Index) is basically a collection of encyclopedia documents. Queries are
also a collection of documents but of various domains. My task is to find
out for each "query document" top 100 matching encyclopedia contents.
I tried b
Somnath Banerjee wrote:
I have created a 8GB index of almost 2 million documents. My
requirement is to run nearly 0.72 million query on this index. Each query
consists of 200 - 400 words. I have created a Boolean Query by ORing these
words. But each query is taking nearly 5 - 10 secon
Phil Rosen wrote:
I would like to get the sum of frequency counts for each term in the fields
I specify across the search results. I can just iterate through the
documents and use getTermFreqVector() for each desired field on each
document, then sum that; but this seems slow to me.
It seems
Phil Rosen wrote:
I am building an application that requires I index a set of documents on
the scale of hundreds of thousands.
A document can have a varying number of attribute fields with an unknown
set of potential values. I realize that just indexing a blob of fields
would be much faster, ho
Erick Erickson wrote:
Arbitrary restrictions by IT on the space the indexes can take up.
Actually, I won't categorically I *can't* make this happen, but in order to
use this option, I need to be able to present a convincing case. And I
can't
do that until I've exhausted my options/creativity.
Erick Erickson wrote:
Here's my problem:
We're indexing books. I need to
a> return books ordered by relevancy
b> for any single book, return the number of hits in each chapter
(which, of
course, may be many pages).
1>If I index each page as a document, creating the relevance on a book
basis
On Wed, 2006-10-18 at 19:05 +1300, Paul Waite wrote:
No they don't want that. They just want a small number. What happens is
they enter some silly query, like searching for all stories with a single
common non-stop-word in them, and with the usual sort criterion of by date
(ie. a field) descendi
If your question is why are the queries
'(field:software field:engineer)'
and
'(+field:software +field:engineer)'
returning the same results, it could be because none of your documents have
*only* "software" *or* "engineer", i.e. they all have both words or neither.
You could tes
Furash Gary wrote:
I'm sure this is just a design point that I'm missing, but is there a
way to have my document objects know more about themselves?
At the time I create my document, I know a bit about how information is
being stored in it (e.g., this field represents a SOUNDEX copy, etc.),
yet
Daniel Naber wrote:
Hi,
as some of you may have noticed, Lucene prefers shorter documents over
longer ones, i.e. shorter documents get a higher ranking, even if the
ratio "matched terms / total terms in document" is the same.
For example, take these two artificial documents:
doc1: x 2 3 4 5
Erick Erickson wrote:
I'm pretty sure your problem is
Query q = new BooleanQuery...
should be
BooleanQuery q = new BooleanQuery...
Good catch! That's why the add() method is barfing. The parse() thing,
though, is probably a change to the QueryParser interface for 2.0.
--MDC
--
[EMAIL PROTECTED] wrote:
When I try this ( using Lucene 2.0 API ) I get the error:
"..The method parse(String) is not applicable for arguments (String,
String, StopAnalyzer)
...
When I try this I get the error:
The method add(TermQuery, BooleanClause.Occur) is undefined for type
Query
Could th
[EMAIL PROTECTED] wrote:
I am using Lucene 2.0 and trying to use the MultiFieldQueryParser
in my search.
I want to limit my search to documents which have "silly"
in "field1" ...within that subset of documents, I want documents which
have
"example" in "field2" OR "field3"
The code fragment b
Van Nguyen wrote:
I just want results that have:
ID: 1234 OR 2344 OR 2323
LOCATION: A1
LANGUAGE: ENU
This query returns everything from my index. How would I create a query
that will only return results the must have LOCATION and LANGUAGE and
have only those three IDs.
I think you'll ne
Alexander Mashtakov wrote:
But, the database is going to be big enough, and the list of IDs
returned by
Lucene too. This
may cause high memory usage and slow sql query speed (for instance 1000 IDs
in "IN (id1, id2 ...)"
sql filter)
For this part, I recommend using a working table to hold the
Dominik Bruhn wrote:
Hy,
i use Lucene to index a SQL-Table which contains three fields: a index-field,
the text to search in and another field. When adding a lucene document I let
Lucene index the search-field and also save the id along with it in the
lucene index.
Uppon searching I collect
Rob Staveley (Tom) wrote:
I guess the most expensive thing I'm doing from the perspective of Boolean
clauses is heavily using PrefixQuery.
I want my user to be able to find e-mail to, cc or from [EMAIL PROTECTED], so
I opted for PrefixQuery on James. Bearing in mind that this is causing me
grie
[EMAIL PROTECTED] wrote:
Hi,
I have a field indexed as follows:
new Field(name, value, Store.YES, Index.UN_TOKENIZED)
I would like to search this field for exact match of
the query term. Thus if, for instance in the above
code snippet:
String name="PROJECT";
String value="Apache Lucene";
Nadav Har'El wrote:
Otis Gospodnetic <[EMAIL PROTECTED]> wrote on 12/06/2006 04:36:45
PM:
Nadav,
Look up one of my onjava.com Lucene articles, where I talk about
this. You may also want to tell Lucene to merge segments on disk
less frequently, which is what mergeFactor does.
Thanks. Can
Nadav Har'El wrote:
What I couldn't figure out how to use, however, was the abundant memory (2
GB) that this machine has.
I tried playing with IndexWriter.setMaxBufferedDocs(), and noticed that
there is no speed gain after I set it to 1000, at which point the running
Lucene takes up just 70 MB
Not sure if I understand exactly what you want to do, but would the
":" syntax that QueryParser understands work for you? That
is, you could send query text like
f1:foo f2:foo f3:foo
to search for "foo" in any of the 3 fields. If you need boolean capabilities
you can use parentheses, li
David Trattnig wrote:
Hello LuceneList,
I've got at least following fields in my index:
AREA = "home news business"
CONTENTS = "... hello world ..."
If I submit the query
query-string: "hello area:home"
Lucene should only search these documents which has the matching area.
Actually Lucene s
Sharad Agarwal wrote:
I am a newbie in lucene space. and trying to understand lucene search
result caching; facing with a wierd issue.
After creating the IndexReader from a file system directory, I
rename/remove the index directory; but still I am able to search the
index and able to get the
Jeremy Hanna wrote:
I would use a database function to force the ordering like the one your
provided that works in Oracle, but it doesn't look like mysql 5
supports that. If anyone else knows of a way to force the ordering
using mysql 5 queries, please respond. I think I'll just resort th
Terenzio Treccani wrote:
You're both true, this doesn't sound like Lucene at all...
But the problem of such SQL tables is their size: speaking about
millions of customers and thousands of news items, the many-to-many
(CustArt) table would end up by containing BILLIONS of lines A bit
too big
This doesn't sound like a Lucene problem, at least the way you've described
it. For example, Lucene can't search on any field that isn't indexed (and
most of yours aren't indexed).
Given that, it seems like your option (c) is the way to go. Seems like a
simple RDBMS schema with 3 tables woul
Anton Potehin wrote:
I have a problem.
There is an index, which contains about 6,000,000 records (15,000,000
will be soon) the size is 4GB. Index is optimized and consists of only
one segment. This index stores the products. Each product has brand,
price and about 10 more additional fields. I
Volodymyr Bychkoviak wrote:
This is not the case.
maxClauseCount limit number of terms that can fit into BooleanQuery
during some queries rewriting. And default value is 1024, not 32.
32 required/prohibited clauses is limitation of Lucene 1.4.3 due to
usage of bit patterns to mask required/
Thomas Papke wrote:
What is the disadvantage of doing that?
Besides being a bit awkward, it could use more RAM and time to compile
queries. That could be because a query like "foo*" gets expanded into all the
terms from the index that begin with "foo", ORed together, like so:
"foo" OR "fo
Thomas Papke wrote:
i am a "newby" in usage of Apache Lucene. If have a relativly big
database indexed by lucene (about 300MB File). Up to now - all users
could search over the hole index. How to restrict the resultset? I have
tried it with adding some BooleanQuerys to restrict entries. But wi
Hugh Ross wrote:
I need to find all distinct values for a keyword field in a Lucene index.
I think the IndexReader.terms() method will do what you want. Good luck!
--MDC
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For addi
SOME ONE wrote:
Hi,
Yes, my queries are like the first case. And as there
have been no other suggestions to do it in a single
search operation, will have to do it the way you
suggested. This technique will do the job particularly
because title's text is always in the body as well. So
finally I
I'm not sure you can do what you want in a single search. But, I'm not sure I
actually understand what your queries look like, either. I *think* you want
to search like
(title:a OR body:a) AND (title:b OR body:b) AND (title:c OR body:c)
not something like
(title:a OR title:b OR title:c) AND
SOME ONE wrote:
Yes, I could run two searches, but that means running
two searches for each request from user and that I
think doubles the job taking double time. Any
suggestions to do it more efficiently please ?
I think it would only take double time if the sets of hit documents have
substa
SOME ONE wrote:
Hi,
I am using MultiFieldQueryParser (Lucene 1.9) to
search title and body fields in the documents. The
requirement is that documents with title match should
be returned before the documents with body match.
Using the default scoring, title matches do come
before the body matche
Java Programmer wrote:
Hi,
Maybe this question is trivial but I need to ask it. I've some problem with
indexing large number of documents, and I seek for better solution.
Task is to index about 33GB text data CSV (each record about 30kB), it
possible of course to index these data but I'm not ver
Chandramohan wrote:
perform such a cull again, you might make several
distinct indexes (one per
day, per week, per whatever) during that reindexing
so the next time will be
much easier.
How would you search and consolidate the results
across multiple indexes? Hits from each index will
have
Greg Gershman wrote:
No problem; this is not meant to be a regular
operation, rather it's a (hopefully) one-time thing
till the index can be restructured.
The data is chronological in nature, deleting
everything before a specific point in time. The index
is optimized, so is it possible to remo
Greg Gershman wrote:
I'm trying to delete a large number of documents
(~15million) from a a large index (30+ million
documents). I've started with an optimized index, and
a list of docIds (our own unique identifier for a
document, not a Lucene doc number) to pass to the
IndexReader.delete(Term
Otis Gospodnetic wrote:
Michael,
Actually, one more thing - you said you changed the
store/BufferedIndexOutput.BUFFER_SIZE from 1024 to 4096 and that turned out to
yield the fastest indexing. Does your FS block size also happen to be 4k
(dumpe2fs output) on that FC3 box? If so, I wonder if
Otis Gospodnetic wrote:
Hi,
I'm wondering if anyone has tested Lucene indexing/search performance with
different file system block sizes?
I just realized one of the servers where I run a lot of Lucene indexing and
searching has an FS with blocks of only 1K in size (typically they are 4k or
Daniel Noll wrote:
Is it possible to customise the QueryParser so that it returns Query
instances that have no relationship to the text index whatsoever?
The syntax that Lucene's QueryParser supports isn't very complicated.
I'm sure you could write your own parser from scratch, perhaps with s
Dmitry Goldenberg wrote:
a) if I index "function()" as "function()" rather than "function", does that mean that if
I search for "function", then it won't be found? -- the problem is that in some cases, the user will want to
find function(), and in some cases just function -- can I accommodate
Dmitry Goldenberg wrote:
Hi,
I'm trying to figure out a way to locate tokens which include special characters. The actual text in the file being indexed is something like "function() { statement1; statement2; }"
The query I'm using is "function\()" since I want to locate precisely "function
Michael Pickard wrote:
Can anyone help me with the formation of a BooleanQuery ?
I want a query of the form:
x AND ( a OR b OR c OR d)
You're going to need 2 BooleanQuery objects, one for the OR'd expression
in parentheses, and another for the AND and expression. Something like
this:
Hi Ori,
Before taking drastic rehosting measures, and introducing the associated
software complexity off splitting your application into pieces running
on separate machines, I'd recommend looking at the way your document
data is distributed and the way you're searching them. Here are some
qu
Steve Rajavuori wrote:
I am periodically getting "Too many open files" error when searching. Currently
there are over 500 files in my Lucene directory. I am attempting to run optimize( ) to
reduce the number of files. However, optimize never finishes because whenever I run it,
it quits with a
Steve Rajavuori wrote:
I am periodically getting "Too many open files" error when searching. Currently
there are over 500 files in my Lucene directory. I am attempting to run optimize( ) to
reduce the number of files. However, optimize never finishes because whenever I run it,
it quits with a
Plat wrote:
Basically, pretend I do a regular search for "category:fiction". After
stemming/etc, this would match any Document with a category of
"fiction", "non-fiction", "fictitious", etc. All 900+ of them.
BUT as far as the results are concerned, I'm not actually interested
in each Document
Mr Plate wrote:
This puzzle has been bugging me for a while; I'm hoping there's an
elegant way to handle it in Lucene.
DATA DESCRIPTION:
I've got an index of over 100,000 Documents. In addition to other
fields, each of these Documents has 0 or more "category" field values.
There are over
Mordo, Aviran (EXP N-NANNATEK) wrote:
Optimization also purges the deleted documents, thus reduces the size
(in bytes) of the index. Until you optimize documents stay in the index
only marked as deleted.
Deleted documents' space is reclaimed during optimization, 'tis true.
But it can also be
John Powers wrote:
Hello,
Lucene only lets you use a wildcard after a term, not before, correct?
What work arounds are there for that?
If I have an item 108585-123
And another 332323-123
How can I look for all the -123 family of items?
Classic indexing problem. Here are a couple simple ideas
Richard Jones wrote:
If you're willing to continue subsetting / summarizing the data out into
Lucene, how about subsetting it out into a dedicated MySQL instance for
this purpose? 100 artists * 1M profiles * 2 ints * 4 bytes/int =
roughly 1 GB of data, which would easily fit into RAM. Queries
Richard Jones wrote:
The data i'm dealing with is stored over a few mysql dbs on different
machines, horizontally partitioned so each user is assigned to a single db.
The queries i'm doing can be done in SQL in parallel over all machines then
combined, which i've tested - it's unacceptably slo
Richard Jones wrote:
Hi,
I'm using lucene (which rocks, btw ;) behind the scenes at www.last.fm for
various things, and i've run into a situation that seems somewhat inelegant
regarding populating fields which i already know the termvector for.
I'm creating a document for each user (last.fm t
tcorbet wrote:
I have an index over the titles to .mp3 songs.
It is not unreasonable for the user to want to
see the results from: "Show me Everything".
I understand that title:* is not a valid wildcard query.
I understand that title:[a* TO z*] is a valid wildcard query.
What I cannot underst
Oren Shir wrote:
My documents contain a field called SORT_ID, which contains an int that
increases with every document added to the index. I want my results to be
sorted by it.
Which approach will prove the best performance:
1) Zero pad SORT_ID field and sort by it as plain text.
2) Sort using
66 matches
Mail list logo