Jamie,
How often are you calling getReader? Is it only these files?
Jason
On Tue, Jan 26, 2010 at 12:58 PM, Jamie wrote:
> Ok. I spoke too soon. The problem is not solved. I am still seeing these
> file handles lying around. Is this something I should be worried about?
> We are no
Is there an analyzer that easily strips non alpha-numeric from the end
of a token?
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
wrote:
> Hi Jason,
>
> Solr's PatternReplaceFilter(ts, "\\P{Alnum}+$", "", false) should work,
> chained after an appropriate tokenizer.
>
> Steve
>
> On 02/04/2010 at 12:18 PM, Jason Rutherglen wrote:
>> Is there an anal
Answering my own question... PatternReplaceFilter doesn't output
multiple tokens...
Which means messing with capture state...
On Thu, Feb 4, 2010 at 2:16 PM, Jason Rutherglen
wrote:
> Transferred partially to solr-user...
>
> Steven, thanks for the reply!
>
> I wonder if
Peter,
Perhaps other concurrent operations?
Jason
On Tue, Feb 23, 2010 at 10:43 AM, Peter Keegan wrote:
> Using Lucene 2.9.1, I have the following pseudocode which gets repeated at
> regular intervals:
>
> 1. FSDirectory dir = FSDirectory.open(java.io.File);
> 2. dir.set
long - whatever
> happened to CSF? That feature is so 2006, and we still
> don't have it? I'm completely disturbed about the whole situation myself.
>
> Who the heck is in charge here?
>
> On 02/25/2010 12:51 PM, Jason Rutherglen wrote:
>>
>> It'd be great to
= new IndexSearcher(ir.reopen(true));
if(ir != indexSearcher.getIndexReader()){
ir.close();
}
Is the if(ir != indexSearcher.getIndexReader()){ check needed?
Thanks,
Jason Tesser
dotCMS Lead Development Manager
1-305-858-1422
keen to
avoid that option if possible.
Is there a quick way to discover this information? All I need is a
list of terms (as simple strings would be fine),
I don't care how many were found or what position or anything else.
just which ones matched.
thoug
Thanks for the ref - didn't know about Pig before.
the language and approach looks useful, so now I'm wondering if it
couldn't be used
across lucene over hadoop too. If data was indexed in lucene and Pig knew that,
then it could make for an interesting alternate lucene query language.
could this w
.
BTW. I am using Hibernate Search.. But have the ability to do pure Lucene...
Thanks.
Jason.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Last term, field, TermEnum
On Tue, May 13, 2008 at 12:34 PM, Erick Erickson <[EMAIL PROTECTED]>
wrote:
> Find the last term of what? Document? Field in an index? Query?
>
> Best
> Erick
>
> On Tue, May 13, 2008 at 12:28 PM, Jason Rutherglen <
> [EMAIL PROTECTED]>
It is easy to find the first term using TermEnum. Is there a way to find
the last term without using StringIndex and binarysearch? Are there plans
to offer this functionality?
https://issues.apache.org/jira/browse/LUCENE-1278 solves this problem
On Tue, May 20, 2008 at 1:32 AM, Anshum <[EMAIL PROTECTED]> wrote:
> Hey Alex,
> I guess you haven't tried warming up the engine before putting it to use.
> Though one of the simpler implementation, you could try warming up the
That is an interesting problem.
https://issues.apache.org/jira/browse/LUCENE-1292 will build a tag index
that uses a ParallelReader to allow tag fields to be searchable. The tag
index does not use the usual IndexWriter but uses a specialized realtime
updateable index built for tags. Depending on
It would be interesting to see the results of using a custom IndexReader
that implements
http://dsiutils.dsi.unimi.it/docs/it/unimi/dsi/util/ImmutableExternalPrefixMap.htmlor
something like it. The only problem right now would be hooking into
the
Lucene SegmentMerger to merge other indices such as
Query time boosting has no bottlenecks. Storing will not affect
performance. You will probably want to use PrefixFilter and
ConstantScoreRangeQuery. Solr has ConstantScorePrefixQuery. Simply means
if the document contains the term, the result will show, the scoring will
not be quite the same be
There needs to be a solution to that problem. I noticed it several years
ago which is why ever since have designed systems using MultiSearcher
concepts. There should only be one instance of deleted docs per IndexReader
now that there is reopen. Editing the live deleted docs does not seem like
so
Seeing strange behavior with RAMDirectory. Is a file designed to supported
IndexOutput being open concurrently with IndexInput? I open an IndexInput
with IndexOutput open, with data written to the file previously, and the
IndexInput is reporting a filelength of 0, while Directory.fileLength()
rep
ROTECTED]> wrote:
> Did you try calling flush() on the IndexOutput before opening the
> IndexInput?
>
> -Yonik
>
> On Thu, Jun 19, 2008 at 12:13 PM, Jason Rutherglen
> <[EMAIL PROTECTED]> wrote:
> > Seeing strange behavior with RAMDirectory. Is a file design
tput.writeBytes(bytes, bytes.length);
output.flush();
System.out.println("fileLength: "+ramDirectory.fileLength("test"));
output = ramDirectory.createOutput("test");
IndexInput input = ramDirectory.openInput("test");
System.out.println("input l
oblem here).
>
> -Yonik
>
> On Thu, Jun 19, 2008 at 3:10 PM, Jason Rutherglen
> <[EMAIL PROTECTED]> wrote:
> > public void testMain() throws IOException {
> >RAMDirectory ramDirectory = new RAMDirectory();
> >IndexOutput output = ramDirectory.
Created a RAMDirectory like directory class that uses
ByteArrayRandomAccessIO from http://reader.imagero.com/uio/ to allow
concurrent random file access.
On Thu, Jun 19, 2008 at 3:33 PM, Jason Rutherglen <
[EMAIL PROTECTED]> wrote:
> Looks like it cannot be used for a log system t
I looked heavily at this. It requires a customization of TermInfosReader
whereby the tii (term dictionary) SegmentTermEnum is traversed looking for
the last term with a particular field. Once found, from that position in
the tis SegmentTermEnum would need to be traversed again for the last term
w
Is there a class to do this?
The scaling per machine should be linear. The overhead from the network is
minimal because the Lucene object sizes are not impacting. Google mentions
in one of their early white papers on scaling
http://labs.google.com/papers/googlecluster-ieee.pdf that they have sub
indexes which are now popular
could be done with indexes that are updates often however it would
seem to require a lot of work with possibly little to gain, unless you want
to offer the user 0.05 second response times.
On Fri, Jul 18, 2008 at 3:49 AM, Eric Bowman <[EMAIL PROTECTED]> wrote:
> Jason Rutherglen wrote:
A possible open source solution using a page based database would be to
store the documents in http://jdbm.sourceforge.net/ which offers BTree,
Hash, and raw page based access. One would use a primary key type of
persistent ID to lookup the document data from JDBM.
Would be a good Lucene project
The contrib realtime search patch enables the functionality you described.
https://issues.apache.org/jira/browse/LUCENE-1313
On Wed, Aug 6, 2008 at 7:45 PM, Alex Wang <[EMAIL PROTECTED]> wrote:
>
> Hi all,
>
> To allow mutilple users concurrently add, delete docs and at the same time
> search the
Renaud, one optimization you can do on this is to try the first 10kb,
see if it finds text worth highlighting, if not, with a slight overlap
try the next 9.9kb - 19.9kb or just 9.9kb -> end if you're feeling lazy.
This assumes that most good matches are at the start of the document,
and that th
the start of the document. Most queries should have results that meet that criteria.
Renaud Waldura wrote:
Jason:
Interesting idea, thanks. But how do you know whether the highlighting is
any good? I thought highlighter implemented some kind of strategy to find
the best fragment.
Say my q
If you store a hash code of the word rather then the actual word you
should be able to search for stuff but not be able to actually retrieve
it; you can trade precision for "security" based on the number of bits
in the hash code ( e.g. 32 or 64 bits). I'd think a 64 bit hash would be
a reasonab
documents
places on them and how much effort he thinks that a hacker might be
prepared to put into recovering the text.
The best you're ever going to do is to protect the index as well as
you do the original documents.
jch
----
x27;t
possibly score higher then my #10 result right now. In this situation
the idea of supplying a page start/end does become valuable in reducing
load and does not require maintaining state inside the engine.
Jason
Erick Erickson wrote:
Efficient in your situation, maybe. Good for everybody? Pro
commands, e-mail: [EMAIL PROTECTED]
--
Jason Pump
Technical Architect
Healthline
660 Third Street, Ste. 100
San Francisco, CA 94107
direct dial 415.281.3133
cell 510.812.1784
www.healthline.com
09 F9 11 02 9D 74 E3 5B D8 41 5
You're not using any type of phrase search. Try ->
( (title:"John Bush"^4.0) OR (body:"John Bush") ) AND ( (title:John^4.0
body:John) AND (title:Bush^4.0 body:Bush) )
or maybe
( (title:"John Bush"~4^4.0) OR (body:"John Bush"~4) ) AND (
(title:John^4.0 body:John) AND (title:Bush^4.0 body:Bush
returned
docs to removed dups using the guid field in the index.
This work fine when the results are under about 5,000 documents, but when
there is a large number of results a search take way too long.
Does anyone know of a better and more efficient way t
when
I search multiple indexes.
--Jason
> You probably want to build a Filter.
>
> I've been planning to do exactly this on our own system, only our
> duplicates are indicated by documents having the same value in an MD5
> digest field, instead of a GUID field.
>
> For
There is also an open source java anti spam api which does a baysian scan of
email content (plus other stuff).
You could retro-fit to work with raw text.
www.jasen.org
(get the latest HEAD from CVS as the current release is a bit old... new
version imminent)
- Original Message -
From:
I just wrote some simple code to test this.
For my test I ran the test with 3 queries:
- A 3 term boolean
- A single term query with over 5000 hits
- A single term query with 0 hits
For each query I ran the ran 4 tests of 10,000 searches:
1) using hits.length to get the counts and the standard si
I think the best way to tokening/stem is to use the analyzer directly. for
example:
TokenStream ts = analyzer.tokenStream(field, new StringReader(text));
Token token = null;
while ((token = ts.next()) != null) {
Term newTerm = new Term(field, token.termTe
I would think what you want to do is index on the stem, and rank on the
stem and the original form. After all, if you match exactly, then you
better match for the stem.
Robert Haycock wrote:
Hi,
I started using the EnglishStemmer and noticed that only the stem gets
added to the index. I woul
It's a string comparison. Make the "5" a "05" would be a simple workaround.
Jason
Peter W. wrote:
Hello,
I'm trying to do a numerical search for a property in Lucene using
RangeFilter.Less
without using both RangeQuery and test cases.
Here's the code
e any standard way to do this?
--Jason
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
All,
I sent this the other day, but didn't get any responses. I'm hoping that it
was just missed, so I'm trying again.
There has to be a better way to to insert a document in to an index then
reindexing everything.
--Jason
On Wednesday 05 July 2006 5:06 pm, Jason Calabre
> When you say you keep your documents ordered alphabetically, it's confusing
> to me. Are you saying that you pre-sort all your documents then insert them
> one after another so that automatically-generated internal Lucene ID maps
> exactly to the alphabetical ordering? That is, for any document I
We only display the 10 hits at a time, so we don't need to iterate through all
the hits.
It feels like there should be a way to pull a document out 1 index and stick
it into an other and bring all the unstored fields along with it.
On Friday 07 July 2006 12:52, Erick Erickson wrote:
> Did you
One fast way to make an alphabetic sort very fast is to presort your docs
before adding them to the index. If you do this you can then just sort by
index order. We are using this for a large index (1 million+ docs) and it
works very good, and seems even slightly faster than relevance sorting.
Hello all,
I am experiencing some performance problems indexing large(ish) amounts of
text using the IndexField.Store.COMPRESS option when creating a Field in
Lucene.
I have a sample document which has about 4.5MB of text to be stored as
compressed data within the field, and the indexing of this
Thanks for the Jira issue...
one question on your synchronization comment...
I have "assumed" I can't have two threads writing to the index concurrently,
so have implemented my own read/write locking system. Are you saying I
don't need to bother with this? My reading of the doco suggests that y
Are your storing the contents of the fields in the index? That is,
specifying Field.Store.YES when creating the field?
In my experience fields which are not stored are not recoverable from the
index (well.. they can be reconstructed but it's a lossy process). So when
you retrieve the document,
I can share the data.. but it would be quicker for you to just pull out some
random text from anywhere you like.
The issue is that the text was in an email, which was one of about 2,000 and
I don't know which one. I got the 4.5MB figure from the number of bytes in
the byte array reported in the
the index. Lucene works
best when the index is light-weight. My recommendation is to think
carefully about the "role" of the index, vs the role of your data storage
approach.
On 8/11/06, Deepan Chakravarthy <[EMAIL PROTECTED]> wrote:
On Fri, 2006-08-11 at 01:58 +1000, Jason Po
Yes you could use lucene for this, but it may be overkill for your
requirement. If I understand you correctly, all you need to is find
documents which match "any" of the words in your list? Do you need to rank
the results? If not, it's probably easier just to create your own inverted
index of
Maybe I'm not understanding your requirement, but this should be fairly
simple in Lucene.
Each document in your document management system would be represented by a
single Lucene document in the index. Each lucene document will then have
several fields, each field representing the values of the
IMO you should avoid storing any data in the index that you don't need for
display. Lucene is an index (and a damn good one), not a database. If you
find yourself storing large amounts of data in the index, this could be an
indication that you may need to re-think your architecture.
In its simp
Sounds like you're a bit frustrated. Cheer up, the simple fact is that
engineering and business rarely see eye-to-eye. Just focus on the fact that
what you have learnt from the process will help you, and they paid for it ;)
On the issue at hand...Lucene should scale to this level, but you need
ync you should be ok.
On 8/11/06, Karel Tejnora <[EMAIL PROTECTED]> wrote:
Jason is right. I think, even Im not expert on lucene too, your newly
added document cann't recreate terms for field with analyzer, because
field text in empty.
There is very hairy solution - hack a IndexRead
My advice would be the "back-to-basics" approach. Create a test case which
creates a simple index with a few documents, verify the index is as you
expect, then re-create the index and verify again. Run this test case on
your production environment (if you are able). This will determine once and
fferent threads accessing the
index. This would also explain why you see the problem in production and
not testing.
On 8/15/06, Jason Polites <[EMAIL PROTECTED]> wrote:
My advice would be the "back-to-basics" approach. Create a test case
which creates a simple index with a few do
, Shaghayegh Sahebie <[EMAIL PROTECTED]> wrote:
thanks Jason and Steve;
maybe i didn't understand your solution well, but in this system a
document is refered many times (we have a refer description wich we should
index it also) and each time a document is refered i should update
I'm not sure about the solution in the referenced thread. It will work, but
doesn't it run the risk of breaching the transaction isolation of the
database write?
The issue is when the index is notified of a database update. If it is
notified prior to the transaction commit, and the commit fails
Hi all,
When indexing with multiple threads, and under heavy load, I get the
following exception:
java.io.IOException: Access is denied
at java.io.WinNTFileSystem.createFileExclusively(Native Method)
at java.io.File.createNewFile(File.java:850)
at org.apache.lucene.store.FSDirectory$1.o
On 8/26/06, Michael McCandless <[EMAIL PROTECTED]> wrote:
Are you also running searchers against this index? Are they re-init'ing
frequently or being opened and then held open?
No searches running in my initial test, although I can't be certain what is
happening under the Compass hood.
This
due to any reason can be thought of as the same
thing, regardless of the reason (so long as its logged).
Seems like the simplest solution too.
On 8/28/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
On 8/26/06, Jason Polites <[EMAIL PROTECTED]> wrote:
> Synchronization at this
]> wrote:
Doron Cohen wrote:
> "Jason Polites" <[EMAIL PROTECTED]> wrote on 27/08/2006 09:36:07:
>
>> I would have thought that simultaneous cross-JVM access to an index was
>> outside of scope of the core Lucene API (although it would be great),
but
&
Not sure what the desired end result is here, but you shouldn't need to
update the document jut to give it a boost factor. This can be done in the
query string used to search the index.
As for updating affecting search order, I don't think you can assume any
guarantees in this regard. You're pr
Yeah.. I had a think about this, and I now remember why I originally came to
the conclusion about cross-JVM access.
When I was adding documents to the index, and searching at the same time
(from a different JVM) I would get the occassional (but regular)
FileNotFoundException.
I don't recall the
ound.. if that
helps.
On 8/28/06, Michael McCandless <[EMAIL PROTECTED]> wrote:
Jason Polites wrote:
> Yeah.. I had a think about this, and I now remember why I originally
> came to
> the conclusion about cross-JVM access.
>
> When I was adding documents to the index, and searc
Have you looked at the MoreLikeThis class in the similarity package?
On 8/30/06, Winton Davies <[EMAIL PROTECTED]> wrote:
Hi All,
I'm scratching my head - can someone tell me which class implements
an efficient multiple term TF.IDF Cosine similarity scoring mechanism?
There is clearly the sin
Hi all,
I understand that it is possible to "re-create" fields which are indexed but
not stored (as is done by Luke), and that this is a lossy process, however I
am wondering whether the indexed version of this remains consistent.
That is, if I re-create a non-stored field, then re-index this fi
Is there a large list of words and their frequency in the english
language? Obviously it would differ by corpus but I would like to see
what's already available.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional comman
Thanks Boris,
Jason
Boris Aleksandrovsky wrote:
Jason,
You can look here:
http://www.cs.ualberta.ca/~lindek/downloads.htm
for
Word frequency counts from a 1.5B word corpus (TREC disks 1-5 and the
Reuters
corpus <http://about.reuters.com/researchandstandards/corpus/>). The
word
Hey all,
I am using the StandardAnalyzer with my own list of stop words (which is
more comprehensive than the default list), and my expectation was that this
would omit these stop words from the index when data is indexed using this
analyzer. However, I am seeing stop words in the term vector fo
Original Message
From: Jason Polites <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Saturday, September 2, 2006 9:05:27 AM
Subject: Stop words in index
Hey all,
I am using the StandardAnalyzer with my own list of stop words (which is
more comprehensive than the default list), and m
ot;, but not "on".
This is fine, and if the user searches for:
Disney on Ice
They will get a match. But, it seems that a search for:
"Disney on Ice"
With the quotations indicating the desire for an "exact match", the absence
of stop words in the index means this
e" which is what should have
cone in your doc when it was indexed using that analyzer.
:
: On 9/3/06, Jason Polites <[EMAIL PROTECTED]> wrote:
: >
: > Roger that. I'll double check my code.
: >
: > Thanks.
: >
: >
: > On 9/3/06, Otis Gospodnetic <[EMAIL PROT
I've also seen FileNotFound exceptions when attempting a search on an index
while it's being updated, and the searcher is in a different JVM. This is
supposed to be supported, but on Windows seems to regularly fail (for me
anyway).
The simplest solution to this would be a service oriented approa
ED]:false + [EMAIL PROTECTED]:"2004" + [EMAIL
PROTECTED]:"February" +
[EMAIL PROTECTED]:"Council"
can anyone tell me if this has been fixed somewhere or whether this was by
design? (I
cannot imagine that it is)
I know I have and set by default but this should stil
if ((indexFile = new File(indexDir)).exists() &&
indexFile.isDirectory())
{
exists = false;
Isn't this backwards?
Couldn't you just do:
indexFile = new File(indexDir);
exists = (indexFile.exists() && indexFile.isDirectory());
-Original Message-
From: bib_lucene bib [mailto:
You could do it asynchronously. That is, separate off the actually
lucene search into a different thread which does the actual search, then
the calling thread simply waits for a maximum time for the search thread
to complete, then queries the status of the search thread to get the
results obtained
To add to other comments:
This functionality should also look at how common a term is in the corpus.
Using the corpus as "correct" set of terms to search on isn't always what
you want if the corpus is unclean (misspellings, etc.)
I believe this is why if you search on an uncommon term, Google w
I had to do something similar, but I plan on re-writing it into something
more elegant. I hope this helps give you some ideas.
1. Create a QueryFilter on only those items that matched the criteria (have
a required clause in your boolean query)
2. Create a BitFilter which takes a BitSet from step
On 9/15/05, James Huang <[EMAIL PROTECTED]> wrote:
>
> Suppose I have a book index with field="publisher", field="title", etc.
> I want to search for books only from "Manning", do I have to do anything
> special? how?
>
add new BooleanClause(new TermQuery(new Term("publisher","Manning")), true,
the latest postings on this
topic were a few years old, I am wondering if there have been any changes in
Lucene query syntax to support searching for empty fields. Has anyone been
successfully searched for empty fields with recent Lucene releases?
Thanks
Jason
ts of resources. Perhaps 8 GB of memory
is just simply not enough to handle an index of 600 million documents. But
before telling management that they must get more memory, I'd to see if there
might be other ways to accomplish this.
Thanks in advance.
Jason
ified Index".
In this implementation, we have only one index file to manage.
I just want to get information as to how am I going to implemented it in a an
optimal way.
Any suggestion would be perfect! :)
Thanks!
Mark Jason Nacional
Junior Software Engineer
y to look for those that match the pattern.
Br.
Jason Jiao
>-Original Message-
>From: ext Daniel Noll [mailto:[EMAIL PROTECTED]
>Sent: Tuesday, August 26, 2008 10:50 AM
>To: java-user@lucene.apache.org
>Subject: Re: How to search
>
>Venkata Subbarayudu wrote:
>
not contain any
new features, API or file format changes, which makes it fully
compatible to 2.3.0 and 2.3.1".
Any hints?
Thanks in advance.
Jason Jiao
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
201 - 288 of 288 matches
Mail list logo