Looking for case studies for 'Lucene and Solr: The Definitive Guide' from O'Reilly

2012-12-17 Thread Jason Rutherglen
Cloud * Hadoop integration Thanks, Jason Rutherglen, Jack Krupansky, and Ryan Tabora http://shop.oreilly.com/product/0636920028765.do - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-ma

Re: RAMDirectory unexpectedly slows

2012-06-04 Thread Jason Rutherglen
t. Is that right? > > What about the ByteBufferDirectory? Can this specific directory utilize the > 2GB memory I grant to the app? > > On Mon, Jun 4, 2012 at 10:58 PM, Jason Rutherglen < > jason.rutherg...@gmail.com> wrote: > >> If you want the index to be stored

Re: RAMDirectory unexpectedly slows

2012-06-04 Thread Jason Rutherglen
If you want the index to be stored completely in RAM, there is the ByteBuffer directory [1]. Though I do not see the point in putting an index in RAM, it will be cached in RAM regardless in the OS system IO cache. 1. https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/ap

Re: frequent keyword computation within a search ( and timeinterval )

2012-01-05 Thread Jason Rutherglen
red > SUM, stats would do it. > > Erick > > On Thu, Jan 5, 2012 at 7:23 PM, Jason Rutherglen > wrote: >>> Short answer is that no, there isn't an aggregate >>> function. And you shouldn't even try >> >> If that is the case why does a 'st

Re: frequent keyword computation within a search ( and timeinterval )

2012-01-05 Thread Jason Rutherglen
> Short answer is that no, there isn't an aggregate > function. And you shouldn't even try If that is the case why does a 'stats' component exist for Solr with the SUM function built in? http://wiki.apache.org/solr/StatsComponent On Thu, Jan 5, 2012 at 1:37 PM, Erick Erickson wrote: > You will

BigInteger usage in numeric Trie range queries

2011-11-28 Thread Jason Rutherglen
Even though the NumericRangeQuery.new* methods do not support BigInteger, the underlying recursive algorithm supports any sized number. Has this been explored? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For

Re: ElasticSearch

2011-11-16 Thread Jason Rutherglen
The docs are slim on examples. On Wed, Nov 16, 2011 at 3:35 PM, Peter Karich wrote: > >>> even high complexity as ES supports lucene-like query nesting via JSON >> That sounds interesting.  Where is it described in the ES docs?  Thanks. > > "Think of the Query DSL as an AST of queries" > http://w

Re: ElasticSearch

2011-11-16 Thread Jason Rutherglen
> even high complexity as ES supports lucene-like query nesting via JSON That sounds interesting. Where is it described in the ES docs? Thanks. On Wed, Nov 16, 2011 at 1:36 PM, Peter Karich wrote: >  Hi, > > its not really fair to compare NRT of Solr to ElasticSearch. > ElasticSearch provides

Re: Index size and performance degradation

2011-06-13 Thread Jason Rutherglen
> deletions made by readers merely mark it for > deletion, and once a doc has been marked for deletions it is deleted for all > intents and purposes, right? There's the point-in-timeness of a reader to consider. > Does the N in NRT represent only the cost of reopening a searcher? Aptly put, and

Re: Index size and performance degradation

2011-06-13 Thread Jason Rutherglen
> I don't think we'd do the post-filtering solution, but instead maybe > resolve the deletes "live" and store them in a transactional data I think Michael B. aptly described the sequence ID approach for 'live' deletes? On Mon, Jun 13, 2011 at 3:00 PM, Michael McCandless wrote: > Yes, adding dele

Lucene Util question

2011-04-08 Thread Jason Rutherglen
Is http://code.google.com/a/apache-extras.org/p/luceneutil/ designed to replace or augment the contrib benchmark? For example it looks like SearchPerfTest would be useful for executing queries over a pre-built index. Though there's no indexing tool in the code tree? -

Re: DocIdSet to represent small numberr of hits in large Document set

2011-04-05 Thread Jason Rutherglen
I think Solr has a HashDocSet implementation? On Tue, Apr 5, 2011 at 3:19 AM, Michael McCandless wrote: > Can we simply factor out (poach!) those useful-sounding classes from > Nutch into Lucene? > > Mike > > http://blog.mikemccandless.com > > On Tue, Apr 5, 2011 at 2:24 AM, Antony Bowesman > w

Append Codec random testing

2011-03-21 Thread Jason Rutherglen
I'm seeing an error when using the misc Append codec. java.lang.AssertionError at org.apache.lucene.store.ByteArrayDataInput.readBytes(ByteArrayDataInput.java:107) at org.apache.lucene.index.codecs.BlockTermsReader$FieldReader$SegmentTermsEnum._next(BlockTermsReader.java:661) at org.apache.luce

Is ConcurrentMergeScheduler useful for multiple running IndexWriter's?

2011-03-04 Thread Jason Rutherglen
ConcurrentMergeScheduler is tied to a specific IndexWriter, however if we're running in an environment (such as Solr's multiple cores, and other similar scenarios) then we'd have a CMS per IW. I think this effectively disables CMS's max thread merge throttling feature? ---

Re: Last/max term in Lucene 4.x

2011-02-21 Thread Jason Rutherglen
ordered IDs stored in the index, so that remaining documents (that lets say were left in RAM prior to process termination) can be indexed. It's an inferred transaction checkpoint. On Mon, Feb 21, 2011 at 5:31 AM, Michael McCandless wrote: > On Sun, Feb 20, 2011 at 8:47 PM, Jason Rutherglen &

Re: Last/max term in Lucene 4.x

2011-02-20 Thread Jason Rutherglen
rd. How would I seek to the last term in the index using VarGaps? Or do I need to interact directly with the FST class (and if so I'm not sure what to do there either). Thanks Mike. On Sun, Feb 20, 2011 at 2:51 PM, Michael McCandless wrote: > On Sat, Feb 19, 2011 at 8:42 AM, Jason Rutherg

Re: Last/max term in Lucene 4.x

2011-02-19 Thread Jason Rutherglen
that supports ord (eg FixedGap). > > Mike > > On Fri, Feb 18, 2011 at 9:24 PM, Jason Rutherglen > wrote: >> This could be a rhetorical question.  The way to find the last/max >> term that is a unique per document is to use TermsEnum to seek to the >> first term of a

Last/max term in Lucene 4.x

2011-02-18 Thread Jason Rutherglen
This could be a rhetorical question. The way to find the last/max term that is a unique per document is to use TermsEnum to seek to the first term of a field, then call seek to the docFreq-1 for the last ord, then get the term, or is there a better/faster way?

Re: Storing an ID alongside a document

2011-02-03 Thread Jason Rutherglen
> there is a entire RAM resident part and a Iterator API that reads / > streams data directly from disk. > look at DocValuesEnum vs, Source Nice, thanks! On Thu, Feb 3, 2011 at 12:20 AM, Simon Willnauer wrote: > On Thu, Feb 3, 2011 at 3:23 AM, Jason Rutherglen > wrote: >>

Re: Storing an ID alongside a document

2011-02-02 Thread Jason Rutherglen
s branch) > > -Yonik > http://lucidimagination.com > > > On Wed, Feb 2, 2011 at 1:03 PM, Jason Rutherglen > wrote: > >> I'm curious if there's a new way (using flex or term states) to store >> IDs alongside a document and retrieve the IDs of the top N resul

Storing an ID alongside a document

2011-02-02 Thread Jason Rutherglen
I'm curious if there's a new way (using flex or term states) to store IDs alongside a document and retrieve the IDs of the top N results? The goal would be to minimize HD seeks, and not use field caches (because they consume too much heap space) or the doc stores (which require two seeks). One pos

Re: API access to in-memory tii file (3.x not flex).

2010-11-10 Thread Jason Rutherglen
Yeah that's customizing the Lucene source. :) I should have gone into more detail, I will next time. On Wed, Nov 10, 2010 at 2:10 PM, Michael McCandless wrote: > Actually, the .tii file pre-flex (3.x) is nearly identical to the .tis > file, just that it only contains every 128th term. > > If you

Re: API access to in-memory tii file (3.x not flex).

2010-11-10 Thread Jason Rutherglen
In a word, no. You'd need to customize the Lucene source to accomplish this. On Wed, Nov 10, 2010 at 1:02 PM, Burton-West, Tom wrote: > Hello all, > > We have an extremely large number of terms in our indexes.  I want to be able > to extract a sample of the terms, say something like every 128th

Re: Recreate segment infos

2010-10-05 Thread Jason Rutherglen
egment is given the same name as the first segment that > shares it.  However, unfortunately, because of merging, it's possible > that this mapping is not easy (maybe not possible, depending on the > merge policy...) to reconstruct.  I think this'll be the hardest part > :) > &

Recreate segment infos

2010-10-04 Thread Jason Rutherglen
Lets say the segment infos file is missing, and I'm aware of CheckIndex, however is there a tool to recreate a segment infos file? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail:

Re: Last Call: Lucene Revolution CFP Closes Tomorrow Wednesday, June 23, 2010, 12 Midnight PDT

2010-06-22 Thread Jason Rutherglen
Grant, I can probably do the 3 billion document one from Prague, or a realtime search one... I spaced on submitting for ApacheCon. Are there cool places in the Carolinas to hang? Cheers bro, Jason On Tue, Jun 22, 2010 at 10:51 AM, Grant Ingersoll wrote: > Lucene Revolution Call For Particip

Monitoring low level IO

2010-06-03 Thread Jason Rutherglen
This is more of a unix related question than Lucene specific however because Lucene is being used, I'm asking here as perhaps other people have run into a similar issue. On an Amazon EC2 merge, read, and write operations are possibly blocking due to underlying IO. Is there a tool that you have use

Re: If you could have one feature in Lucene...

2010-02-25 Thread Jason Rutherglen
long - whatever > happened to CSF? That feature is so 2006, and we still > don't have it? I'm completely disturbed about the whole situation myself. > > Who the heck is in charge here? > > On 02/25/2010 12:51 PM, Jason Rutherglen wrote: >> >> It'd be great to

Re: IndexWriter.getReader.getVersion behavior

2010-02-22 Thread Jason Rutherglen
Peter, Perhaps other concurrent operations? Jason On Tue, Feb 23, 2010 at 10:43 AM, Peter Keegan wrote: > Using Lucene 2.9.1, I have the following pseudocode which gets repeated at > regular intervals: > > 1. FSDirectory dir = FSDirectory.open(java.io.File); > 2. dir.setLockFactory(new SingleIn

Re: Analyzer for stripping non alpha-numeric characters?

2010-02-04 Thread Jason Rutherglen
Answering my own question... PatternReplaceFilter doesn't output multiple tokens... Which means messing with capture state... On Thu, Feb 4, 2010 at 2:16 PM, Jason Rutherglen wrote: > Transferred partially to solr-user... > > Steven, thanks for the reply! > > I wonder if

Re: Analyzer for stripping non alpha-numeric characters?

2010-02-04 Thread Jason Rutherglen
wrote: > Hi Jason, > > Solr's PatternReplaceFilter(ts, "\\P{Alnum}+$", "", false) should work, > chained after an appropriate tokenizer. > > Steve > > On 02/04/2010 at 12:18 PM, Jason Rutherglen wrote: >> Is there an anal

Analyzer for stripping non alpha-numeric characters?

2010-02-04 Thread Jason Rutherglen
Is there an analyzer that easily strips non alpha-numeric from the end of a token? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: file open handles?

2010-01-26 Thread Jason Rutherglen
246663 /var/index/vol201001/_5q5.cfs (deleted) > > On 2010/01/26 10:09 PM, Jamie wrote: >> >> HI Jason >> >> Thanks a ton. Problem solved. No more stray file handles! >> >> Jamie >> >> On 2010/01/26 10:03 PM, Jason Rutherglen wrote: >>>

Re: file open handles?

2010-01-26 Thread Jason Rutherglen
st switched over to > using the the writer.getReader() method and was worried if I closed the > Reader that the Writer would be closed too. Is this misguided? > > Jamie > > > On 2010/01/26 09:40 PM, Jason Rutherglen wrote: >> >> Jamie, >> >> Are you calling c

Re: file open handles?

2010-01-26 Thread Jason Rutherglen
Jamie, Are you calling close on the reader? Jason On Tue, Jan 26, 2010 at 11:23 AM, Jamie wrote: > Hi Erick > > Our app is a long running server. Is it a problem if indexes are never > closed? Our searchers > do see the latest snapshot as we use writer.getReader() method for fast > searches. >

Re: Tag Index patch (LUCENE-1292) status?

2010-01-21 Thread Jason Rutherglen
E-1879 stuff), then do I need to manually create two indexes, one > for my static fields and one for my tags? (I would need to be careful > about how I coordinated these indexes, so I could use a ParallelReader > with them.) Or is there only one index, and the tag fields are > updat

Re: Tag Index patch (LUCENE-1292) status?

2010-01-19 Thread Jason Rutherglen
Hi Chris, It's not actively being worked on. Are you interested in working on it? Jason On Tue, Jan 19, 2010 at 4:42 PM, Chris Harris wrote: > I'm interested in the Tag Index patch (LUCENE-1292), in particular > because of how it enables you to modify certain fields without > reindexing a whol

Re: Max Segmentation Size when Optimizing Index

2010-01-13 Thread Jason Rutherglen
m/ -- Solr - Lucene - Nutch > > > > > ____ > From: Jason Rutherglen > To: java-user@lucene.apache.org > Sent: Wed, January 13, 2010 5:54:38 PM > Subject: Re: Max Segmentation Size when Optimizing Index > > Yes... You could hack LogMergePolicy to do something else.

Re: Max Segmentation Size when Optimizing Index

2010-01-13 Thread Jason Rutherglen
______ > From: Jason Rutherglen > To: java-user@lucene.apache.org > Sent: Wed, January 13, 2010 5:54:38 PM > Subject: Re: Max Segmentation Size when Optimizing Index > > Yes... You could hack LogMergePolicy to do something else. > > I use optimise(numse

Re: Max Segmentation Size when Optimizing Index

2010-01-13 Thread Jason Rutherglen
Chavalittumrong wrote: > Seems like optimize() only cares about final number of segments rather than > the size of the segment. Is it so? > > On Wed, Jan 13, 2010 at 2:35 PM, Jason Rutherglen < > jason.rutherg...@gmail.com> wrote: > >> There's a different method

Re: Max Segmentation Size when Optimizing Index

2010-01-13 Thread Jason Rutherglen
is only used during index time and will be ignored > by by the Optimize() process? > > > On Wed, Jan 13, 2010 at 1:57 PM, Jason Rutherglen < > jason.rutherg...@gmail.com> wrote: > >> Oh ok, you're asking about optimizing... I think that's a different >&g

Re: Max Segmentation Size when Optimizing Index

2010-01-13 Thread Jason Rutherglen
olicy.setMaxMergeMB(100) > will prevent > merging of two segments that is larger than 100 Mb each at the optimizing > time? > > If so, why do think would I still see segment that is larger than 200 MB? > > > > On Wed, Jan 13, 2010 at 1:43 PM, Jason Rutherglen <

Re: Max Segmentation Size when Optimizing Index

2010-01-13 Thread Jason Rutherglen
Hi Trin, There was recently a discussion about this, the max size is for the before merge segments, rather than the resultant merged segment (if that makes sense). It'd be great if we had a merge policy that limited the resultant merged segment, though that'd by a rough approximation at best. Jas

Re: Term Frequency for phrases

2010-01-08 Thread Jason Rutherglen
I'm not going to go into too much code level detail, however I'd index the phrases using tri-gram shingles, and as uni-grams. I think this'll give you the results you're looking for. You'll be able to quickly recall the count of a given phrase aka tri-gram such as "blue_shorts_burough" On Fri, J

Re: Is there a way to limit the size of an index?

2010-01-07 Thread Jason Rutherglen
The naming is unclear, when I looked at this I had to thumb through the code a fair bit before discerning if it was the input segments or the output segment of a merge (it's the former). Though I find the current functionality somewhat odd because it will inherently exceed the given size with a mer

CJKAnalyzer phrase slop?

2009-12-13 Thread Jason Rutherglen
Does CJK support phrase slop? (I'm assuming no) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: NearSpansUnordered payloads not returning all the time

2009-12-09 Thread Jason Rutherglen
f it already fell on a prior span. > > Mike > > On Wed, Dec 9, 2009 at 11:25 AM, Jason Rutherglen > wrote: >> Right we're getting the spans, however it's just the payloads that are >> missing, randomly... >> >> On Wed, Dec 9, 2009 at 2:23 AM, Michae

Re: NearSpansUnordered payloads not returning all the time

2009-12-09 Thread Jason Rutherglen
if that included sometimes > missing payloads... > > Mike > > On Tue, Dec 8, 2009 at 7:34 PM, Jason Rutherglen > wrote: >> Howdy, >> >> I am wondering if anyone has seen >> NearSpansUnordered.getPayload() not return payloads that are >> verifiably ac

NearSpansUnordered payloads not returning all the time

2009-12-08 Thread Jason Rutherglen
Howdy, I am wondering if anyone has seen NearSpansUnordered.getPayload() not return payloads that are verifiably accessible via IR.termPositions? It's a bit confusing because most of the time they're returned properly. I suspect the payload logic gets tripped up in NearSpansUnordered. I'll put to

Re: heap memory issues when sorting by a string field

2009-12-07 Thread Jason Rutherglen
m I mentioned above (but >> I haven't looked at the code yet). >> >> It's an apache license - but you mentioned something about no third party >> libraries. Is that a policy for Lucene? >> >> Thanks, >> >> Tom >> >> >> On Mon, Dec 7

Re: heap memory issues when sorting by a string field

2009-12-07 Thread Jason Rutherglen
> Thanks, > > Tom > > > On Mon, Dec 7, 2009 at 4:44 PM, Jason Rutherglen > wrote: > >> I wonder if Google Collections (even though we don't use third party >> libraries) concurrent map, which supports weak keys, handles the >> removal of weakly referenc

Re: IndexWriter creates multiple .cfs files

2009-12-07 Thread Jason Rutherglen
RB, That's expected behavior, each .cfs corresponds to all of a segment's files. You could write your own directory implementation that underneath writes to a single file. It's usually good to present what you're trying to accomplish (i.e. the why). Jason On Mon, Dec 7, 2009 at 10:25 PM, Cool Th

Re: heap memory issues when sorting by a string field

2009-12-07 Thread Jason Rutherglen
I wonder if Google Collections (even though we don't use third party libraries) concurrent map, which supports weak keys, handles the removal of weakly referenced keys in a more elegant way than Java's WeakHashMap? On Mon, Dec 7, 2009 at 4:38 PM, Tom Hill wrote: > Hi - > > If I understand correct

Re: Disk full while optimizing an index

2009-11-30 Thread Jason Rutherglen
Siraj, You could estimate the maximum size used during optimization at 2.5 (a sort of rough maximum) times your current index size, and not optimize if your index (at 2.5 times) would exceed your allowable disk space. Jason On Mon, Nov 30, 2009 at 2:50 PM, Siraj Haider wrote: > Index optimizati

Re: NearSpansUnordered payloads

2009-11-25 Thread Jason Rutherglen
I don't mind adding the "positions" of the payloads in them. However, maybe we can be little more clear in the javadocs what's going on underneath? On Wed, Nov 25, 2009 at 5:36 AM, Mark Miller wrote: > Grant Ingersoll wrote: >> On Nov 20, 2009, at 6:49 PM, Jason Ru

Re: Is Lucene a good choice for PB scale mailbox search?

2009-11-23 Thread Jason Rutherglen
A sharded architecture (i.e. smaller indexes) used by Google for example and implemented by open source in the Katta project may be best for scaling to sizable levels. Katta is also useful for redundancy and fault tolerance. On Mon, Nov 23, 2009 at 6:35 PM, fulin tang wrote: > We are going to ad

Re: ConcurrentMergeScheduler, Exception and transaction

2009-11-20 Thread Jason Rutherglen
Teruhiko, The index remains consistent even when a background merge fails, meaning commit truly represents a valid index after it's called. You can share merge schedulers, though in practice it's not going to improve anything. Jason 2009/11/20 Teruhiko Kurosaka : > I was experimenting how Lucene

NearSpansUnordered payloads

2009-11-20 Thread Jason Rutherglen
I'm interested in getting the payload information from the matching span, however it's unclear from the javadocs why NearSpansUnordered is different than NearSpansOrdered in this regard. NearSpansUnordered returns payloads in a hash set that's computed each method call by iterating over the SpanCe

Re: Verbose logging via ant, get an OOM

2009-11-12 Thread Jason Rutherglen
gt; Raise -Xmx, there is a setting in common-build.xml or buidl.xml >> >> - >> Uwe Schindler >> H.-H.-Meier-Allee 63, D-28213 Bremen >> http://www.thetaphi.de >> eMail: u...@thetaphi.de >> >>> -Original Message- >>> From: Jason R

Verbose logging via ant, get an OOM

2009-11-12 Thread Jason Rutherglen
Is there a setting to fix this? [junit] Exception in thread "main" java.lang.OutOfMemoryError: Java heap space [junit] at java.util.Arrays.copyOf(Arrays.java:2882) [junit] at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100) [junit] at java.lang

Re: IndexWriter.close() no longer seems to close everything

2009-11-12 Thread Jason Rutherglen
If there's a bug you're seeing, it's helpful to open an issue and post code reproducing it. On Wed, Nov 11, 2009 at 3:41 AM, Albert Juhe wrote: > > I think that this is the best way to proceed. > > thank you Mike > > > > Michael McCandless-2 wrote: >> >> Can you narrow the leak down to a small se

Re: Realtime search best practices

2009-10-12 Thread Jason Rutherglen
Hi Cedric, There is a wiki page on NRT at: http://wiki.apache.org/lucene-java/NearRealtimeSearch Feel free tp ask questions if there's not enough information. -J On Mon, Oct 12, 2009 at 2:24 AM, melix wrote: > > Hi, > > I'm going to replace an old reader/writer synchronization mechanism we had

Re: Realtime & distributed

2009-10-10 Thread Jason Rutherglen
ust plain > disappointing.* > >        Thanks Jake for the clarification, and Eric, let me know if you to > know more in detail with how we are dealing with realtime indexing/search > with Zoie here at linkedin in a production environment powering a real > internet company with real

Re: Realtime & distributed

2009-10-09 Thread Jason Rutherglen
variety of configurations. The best way to go about >> this is to post benchmarks that others may run in their >> environment which can then be tweaked for their unique edge >> cases. I wish I had more time to work on it. >> >> -J >> >> On Thu, Oct 8, 2009

Re: Realtime & distributed

2009-10-09 Thread Jason Rutherglen
on it. -J On Thu, Oct 8, 2009 at 8:18 PM, Jake Mannix wrote: > Jason, > > On Thu, Oct 8, 2009 at 7:56 PM, Jason Rutherglen > wrote: > >> Today near realtime search (with or without SSDs) comes at a >> price, that is reduced indexing speed due to continued in RAM >&g

Re: Realtime & distributed

2009-10-08 Thread Jason Rutherglen
Eric, Katta doesn't require HDFS which would be slow to search on, though Katta can be used to copy indexes out of HDFS onto local servers. The best bet is hardware that uses SSDs because merges and update latency will greatly decrease and there won't be a synchronous IO issue as there is with har

Re: Reverse stemmer?

2009-10-08 Thread Jason Rutherglen
Out of curiousity and perhaps for practical purposes, how does one handle mixed language documents? I suppose one could extract the words of a particular language and place it in a lang specific field? Are there libraries to perform this (yet)? On Thu, Oct 8, 2009 at 6:32 AM, Christian Reuschling

Index splitter

2009-10-07 Thread Jason Rutherglen
We have a way to merges indexes together with IW.addIndexes, however not the opposite, split up an index with multiple segments. I think I can simply manufacture a new segmentinfos in a new directory, copy over the segments files from those segments, delete the copied segments from the source, and

Re: Best strategy for reindexing large amount of data

2009-10-07 Thread Jason Rutherglen
Maarten, Depending on the hardware available you can use a Hadoop cluster to reindex more quickly. With Amazon EC2 one can spin up several nodes, reindex, then tear them down when they're no longer needed. Also you can simply update in place the existing documents in the index, though you'd need t

Re: How to setup a scalable deployment?

2009-10-06 Thread Jason Rutherglen
Chris, It sounds like you're on the right track. Have you looked at Solr which uses the rsync/Java replication method you mentioned? Replication and near realtime in Solr aren't quite there yet, however it wouldn't be too hard to add it. -J On Tue, Oct 6, 2009 at 3:57 PM, Chris Were wrote: > Hi

Re: Efficiently reopening remotely-distributed indexes in 2.9?

2009-10-05 Thread Jason Rutherglen
I'm not sure I understand the question. You're trying to reopen the segments that you're replicated and you're wondering what's changed in Lucene? On Mon, Oct 5, 2009 at 5:30 PM, Nigel wrote: > Anyone have any ideas here?  I imagine a lot of other people will have a > similar question when trying

Re: Concurrent Indexing and Searching

2009-09-25 Thread Jason Rutherglen
It depends on whether or not the commit completes before the reopen. Lucene 2.9 adds an IndexWriter.getReader method that will always return with the latest modifications to your index. So if you're adding many documents, you can at anytime, call IW.getReader and you will be able to search the cha

Re: Index docstore flush problem

2009-09-10 Thread Jason Rutherglen
he fdx file > size is 3748 (= 4 + 468*8), yet the file size is far larger than that > (298404). > > How repeatable is it?  Can you turn on infoStream, get the exception > to happen, then post the resulting output? > > Mike > > On Thu, Sep 10, 2009 at 7:19 PM, Jason Ruther

Index docstore flush problem

2009-09-10 Thread Jason Rutherglen
I'm seeing a strange exception when indexing using the latest Solr rev on EC2. org.apache.solr.client.solrj.SolrServerException: org.apache.solr.client.solrj.SolrServerException: java.lang.RuntimeException: after flush: fdx size mismatch: 468 docs vs 298404 length in bytes of _0.fdx at or

Re: Extending Sort/FieldCache

2009-09-10 Thread Jason Rutherglen
I think CSF hasn't been implemented because it's only marginally useful yet requires fairly significant rewrites of core code (i.e. SegmentMerger) so no one's picked it up including myself. An interim solution that fulfills the same function (quickly loading field cache values) using what works rel

Re: JVM bug?

2009-08-28 Thread Jason Rutherglen
> - Mark > > http://www.lucidimagination.com > > > > Jason Rutherglen wrote: >> While indexing with the latest nightly build of Solr on Amazon EC2 the >> following JVM bug has occurred twice on two different servers. >> >> Post the log to a Jira issue? >>

JVM bug?

2009-08-28 Thread Jason Rutherglen
While indexing with the latest nightly build of Solr on Amazon EC2 the following JVM bug has occurred twice on two different servers. Post the log to a Jira issue? java version "1.6.0_07" Java(TM) SE Runtime Environment (build 1.6.0_07-b06) Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed

Re: Is there a way to check for field "uniqueness" when indexing?

2009-08-26 Thread Jason Rutherglen
Daniel, You may want to look at SOLR-1375 which enables ID checking using a BloomFilter (with a specified errorrate of false positives). Otherwise for what you're trying to do, you'd need to create a hash map? -J On Thu, Aug 13, 2009 at 7:33 AM, Daniel Shane wrote: > Hi all! > > I'm currently ru

Re: Lucene SORT does a sort on entire index..how do I filter SORT?

2009-08-21 Thread Jason Rutherglen
even hits. > > Is there no way to limit the sorting to only the documents that were found > in the query? > > Thanks > > > > Jason Rutherglen-2 wrote: >> >> Take a look at contrib/spatial. >> >> On Fri, Aug 21, 2009 at 7:00 AM, javaguy44 wrot

Re: Lucene SORT does a sort on entire index..how do I filter SORT?

2009-08-21 Thread Jason Rutherglen
Take a look at contrib/spatial. On Fri, Aug 21, 2009 at 7:00 AM, javaguy44 wrote: > > Hi, > > I'm currently looking at sorting in lucene, and to get started I took a look > at the distance sorting example from the Lucene in Action book. > > Working through the test DistanceSortingTest, I've notice

Re: Bizarre indexing issue where thousands of files get created

2009-08-18 Thread Jason Rutherglen
Micah, If you can post some of your code, it may be easier to identify the problem you're experiencing. -J On Tue, Aug 18, 2009 at 9:55 AM, Micah Jaffe wrote: > Hi, thanks for the response!  The (custom) searchers that are falling out of > cache are indeed calling close on their IndexReader in f

Complexity of PhraseQuery slop?

2009-08-12 Thread Jason Rutherglen
In trying to calculate the cost of various slop settings for phrase queries, what's the time complexity? O(n) or O(n^2)? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user

New more affordable and performant Intel SSDs

2009-07-22 Thread Jason Rutherglen
http://arstechnica.com/hardware/news/2009/07/intels-new-34nm-ssds-cut-prices-by-60-percent-boost-speed.ars For me the price on the 80GB is now within reason for a $1300 SuperMicro quad-core 12GB RAM type of server. - To unsubscri

Re: speed of BooleanQueries on 2.9

2009-07-16 Thread Jason Rutherglen
be honest, I do not know is anyone today runs high volume search from disk > (maybe SSD), even than, significant portion has to be in RAM... > > One day we could throw many CPUs at Query... but this is not an easy one... > > > > > > - Original Message >> F

Re: speed of BooleanQueries on 2.9

2009-07-16 Thread Jason Rutherglen
Do we think that we'll be able to support indexing stop words using PFOR (with relaxation on the compression to gain performance?) Today it seems like the best approach to indexing stop words is to use shingles? However this blows up the term dict because shingles concatenates phrases together. On

Anyone used org.apache.lucene.analysis.compound.hyphenation.TernaryTree?

2009-07-14 Thread Jason Rutherglen
Just wondering if it works and if it's a good fit for autosuggest? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Optimizing unordered queries

2009-07-07 Thread Jason Rutherglen
Ah ok, I was thinking we'd wait for the new flex indexing patch. I had started working along these lines before and will take it on as a project (which is I believe reducing the memory consumption of the term dictionary). I plan to segue it into the tag index at some point. On Tue, Jul 7, 2009 at

Re: Delete by docId in IndexWriter

2009-06-28 Thread Jason Rutherglen
This requires tracking the genealogy of docids as they are merged inside IndexWriter. It's doable, so if you're particularly interested feel free to open a jira issue. On Sun, Jun 28, 2009 at 2:21 AM, Shay Banon wrote: > > Hi, > > I have a case where deleting documents by doc id make sense (I

Re: caching an indexreader

2009-06-19 Thread Jason Rutherglen
On the topic of RAM consumption, it seems like field caches could return estimated RAM usage (given they're arrays of standard Java types)? There's methods of calculating per platform (I believe relatively accurately). On Fri, Jun 19, 2009 at 12:11 PM, Michael McCandless < luc...@mikemccandless.co

Re: caching an indexreader

2009-06-19 Thread Jason Rutherglen
> As I understand it, the user won't see any changes to the index until a new Searcher is created. Correct. > How much memory will caching the searcher cost? Are there other tradeoff's I need to consider? If you're updating the index frequently (every N seconds) and the searcher/reader is closed

Re: Lucene memory usage

2009-06-10 Thread Jason Rutherglen
d >terms, and is slurped into the arrays on init. > > This is a sizable RAM savings over what's done now because you save 2 > objects, 3 pointers, 2 longs, 2 ints (I think), per indexed term. > > Mike > > On Wed, Jun 10, 2009 at 2:02 PM, Jason > Rutherglen wrote: &

Re: Lucene memory usage

2009-06-10 Thread Jason Rutherglen
> LUCENE-1458 (flexible indexing) has these improvements, Mike, can you explain how it's different? I looked through the code once but yeah, it's in with a lot of other changes. On Wed, Jun 10, 2009 at 5:40 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > This (very large number of

Bay Area Lucene Group?

2009-05-19 Thread Jason Rutherglen
On the topic of user groups, is there a Bay Area Lucene users group?

Re: is there a way to control when merges happen?

2009-05-15 Thread Jason Rutherglen
Hi Dan, You are looking to throttle the merging? I'd recommend setting ConcurrentMergeScheduler.setMaxThreadCount(1). This way IW.addDocument doesn't wait while a merge occurs (like SerialMergeScheduler) however it should not use as much CPU as only one merge will occur at a time. In regards to

Re: Getting an IndexReader from a committed IndexWriter

2009-05-14 Thread Jason Rutherglen
Hi Shay, I think IndexWriter.getReader from LUCENE-1516 in trunk is what you're talking about? It pools readers internally so there's no need to call IndexReader.reopen, one simply calls IW.getReader to get new readers containing recent updates. -J BTW I replied to the message on java-u...@lucen

Re: IndexWriter.deleteDocuments(Query query)

2009-04-01 Thread Jason Rutherglen
John, We looked at implementing delete by doc id for LUCENE-1516, however it seemed to be something that if enough people wanted we could implement it at as a later patch. The implementation involves maintaining a genealogy of SegmentReaders within IndexWriter so that deletes to a reader that has

Re: Assertion Error in TermsHashPerField.comparePostings - Lucene 2.4

2009-03-26 Thread Jason Rutherglen
e segments with enough deletes need to merged away in 1-2 hours. Meaning optimizing may not be best as it requires later large merges. Also an interleaving system that does not perform merges if a flush is occurring could useful for minimizing disk trash. On Wed, Mar 25, 2009 at 3:39 PM, J

Re: Assertion Error in TermsHashPerField.comparePostings - Lucene 2.4

2009-03-25 Thread Jason Rutherglen
LuceneError when executed should reproduce the failure. The contrib/benchmark libraries are required. MultiThreadDocAdd is a multithreaded indexing utility class. On Wed, Mar 25, 2009 at 1:06 PM, Jason Rutherglen < jason.rutherg...@gmail.com> wrote: > Each document is being created in

Re: Assertion Error in TermsHashPerField.comparePostings - Lucene 2.4

2009-03-25 Thread Jason Rutherglen
It looks like you are reusing a Field (the f.setValue(...) calls); are > you sure you're not changing a Document/Field while another thread is > adding it to the index? > > If you can post the full code, then I can try to run it on my > wikipedia dump locally. > > Mi

Re: Assertion Error in TermsHashPerField.comparePostings - Lucene 2.4

2009-03-24 Thread Jason Rutherglen
12:25 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > H. > > Jason is this easily/compactly repeated? EG, try to index the N docs > before that one. > > If you remove the SinglePayloadTokenStream field, does the exception > still happen? > > Mike

  1   2   >