ingly impossible... but this is
a separate issue.
On Wed, Sep 10, 2014 at 6:35 PM, Robert Muir wrote:
> Yes, there is also a safety check, but IMO it should be removed.
>
> See the patch on the issue, the test passes now.
>
> On Wed, Sep 10, 2014 at 9:31 PM, Vitaly Funstein
> wro
i think you should be able to do this, we just have to
> add the hasDeletions check to #2
>
> On Wed, Sep 10, 2014 at 7:46 PM, Vitaly Funstein
> wrote:
> > One other observation - if instead of a reader opened at a later commit
> > point (T1), I pass in an NRT reader *w
new segment files, as well... unfortunately, our system can't make
either assumption.
On Wed, Sep 10, 2014 at 4:30 PM, Vitaly Funstein
wrote:
> Normally, reopens only go forwards in time, so if you could ensure
>> that when you reopen one reader to another, the 2nd one is always
&g
>
> Normally, reopens only go forwards in time, so if you could ensure
> that when you reopen one reader to another, the 2nd one is always
> "newer", then I think you should never hit this issue
Mike, I'm not sure if I fully understand your suggestion. In a nutshell,
the use here case is as follo
er versions.
> >
> > But that being said, I think the bug is real: if you try to reopen
> > from a newer NRT reader down to an older (commit point) reader then
> > you can hit this.
> >
> > Can you open an issue and maybe post a test case showing it? Thanks.
> &
ince merge timings aren't deterministic.
On Mon, Sep 8, 2014 at 11:45 AM, Vitaly Funstein
wrote:
> UPDATE:
>
> After making the changes we discussed to enable sharing of SegmentReaders
> between the NRT reader and a commit point reader, specifically calling
> through to DirectoryReader.o
t; On Thu, Aug 28, 2014 at 5:38 PM, Vitaly Funstein
> wrote:
> > On Thu, Aug 28, 2014 at 1:25 PM, Michael McCandless <
> > luc...@mikemccandless.com> wrote:
> >
> >>
> >> The segments_N file can be different, that's fine: after that, we then
&g
On Thu, Aug 28, 2014 at 2:38 PM, Vitaly Funstein
wrote:
>
> Looks like this is used inside Lucene41PostingsFormat, which simply passes
> in those defaults - so you are effectively saying the minimum (and
> therefore, maximum) block size can be raised to reuse the size of the terms
&g
On Thu, Aug 28, 2014 at 1:25 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:
>
> The segments_N file can be different, that's fine: after that, we then
> re-use SegmentReaders when they are in common between the two commit
> points. Each segments_N file refers to many segments...
>
>
Y
block sizes uses by the terms index (see
> BlockTreeTermsWriter). Larger blocks = smaller terms index (FST) but
> possibly slower searches, especially MultiTermQueries ...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Aug 28, 2014 at 2:50 PM, Vitaly Fun
there
> are 88 fields totaling ~46 MB so ~0.5 MB per indexed field ...
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Aug 28, 2014 at 1:56 PM, Vitaly Funstein
> wrote:
> > Here's the link:
> >
> https://drive.google.com/file/d/0B5e
> N commit points that you have readers open for, they will be sharing
> SegmentReaders for segments they have in common.
>
> How many unique fields are you adding?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Wed, Aug 27, 2014 at 7:41 PM, Vitaly Fun
que fields?
>
> Can you post screen shots of the heap usage?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, Aug 26, 2014 at 3:53 PM, Vitaly Funstein
> wrote:
> > This is a follow up to the earlier thread I started to understand memory
> &g
This is a follow up to the earlier thread I started to understand memory
usage patterns of SegmentReader instances, but I decided to create a
separate post since this issue is much more serious than the heap overhead
created by use of stored field compression.
Here is the use case, once again. The
Is it reasonable to assume that using stored field compression with a lot
of stored fields per document in a very large index (100+ GB) could
potentially lead to a significant heap utilization? If I am reading the
code in CompressingStoredFieldsIndexReader correctly, there's a non-trivial
accounti
y sure calling indexWriterConfig.clone() in the middle
> of indexing documents used to work for my code(same Lucene 4.7). It is
> since recently I had to do faceted indexing as well that this problem
> started to emerge. Is it related?
>
>
> On Mon, Aug 11, 2014 at 11:31 PM, Vit
mean whenever the indexWriter gets called for
> commit/prepareCommit, etc., the corresponding indexWriterConfig object
> cannot be called with .clone() at all?
>
>
> On Mon, Aug 11, 2014 at 9:52 PM, Vitaly Funstein
> wrote:
>
> > Looks like you have to clone it
Looks like you have to clone it prior to using with any IndexWriter
instances.
On Mon, Aug 11, 2014 at 2:49 PM, Sheng wrote:
> I tried to create a clone of indexwriteconfig with
> "indexWriterConfig.clone()" for re-creating a new indexwriter, but I then I
> got this very annoying illegalstateex
As a compromise, you can base your custom sort function on values of stored
fields in the same index - as opposed to fetching them from an external
data store, or relying on internal sorting implementation in Lucene. It
will still be relatively slow, but not nearly as slow as going out to a
DB... t
on. The same version maintained in lucene as one document. During
> startup these numbers define what has to be syncd up. Unfortunately lucene
> is used in a webapp, so this happens "only" during a jetty restart.
>
> - Vidhya
>
>
> > On 21-Jun-2014, at 11:08 am, &qu
so verify if the time taken for commit() is longer when more
> data piled up to commit. But definitely should be better than committing
> for every thread..
>
> Will post back after tests.
>
> - Vidhya
>
>
> > On 21-Jun-2014, at 10:28 am, "Vitaly Funstein"
the auto commit parameters appropriately, do i still need the
> committer thread ? Because it's job is to call commit. Anyway
> add/updateDocument is already done in my writer threads.
>
> Thanks for your time and your suggestions!
>
> - Vidhya
>
>
> > On 21-Jun-2014, a
If you are using stored fields in your index, consider playing with
compression settings, or perhaps turning stored field compression off
altogether. Ways to do this have been discussed in this forum on numerous
occasions. This is highly use case dependent though, as your indexing
performance may o
You could just avoid calling commit() altogether if your application's
semantics allow this (i.e. it's non-transactional in nature). This way,
Lucene will do commits when appropriate, based on the buffering settings
you chose. It's generally unnecessary and undesirable to call commit at the
end of
wrote:
> Vitaly
>
> See below:
>
>
> On 2014/06/03, 12:09 PM, Vitaly Funstein wrote:
>
>> A couple of questions.
>>
>> 1. What are you trying to achieve by setting the current thread's priority
>> to max possible value? Is it grabbing as much CPU t
A couple of questions.
1. What are you trying to achieve by setting the current thread's priority
to max possible value? Is it grabbing as much CPU time as possible? In my
experience, mucking with thread priorities like this is at best futile, and
at worst quite detrimental to responsiveness and o
Something doesn't quite add up.
TopFieldCollector fieldCollector = TopFieldCollector.create(sort, max,true,
> false, false, true);
>
> We use pagination, so only returning 1000 documents or so at a time.
>
>
You say you are using pagination, yet the API you are using to create your
collector isn't
At the risk of sounding overly critical here, I would say you need to scrap
your entire approach of building one small index per request, and just
build your entire searchable data store in Lucene/Solr. This is the
simplest and probably most maintainable and scalable solution. Even if your
index co
Does Lucene have support for queries that operate on fields that match a
specific name pattern?
Let's say that I am modeling an indexed field that can have a collection of
values, but don't want the default behavior of these values appended
together within the field, for the purposes of search. So
e
> filesystem cache that likely contains other fields' values that you
> are not interested in.
>
>
>
> On Sat, Apr 5, 2014 at 12:23 AM, Vitaly Funstein
> wrote:
> > I use stored fields to load values for the following use cases:
> > - to return per-document v
Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> > -Original Message-
> > From: Vitaly Funstein [mailto:vfunst...@gmail.com]
> > Sent: Friday, April 04, 2014 9:44 PM
> > To: java-user@lucene.apache.org
> > Subject: Stored fields and OS file caching
&g
I have heard here that stored fields don't work well with OS file caching.
Could someone elaborate on why that is? I am using Lucene 4.6 and we do use
stored fields but not doc values; it appears most of the benefit from the
latter comes as improvement in sorting performance, and I don't actually
u
ne between two snapshots, of course
> more files can change, because smaller segments may got combined with other
> ones on newer snapshots.
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
&
I have a usage pattern where I need to package up and store away all files
from an index referenced by multiple commit points. To that end, I
basically call IndexWriter.commit(), followed by
SnapshotDeletionPolicy.snapshot(), followed by something like this:
List files = new ArrayList(dir.li
Suppose I have an IndexReader instance obtained with this API:
DirectoryReader.open(IndexWriter, boolean);
(I actually use a ReaderManager in front of it, but that's beside the
point).
There is no manual commit happening prior to this call. Now, I would like
to keep this reader around until no l
I see that SnapshotDeletionPolicy no longer supports snapshotting by an
app-supplied string id, as of Lucene 4.4. However, my use case relies on
the policy's ability to maintain multiple snapshots simultaneously to
provide index versioning semantics, of sorts. What is the new recommended
way of doi
, Oct 10, 2013 at 7:01 PM, Vitaly Funstein wrote:
> Hello,
>
> I am trying to weigh some ideas for implementing paged search
> functionality in our system, which has these basic requirements:
>
>- Using Solr is not an option (at the moment).
>- Any Lucene 4.x version can b
Or FieldValueFilter - that's probably easier to use.
> -Original Message-
> From: Michael McCandless [mailto:luc...@mikemccandless.com]
> Sent: Monday, November 04, 2013 4:37 AM
> To: Lucene Users
> Subject: Re: Lucene Empty Non-empty Fields
>
> You can also use FieldCache.getDocsWithFiel
Hello,
I am trying to weigh some ideas for implementing paged search functionality
in our system, which has these basic requirements:
- Using Solr is not an option (at the moment).
- Any Lucene 4.x version can be used.
- Result pagination is driven by the user application code.
- User
I don't think you want to load indexes of this size into a RAMDirectory.
The reasons have been listed multiple times here... in short, just use
MMapDirectory.
On Wed, Oct 9, 2013 at 3:17 PM, Igor Shalyminov
wrote:
> Hello!
>
> I need to perform an experiment of loading the entire index in RAM an
to focus on parallelism within
> queries rather than across many queries. Batch processing performance is
> still important, but we cannot sacrifice quick "online" responses. It would
> be much easier to avoid this whole mess, but we cannot meet our performance
> requirements w
Matt,
I think you are mostly on track with suspecting thread pool task overload
as the possible culprit here. First, the old school (prior to Java 7)
ThreadPoolExecutor only accepts a BlockingQueue to use internally for
worker tasks, instead of a concurrent variant (not sure why). So this
internal
n, might be to list the
> > directory: if there is only one file, segments_1, then it's a corrupt
> > first commit and you can recreate the index.
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> > On Wed, May 29, 2013 at 8:09 PM, Vita
I have encountered a strange issue, that appears to be pretty hard to hit,
but is still a serious problem when it does occur. It seems that if the JVM
crashes in a racy fashion with instantiation of IndexWriter, the index may
be left in an inconsistent state. An attempt to reload such an index on
r
ustered in the codec
> manager by adding META-INF files to your JAR and not using anonymous
> subclasses.
>
>
>
> Vitaly Funstein schrieb:
>
> >Uwe,
> >
> >I may not be doing this correctly, but I tried to see what would happen
> >if
> >I were to a reo
remen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> > -Original Message-
> > From: Vitaly Funstein [mailto:vfunst...@gmail.com]
> > Sent: Wednesday, May 15, 2013 11:36 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Toggling compression
nothing to do with "reindexing" as you are just changing
> the encoding of the exact same data on disk.
>
> Uwe
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> > -Orig
Is it possible to have a mix of compressed and uncompressed documents
within a single index? That is, can I load an index created with Lucene 4.0
into 4.1 and defer the decision of whether or not to use
CompressingStoredFieldsFormat until a later time, or even go back and forth
between compressed a
Something like this will work:
BooleanQuery query = new BooleanQuery();
query.add(new MatchAllDocsQuery(), Occur.MUST);
query.add(new BooleanClause(termQuery, Occur.MUST_NOT));
On Fri, Mar 29, 2013 at 1:06 PM, Paul Bell wrote:
> Hi,
>
> I've done a few experiments in Lucene 4.2 wit
This is probably a pretty general inquiry, but I'm just exploring this as
an option at the moment.
It seems that Lucene 4 adds some freedom to define how data is actually
written to underlying storage by exposing the codec API. However, I find
the learning curve for understanding what bits to chan
I know that general questions about aggregate functions have been asked
here before a number of times, but I would like to figure out how to solve
at least one specific subset of this issue. Namely, given a specific
indexed field, how do I efficiently get the min/max value of the field in
the index
If you don't need to support case-sensitive search in your application,
then you may be able to get away with adding string fields to your
documents twice - lowercase version for indexing only, and verbatim to
store. For example (this is Lucene 4 code, but same idea),
// indexed - not stored
d
this API and new users have a steeper
> learning curve because of it.
>
>
> Igal
>
>
>
> On 11/5/2012 11:38 AM, Vitaly Funstein wrote:
>
>> Are you critiquing CharTermAttribute in particular, or Lucene in general?
>> It appears CharTermAttribute is DSL-style builder
Are you critiquing CharTermAttribute in particular, or Lucene in general?
It appears CharTermAttribute is DSL-style builder API, just like its
superinterface Appendable - does that not appear intentional and
self-explanatory? Further, I believe Term instances are meant to be
immutable hence no dire
One thing to keep in mind is that the default merge policy has changed in
3.6 from 2.3.2 (I'm almost certain of that). So it's just a hunch but you
may have some unmerged segments left over at the end. Try calling
IndexWriter.close(true) after you're done indexing.
On Fri, Oct 26, 2012 at 10:50 AM
ld be "OR (*:* -allergies:[* TO *])" in
> Lucene/Solr.
>
> -- Jack Krupansky
>
> -Original Message- From: Vitaly Funstein
> Sent: Thursday, October 25, 2012 8:25 PM
> To: java-user@lucene.apache.org
> Subject: Re: query for documents WITHOUT a field?
>
>
Sorry for resurrecting an old thread, but how would one go about writing a
Lucene query similar to this?
SELECT * FROM patient WHERE first_name = 'Zed' OR allergies IS NULL
An AND case would be easy since one would just use a simple TermQuery with
a FieldValueFilter added, but what about other bo
Just curious - why not take your search feature offline during the
reindexing? That would seem sensible from an operational perspective, I
think.
On Tue, Oct 23, 2012 at 2:03 PM, Raghavan Parthasarathy <
raghavan8...@gmail.com> wrote:
> Hi,
>
> We are using Lucene-core and we reindex once a day a
You have probably figured it out by now, but my suggestion would be to use
SearcherManager the way it is documented for maintaining a searcher backed
by an NRT reader.
On Sun, Aug 26, 2012 at 2:03 PM, Mossaab Bagdouri wrote:
> Thanks for the quick reply.
>
> I've changed my code to the following
How tolerant is your project of decreased search and indexing performance?
You could probably write a simple test that compares search and write
speeds of local and NFS-mounted indexes and make the decision based on the
results.
On Mon, Oct 1, 2012 at 3:06 PM, Jong Kim wrote:
> Hi,
>
> According
at an older 4.x/5.x version? We recently
> removed declaration of this (unchecked) exception... (LUCENE-4172).
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Fri, Jul 20, 2012 at 11:26 PM, Vitaly Funstein
> wrote:
> > This probably belongs in the JIRA, an
This probably belongs in the JIRA, and is related to
https://issues.apache.org/jira/browse/LUCENE-4025, but
java.util.Lock.lock() doesn't throw anything. I believe the author of the
change originally meant to use lockInterruptibly() inside but forgot to
adjust the method sig after changing it back
I was referring to *RAMDirectory*.
On Wed, Jul 18, 2012 at 11:04 PM, Lance Norskog wrote:
>> You do not want to store 30 G of data in the JVM heap, no matter what
library does this.
> MMapDirectory does not store data in the JVM heap. It lets the
> operating system manage the disk buffer space. E
You do not want to store 30 G of data in the JVM heap, no matter what
library does this.
On Wed, Jul 18, 2012 at 10:44 AM, Paul Jakubik wrote:
> If only 30GB, go with RAM and MMAPDirectory (as long as you have the budget
> for that hardware).
>
> My understanding is that RAMDirectory is intended
Have you tried sharding your data? Since you have a fast multi-core
box, why not split your indices N-ways, say the smaller one into 4,
and the larger into 8. Then you can have a pool of dedicated search
threads, executing the same query against separate physical indices
within each "logical" one i
, Jul 12, 2012 at 2:34 PM, Lance Norskog wrote:
> You can choose another directory implementation.
>
> On Thu, Jul 12, 2012 at 1:42 PM, Vitaly Funstein wrote:
>> Just thought I'd bump this. To clarify - for reasons outside my
>> control, I can't just run the JVM hos
parameter has to be fairly close to the actual size
used by the app (padded for Lucene and possibly other consumers).
On Mon, Jul 9, 2012 at 7:59 PM, Vitaly Funstein wrote:
>
> Hello,
>
> I have recently run into the situation when there was not a sufficient amount
> of di
Hello,
I have recently run into the situation when there was not a sufficient
amount of direct memory available for IndexWriter to work. This was
essentially caused by the embedding application making heavy use of JVM's
direct memory buffers and not leaving enough headroom for NIOFSDirectory to
op
, Michael McCandless <
luc...@mikemccandless.com> wrote:
> On Fri, Jun 1, 2012 at 8:09 PM, Vitaly Funstein
> wrote:
> > Yes, I am only calling IndexWriter.addDocument()
>
> OK.
>
> > Interestingly, relative performance of either approach seems to greatly
> >
indexing
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Tue, May 29, 2012 at 9:42 PM, Vitaly Funstein
wrote:
>> Hello,
>>
>> I am trying to optimize the process of "warming up" an index prior to
>> using the search subsystem, i.e. it is guarant
Any takers on this one or is my inquiry a bit too broad? I can post my
test code if that helps...
On Tue, May 29, 2012 at 6:42 PM, Vitaly Funstein wrote:
> Hello,
>
> I am trying to optimize the process of "warming up" an index prior to
> using the search subsystem, i.e. it
Hello,
I am trying to optimize the process of "warming up" an index prior to
using the search subsystem, i.e. it is guaranteed that no other writes
or searches can take place in parallel with with the warmup. To that
end, I have been toying with the idea of turning off segment merging
altogether u
Hello,
I am currently experimenting with tuning of max merged segment MB
parameter on TieredMergePolicy in Lucene 3.5, and seeing significant
gains in index writing speed from values dramatically lower than the
default (5 Gb). For instance, when setting it to 5 or 10 MB, I can see
my writing tests
olicy by default. can you try to use the same merge policy
> on both 3.0.3 and 3.5 and report back? ie LogByteSizeMergePolicy or
> whatever you are using...
>
> simon
>
> On Thu, Feb 9, 2012 at 5:28 AM, Vitaly Funstein wrote:
>> Hello,
>>
>> I am currently evalua
Hello,
I am currently evaluating Lucene 3.5.0 for upgrading from 3.0.3, and
in the context of my usage, the most important parameter is index
writing throughput. To that end, I have been running various tests,
but seeing some contradictory results from different setups, which
hopefully someone wit
75 matches
Mail list logo