Re: SortingMergePolicy for already sorted segments

2014-06-17 Thread Ravikumar Govindarajan
> > These lookups are expensive and will be done millions of times (each term, > each DV field, each .. everything). Yes, I think you have described the issue correctly. There is no way we can achieve speed-ups without a DocMap, especially for repeated lookups/merge IndexWriter relies on this i

Re: Lucene QueryParser/Analyzer inconsistency

2014-06-17 Thread Jack Krupansky
Yeah, this is kind of tricky and confusing! Here's what happens: 1. The query parser "parses" the input string into individual source terms, each delimited by white space. The escape is removed in this process, but... no analyzer has been called at this stage. 2. The query parser (generator)

Re: Facets in Lucene 4.7.2

2014-06-17 Thread Shai Erera
You can get the size of the taxonomy by calling taxoReader.getSize(). What does the 28K of the $facets field denote - the number of terms (drill-down)? If so, that sounds like your taxonomy is of that size. And indeed, this is a tiny taxonomy ... How many facets do you record per document? This a

Re: Facets in Lucene 4.7.2

2014-06-17 Thread Sandeep Khanzode
If I am counting correctly, the $facets field in the index shows a count of approx. 28k. That does not sound like much, I guess. All my facets are flat and the FacetsConfig only defines a couple of them to be multi-valued. Let me know if I am not counting the taxonomy size correctly. The taxoRe

Lucene QueryParser/Analyzer inconsistency

2014-06-17 Thread Luis Pureza
Hi, I'm experience a puzzling behaviour with the QueryParser and was hoping someone around here can help me. I have a very simple Analyzer that tries to replace forward slashes (/) by spaces. Because QueryParser forces me to escape strings with slashes before parsing, I added a MappingCharFilter

Re: Facets in Lucene 4.7.2

2014-06-17 Thread Shai Erera
Nothing suspicious ... code looks fine. The call to FastTaxoFacetCounts actually computes the counts ... that's the expensive part of faceted search. How big is your taxonomy (number categories)? Is it hierarchical (i.e. are your dimensions flat, or deep like A/1/2/3/)? What does your FacetsConfig

Re: Facets in Lucene 4.7.2

2014-06-17 Thread Sandeep Khanzode
Hi, Thanks for your response. It does sound pretty bad which is why I am not sure whether there is an issue with the code, the index, the searcher, or just the machine, as you say.  I will try with another machine just to make sure and post the results. Meanwhile, can you tell me if there is an

Re: SortingMergePolicy for already sorted segments

2014-06-17 Thread Shai Erera
That said... if we generate the global DocMap up front, there's no reason to not execute the merge of the segments more efficiently, i.e. without wrapping them in a SlowCompositeReaderWrapper. But that's not work for SortingMergePolicy, it's either a special SortingAtomicReader which wraps a group

Re: SortingMergePolicy for already sorted segments

2014-06-17 Thread Shai Erera
OK I think I now understand what you're asking :). It's unrelated though to SortingMergePolicy. You propose to do the "merge" part of a merge-sort, since we know the indexes are already sorted, right? This is something we've considered in the past, but it is very tricky (see below) and we went wit

Re: Facets in Lucene 4.7.2

2014-06-17 Thread Shai Erera
Hi 40 seconds for faceted search is ... crazy. Also, note how the times don't differ much even though the number of hits is much higher (29K vs 15.1M) ... That, w/ that you say that subsequent queries are much faster (few seconds) suggests that something is seriously messed up w/ your environment.

Re: Indexing size increase 20% after switching from lucene 4.4 to 4.5 or 4.8 with BinaryDocValuesField

2014-06-17 Thread Robert Muir
Again, because merging is based on byte size, you have to be careful how you measure (hint: use LogDocMergePolicy). Otherwise you are comparing apples and oranges. Separately, your configuration is using experimental codecs like "disk"/"memory" which arent as heavily benchmarked etc as the defaul

Indexing size increase 20% after switching from lucene 4.4 to 4.5 or 4.8 with BinaryDocValuesField

2014-06-17 Thread Zhao, Gang
I used lucene 4.4 to create index for some documents. One of the indexing fields is BinaryDocValuesField. After I change the dependency to lucene 4.5. The index size for 1 million documents increases from 293MB to 357MB. If I did not use BinaryDocValuesField, the index size increases only about

Re: Facet migration 4.6.1 to > 4.7.0

2014-06-17 Thread Shai Erera
> > - we are extending FacetResultsHandler to change the order of the facet > results (i.e. date facets ordered by date instead of count). How can I > achieve this now? > Now everything is a Facets. In your case, since you use the taxonomy, it's TaxonomyFacets. You can check the class-hierarchy, w

Facet migration 4.6.1 to > 4.7.0

2014-06-17 Thread Nicola Buso
Hi, I'm migrating from lucene 4.6.1 to 4.8.1 and I noticed some Facet API changes happened on 4.7.0 probably mostly related to this ticket: http://issues.apache.org/jira/browse/LUCENE-5339 Here are few question about some customization/extension we did and seem not having a direct counterpart/ext

Re: Facets in Lucene 4.7.2

2014-06-17 Thread Sandeep Khanzode
Hi, Thanks again! This time, I have indexed data with the following specs. I run into > 40 seconds for the FastTaxonomyFacetCounts to create all the facets. Is this as per your measurements? Subsequent runs fare much better probably because of the Windows file system cache. How can I speed thi

Re: SortingMergePolicy for already sorted segments

2014-06-17 Thread Ravikumar Govindarajan
> > Therefore the DocMap is initialized only when the > merge actually executes ... what is there more to postpone? Agreed. However, what I am asking is, if there is an alternative to DocMap, will that be better? Plz read-on And besides, if the segments are already sorted, you should return a n

Re: SortingMergePolicy for already sorted segments

2014-06-17 Thread Shai Erera
> > I am afraid the DocMap still maintains doc-id mappings till merge and I am > trying to avoid it... > What do you mean 'till merge'? The method OneMerge.getMergeReaders() is called only when the merge is executed, not when the MergePolicy decided to merge those segments. Therefore the DocMap is

Re: SortingMergePolicy for already sorted segments

2014-06-17 Thread Ravikumar Govindarajan
I am afraid the DocMap still maintains doc-id mappings till merge and I am trying to avoid it... I think lucene itself has a MergeIterator in o.a.l.util package. A MergePolicy can wrap a simple MergeIterator for iterating docs across different AtomicReaders in correct sort-order for a given field

Search degradation on Windows when upgrading from lucene 3.6 to lucene 4.7.2

2014-06-17 Thread Shlomit Rosen
Hi, We are in the process of upgrading from lucene 3.6.0 to lucene 4.7.2, and our tests show a significant search degradation on Windows platform. Trying to figure this out, here are a couple of points we noticed. Any suggestions/thoughts will be greatly appreciated. Thanks! 1) Running sea

Re: SortingMergePolicy for already sorted segments

2014-06-17 Thread Shai Erera
loadSortTerm is your method right? In the current Sorter.sort implementation, I see this code: boolean sorted = true; for (int i = 1; i < maxDoc; ++i) { if (comparator.compare(i-1, i) > 0) { sorted = false; break; } } if (sorted) { return null;

RE: Lucene Upgrade from 2.9.x to 4.7.x

2014-06-17 Thread Uwe Schindler
Hi, > Thanks Uwe. I tried this path and I do not find any .cfs files. Lucene 3 and Lucene 4 indexes do not necessarily always contain CFS files, especially not if they are optimized. This depends on the merge policy. The index upgrader uses the default one, which creates no CFS files for the la