Indexing multiple fields with one document position

2013-01-21 Thread Igor Shalyminov
Hello!

When indexing text with position data, one just adds field do a document in the 
form of its name and value, and the indexer assigns it unique position in the 
index.
I wonder, if I have an entry with two attributes, say:

cat,

How do I store in the index two fields, "pos" and "number" with its values, 
pointing to the same position in the document?

-- 
Best Regards,
Igor Shalyminov

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Inner join in lucene

2013-01-21 Thread Ramprakash Ramamoorthy
On Fri, Jan 18, 2013 at 9:05 PM, Apostolis Xekoukoulotakis <
xekou...@gmail.com> wrote:

> You can put those fields as a DocValue type of field. They are optimized
> for use during search(or join in this case).
>
> Then create a collector that collects the documents which have the same
> value in those fields.
>
> Have other more experienced comment though before you start implementing
> it.
>
> Thank you Apostolis,

  That is definitely giving me some headstarts. Will check with
this and also update this thread, when I infer.

>
>
2013/1/18 Ramprakash Ramamoorthy 
>
> > Dear all,
> >
> >  I know, lucene is no relational database, but spare me. I need
> to
> > run a search across an index, and find fields that have a common equal
> > value, where the common value is unknown(to be determined at run time).
> >
> >  An outright sql query would be *SELECT * from table1 where
> > table1.field1=table2.field3=table3.field2; *
> >
> > **How do I proceed upon this? Get distinct values (if at all it
> is
> > possible) for a field and iterate the query across fields is the only
> > solution? Or is there an aesthetic way?
> >
> >  Please to help. Thanks in advance.
> >
> > --
> > With thanks and regards,
> > Ramprakash Ramamoorthy,
> > SASTRA University,
> > India.
> >
>
>
>
> --
>
>
> Sincerely yours,
>
>  Apostolis Xekoukoulotakis
>


Re: Tool for Lucene storage recovery

2013-01-21 Thread Michał Brzezicki
I don't think it is possible to simply compile it as jar since you need to
implement handling of recovered documents.

-- 
Michał

2013/1/19 Simon Willnauer 

> hey,
>
> do you wanna open a jira issue for this and attach your code? this
> might help others too and if the shit hits the fan its good to have
> something in the lucene jar that can bring some data back.
>
> simon
> On Fri, Jan 18, 2013 at 6:37 PM, Michał Brzezicki 
> wrote:
> > in lucene (*.fdt). Code is available here http://pastebin.com/nmF0j4npyou
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


FacetedSearch and MultiReader

2013-01-21 Thread Nicola Buso
Hi all,

I'm trying to develop faceted search using lucene 4.0 faceting
framework.
In our project we are searching on multiple indexes using lucene
MultiReader. How should we use the faceted framework to obtain
FacetResults starting from a MultiReader? all the example I see are
using a "single" IndexReader.



Nicola.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: FacetedSearch and MultiReader

2013-01-21 Thread Uwe Schindler
Just use MultiReader, it extends IndexReader, so you can pass it anywhere where 
IndexReader can be passed.

-
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de

> -Original Message-
> From: Nicola Buso [mailto:nb...@ebi.ac.uk]
> Sent: Monday, January 21, 2013 3:59 PM
> To: java-user@lucene.apache.org
> Subject: FacetedSearch and MultiReader
> 
> Hi all,
> 
> I'm trying to develop faceted search using lucene 4.0 faceting framework.
> In our project we are searching on multiple indexes using lucene
> MultiReader. How should we use the faceted framework to obtain
> FacetResults starting from a MultiReader? all the example I see are using a
> "single" IndexReader.
> 
> 
> 
> Nicola.
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: FacetedSearch and MultiReader

2013-01-21 Thread Nicola Buso
Thanks for the reply Uwe,

we currently can search with MultiReader over all the indexes we have.
Now I want to add the faceting search, than I created a categories index
for every index I currently have.
To accumulate the faceted results now I have a MultiReader pointing all
the indexes and I can create a TaxonomyReader for every categories index
I have; all the way I see to obtain FacetResults are:
1 - FacetsCollector
2 - a FacetsAccumulator implementation

suppose I use the second option. I should:
- search as usual using the MultiReader
- than try to collect all the facetresults iterating over my
TaxonomyReaders; at every iteration:
  - I create a FacetsAccumulator using the MultiReader and a
TaxonomyReader
  - I get a list of FacetResult from the accumulator.
- as I finish I should in some way merge all the List I
have.

I think this solution is not correct because the docsids from the search
are pointing the multireader instead the taxonomyreader is pointing to
the categories index of a single reader.
I neither like to merge all the List of FacetResult I retrieve from the
Accumulators.

Probably I'm missing something, can somebody clarify to me how I should
collect the facets in this case?


Nicola.

 

On Mon, 2013-01-21 at 16:22 +0100, Uwe Schindler wrote:
> Just use MultiReader, it extends IndexReader, so you can pass it anywhere 
> where IndexReader can be passed.
> 
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
> 
> > -Original Message-
> > From: Nicola Buso [mailto:nb...@ebi.ac.uk]
> > Sent: Monday, January 21, 2013 3:59 PM
> > To: java-user@lucene.apache.org
> > Subject: FacetedSearch and MultiReader
> > 
> > Hi all,
> > 
> > I'm trying to develop faceted search using lucene 4.0 faceting framework.
> > In our project we are searching on multiple indexes using lucene
> > MultiReader. How should we use the faceted framework to obtain
> > FacetResults starting from a MultiReader? all the example I see are using a
> > "single" IndexReader.
> > 
> > 
> > 
> > Nicola.
> > 
> > 
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Indexing multiple fields with one document position

2013-01-21 Thread Jack Krupansky
Send the same input text to two different analyzers for two separate fields. 
The first analyzer emits only the first attribute. The second analyzer emits 
only the second attribute. The document position in one will correspond to 
the document position in the other.


-- Jack Krupansky

-Original Message- 
From: Igor Shalyminov

Sent: Monday, January 21, 2013 3:04 AM
To: java-user@lucene.apache.org
Subject: Indexing multiple fields with one document position

Hello!

When indexing text with position data, one just adds field do a document in 
the form of its name and value, and the indexer assigns it unique position 
in the index.

I wonder, if I have an entry with two attributes, say:

cat,

How do I store in the index two fields, "pos" and "number" with its values, 
pointing to the same position in the document?


--
Best Regards,
Igor Shalyminov

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: FacetedSearch and MultiReader

2013-01-21 Thread Shai Erera
Hi Nicola,

I think that what you're describing corresponds to distributed faceted
search. I.e., you have N content indexes, alongside N taxonomy indexes.
The information that's indexed in each of those sub-indexes does not
correlate with the other ones.
For example, say that you index the category "Movie/Drama", it may receive
ordinal 12 in index1 and 23 in index2.
If you'll try to count ordinals using MultiReader, you'll just mess up
everything.

If you can share a single taxonomy index for all N content indexes, then
you'll be in a super-simple position:
1) Open one TaxonomyReader
2) Execute search with MultiReader and FacetsCollector

It doesn't get simpler than that ! :)

Before I go into great length describing what you should do if you cannot
share the taxonomy, let me know if that's not an option for you.

Shai


On Mon, Jan 21, 2013 at 5:39 PM, Nicola Buso  wrote:

> Thanks for the reply Uwe,
>
> we currently can search with MultiReader over all the indexes we have.
> Now I want to add the faceting search, than I created a categories index
> for every index I currently have.
> To accumulate the faceted results now I have a MultiReader pointing all
> the indexes and I can create a TaxonomyReader for every categories index
> I have; all the way I see to obtain FacetResults are:
> 1 - FacetsCollector
> 2 - a FacetsAccumulator implementation
>
> suppose I use the second option. I should:
> - search as usual using the MultiReader
> - than try to collect all the facetresults iterating over my
> TaxonomyReaders; at every iteration:
>   - I create a FacetsAccumulator using the MultiReader and a
> TaxonomyReader
>   - I get a list of FacetResult from the accumulator.
> - as I finish I should in some way merge all the List I
> have.
>
> I think this solution is not correct because the docsids from the search
> are pointing the multireader instead the taxonomyreader is pointing to
> the categories index of a single reader.
> I neither like to merge all the List of FacetResult I retrieve from the
> Accumulators.
>
> Probably I'm missing something, can somebody clarify to me how I should
> collect the facets in this case?
>
>
> Nicola.
>
>
>
> On Mon, 2013-01-21 at 16:22 +0100, Uwe Schindler wrote:
> > Just use MultiReader, it extends IndexReader, so you can pass it
> anywhere where IndexReader can be passed.
> >
> > -
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: u...@thetaphi.de
> >
> > > -Original Message-
> > > From: Nicola Buso [mailto:nb...@ebi.ac.uk]
> > > Sent: Monday, January 21, 2013 3:59 PM
> > > To: java-user@lucene.apache.org
> > > Subject: FacetedSearch and MultiReader
> > >
> > > Hi all,
> > >
> > > I'm trying to develop faceted search using lucene 4.0 faceting
> framework.
> > > In our project we are searching on multiple indexes using lucene
> > > MultiReader. How should we use the faceted framework to obtain
> > > FacetResults starting from a MultiReader? all the example I see are
> using a
> > > "single" IndexReader.
> > >
> > >
> > >
> > > Nicola.
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: FacetedSearch and MultiReader

2013-01-21 Thread Nicola Buso
Hi Shai,

I was thinking to that too, but I'm indexing all indexes in a custom
distributed environment than I can't in this moment have a single
categories index for all the content indexes at indexing time.
A solution should be to merge all the categories indexes in one only
index and use your solution but the merge code I see in the examples
merge also the content index and I can't do that.

I should share the taxonomy if is possible to merge (I see the resulting
categories indexes are not that big currently), but I would prefer to
have a solution where I can collect the facets over multiple categories
indexes in this way I will be sure the solution will scale better.


Nicola.


On Mon, 2013-01-21 at 17:54 +0200, Shai Erera wrote:
> Hi Nicola,
> 
> 
> I think that what you're describing corresponds to distributed faceted
> search. I.e., you have N content indexes, alongside N taxonomy
> indexes.
> 
> The information that's indexed in each of those sub-indexes does not
> correlate with the other ones.
> For example, say that you index the category "Movie/Drama", it may
> receive ordinal 12 in index1 and 23 in index2.
> 
> If you'll try to count ordinals using MultiReader, you'll just mess up
> everything.
> 
> 
> If you can share a single taxonomy index for all N content indexes,
> then you'll be in a super-simple position:
> 
> 1) Open one TaxonomyReader
> 
> 2) Execute search with MultiReader and FacetsCollector
> 
> 
> 
> It doesn't get simpler than that ! :)
> 
> 
> Before I go into great length describing what you should do if you
> cannot share the taxonomy, let me know if that's not an option for
> you.
> 
> Shai
> 
> 
> 
> On Mon, Jan 21, 2013 at 5:39 PM, Nicola Buso  wrote:
> Thanks for the reply Uwe,
> 
> we currently can search with MultiReader over all the indexes
> we have.
> Now I want to add the faceting search, than I created a
> categories index
> for every index I currently have.
> To accumulate the faceted results now I have a MultiReader
> pointing all
> the indexes and I can create a TaxonomyReader for every
> categories index
> I have; all the way I see to obtain FacetResults are:
> 1 - FacetsCollector
> 2 - a FacetsAccumulator implementation
> 
> suppose I use the second option. I should:
> - search as usual using the MultiReader
> - than try to collect all the facetresults iterating over my
> TaxonomyReaders; at every iteration:
>   - I create a FacetsAccumulator using the MultiReader and a
> TaxonomyReader
>   - I get a list of FacetResult from the accumulator.
> - as I finish I should in some way merge all the
> List I
> have.
> 
> I think this solution is not correct because the docsids from
> the search
> are pointing the multireader instead the taxonomyreader is
> pointing to
> the categories index of a single reader.
> I neither like to merge all the List of FacetResult I retrieve
> from the
> Accumulators.
> 
> Probably I'm missing something, can somebody clarify to me how
> I should
> collect the facets in this case?
> 
> 
> Nicola.
> 
> 
> 
> On Mon, 2013-01-21 at 16:22 +0100, Uwe Schindler wrote:
> > Just use MultiReader, it extends IndexReader, so you can
> pass it anywhere where IndexReader can be passed.
> >
> > -
> > Uwe Schindler
> > H.-H.-Meier-Allee 63, D-28213 Bremen
> > http://www.thetaphi.de
> > eMail: u...@thetaphi.de
> >
> > > -Original Message-
> > > From: Nicola Buso [mailto:nb...@ebi.ac.uk]
> > > Sent: Monday, January 21, 2013 3:59 PM
> > > To: java-user@lucene.apache.org
> > > Subject: FacetedSearch and MultiReader
> > >
> > > Hi all,
> > >
> > > I'm trying to develop faceted search using lucene 4.0
> faceting framework.
> > > In our project we are searching on multiple indexes using
> lucene
> > > MultiReader. How should we use the faceted framework to
> obtain
> > > FacetResults starting from a MultiReader? all the example
> I see are using a
> > > "single" IndexReader.
> > >
> > >
> > >
> > > Nicola.
> > >
> > >
> > >
> -
> > > To unsubscribe, e-mail:
> java-user-unsubscr...@lucene.apache.org
> > > For additional commands, e-mail:
> java-user-h...@lucene.apache.org
> >
> 
> 
> 
> -
> To unsubscribe, e-mail:
>

[Fwd: Re: FacetedSearch and MultiReader]

2013-01-21 Thread Nicola Buso

--- Begin Message ---
Hi,

it's not clear your proposal.

On Mon, 2013-01-21 at 18:21 +0200, Shai Erera wrote:
> Hi
> 
> 
> First, if it's a one time operation, you can merge the taxonomy
> indexes into one, without merging the content indexes too (but you'll
> need to re-map the ordinals in each content index, by e.g. adding it
> to itself). Not a cheap solution.
I think I can't do that I have terabytes of indexes and seam not
feasible to me.

> 
> Another option is to merge all taxonomy indexes into one, and obtain
> the OrdinalMap per content index.
How this solution differ from the previous? I merge the taxonomy without
touching the content indexes?
Is there some documentation explaining how the ordinalmaps are used with
facets? Just to be aware of what I'm doing?
Suppose I merge in memory more taxonomy indexes (think I want facets on
a subset of all my indexes otherwise would be to heavy).
I iterate over the taxonomy indexes and I merge them with
TaxonomyWriter.addTaxonomy(directory, map)
how I obtain the OrdinalMap[M]? (I suppose this is the map corresponding
to my MultiReader).


> Then run the search w/ MultiReader, and when asked to count ordinal M,
> you count ordMap[M] instead.
> 
> You can do so by creating your own Aggregator, and override
> CountFacetRequest.createAggregator().

in CountFacetRequest where should I use the new OrdinalMap? should I use
the OrdinalMap.getMap to construct the CountingAggregator?
  public Aggregator createAggregator(boolean useComplements,
  FacetArrays arrays, IndexReader
reader,
  TaxonomyReader taxonomy) {
// we rely on that, if needed, result is cleared by arrays!
int[] a = arrays.getIntArray();
if (useComplements) {
  return new ComplementCountingAggregator(a);
}
return new CountingAggregator(a);
  }
> 
> 
> If that's also not an option, then you'll need to do a form of
> distributed search. You'll need to run the search against each
> content/taxonomy index pair, then collect the top-K and merge the
> categories' weights (counts).
> Note though that in this process you may lose some categories that
> should be in the top-K.
I think I can merge in memory at indexing time;
Can you elaborate e bit more about the solution consisting in the
taxonomyindexes merge?


Nicola


> 
> E.g. imagine that categories A(3) and B(2) are returned from index1
> and A(4) and C(3) are returned from index2 (for top-2, numbers in
> parenthesis denote counts).
> 
> And say that category B appears in index2 with count 2. Then it should
> be among the top 2 categories: A(7), B(4), but instead you'll return
> A(7), C(3).
> 
> You can somewhat overcome that by requesting to count c*K, where 'c'
> is an over-counting factor (say 5), and hopefully the true top-K will
> be in the top-5*K of all indexes.
> 
> That too can break under some extreme circumstances, but we've tested
> it once and c=2 was enough for a rather large index.
> However, since your searches are run locally (i.e. you don't transmit
> intermediate results over the wire), you can use a larger 'c'.
> 
> HTH,
> Shai


--- End Message ---

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Tool for Lucene storage recovery

2013-01-21 Thread Erick Erickson
Maybe do the handling as an overridable method and make it abstract?
That would give the skeleton of all the recovery stuff, but then
require the user to implement the actual recovery?

Just a thought
Erick

On Mon, Jan 21, 2013 at 9:06 AM, Michał Brzezicki  wrote:
> I don't think it is possible to simply compile it as jar since you need to
> implement handling of recovered documents.
>
> --
> Michał
>
> 2013/1/19 Simon Willnauer 
>
>> hey,
>>
>> do you wanna open a jira issue for this and attach your code? this
>> might help others too and if the shit hits the fan its good to have
>> something in the lucene jar that can bring some data back.
>>
>> simon
>> On Fri, Jan 18, 2013 at 6:37 PM, Michał Brzezicki 
>> wrote:
>> > in lucene (*.fdt). Code is available here http://pastebin.com/nmF0j4npyou
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Tool for Lucene storage recovery

2013-01-21 Thread Erick Erickson
P.S. Or just attach the code without your customized doc recovery
stuff with a note about how to carry it forward? That way someone
could pick it up if interested and generalize it.

Best
Erick

On Mon, Jan 21, 2013 at 12:37 PM, Erick Erickson
 wrote:
> Maybe do the handling as an overridable method and make it abstract?
> That would give the skeleton of all the recovery stuff, but then
> require the user to implement the actual recovery?
>
> Just a thought
> Erick
>
> On Mon, Jan 21, 2013 at 9:06 AM, Michał Brzezicki  
> wrote:
>> I don't think it is possible to simply compile it as jar since you need to
>> implement handling of recovered documents.
>>
>> --
>> Michał
>>
>> 2013/1/19 Simon Willnauer 
>>
>>> hey,
>>>
>>> do you wanna open a jira issue for this and attach your code? this
>>> might help others too and if the shit hits the fan its good to have
>>> something in the lucene jar that can bring some data back.
>>>
>>> simon
>>> On Fri, Jan 18, 2013 at 6:37 PM, Michał Brzezicki 
>>> wrote:
>>> > in lucene (*.fdt). Code is available here http://pastebin.com/nmF0j4npyou
>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: FieldCacheTermsFilter performance

2013-01-21 Thread emmanuel Gosse
Hi,

We have about 120 filters, half is selective but some filters are "boolean".


It's easy to find where the difference comes.

binarySearchLookup in DocTermsIndexImpl versus StringIndex :

In StringIndex, just a comparaison between Strings  :
int cmp = lookup[mid].compareTo(key);

In DocTermsIndexImpl, the BytesRef has to be retrieved :

public BytesRef lookup(int ord, BytesRef ret) {
  return bytes.fill(ret, termOrdToBytesOffset.get(ord));
}


Emmanuel


2013/1/20 Uwe Schindler 

> Hi,
>
> in Lucene 4.0 I would recommend to use TermsFilter (from queries module),
> not FieldCacheTermsFilter, because the term dictionary is much faster and
> it is in this case better to use the posting lists, instead of scanning all
> documents (which FCTermsCache does). How many filter terms do you have? Is
> the filter selective? To further improve, use CachingWrapperFilter, too
> (this will cache filter results, which is useful if you have a set of
> Filters/terms that are used quite often).
> The problem with FCTermsFilter is: It scans all documents from beginning
> to end and looks them up the terms cache. In Lucene 4.0 the structure of
> the FieldCache changed to be more memory efficient (which does not hurt the
> primary use-case of sorting), but scanning all documents and resolving all
> terms is not always the best option (this also heavily relies on your index
> structure, FCTermsFilter may still be faster under some circumstances).
>
> Uwe
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> > -Original Message-
> > From: emmanuel Gosse [mailto:emmanuel.go...@gmail.com]
> > Sent: Saturday, January 19, 2013 10:58 PM
> > To: java-user@lucene.apache.org
> > Subject: FieldCacheTermsFilter performance
> >
> > Hi,
> >
> > I would like to share a performance problem about FieldCacheTermsFilter
> > between 3.0.3 and 4.0.0 Lucene versions.
> >
> > I've made tests with the same application with 3.0.3 (my production
> > version) and 4.0.0.
> > And I found a "big" difference of response time.
> >
> > I run "real life" injection of 400 000 queries and I obtain the average
> of time
> > response.
> > I used to run this type of tests to validate that we have no performance
> > regression.
> >
> > So I've made other tests to find out where comes this difference.
> > Desactivating faceting or changing Directory used or other more...
> >
> > And for one test, I desactivated the filters (I use only
> > FieldCacheTermsFilter) and I obtained the same average of time response.
> >
> > To give some data :
> > 20 millions of documents
> > 3 indexes under a multireader
> > no indexations, only searcher (indexation is not implemented in this app)
> > 400 000 queries with jmeter
> >
> > Test :
> >
> > 3.0.3 or 4.0.0
> > Queries without filters : 60ms (average of time response)
> >
> > Queries with filters:
> > 3.0.3 : 150ms
> > 4.0.0 : 400ms
> >
> > The code difference of my application is only the required one to plug
> with
> > each Lucene version.
> >
> > The fields used to filter are not stored and in 4.0.0 version, are
> stringfield.
> > I checked that caches of fieldCache dont move for the test.
> >
> > I have no more ideas to seek. Maybe I've not understood which type of
> field
> > I should use.
> >
> > Emmanuel
> >
> > ---
> > Emmanuel Gosse
> > Fnac.Com 
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


-- 
Emmanuel Gosse
06 65 26 96 71


Re: Is LogByteSizeMergePolicy deterministic?

2013-01-21 Thread Denis Bazhenov
Can you explain in more details why is that? We have in-house replication for 
Lucene/3.6 index and use default IndexWriter settings. All works fine except 
sometimes (just after optimization, in fact) index could not be opened (segment 
file is missing on FS). We tolerate this issue by replicating index one more 
time in case of failure. I guess it's somehow related to the discussed issue.

On Jan 19, 2013, at 5:16 AM, Michael McCandless  
wrote:

> You must also use only a single indexing thread.
> 
> And you must use SerialMergeScheduler.
> 
> If you do that, I think it will be deterministic.
> 
> But don't rely on this ... this is runtime behavior and can suddenly
> change between releases ...
> 
> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> On Fri, Jan 18, 2013 at 10:39 AM, Apostolis Xekoukoulotakis
>  wrote:
>> I want to replicate an index from multiple replicas at the same time.
>> 
>> Those replicas have been given the same documents and at the same order.
>> 
>> Will the files be the same across all replicas?
>> 
>> 
>> 
>> --
>> 
>> 
>> Sincerely yours,
>> 
>> Apostolis Xekoukoulotakis
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 

---
Denis Bazhenov 






-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: FacetedSearch and MultiReader

2013-01-21 Thread Denis Bazhenov
We have similar distribute search system and we have finished with the 
following scheme. Search replicas (machines where index resides) are build 
FacetResult's based on their index chunk (top N categories with document 
counts). Later on the results are merged "by hands" with summing relevant 
categories from different replicas.

On Jan 22, 2013, at 3:08 AM, Nicola Buso  wrote:

> Hi Shai,
> 
> I was thinking to that too, but I'm indexing all indexes in a custom
> distributed environment than I can't in this moment have a single
> categories index for all the content indexes at indexing time.
> A solution should be to merge all the categories indexes in one only
> index and use your solution but the merge code I see in the examples
> merge also the content index and I can't do that.
> 
> I should share the taxonomy if is possible to merge (I see the resulting
> categories indexes are not that big currently), but I would prefer to
> have a solution where I can collect the facets over multiple categories
> indexes in this way I will be sure the solution will scale better.
> 
> 
> Nicola.
> 
> 
> On Mon, 2013-01-21 at 17:54 +0200, Shai Erera wrote:
>> Hi Nicola,
>> 
>> 
>> I think that what you're describing corresponds to distributed faceted
>> search. I.e., you have N content indexes, alongside N taxonomy
>> indexes.
>> 
>> The information that's indexed in each of those sub-indexes does not
>> correlate with the other ones.
>> For example, say that you index the category "Movie/Drama", it may
>> receive ordinal 12 in index1 and 23 in index2.
>> 
>> If you'll try to count ordinals using MultiReader, you'll just mess up
>> everything.
>> 
>> 
>> If you can share a single taxonomy index for all N content indexes,
>> then you'll be in a super-simple position:
>> 
>> 1) Open one TaxonomyReader
>> 
>> 2) Execute search with MultiReader and FacetsCollector
>> 
>> 
>> 
>> It doesn't get simpler than that ! :)
>> 
>> 
>> Before I go into great length describing what you should do if you
>> cannot share the taxonomy, let me know if that's not an option for
>> you.
>> 
>> Shai
>> 
>> 
>> 
>> On Mon, Jan 21, 2013 at 5:39 PM, Nicola Buso  wrote:
>>Thanks for the reply Uwe,
>> 
>>we currently can search with MultiReader over all the indexes
>>we have.
>>Now I want to add the faceting search, than I created a
>>categories index
>>for every index I currently have.
>>To accumulate the faceted results now I have a MultiReader
>>pointing all
>>the indexes and I can create a TaxonomyReader for every
>>categories index
>>I have; all the way I see to obtain FacetResults are:
>>1 - FacetsCollector
>>2 - a FacetsAccumulator implementation
>> 
>>suppose I use the second option. I should:
>>- search as usual using the MultiReader
>>- than try to collect all the facetresults iterating over my
>>TaxonomyReaders; at every iteration:
>>  - I create a FacetsAccumulator using the MultiReader and a
>>TaxonomyReader
>>  - I get a list of FacetResult from the accumulator.
>>- as I finish I should in some way merge all the
>>List I
>>have.
>> 
>>I think this solution is not correct because the docsids from
>>the search
>>are pointing the multireader instead the taxonomyreader is
>>pointing to
>>the categories index of a single reader.
>>I neither like to merge all the List of FacetResult I retrieve
>>from the
>>Accumulators.
>> 
>>Probably I'm missing something, can somebody clarify to me how
>>I should
>>collect the facets in this case?
>> 
>> 
>>Nicola.
>> 
>> 
>> 
>>On Mon, 2013-01-21 at 16:22 +0100, Uwe Schindler wrote:
>>> Just use MultiReader, it extends IndexReader, so you can
>>pass it anywhere where IndexReader can be passed.
>>> 
>>> -
>>> Uwe Schindler
>>> H.-H.-Meier-Allee 63, D-28213 Bremen
>>> http://www.thetaphi.de
>>> eMail: u...@thetaphi.de
>>> 
 -Original Message-
 From: Nicola Buso [mailto:nb...@ebi.ac.uk]
 Sent: Monday, January 21, 2013 3:59 PM
 To: java-user@lucene.apache.org
 Subject: FacetedSearch and MultiReader
 
 Hi all,
 
 I'm trying to develop faceted search using lucene 4.0
>>faceting framework.
 In our project we are searching on multiple indexes using
>>lucene
 MultiReader. How should we use the faceted framework to
>>obtain
 FacetResults starting from a MultiReader? all the example
>>I see are using a
 "single" IndexReader.
 
 
 
 Nicola.
 
 
 
>>-
 To unsubscribe, e-mail:
>>java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail:
>>java-user-h...@lucene.apache.org
>>> 
>> 
>> 
>> 

Re: FacetedSearch and MultiReader

2013-01-21 Thread Shai Erera
Hi Nicola,

What I had in mind is something similar to this, which is possible starting
with Lucene 4.1, due to changes done to facets (per-segment faceting):

DirTaxoWriter master = new DirTaxoWriter(masterDir);
Directory[] origTaxoDirs = new Directory[numTaxoDirs]; // open Directories
and store in that array
OrdinalMap[] ordinalMaps = new OrdinalMap[numTaxoDirs]; // initialize
OrdinalMap and store in that array

// now do the merge
for (int i = 0; i < origTaxoDirs.length; i++) {
  master.addTaxonomy(origTaxoDir[i], ordinalMaps[i]);
}

// now open your readers, and create the important map
Map();
DirectoryReader[] readers = new DirectoryReader[origTaxoDirs.length];
for (int i = 0; i < origTaxoDirs.length; i++) {
  DirectoryReader r = DirectoryReader.open(contentDirectories[i]);
  OrdinalMap ordMap = ordinalMaps[i];
  for (AtomicReaderContext ctx : r.leaves()) {
readerOrdinals.put(ctx.reader(), ordMap);
  }
}

MultiReader mr = new MultiReader(readers);

// create your FacetRequest (CountFacetRequest) with a custom Aggregator
FacetRequest fr = new CountFacetRequest(cp, topK) {
  @Override
  public Aggregator createAggregator(...) {
return new OrdinalMappingAggregator() {
  int[] ordMap;

  @Override
  public void setNextReader(AtomicReaderContext context) {
ordMap = readerOrdinals.get(context.reader()).getMap();
  }

  @Override
  public void aggregate(int docID, float score, IntsRef ordinals) {
int upto = ordinals.offset + ordinals.length;
for (int i = ordinals.offset; i < upto; i++) {
  int ordinal = ordinals[i]; // original ordinal read for the
AtomicReader given to setNextReader
  int mappedOrdinal = ordMap[ordinal]; // mapped ordinal, following
the taxonomy merge
  counts[mappedOrdinal]++; // count the mapped ordinal instead, so
all AtomicReaders count that ordinal
}
  }
};
  }
}

While it may look like I wrote actual code to do it, I didn't :). So I
guess it should work, but I haven't tried it.
That way, you don't touch the content indexes at all, just the taxonomy
ones.

Note however that you'll need to do this step every time the taxonomy index
is updated, and you refresh the TaxoReader instance.
Also, this will only work if all your indexes are opened in the same JVM
(which I assume that's the case, since you use MultiReader).

If you still don't want to do that, then what Dennis wrote above is another
way to do distributed faceted search, either inside the same JVM or across
multiple JVMs.
You obtain the FacetResult from each search and merge the results
(unfortunately, there's still no tool in Lucene to do that for you).
Just make sure to ask for a larger K, to ensure that the correct top-K is
returned (see my previous notes).

Shai




On Tue, Jan 22, 2013 at 4:32 AM, Denis Bazhenov  wrote:

> We have similar distribute search system and we have finished with the
> following scheme. Search replicas (machines where index resides) are build
> FacetResult's based on their index chunk (top N categories with document
> counts). Later on the results are merged "by hands" with summing relevant
> categories from different replicas.
>
> On Jan 22, 2013, at 3:08 AM, Nicola Buso  wrote:
>
> > Hi Shai,
> >
> > I was thinking to that too, but I'm indexing all indexes in a custom
> > distributed environment than I can't in this moment have a single
> > categories index for all the content indexes at indexing time.
> > A solution should be to merge all the categories indexes in one only
> > index and use your solution but the merge code I see in the examples
> > merge also the content index and I can't do that.
> >
> > I should share the taxonomy if is possible to merge (I see the resulting
> > categories indexes are not that big currently), but I would prefer to
> > have a solution where I can collect the facets over multiple categories
> > indexes in this way I will be sure the solution will scale better.
> >
> >
> > Nicola.
> >
> >
> > On Mon, 2013-01-21 at 17:54 +0200, Shai Erera wrote:
> >> Hi Nicola,
> >>
> >>
> >> I think that what you're describing corresponds to distributed faceted
> >> search. I.e., you have N content indexes, alongside N taxonomy
> >> indexes.
> >>
> >> The information that's indexed in each of those sub-indexes does not
> >> correlate with the other ones.
> >> For example, say that you index the category "Movie/Drama", it may
> >> receive ordinal 12 in index1 and 23 in index2.
> >>
> >> If you'll try to count ordinals using MultiReader, you'll just mess up
> >> everything.
> >>
> >>
> >> If you can share a single taxonomy index for all N content indexes,
> >> then you'll be in a super-simple position:
> >>
> >> 1) Open one TaxonomyReader
> >>
> >> 2) Execute search with MultiReader and FacetsCollector
> >>
> >>
> >>
> >> It doesn't get simpler than that ! :)
> >>
> >>
> >> Before I go into great length describing what you should do if you
> >> cannot share the taxonomy, let me know if th