date:20180102

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

2018-01-02 Thread Chetan Mehrotra

> How about the quickest solution: dump the content of both indexes to a
document-per-line text

That would work (and is the plan) but so far I can only get stored
field per document and no other data on per document basis. What other
data we can get on per document basis using the Lucene API?
Chetan Mehrotra


On Tue, Jan 2, 2018 at 1:03 PM, Dawid Weiss  wrote:
> How about the quickest solution: dump the content of both indexes to a
> document-per-line text
> file, sort, diff?
>
> Even if your indexes are large, if you have large spare disk, this
> will be super fast.
>
> Dawid
>
> On Tue, Jan 2, 2018 at 7:33 AM, Chetan Mehrotra
>  wrote:
>> Hi,
>>
>> We use Lucene for indexing in Jackrabbit Oak [2]. Recently we
>> implemented a new indexing approach [1] which traverses the data to be
>> indexed in a different way compared to the traversal approach we have
>> been using so far. The new approach is faster and produces index with
>> same number of documents.
>>
>> Some notes around index
>> 
>>
>> - The lucene index only has one stored field for ':path' of node in 
>> repository.
>> - Content being indexed is unstructured so presence of fields may differ
>> - Lucene version 4.7.x
>> - Both approach would index a given node in same way. Its just the
>> traversal order which differ
>>
>> Now we need to compare the index which is produced by earlier approach
>> with newer one to determine if the generated index is "same". As
>> indexed data is traversed in different order the documentId would
>> differ between two indexes and hence the final size differs to some
>> extent.
>>
>> So I would like to implement a logic which can logically compare 2
>> indexes. One way could be to find if a document with given path in 2
>> indexes has same fieldNames associated with it. However as fields are
>> not stored its not possible to determine the fieldNames per document.
>>
>> Questions
>> --
>>
>> 1. Any way to map field names (not the values) associated with a given 
>> document
>> 2. Any other way to logically compare the index data between 2 indexes
>> which are generated using different approach but index same content.
>>
>> Chetan Mehrotra
>> [1] https://issues.apache.org/jira/browse/OAK-6353
>> [2] http://jackrabbit.apache.org/oak/docs/query/lucene.html
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

2018-01-02 Thread Dawid Weiss

Only stored fields are kept for each document. If you need to dump
internal data structures (terms, positions, offsets, payloads, you
name it) you'll need to dive into the API and traverse all segments,
then dump the above (and note that document IDs are per-segment and
will have to be somehow consolidated back to your document IDs).

I don't quite understand the motive here -- the indexes should behave
identically regardless of the order of input documents; what's the
point of dumping all this information?

Dawid

On Tue, Jan 2, 2018 at 9:36 AM, Chetan Mehrotra
 wrote:
>> How about the quickest solution: dump the content of both indexes to a
> document-per-line text
>
> That would work (and is the plan) but so far I can only get stored
> field per document and no other data on per document basis. What other
> data we can get on per document basis using the Lucene API?
> Chetan Mehrotra
>
>
> On Tue, Jan 2, 2018 at 1:03 PM, Dawid Weiss  wrote:
>> How about the quickest solution: dump the content of both indexes to a
>> document-per-line text
>> file, sort, diff?
>>
>> Even if your indexes are large, if you have large spare disk, this
>> will be super fast.
>>
>> Dawid
>>
>> On Tue, Jan 2, 2018 at 7:33 AM, Chetan Mehrotra
>>  wrote:
>>> Hi,
>>>
>>> We use Lucene for indexing in Jackrabbit Oak [2]. Recently we
>>> implemented a new indexing approach [1] which traverses the data to be
>>> indexed in a different way compared to the traversal approach we have
>>> been using so far. The new approach is faster and produces index with
>>> same number of documents.
>>>
>>> Some notes around index
>>> 
>>>
>>> - The lucene index only has one stored field for ':path' of node in 
>>> repository.
>>> - Content being indexed is unstructured so presence of fields may differ
>>> - Lucene version 4.7.x
>>> - Both approach would index a given node in same way. Its just the
>>> traversal order which differ
>>>
>>> Now we need to compare the index which is produced by earlier approach
>>> with newer one to determine if the generated index is "same". As
>>> indexed data is traversed in different order the documentId would
>>> differ between two indexes and hence the final size differs to some
>>> extent.
>>>
>>> So I would like to implement a logic which can logically compare 2
>>> indexes. One way could be to find if a document with given path in 2
>>> indexes has same fieldNames associated with it. However as fields are
>>> not stored its not possible to determine the fieldNames per document.
>>>
>>> Questions
>>> --
>>>
>>> 1. Any way to map field names (not the values) associated with a given 
>>> document
>>> 2. Any other way to logically compare the index data between 2 indexes
>>> which are generated using different approach but index same content.
>>>
>>> Chetan Mehrotra
>>> [1] https://issues.apache.org/jira/browse/OAK-6353
>>> [2] http://jackrabbit.apache.org/oak/docs/query/lucene.html
>>>
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

2018-01-02 Thread Chetan Mehrotra

> Only stored fields are kept for each document. If you need to dump
> internal data structures (terms, positions, offsets, payloads, you
> name it) you'll need to dive into the API and traverse all segments,
> then dump the above (and note that document IDs are per-segment and
> will have to be somehow consolidated back to your document IDs).

Okie. So this would require deeper understanding of index format.
Would have a look. To start with I was just looking for a way to dump
indexed field names per document and nothing more

/foo/bar|status, lastModified
/foo/baz|status, type

Where path is stored field (primary key) and rest of the stuff are
sorted field names. Then such a file can be generated for both indexes
and diff can be done post sorting

> I don't quite understand the motive here -- the indexes should behave
> identically regardless of the order of input documents; what's the
> point of dumping all this information?

This is because of way indexing logic is given access to the Node
hierarchy. Would try to provide a brief explanation

Jackrabbit Oak provides a hierarchical storage in a tree form where
sub trees can be of specific type.

/content/dam/assets/december/banner.png
  - jcr:primaryType = "app:Asset"
  + jcr:content
- jcr:primaryType = "app:AssetContent"
+ metadata
  - status = "published"
  - jcr:lastModified = "2009-10-9T21:52:31"
  - app:tags = ["properties:orientation/landscape",
"marketing:interest/product"]
  - comment = "Image for december launch"
  - jcr:title = "December Banner"
  + xmpMM:History
+ 1
  - softwareAgent = "Adobe Photoshop"
  - author = "David"
+ renditions (nt:folder)
  + original (nt:file)
+ jcr:content
  - jcr:data = ...

To access this content Oak provides a NodeStore/NodeState api [1]
which provides way to access the children. The default indexing logic
uses this api to read the content to be indexed and uses index rules
which allow to index content via relative path. For e.g. it would
create a Lucene field status which maps to
jcr:content/metadata/@status (for an index rule for nodes of type
app:Asset).

This mode of access proved to be slow over remote storage like Mongo
specially for full reindexing case. So we implemented a newer approach
where all content was dumped in a flat file (1 node per line) ->
sorted file and then have a NodeState impl over this flat file. This
changes the way how relative paths work and thus there may be some
potential bugs in newer implementation.

Hence we need to validate that indexing using new api produces same
index as using the stable api. For a case both index would have a
document for "/content/dam/assets/december/banner.png" but if newer
impl had some bug then it may not have indexed the "status" field

So I am looking for way where I can map all fieldNames for a given
document. Actual indexed content would be same if both index have
"status" field indexed so we only need to validate fieldnames per
document. Something like

Thanks for reading all this if you have read so far :)

Chetan Mehrotra
[1] 
https://github.com/apache/jackrabbit-oak/blob/trunk/oak-store-spi/src/main/java/org/apache/jackrabbit/oak/spi/state/NodeState.java

On Tue, Jan 2, 2018 at 2:10 PM, Dawid Weiss  wrote:
> Only stored fields are kept for each document. If you need to dump
> internal data structures (terms, positions, offsets, payloads, you
> name it) you'll need to dive into the API and traverse all segments,
> then dump the above (and note that document IDs are per-segment and
> will have to be somehow consolidated back to your document IDs).
>
> I don't quite understand the motive here -- the indexes should behave
> identically regardless of the order of input documents; what's the
> point of dumping all this information?
>
> Dawid
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

2018-01-02 Thread Dawid Weiss

Ok. I think you should look at the Java API -- this will give you more
clarity of what is actually stored in the index
and how to extract it. The thing (I think) you're missing is that an
inverted index points in the "other" direction (from a given value to
all documents that contained it). So unless you "store" that value
with the document as a stored field, you'll have to "uninvert" the
index yourself.

Dawid

On Tue, Jan 2, 2018 at 10:05 AM, Chetan Mehrotra
 wrote:
>> Only stored fields are kept for each document. If you need to dump
>> internal data structures (terms, positions, offsets, payloads, you
>> name it) you'll need to dive into the API and traverse all segments,
>> then dump the above (and note that document IDs are per-segment and
>> will have to be somehow consolidated back to your document IDs).
>
> Okie. So this would require deeper understanding of index format.
> Would have a look. To start with I was just looking for a way to dump
> indexed field names per document and nothing more
>
> /foo/bar|status, lastModified
> /foo/baz|status, type
>
> Where path is stored field (primary key) and rest of the stuff are
> sorted field names. Then such a file can be generated for both indexes
> and diff can be done post sorting
>
>> I don't quite understand the motive here -- the indexes should behave
>> identically regardless of the order of input documents; what's the
>> point of dumping all this information?
>
> This is because of way indexing logic is given access to the Node
> hierarchy. Would try to provide a brief explanation
>
> Jackrabbit Oak provides a hierarchical storage in a tree form where
> sub trees can be of specific type.
>
> /content/dam/assets/december/banner.png
>   - jcr:primaryType = "app:Asset"
>   + jcr:content
> - jcr:primaryType = "app:AssetContent"
> + metadata
>   - status = "published"
>   - jcr:lastModified = "2009-10-9T21:52:31"
>   - app:tags = ["properties:orientation/landscape",
> "marketing:interest/product"]
>   - comment = "Image for december launch"
>   - jcr:title = "December Banner"
>   + xmpMM:History
> + 1
>   - softwareAgent = "Adobe Photoshop"
>   - author = "David"
> + renditions (nt:folder)
>   + original (nt:file)
> + jcr:content
>   - jcr:data = ...
>
> To access this content Oak provides a NodeStore/NodeState api [1]
> which provides way to access the children. The default indexing logic
> uses this api to read the content to be indexed and uses index rules
> which allow to index content via relative path. For e.g. it would
> create a Lucene field status which maps to
> jcr:content/metadata/@status (for an index rule for nodes of type
> app:Asset).
>
> This mode of access proved to be slow over remote storage like Mongo
> specially for full reindexing case. So we implemented a newer approach
> where all content was dumped in a flat file (1 node per line) ->
> sorted file and then have a NodeState impl over this flat file. This
> changes the way how relative paths work and thus there may be some
> potential bugs in newer implementation.
>
> Hence we need to validate that indexing using new api produces same
> index as using the stable api. For a case both index would have a
> document for "/content/dam/assets/december/banner.png" but if newer
> impl had some bug then it may not have indexed the "status" field
>
> So I am looking for way where I can map all fieldNames for a given
> document. Actual indexed content would be same if both index have
> "status" field indexed so we only need to validate fieldnames per
> document. Something like
>
> Thanks for reading all this if you have read so far :)
>
> Chetan Mehrotra
> [1] 
> https://github.com/apache/jackrabbit-oak/blob/trunk/oak-store-spi/src/main/java/org/apache/jackrabbit/oak/spi/state/NodeState.java
>
>
> On Tue, Jan 2, 2018 at 2:10 PM, Dawid Weiss  wrote:
>> Only stored fields are kept for each document. If you need to dump
>> internal data structures (terms, positions, offsets, payloads, you
>> name it) you'll need to dive into the API and traverse all segments,
>> then dump the above (and note that document IDs are per-segment and
>> will have to be somehow consolidated back to your document IDs).
>>
>> I don't quite understand the motive here -- the indexes should behave
>> identically regardless of the order of input documents; what's the
>> point of dumping all this information?
>>
>> Dawid
>>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

2018-01-02 Thread Erick Erickson

Luke has some capabilities to look at the index at a low level,
perhaps that could give you some pointers. I think you can pull
the older branch from here:
https://github.com/DmitryKey/luke

or:
https://code.google.com/archive/p/luke/

NOTE: This is not a part of Lucene, but an independent project
so it won't have the same labels.

Best,
Erick

On Tue, Jan 2, 2018 at 2:06 AM, Dawid Weiss  wrote:
> Ok. I think you should look at the Java API -- this will give you more
> clarity of what is actually stored in the index
> and how to extract it. The thing (I think) you're missing is that an
> inverted index points in the "other" direction (from a given value to
> all documents that contained it). So unless you "store" that value
> with the document as a stored field, you'll have to "uninvert" the
> index yourself.
>
> Dawid
>
> On Tue, Jan 2, 2018 at 10:05 AM, Chetan Mehrotra
>  wrote:
>>> Only stored fields are kept for each document. If you need to dump
>>> internal data structures (terms, positions, offsets, payloads, you
>>> name it) you'll need to dive into the API and traverse all segments,
>>> then dump the above (and note that document IDs are per-segment and
>>> will have to be somehow consolidated back to your document IDs).
>>
>> Okie. So this would require deeper understanding of index format.
>> Would have a look. To start with I was just looking for a way to dump
>> indexed field names per document and nothing more
>>
>> /foo/bar|status, lastModified
>> /foo/baz|status, type
>>
>> Where path is stored field (primary key) and rest of the stuff are
>> sorted field names. Then such a file can be generated for both indexes
>> and diff can be done post sorting
>>
>>> I don't quite understand the motive here -- the indexes should behave
>>> identically regardless of the order of input documents; what's the
>>> point of dumping all this information?
>>
>> This is because of way indexing logic is given access to the Node
>> hierarchy. Would try to provide a brief explanation
>>
>> Jackrabbit Oak provides a hierarchical storage in a tree form where
>> sub trees can be of specific type.
>>
>> /content/dam/assets/december/banner.png
>>   - jcr:primaryType = "app:Asset"
>>   + jcr:content
>> - jcr:primaryType = "app:AssetContent"
>> + metadata
>>   - status = "published"
>>   - jcr:lastModified = "2009-10-9T21:52:31"
>>   - app:tags = ["properties:orientation/landscape",
>> "marketing:interest/product"]
>>   - comment = "Image for december launch"
>>   - jcr:title = "December Banner"
>>   + xmpMM:History
>> + 1
>>   - softwareAgent = "Adobe Photoshop"
>>   - author = "David"
>> + renditions (nt:folder)
>>   + original (nt:file)
>> + jcr:content
>>   - jcr:data = ...
>>
>> To access this content Oak provides a NodeStore/NodeState api [1]
>> which provides way to access the children. The default indexing logic
>> uses this api to read the content to be indexed and uses index rules
>> which allow to index content via relative path. For e.g. it would
>> create a Lucene field status which maps to
>> jcr:content/metadata/@status (for an index rule for nodes of type
>> app:Asset).
>>
>> This mode of access proved to be slow over remote storage like Mongo
>> specially for full reindexing case. So we implemented a newer approach
>> where all content was dumped in a flat file (1 node per line) ->
>> sorted file and then have a NodeState impl over this flat file. This
>> changes the way how relative paths work and thus there may be some
>> potential bugs in newer implementation.
>>
>> Hence we need to validate that indexing using new api produces same
>> index as using the stable api. For a case both index would have a
>> document for "/content/dam/assets/december/banner.png" but if newer
>> impl had some bug then it may not have indexed the "status" field
>>
>> So I am looking for way where I can map all fieldNames for a given
>> document. Actual indexed content would be same if both index have
>> "status" field indexed so we only need to validate fieldnames per
>> document. Something like
>>
>> Thanks for reading all this if you have read so far :)
>>
>> Chetan Mehrotra
>> [1] 
>> https://github.com/apache/jackrabbit-oak/blob/trunk/oak-store-spi/src/main/java/org/apache/jackrabbit/oak/spi/state/NodeState.java
>>
>>
>> On Tue, Jan 2, 2018 at 2:10 PM, Dawid Weiss  wrote:
>>> Only stored fields are kept for each document. If you need to dump
>>> internal data structures (terms, positions, offsets, payloads, you
>>> name it) you'll need to dive into the API and traverse all segments,
>>> then dump the above (and note that document IDs are per-segment and
>>> will have to be somehow consolidated back to your document IDs).
>>>
>>> I don't quite understand the motive here -- the indexes should behave
>>> identically regardless of the order of input documents; what's the
>>> point of dumping all this information?
>>>
>>> Dawid
>>>
>>
>> -

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

Re: Comparing two indexes for equality - Finding non stored fieldNames per document

5 matches

Site Navigation

Mail list logo

Footer information