sorting biginteger

2016-08-21 Thread Cristian Lorenzetto
I took a look for bigInteger point but i didnt see no reference for
sorting,
and SortedNumericDocValuesField accept long not biginteger.


I thought to sort so :

BigInteger bi = (BigInteger) o;
byte[] b = bi.toByteArray();
NumericUtils.bigIntToSortableBytes(bi, BigIntegerPoint.BYTES, b, 0);
doc.add(new SortedSetDocValuesField(key, new BytesRef(b)));

it is correct and it is the best practice ?


searchAfter behavior after reindexing

2016-08-21 Thread Rajnish kamboj
Hi Team

What is the searchAfter behavior if index is continuously being updated.
Document numbers changes if indexes are updated. Also indexes are update on
segment merge.

Now, Suppose
- I am holding a ScoreDoc before index update
- Index is updated (document number changes).
(A document number may not be relevant in the context to
searchAfter query)
- I pass this ScoreDoc to searchAfter.

Will searchAfter start after ScoreDoc document number OR it will search
from scratch?


Regards
Rajnish


Re: docid is just a signed int32

2016-08-21 Thread Cristian Lorenzetto
i m overviewing TopDocs.merge.

What is the difference to use multiple SearchIndexer and then to use
TopDocs or to use MultiReader?

2016-08-21 2:28 GMT+02:00 Cristian Lorenzetto :

> For my opinion this study dont tell any thing more than before. Obviously
> if you try to retrieve all data store in a single query the performance
> will be not good. Lucene is fantastic But no magic. The physic laws
> continue to work also with lucene. The query is designed for retrieving a
> small part of a big store, not All The store. In addition i think The time
> would be worst also if you dont sort documents. Using a sorted linked list
> persisted i dont see relevant delays . Syncerely i dont understand also gc
> memory limit with lucene algorithm. The size of memory used is not
> proporzional to the datastore size, else lucene will be not scalable. The
> problem to analize for me is another : considering The trend of big data to
> encrease in The last years , considering The classical max size of a
> database among those we know, considering The possibility or not to scale
> up sharding in lucene in arrays defined dinamically or not , we can
> evaluate if this refactoring has sense or not.
>
> Inviato da iPad
>
> > Il giorno 19 ago 2016, alle ore 05:50, Erick Erickson <
> erickerick...@gmail.com> ha scritto:
> >
> > OK, I'm a little out of my league here, but I'll plow on anyway
> >
> > bq: There are use cases out there where >2^31 does make sense in a
> single index
> >
> > Ok, let's put some definition to this and define the use-case
> > specifically rather than
> > be vague. I've just run an experiment for instance where I had 200M
> > docs in a single
> > shard (very small docs) and tried to sort by a date on all of them.
> > Performance on the order of
> > 5 seconds. 3B is what, 75 seconds? Does the use-case involve sorting?
> > Faceting? If
> > so the performance will probably be poor.
> >
> > This would be huge surgery I believe, and there hasn't been a
> > compelling use-case
> > in the search world for it. Unless and until that case is made I
> > suspect this idea will
> > meet with a lot of resistance.
> >
> > That said, I do understand that this is somewhat akin to "Nobody will
> > ever need more
> > than 64K of ram", meaning that some limits are assumed and eventually
> become
> > outmoded. But given Java's issues with memory and GC I suspect that
> > it'll be really
> > hard to justify the work this would take.
> >
> > FWIW,
> > Erick
> >
> >
> >> On Thu, Aug 18, 2016 at 6:31 PM, Trejkaz  wrote:
> >>> On Thu, Aug 18, 2016 at 11:55 PM, Adrien Grand 
> wrote:
> >>> No, IndexWriter enforces that the number of documents cannot go over
> >>> IndexWriter.MAX_DOCS (which is a bit less than 2^31) and
> >>> BaseCompositeReader computes the number of documents in a long
> variable and
> >>> ensures it is less than 2^31, so you cannot have indexes that contain
> more
> >>> than 2^31 documents.
> >>>
> >>> Larger collections should be written to multiple shards and use
> >>> TopDocs.merge to merge results.
> >>
> >> But hang on:
> >> * TopDocs#merge still returns a TopDocs.
> >> * TopDocs still uses an array of ScoreDoc.
> >> * ScoreDoc still uses an int doc ID.
> >>
> >> Looks like you're still screwed.
> >>
> >> I wish IndexReader would use long IDs too, because one IndexReader can
> >> be across multiple shards too - it doesn't make much sense to me that
> >> this is restricted, although "it's hard to fix in a
> >> backwards-compatible way" is certainly a good reason. :D
> >>
> >> TX
> >>
> >> -
> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
>


Re: docid is just a signed int32

2016-08-21 Thread Cristian Lorenzetto
maybe using TopDocs.merge you can the same query on multiple indexes, with
multireader you can also to make join operation on different indexes

2016-08-21 19:31 GMT+02:00 Cristian Lorenzetto <
cristian.lorenze...@gmail.com>:

> i m overviewing TopDocs.merge.
>
> What is the difference to use multiple SearchIndexer and then to use
> TopDocs or to use MultiReader?
>
> 2016-08-21 2:28 GMT+02:00 Cristian Lorenzetto <
> cristian.lorenze...@gmail.com>:
>
>> For my opinion this study dont tell any thing more than before. Obviously
>> if you try to retrieve all data store in a single query the performance
>> will be not good. Lucene is fantastic But no magic. The physic laws
>> continue to work also with lucene. The query is designed for retrieving a
>> small part of a big store, not All The store. In addition i think The time
>> would be worst also if you dont sort documents. Using a sorted linked list
>> persisted i dont see relevant delays . Syncerely i dont understand also gc
>> memory limit with lucene algorithm. The size of memory used is not
>> proporzional to the datastore size, else lucene will be not scalable. The
>> problem to analize for me is another : considering The trend of big data to
>> encrease in The last years , considering The classical max size of a
>> database among those we know, considering The possibility or not to scale
>> up sharding in lucene in arrays defined dinamically or not , we can
>> evaluate if this refactoring has sense or not.
>>
>> Inviato da iPad
>>
>> > Il giorno 19 ago 2016, alle ore 05:50, Erick Erickson <
>> erickerick...@gmail.com> ha scritto:
>> >
>> > OK, I'm a little out of my league here, but I'll plow on anyway
>> >
>> > bq: There are use cases out there where >2^31 does make sense in a
>> single index
>> >
>> > Ok, let's put some definition to this and define the use-case
>> > specifically rather than
>> > be vague. I've just run an experiment for instance where I had 200M
>> > docs in a single
>> > shard (very small docs) and tried to sort by a date on all of them.
>> > Performance on the order of
>> > 5 seconds. 3B is what, 75 seconds? Does the use-case involve sorting?
>> > Faceting? If
>> > so the performance will probably be poor.
>> >
>> > This would be huge surgery I believe, and there hasn't been a
>> > compelling use-case
>> > in the search world for it. Unless and until that case is made I
>> > suspect this idea will
>> > meet with a lot of resistance.
>> >
>> > That said, I do understand that this is somewhat akin to "Nobody will
>> > ever need more
>> > than 64K of ram", meaning that some limits are assumed and eventually
>> become
>> > outmoded. But given Java's issues with memory and GC I suspect that
>> > it'll be really
>> > hard to justify the work this would take.
>> >
>> > FWIW,
>> > Erick
>> >
>> >
>> >> On Thu, Aug 18, 2016 at 6:31 PM, Trejkaz 
>> wrote:
>> >>> On Thu, Aug 18, 2016 at 11:55 PM, Adrien Grand 
>> wrote:
>> >>> No, IndexWriter enforces that the number of documents cannot go over
>> >>> IndexWriter.MAX_DOCS (which is a bit less than 2^31) and
>> >>> BaseCompositeReader computes the number of documents in a long
>> variable and
>> >>> ensures it is less than 2^31, so you cannot have indexes that contain
>> more
>> >>> than 2^31 documents.
>> >>>
>> >>> Larger collections should be written to multiple shards and use
>> >>> TopDocs.merge to merge results.
>> >>
>> >> But hang on:
>> >> * TopDocs#merge still returns a TopDocs.
>> >> * TopDocs still uses an array of ScoreDoc.
>> >> * ScoreDoc still uses an int doc ID.
>> >>
>> >> Looks like you're still screwed.
>> >>
>> >> I wish IndexReader would use long IDs too, because one IndexReader can
>> >> be across multiple shards too - it doesn't make much sense to me that
>> >> this is restricted, although "it's hard to fix in a
>> >> backwards-compatible way" is certainly a good reason. :D
>> >>
>> >> TX
>> >>
>> >> -
>> >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> >> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >
>> > -
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >
>>
>
>


Re: Using Lucene's Multi Dimensional Space Search for Air traffic handling.

2016-08-21 Thread Janaka Thilakarathna
Hi Michael,

I started playing with Lucene LatLon and Geo3D points. I have a problem
about the constructor of Geo3DPoint.

In the other constructor, Geo3DPoint(String name, double x, double y,
double z)
.
How can I map*,* Lat, Lon and Altitude into x,y,z. If we use x,y,z there
should be an Axis system. For an example I have these questions,

   - Where those axises are pointed to?
   - What are the units (km or m)?

If you can give me an idea on that it will be really helpful. :-)
Thank you.

Regards,

Janaka.

On Thu, Aug 18, 2016 at 9:55 AM, Janaka Thilakarathna <
bjchathura...@gmail.com> wrote:

> Hi Michael,
>
> Sorry for the late reply and thank you very much for your quick respond.
> :-)
>
> Yeah, it looks like an interesting data set to play with, but it is really
> large to start. :D
> I will try some simple projects and get back to you if I find any trouble.
>
> Janaka.
>
> On Tue, Aug 16, 2016 at 2:49 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> What a fun use case for dimensional points!  I just saw NASA announce
>> this data set recently: https://plus.google.
>> com/+MichaelMcCandless/posts/h8eUtkhizKG
>>
>> And I was wondering how to play with it... 36 TB of airplane flight
>> routes :)
>>
>> You can easily index your data (3 spatial dims + 1 time dim) using e.g.
>> DoublePoint but then the only way to query those points currently is the
>> PointRangeQuery (4D boxes); maybe you can use that to find the "interesting
>> area" traversals?
>>
>> For "minimum distance between two air-planes", you might be able to start
>> with LatLonPoint.nearest (KNN search implementation) but generalize it a
>> bit to N dims not just the 2 (lat, lon) that it supports today?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>> On Tue, Aug 16, 2016 at 4:29 AM, Janaka Thilakarathna <
>> bjchathura...@gmail.com> wrote:
>>
>>> Hi everyone,
>>>
>>> I am from University of Moratuwa. I have a quite understanding in
>>> Lucene's
>>> Text Search, Geo3DPoint
>>> >> ucene/spatial3d/Geo3DPoint.html#newShapeQuery-java.lang.Stri
>>> ng-org.apache.lucene.spatial3d.geom.GeoShape->,
>>> LatLonPoint
>>> >> ene/document/LatLonPoint.html#newPolygonQuery-java.lang.Stri
>>> ng-double:A-double:A->
>>> but I have never used Lucene for multi dimentional space search.
>>>
>>> The idea is to use multi dimensional space search on 4 dimensional
>>> space(3
>>> physical dimensions and time as another dimension) to calculate results
>>> for
>>> following queries.
>>>
>>>- minimum distance between two air-planes
>>>- whether air planes goes through and interesting area. (For a
>>> example:
>>>A forbidden air space)
>>>
>>> Paths of air-planes can be represented by arrays of 4D points.
>>> (time,x,y,z). In other words I have different x,y,z coordinates for
>>> different time values. My idea is to index these points and query for
>>> above
>>> results.
>>>
>>> Since there is no much tutorials on this new feature on Lucene 6, I am
>>> quite confused where to start the project. I am really glad if someone
>>> can
>>> help me with this.
>>>
>>> I just want to know whether I can use Lucene for this use-case. Further
>>> if
>>> you can point me out a place to start developing, it will be really help
>>> full.
>>>
>>> Thank you!
>>>
>>> Regards
>>>
>>> --
>>> *Janaka Chathuranga Thilakarathna*
>>> Undergraduate at Computer Science and Engineering Department,
>>> UNIVERSITY OF Moratuwa, Sri Lanka*.*
>>>
>>> mobile :(+94)* 713315725 **| *email :  *janaka...@cse.mrt.ac.lk*
>>> *, bjchathura...@gmail.com
>>> *
>>> skype  : *janaka.chathurangat* |  website : janakact.wordpress.com
>>>
>>> my public profiles :  [image: Facebook]
>>>  [image: LinkedIn]
>>> >> >
>>>
>>
>>
>
>
> --
> *Janaka Chathuranga Thilakarathna*
> Undergraduate at Computer Science and Engineering Department,
> UNIVERSITY OF Moratuwa, Sri Lanka*.*
>
> mobile :(+94)* 713315725 **| *email :  *janaka...@cse.mrt.ac.lk*
> *, bjchathura...@gmail.com
> *
> skype  : *janaka.chathurangat* |  website : janakact.wordpress.com
>
> my public profiles :  [image: Facebook]
>  [image: LinkedIn]
> 
>



-- 
*Janaka Chathuranga Thilakarathna*
Undergraduate at Computer Science and Engineering Department,
UNIVERSITY OF Moratuwa, Sri Lanka*.*

mobile :(+94)* 713315725 **| *email :  *janaka...@cse.mrt.ac.lk*
*, bjchathura...@gmail.com
*
skype  : *janaka.chathurangat* |  website : janakact.wordpress.com

my public profiles :  [image: Facebook]
 [ima

Lucene commit

2016-08-21 Thread Paul Masurel
Hi,

If I understand correctly, Lucene indexing threads are working on their own
individual segment.
When a thread has enough documents in its segment, it flushes it on disc
and starts a new one.
But segments are only searchable when they are commited.

Now my question is, wouldn't it be nice to be able to set up Lucene so that
segments are made searchable as soon as they are flushed?

Commit would still play the roll of "checkpoint" in a hardware failure
scenario.
This is different from the old "autocommit" feature in that sense.

Of course, this "searchable yet not committed flushed segment" leads to the
following weird behavior :
- documents can become searchable and in case of failure, become not
searchable
(and then eventually searchable again if the client does its job properly
and reindexes rollbacked documents).
- one document can become searchable after another one even though it was
added before.

The benefit would be to reduce the average latency for a document
to become searchable, without hurting throughput by calling commit() too
frequently.

Regards,

Paul


Re: Lucene commit

2016-08-21 Thread Christoph Kaser

Hello Paul,

this is already possible using 
DirectoryReader.openIfChanged(indexReader,indexWriter). This will give 
you an indexreader that already "sees" all changes made by the writer 
(up to that point), even though the changes were not yet committed:

https://lucene.apache.org/core/6_1_0/core/org/apache/lucene/index/DirectoryReader.html#openIfChanged-org.apache.lucene.index.DirectoryReader-org.apache.lucene.index.IndexWriter-

Regards,
Christoph

Am 22.08.2016 um 08:31 schrieb Paul Masurel:

Hi,

If I understand correctly, Lucene indexing threads are working on their own
individual segment.
When a thread has enough documents in its segment, it flushes it on disc
and starts a new one.
But segments are only searchable when they are commited.

Now my question is, wouldn't it be nice to be able to set up Lucene so that
segments are made searchable as soon as they are flushed?

Commit would still play the roll of "checkpoint" in a hardware failure
scenario.
This is different from the old "autocommit" feature in that sense.

Of course, this "searchable yet not committed flushed segment" leads to the
following weird behavior :
- documents can become searchable and in case of failure, become not
searchable
(and then eventually searchable again if the client does its job properly
and reindexes rollbacked documents).
- one document can become searchable after another one even though it was
added before.

The benefit would be to reduce the average latency for a document
to become searchable, without hurting throughput by calling commit() too
frequently.

Regards,

Paul