Re: dv field is too large
I agree, I'll improve the docs about this limit. Thanks Sheng. Mike McCandless http://blog.mikemccandless.com On Wed, Jul 6, 2016 at 10:59 PM, Sheng wrote: > I agree. That said, wouldn't it also make sense to clearly point it out by > adding the comments to the corresponding classes. This is not the first > time I am running into this "magic number" pitfall when using Lucene > (e.g., 1024 > limit for the token length in early version of Lucene). Generally speaking, > the documentation is pretty good and helpful. But without documenting > subtle issues like this, they may only manifest themselves in production > when the real data come in and they are "big". > > On Wednesday, July 6, 2016, Erick Erickson > wrote: > > > Well, if you must sort on a 32K single value (although I think this is > > extremely silly, _nobody_ will notice that two docs are out of order > > because they were identical up until the 30,000th character but the > > 30,001st character isn't sorted correctly), do as Mike suggests and > > chop it off before sending it to Lucene. > > > > Best, > > Erick > > > > On Wed, Jul 6, 2016 at 3:53 PM, Sheng > > > wrote: > > > You misunderstand. I have many fields, and unfortunately a few of them > > are > > > quite big, i.e. exceeding the 32k limit. In order to make these "big" > > > fields sortable, they have to be stored as SortedDocValueField. Or that > > is > > > wrong, one can actually sort the search result by a "big" field without > > > indexing it to a SortedDocValueField. Suggestion ? > > > > > > On Wednesday, July 6, 2016, Erick Erickson > > wrote: > > > > > >> bq: In this case, we > > >> have to index a particular data structure which has bunch of fields > and > > >> each of them is promised to be searchable and search-sortable to the > > user > > >> > > >> If I'm reading this right, you have some structure. You say > > >> "each of them is promised to be searchable and search-sortable" > > >> > > >> It _sounds_ like what you want to do is break these fields out > > >> into separate fields each of which is searchable and sortable > > >> independently. But from what you've described, putting the entire > > >> thing into a single DV field isn't useful. > > >> > > >> Best, > > >> Erick > > >> > > >> > > >> > > >> On Wed, Jul 6, 2016 at 3:10 PM, Sheng > > > > >> wrote: > > >> > To be clear, the "field" is indeed tokenized, which is accompanied > > with a > > >> > SortedDocValueField so that it is sortable too. Am I making the > wrong > > >> > assumption here ? > > >> > > > >> > On Wednesday, July 6, 2016, Sheng > > > > > >> wrote: > > >> > > > >> >> Hi Eric, > > >> >> > > >> >> I am refactoring a legacy system. One of the most annoying things > is > > I > > >> >> have to keep the old feature even though it makes little sense. In > > this > > >> >> case, we have to index a particular data structure which has bunch > of > > >> >> fields and each of them is promised to be searchable and > > >> search-sortable to > > >> >> the user. Turns out one field is notoriously large. I think the old > > >> >> implementation uses some quite clumsy way to make it happen. But > > since > > >> we > > >> >> decide to refactor the system with all the goodies from Lucene, we > > want > > >> to > > >> >> do the sorting right, and here we are at this issue... :-( > > >> >> > > >> >> On Wednesday, July 6, 2016, Erick Erickson < > erickerick...@gmail.com > > > > >> > > >> >> > > ');>> > > >> wrote: > > >> >> > > >> >>> Is this an "XY" problem? Meaning, why do you need DV fields larger > > than > > >> >>> 32K? > > >> >>> > > >> >>> You can't search it as text as it's not tokenized. Faceting and > > sorting > > >> >>> by a 32K > > >> >>> field doesn't seem very useful. You may have a perfectly valid > > reason, > > >> >>> but it's > > >> >>> not obvious what use-case you're serving from this thread so > far > > >> >>> > > >> >>> Nobody has yet put forth a compelling use-case for such large > > fields, > > >> >>> perhaps > > >> >>> this would be one. > > >> >>> > > >> >>> Best, > > >> >>> Erick > > >> >>> > > >> >>> On Wed, Jul 6, 2016 at 2:24 PM, Sheng > > > >> > wrote: > > >> >>> > Mike - Thanks for the prompt response. Is there a way to bypass > > this > > >> >>> > constraint for SortedDocValueField ? Or we have to live with it, > > >> >>> meaning no > > >> >>> > fix even in future release? > > >> >>> > > > >> >>> > On Wednesday, July 6, 2016, Michael McCandless < > > >> >>> luc...@mikemccandless.com > > > >> >>> > wrote: > > >> >>> > > > >> >>> >> I believe only binary DVs can be larger than 32K bytes. > > >> >>> >> > > >> >>> >> Mike McCandless > > >> >>> >> > > >> >>> >> http://blog.mikemccandless.com > > >> >>> >> > > >> >>> >> On Wed, Jul 6, 2016 at 10:31 AM, Sheng > > > >> > > >> >>> > > > >> >>> >> wrote: > > >> >>> >> > > >> >>> >> > Hi, > > >> >>> >> > > > >> >>> >> > I am getting an IAE indicating one of the SortedDocValueField > > is > > >> too > > >> >>> >> large, > > >> >>> >> >
Re: lucene index reader performance
Any suggestions pls? On Mon, Jul 4, 2016 at 3:37 PM, Tarun Kumar wrote: > Hey Michael, > > docIds from multiple indices (from multiple machines) need to be > aggregated, sorted and first few thousand new to be queried. These few > thousand docs can be distributed among multiple machines. Each machine will > search the docs which are there in their own indices. So, pulling sorting > on server side won't suffice the use-case. Is there a alternative to get > document for given docIds faster? > > Thanks > Tarun > > On Mon, Jul 4, 2016 at 3:17 PM, Michael McCandless < > luc...@mikemccandless.com> wrote: > >> Why not ask Lucene to do the sort on your time field, instead of pulling >> millions of docids to the client and having it sort. You could even do >> index-time sorting by time field if you want, which makes early termination >> possible (faster sorted searches). >> >> But if even on having Lucene do the sort you still need to load millions >> of documents per search request, you are in trouble: you need to >> re-formulate your use case somehow to take advantage of what Lucene is good >> for (getting top results for a search). >> >> Maybe you can use faceting to do whatever aggregation you are currently >> doing after retrieving those millions of documents. >> >> Maybe you could make a custom collector, and use doc values, to do your >> own custom aggregation. >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Mon, Jul 4, 2016 at 1:39 AM, Tarun Kumar wrote: >> >>> Thanks for reply Michael! In my application, i need to get millions of >>> documents per search. >>> >>> Use case is following: return documents in increasing order of field >>> time. Client (caller) can't hold more than a few thousand docs at a time so >>> it gets all docIds and corresponding time field for each doc, sort them on >>> time and get n docs at a time. To support this usecase, i am: >>> >>> - getting all docsIds first. >>> - Sort docIds on time fields. >>> - Query n docids at a time from client which make >>> indexReader.document(docId) call for all n docs at server, combine the docs >>> these docs and return. >>> >>> indexReader.document(docId) is creating bottlenecks. What alternatives >>> do you suggest? >>> >>> On Wed, Jun 29, 2016 at 4:00 AM, Michael McCandless < >>> luc...@mikemccandless.com> wrote: >>> Are you maybe trying to load too many documents for each search request? The IR.document API is designed to be used to load just a few hits, like a page worth or ~ 10 documents, per search. Mike McCandless http://blog.mikemccandless.com On Tue, Jun 28, 2016 at 7:05 AM, Tarun Kumar wrote: > I am running lucene 4.6.1. I am trying to get documents corresponding > to > docIds. All threads get stuck (don't get stuck exactly but spend a LOT > of > time in) at: > > java.lang.Thread.State: RUNNABLE > at sun.nio.ch.FileDispatcherImpl.pread0(Native Method) > at > sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:52) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:220) > at sun.nio.ch.IOUtil.read(IOUtil.java:197) > at > sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:731) > at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:716) > at > > org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:169) > at > > org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:271) > at > > org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:51) > at > org.apache.lucene.store.DataInput.readVInt(DataInput.java:108) > at > > org.apache.lucene.store.BufferedIndexInput.readVInt(BufferedIndexInput.java:218) > at > > org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsReader.java:232) > at > org.apache.lucene.index.SegmentReader.document(SegmentReader.java:277) > at > > org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeReader.java:110) > at > org.apache.lucene.index.IndexReader.document(IndexReader.java:440) > > > There is no disk throttling. What can result into this? > > Thanks > Tarun > >>> >> >
Re: lucene index reader performance
Somehow you need to get the sorting server-side ... that's really the only way to do your use case efficiently. Why can't you sort each request to your N shards, and then do a merge sort on the client side, to get the top hits? Mike McCandless http://blog.mikemccandless.com On Thu, Jul 7, 2016 at 5:48 AM, Tarun Kumar wrote: > Any suggestions pls? > > On Mon, Jul 4, 2016 at 3:37 PM, Tarun Kumar wrote: > >> Hey Michael, >> >> docIds from multiple indices (from multiple machines) need to be >> aggregated, sorted and first few thousand new to be queried. These few >> thousand docs can be distributed among multiple machines. Each machine will >> search the docs which are there in their own indices. So, pulling sorting >> on server side won't suffice the use-case. Is there a alternative to get >> document for given docIds faster? >> >> Thanks >> Tarun >> >> On Mon, Jul 4, 2016 at 3:17 PM, Michael McCandless < >> luc...@mikemccandless.com> wrote: >> >>> Why not ask Lucene to do the sort on your time field, instead of pulling >>> millions of docids to the client and having it sort. You could even do >>> index-time sorting by time field if you want, which makes early termination >>> possible (faster sorted searches). >>> >>> But if even on having Lucene do the sort you still need to load millions >>> of documents per search request, you are in trouble: you need to >>> re-formulate your use case somehow to take advantage of what Lucene is good >>> for (getting top results for a search). >>> >>> Maybe you can use faceting to do whatever aggregation you are currently >>> doing after retrieving those millions of documents. >>> >>> Maybe you could make a custom collector, and use doc values, to do your >>> own custom aggregation. >>> >>> Mike McCandless >>> >>> http://blog.mikemccandless.com >>> >>> On Mon, Jul 4, 2016 at 1:39 AM, Tarun Kumar wrote: >>> Thanks for reply Michael! In my application, i need to get millions of documents per search. Use case is following: return documents in increasing order of field time. Client (caller) can't hold more than a few thousand docs at a time so it gets all docIds and corresponding time field for each doc, sort them on time and get n docs at a time. To support this usecase, i am: - getting all docsIds first. - Sort docIds on time fields. - Query n docids at a time from client which make indexReader.document(docId) call for all n docs at server, combine the docs these docs and return. indexReader.document(docId) is creating bottlenecks. What alternatives do you suggest? On Wed, Jun 29, 2016 at 4:00 AM, Michael McCandless < luc...@mikemccandless.com> wrote: > Are you maybe trying to load too many documents for each search > request? > > The IR.document API is designed to be used to load just a few hits, > like a page worth or ~ 10 documents, per search. > > Mike McCandless > > http://blog.mikemccandless.com > > On Tue, Jun 28, 2016 at 7:05 AM, Tarun Kumar > wrote: > >> I am running lucene 4.6.1. I am trying to get documents corresponding >> to >> docIds. All threads get stuck (don't get stuck exactly but spend a >> LOT of >> time in) at: >> >> java.lang.Thread.State: RUNNABLE >> at sun.nio.ch.FileDispatcherImpl.pread0(Native Method) >> at >> sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:52) >> at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:220) >> at sun.nio.ch.IOUtil.read(IOUtil.java:197) >> at >> sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:731) >> at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:716) >> at >> >> org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOFSDirectory.java:169) >> at >> >> org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:271) >> at >> >> org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:51) >> at >> org.apache.lucene.store.DataInput.readVInt(DataInput.java:108) >> at >> >> org.apache.lucene.store.BufferedIndexInput.readVInt(BufferedIndexInput.java:218) >> at >> >> org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.visitDocument(CompressingStoredFieldsReader.java:232) >> at >> org.apache.lucene.index.SegmentReader.document(SegmentReader.java:277) >> at >> >> org.apache.lucene.index.BaseCompositeReader.document(BaseCompositeReader.java:110) >> at >> org.apache.lucene.index.IndexReader.document(IndexReader.java:440) >> >> >> There is no disk throttling. What can result into this? >> >> Thanks >> Tarun >> > > >>> >> >
Re: IndexWriter and IndexReader in a shared environment
The API is pretty simple. Create IndexWriter and leave it open forever, using it to index/delete documents, and periodically calling IW.commit when you need durability. Create a SearcherManager, passing it the IndexWriter, and use it per-search to acquire/release the searcher. Periodically (ideally from a separate thread) call SM.maybeRefresh so the searcher sees the latest indexing changes. Mike McCandless http://blog.mikemccandless.com On Tue, Jul 5, 2016 at 9:36 AM, Desteny Child wrote: > Hi, > > In my Spring Boot application I have implemented 2 API endpoints - one for > Lucene(I use 5.2.1) document indexing and another one for searching. > > Right now I open every time on each request IndexWriter and IndexReader. > With a big index it works pretty slow. > > I know that there is a possibility to use single IndexWriter and > IndexReader in a shared environment. There is SearcherManager or something > like this for this purpose. > > I can't find a good example for this for Lucene 5. Could you please share > with me this example ? > > Thanks in advance, > Alex >
Re: Lucene cluster with NFS or synchronization tool such as rsync
Alas, there are no more docs than the classes themselves, in the lucene/replicator module, under the oal.replicator.nrt package. Essentially, you create a PrimaryNOde (equivalent of IndexWriter) for indexing documents, in a JVM on machine 1, and a ReplicaNode in a JVM on machine 2, but you must subclass these classes to handle sending files across the wire. The test cases give simplistic examples (thread-per-socket-connection) of how to do this. Mike McCandless http://blog.mikemccandless.com On Mon, Jul 4, 2016 at 8:10 AM, Desteny Child wrote: > Hi Mike, > > Thanks you very much for your response. > > I would be really grateful if you can please provide me with an information > where I can read(may be with examples) about new near-real-time replication > ? > > Thanks, > Alex > > 2016-07-04 12:57 GMT+03:00 Michael McCandless : > > > NFS is dangerous if different nodes may take turns writing to the shared > > index. > > > > Locking sometimes doesn't work correctly, client-side metadata caching > > (e.g. the directory entry) can cause problems, NFS doesn't support > "delete > > on final close" semantics that Lucene relies on. > > > > rsync-like behavior can work with IndexWriter if you use > > SnapshotDeletionPolicy to hold a point-in-time view of the index open for > > copying ... this is also how to take a live backup of a still-writing > > index, and it's how Lucene's replication module works. > > > > You could also try the new near-real-time replication, which copies just > > the newly written segment files without requiring a full commit (fsync) > on > > the source index. > > > > Mike McCandless > > > > http://blog.mikemccandless.com > > > > On Sun, Jul 3, 2016 at 2:09 PM, Desteny Child > wrote: > > > > > I need to organize a cluster for my stateless application based on > Lucene > > > 5.2.1. Right now I'm looking for a solution in order to share Lucene > > index > > > via NFS or rsync between different Lucene nodes. > > > > > > Is it a good idea to use NFS for this purpose and if so will it be > > possible > > > to read/write from different nodes to the same shared index ? > > > > > > Also I read that rsync tool can be used for this purpose(in order to > > > synchronize index files across all nodes) but I can't find any success > > > story for using rsync + Lucene. Right now I have a lot of question, one > > of > > > them - is it safe to use rsync at anytime especially when IndexWriter > is > > in > > > progress(not closed) and actively indexes documents. > > > > > >
Re: Hierarchical Facets need duplicated counts
Any hint on how to calculate these values without asking the whole facet hierarchy and count them? Is there a specific point in the code where I can check for this distinct count, and maybe modify the code? Nicola On Wed, 2016-07-06 at 13:42 +0100, Nicola Buso wrote: > Hello everyone, > > we are using hierarchical facets (from > org.apache.lucene.facet.taxonomy), in our case 1 entry can have > several > values referencing more leaves in the hierarchical facet. > > At search time we are noticing that if we search for exactly 1 entry > we > have count = 1 in the hierarchical facet root and navigating the > hierarchical tree there are more than 1 leaf with count = 1. We > presume > this is due to a distinct count it's calculated in the collector. > > Is it possible to have also the duplicated count? What we would like > to > achieve is to understand how many leaves the search is reaching and > maybe have this "duplicated count" summed up in the parent nodes. > > Do you have any hints on how to achieve it? > > Regards, > > > > Nicola > > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Document retrieval, performance, and DocValues
You should do the MultiDocValues.getBinaryDocValues(indexReader, "pos_id") once up front, not per hit. You could operate per-segment instead by making a custom Collector. Are you sorting by your pos_id field? If so, the value is already available in each FieldDoc and you don't need to separately look it up. How many hits are you collecting for each search? Mike McCandless http://blog.mikemccandless.com On Tue, Jul 5, 2016 at 11:20 AM, Randy Tidd wrote: > My Lucene index has about 3 million documents and result sets can be > large, often 1000’s and sometimes as many as 100,000. I am expecting the > index size to grow 5-10x as the system matures. > > I index 5 fields, and per recommendations I’ve read, am storing the > minimal data in Lucene, currently just a 12 byte numeric identifier (a > Mongo ObjectId) per document. I store the rest of the data separately and > use the id I get from Lucene to look it up there. > > In my load testing, a search like this: > > TopDocs docs = indexSearcher.search(query, maxResults, sort) > > takes about 50-75 msec which is good. Retrieving documents with a loop > like this: > > for(int i=0; i ScoreDoc sdoc = docs.scoreDocs[i]; > String id = indexReader.document(sdoc.doc, > Collections.singleton("pos_id”)).getField("pos_id").stringValue(); > // … retrieve data with id > } > > takes around 350-400 msec, sometimes as long as 800 msec. I’m looking for > ways to try to decrease this time if possible. > > I’ve read up on DocValues and am not sure if that is intended to help with > this. I understand that it is a separate store/mapping of Lucene’s > internal document id’s to my “pos_id” which sounds like it may help but I > am not sure. I tried getting the id’s from my reader like this: > > String id = MultiDocValues.getBinaryValues(indexReader, > "pos_id").get(sdoc.doc).utf8ToString() > > But performance was no better. However I saw in the docs for > MultiDocValues that I may get better performance using the "atomic leaves > and then operate per-LeafReader”. I searched around and could not find > documentation on how to do that. I see some examples using leaf readers in > the solr projects but they were just examples and I don’t think were > written specifically to optimize performance. It would be great to find an > explanation of why there are multiple leaf readers per reader and how to > use them. > > So my questions are 1) are DocValues a possibility for improving my > document retrieval performance, and 2) if so, where can I find an example > of this that is written for best performance? > > Thanks in advance! > > Randy > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Query with modulo function in Lucene without Solr?
I would like to use the mod() function in a query to for example fetch every 10th or 100th matching document, or to return documents that return a certain result from the mod() function for a numeric field. I know this question has come up in the past and I have seen answers that suggest using Solr’s function query parsing to implement it. But I am using Lucene 6.1.0 without Solr and would like to implement this with just lucene-core and lucene-queries. Probably a quick answer but I can’t seem to find the class(es) in Lucene that implements this, guessing FunctionQuery / ValueSource which is now in the lucene-queries module? If anyone could point me to a sample implementation that would be very helpful. Thanks, Randy - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Similarity Implementation
We are in the process of upgrading from 2.x to 6.x. In 2.x we implemented our own similarity where all the functions return 1.0f, how can we implement such thing in 6.x? Is there an implementation already there that we can use and have the same results? -- Regards -Siraj Haider (212) 306-0154 This electronic mail message and any attachments may contain information which is privileged, sensitive and/or otherwise exempt from disclosure under applicable law. The information is intended only for the use of the individual or entity named as the addressee above. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution (electronic or otherwise) or forwarding of, or the taking of any action in reliance on, the contents of this transmission is strictly prohibited. If you have received this electronic transmission in error, please notify us by telephone, facsimile, or e-mail as noted above to arrange for the return of any electronic mail or attachments. Thank You.
Port of Custom value source from v4.10.3 to v6.1.0
Hi all, I wrote some time ago a ValueSourceParser + ValueSource to allow using results produced by an external system as a facet query : - in solrconfig.xml : added my parser : http://lucene.472066.n3.nabble.com/Port-of-Custom-value-source-from-v4-10-3-to-v6-1-0-tp4286236.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Similarity Implementation
Hi Siraj, I think https://lucene.apache.org/core/6_1_0/core/index.html?org/apache/lucene/search/ConstantScoreQuery.html should be good enough. On Fri, Jul 8, 2016 at 12:27 AM Siraj Haider wrote: > We are in the process of upgrading from 2.x to 6.x. In 2.x we implemented > our own similarity where all the functions return 1.0f, how can we > implement such thing in 6.x? Is there an implementation already there that > we can use and have the same results? > > -- > Regards > -Siraj Haider > (212) 306-0154 > > > > > This electronic mail message and any attachments may contain information > which is privileged, sensitive and/or otherwise exempt from disclosure > under applicable law. The information is intended only for the use of the > individual or entity named as the addressee above. If you are not the > intended recipient, you are hereby notified that any disclosure, copying, > distribution (electronic or otherwise) or forwarding of, or the taking of > any action in reliance on, the contents of this transmission is strictly > prohibited. If you have received this electronic transmission in error, > please notify us by telephone, facsimile, or e-mail as noted above to > arrange for the return of any electronic mail or attachments. Thank You. >