Re: Storing payloads without term-position and frequency
Hello Grant, I am currently storing the first term instance only because I just index each token for an article once. What I want to achieve is an index for versioned document collections like wikipedia (See this paper http://www.cis.poly.edu/suel/papers/archive.pdf). In detail I create on the first level (Lucene) a document for one wikipedia article containing all distinct terms of its versions. On the second level (payloads) I store the frequency information corresponding to each article version and its terms. If I search now I can find an article by its term and through the term and its payload I receive informations about the other versions and how often a token occured (In my case with one term the payload pos is always 1!). So I look on the first level and pick only the information from the second level which I need. By this I can avoid storing informations several times because most wikipedia versions are very similar (in term context). This is working so far and I just want to reduce my index size but I don't know how much I can save by disabling term freqs/pos. I hope I could explain the problem a little bit. If not just tell me I try to explain it again. :) Best regards Alex PS: I am currently looking for a bedroom in New York, Brooklyn (Park Slope or near NYU Poly). Maybe somebody rents a room from 15 Feb until 15 April. :) Am Donnerstag, den 03.02.2011, 12:38 -0500 schrieb Grant Ingersoll: > Payloads only make sense in terms of specific positions in the index, so I > don't think there is a way to hack Lucene for it. You could, I suppose, just > store the payload for the first instance of the term. > > Also, what's the use case you are trying to solve here? Why store term > frequency as a payload when Lucene already does it (and it probably does it > more efficiently) > > -Grant > > On Feb 2, 2011, at 2:35 PM, Alex vB wrote: > > > > > Hello everybody, > > > > I am currently using Lucene 3.0.2 with payloads. I store extra information > > in the payloads about the term like frequencies and therefore I don't need > > frequencies and term positions stored normally by Lucene. I would like to > > set f.setOmitTermFreqAndPositions(true) but then I am not able to retrieve > > payloads. Would it be hard to "hack" Lucene for my requests? Anymore I only > > store one payload per term if that information makes it easier. > > > > Best regards > > Alex > > -- > > View this message in context: > > http://lucene.472066.n3.nabble.com/Storing-payloads-without-term-position-and-frequency-tp2408094p2408094.html > > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > > - > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > -- > Grant Ingersoll > http://www.lucidimagination.com > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
which parser to use?
hi all, I need to create analyzer and I need to choose what parser to use. can anyone recommend ? JFlex javacc antlr thanks. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Query and language conversion
Hi, I am new to Lucene so excuse me if this is a trivial question .. I have data that I Index in a given language (English). My users will come from different countries and my search screen will be internationalized. My users will then probably query thing in their own language. Is it possible too lookup for Items that were indexed in a different language. To make thing a bit more clear. My "Business" object has a "type" attribute. In lucene the "type" field is created. The Business object for "Doctor Smuck" will be indexed with the "type" field as "medical doctor" or anything similar. My German users will query using german languange. He tries to find a Doctor using "Arzt" or maybe "Mediziner" as a query. Is Lucene able to match the query to the value that was indexed in another language ? Is there an analyser for that ? By the way : I can provide the probable input language, based on the client's search page language, as a parameter if that helps (it probably will) . Many thanks for your thoughts !
Re: Query and language conversion
Many thanks Steve for all that information. I understand by your answer that cross-lingual search doesn't come "out-of -the-box" in Lucene. Cheers. Alex On Tue, Sep 1, 2009 at 6:46 PM, Steven A Rowe wrote: > Hi Alex, > > What you want to do is commonly referred to as "Cross Language Information > Retrieval". Doug Oard at the University of Maryland has a page of CLIR > resources here: > > http://terpconnect.umd.edu/~dlrg/clir/<http://terpconnect.umd.edu/%7Edlrg/clir/> > > Grant Ingersoll responded to a similar question a couple of years ago on > this list: > > < > http://search.lucidimagination.com/search/document/e1398067af353a49/cross_lingual_ir#e1398067af353a49 > > > > Here's another recent thread with lots of good info, from the solr-user > mailing list, on the same topic: > > < > http://search.lucidimagination.com/search/document/f7c17dc516c89bf6/preparing_the_ground_for_a_real_multilang_index#797001daa3f73e17 > > > > Here's a paper written by a group that put together a Greek-English > cross-language retrieval system using Lucene: > > http://www.springerlink.com/content/n172420t1346q683/ > > And here's another paper written by a group that made a Hindi and Telugu to > English cross-language retrieval system using Lucene, from the CLEF 2006 > conference proceedings: > > http://www.iiit.ac.in/techreports/2008_76.pdf > > Steve > > > -Original Message- > > From: Alex [mailto:azli...@gmail.com] > > Sent: Tuesday, September 01, 2009 10:30 AM > > To: java-user@lucene.apache.org > > Subject: Query and language conversion > > > > Hi, > > > > I am new to Lucene so excuse me if this is a trivial question .. > > > > > > I have data that I Index in a given language (English). My users will > > come from different countries and my search screen will be > > internationalized. My users will then probably query thing in their > > own language. Is it possible too lookup for Items that were indexed > > in a different language. > > > > To make thing a bit more clear. > > > > My "Business" object has a "type" attribute. In lucene the "type" field > > is created. The Business object for "Doctor Smuck" will be indexed with > > the "type" field as "medical doctor" or anything similar. My German > > users will query using german languange. He tries to find a Doctor > > using "Arzt" or maybe "Mediziner" as a query. Is Lucene able to match > > the query to the value that was indexed in another language ? > > Is there an analyser for that ? > > > > By the way : I can provide the probable input language, based on the > > client's search page language, as a parameter if that helps (it > > probably will) . > > > > Many thanks for your thoughts ! > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Filtering query results based on relevance/acuracy
Hi, I'm, a total newbie with lucene and trying to understand how to achieve my (complicated) goals. So what I'm doing is yet totally experimental for me but is probably extremely trivial for the experts in this list :) I use lucene and Hibernate Search to index locations by their name, type, etc ... The LocationType is an Object that has it's "name" field indexed both tokenized and untokenized. The following LocationType names are indexed "Restaurant" "Mexican Restaurant" "Chinese Restaurant" "Greek Restaurant" etc... Considering the following query : "Mexican Restaurant" I systematically get all the entries as a result, most certainly because the "Restaurant" keyword is present in all of them. I'm trying to have a finer grained result set. Obviously for "Mexican Restaurant" I want the "Mexican Restaurant" entry as a result but NOT "Chinese Restaurant" nor "Greek Restaurant" as they are irrelevant. But maybe "Restaurant" itself should be returned with a lower wight/score or maybe it shouldn't ... im not sure about this one. 1) How can I do that ? Here is the code I use for querying : String[] typeFields = {"name", "tokenized_name"}; Map boostPerField = new HashMap(2); boostPerField.put( "name", (float) 4); boostPerField.put( "tokenized_name", (float) 2); QueryParser parser = new MultiFieldQueryParser( typeFields , new StandardAnalyzer(), boostPerField ); org.apache.lucene.search.Query luceneQuery; try { luceneQuery = parser.parse(queryString); } catch (ParseException e) { throw new RuntimeException("Unable to parse query: " + queryString, e); } I guess that there is a way to filter out results that have a score below a given threshold or a way to filter out results based on score gap or anything similar. But I have no idea on how to do this... What is the best way to achieve what I want? Thank you for your help ! Cheers, Alex
Re: Filtering query results based on relevance/acuracy
Hi Otis and thank your for helping me out. Sorry for the late reply. Although a Phrase query or TermQuery would be perfectly suited in some cases, this will not work in my case. Basically my application's search feature is a single field "à la Google" and the user can be looking for a lot of different things... For example the user can search for "Chinese Restaurant in New York USA" or maybe just "Chinese Restaurant" (which should be understood as "nearby Chinese Restaurant" or maybe "Chinese Retaurant at 12 Main St. New York" or "1223 Main Street New York" So basically I will get many different query structures depending on the user's intent/meaning/logic and I think I need to figure out a good analysis algorithm to get Locations as acurately as possible. As a first step in my algo I am trying to isolate/identify a potential LocationType from the query string. So my idea was to separate each words and use them to query my Index for LocationTypes that would best match what's included in the query. I could then get the best matching LocationTypes based on how it scored against the luicene query and then move on to the next step of my algo which would try to find another potential feature of the query such as the presence of a Country name or City name etc That's why a phrase query would not be appropriate here as this would mean that the entire query string would be used and would most of the times return no relevant LocationTypes. Once I have analysed the query string and isolated the various features (LocationType, City Name, Country Name , Address ) I could maybe create a Boolean Query where I would use all that was fetched earlier So basically I'm not sure what feature of Lucene I should use here in the first step of the algo to only find the most relevant LocationTypes and filter out the ones that are not relevant enough. Any help and any thoughts on my approach greatly appreciated. Thanks in advance. Cheers, Alex.
Re: Filtering query results based on relevance/acuracy
anybody can help ? On Sat, Sep 26, 2009 at 11:22 PM, Alex wrote: > Hi Otis and thank your for helping me out. > > Sorry for the late reply. > > > > Although a Phrase query or TermQuery would be perfectly suited in some > cases, this will not work in my case. > > Basically my application's search feature is a single field "à la Google" > and the user can be looking for a lot of different things... > > For example the user can search for > "Chinese Restaurant in New York USA" > or maybe just > "Chinese Restaurant" (which should be understood as "nearby Chinese > Restaurant" > or maybe > "Chinese Retaurant at 12 Main St. New York" > or > "1223 Main Street New York" > > > > So basically I will get many different query structures depending on the > user's intent/meaning/logic and I think I need to figure out a good analysis > algorithm to get Locations as acurately as possible. > > As a first step in my algo I am trying to isolate/identify a potential > LocationType from the query string. > So my idea was to separate each words and use them to query my Index for > LocationTypes that would best match what's included in the query. > I could then get the best matching LocationTypes based on how it scored > against the luicene query and then move on to the next step of my algo which > would try to find another potential feature of the query such as the > presence of a Country name or City name etc > > That's why a phrase query would not be appropriate here as this would mean > that the entire query string would be used and would most of the times > return no relevant LocationTypes. > > Once I have analysed the query string and isolated the various features > (LocationType, City Name, Country Name , Address ) I could maybe create > a Boolean Query where I would use all that was fetched earlier > > > So basically I'm not sure what feature of Lucene I should use here in the > first step of the algo to only find the most relevant LocationTypes and > filter out the ones that are not relevant enough. > > > Any help and any thoughts on my approach greatly appreciated. > > > Thanks in advance. > > Cheers, > > Alex. > > >
Document category identification in query
Hi, I am trying to expand user queries to figure out potential document categories implied in the query. I wanted to know what was the best way to figure out the document category that is the most relevant to the query. Let me explain further: I have created categories that are applied to documents I want to index. Some example categories are : Hotel Restaurant Fast Food Chinese Restaurant Church Bank Gas station I also am trying to create category aliases such as Chinese food can also be named Chinese restaurant with the same category unique ID. The documents I index have 1 primary category and 1...N secondary categories such as : McDonalds will be categorized under Fast food as a primary category but also under Restaurant as secondary category. The London Pub at the corner of my street will be categorized as Pub as primary category and also as Bar, Food and Beverages, Restaurant, and Fast Food (since then also have takeaway burgers ;). This all gives me a set of categories that are quite clearly identified, as well as a set of category aliases even though I'm aware that I can't figure out all the possible aliases of all my categories. At least I have the most obvious ones. Now with all this, I wanted to know, with the help of Lucene (or any other efficient method), how I could figure out the most relevant category (if any) behind a user query. For example : If my user looks for : "Chang's chinese restaurant" the obvious category should be "Chinese Restaurant" but if my user looks for "chines restauran" (misspelled) the category should also be "Chinese Restaurant" (such as google is capable of doing) OR "chinese bistro" should probably also return me the category "Chinese Restaurant" since bistro is a very similar concept to "Retaurant" ... Once the category is identified I can then query the index for documents that match that category the best. What is the proper way to identify the most relevant category in a user query based on the above ? Should I consider any other better approach ? Any help appreciated. Many thanks Alex.
Re: Document category identification in query
Can anybody help me or maybe point me to relevant resources I could learn from ? Thanks.
Re: Document category identification in query
Hi ! Many thanks to both of you for your suggestions and answers! What Weiwei Wang suggests is a part of the solution I am willing to implement. I will definitely use the suggest-as-you-type approach in the query form as it will allow for pre-emptive disambiguation and I believe, will give very satisfying results. However, search users are wild beasts and I can't count on them to always use the given suggestions. All I can count on is very erratic, sparse and ambiguous queries :) So I need an almost fool proof solution. To answer your question : "BTW, I do not understand why you need to know the category of user input" I am trying to understand the user intent behind the query to filter out results based on a given category of locations. If a user queries "Fast Food in Nanjing" I don't want to return all the documents that contain the words "Fast" and "Food" and "Nanjing". I use a custom algorithm to figure out the intended location first. Then using the Spatial contrib I filter out the results based on a given area that was identified earlier. Finally I sort the results according to distance from the location point / centroid found earlier. Identifying the category allows me two things : 1) Filter out irrelevant results : I dont want my resultset to include a Supermarket in "Nanjing" where the "food" is fresh and service is "fast" just because the query words were included in the description of the location. Since I am using custom, distance based, sorting of the results, I can't afford to have the supermarket be the top result because it is the closest to the location centroid identified earlier. The user intent was clearly "fast food" and not a supermarket ! 2) Understand user intent to provide targetted advertizing. 3) Understanding the category of location a user is looking for also allows to calculate more accurately the bounding box = the maximum distance at which the location should be located to be relevant to the user. A user looking for Pizza in New York is expecting to have his results within a radius of a maximum of 1 or 2 miles. If he is looking for a Theme Park he will probably be willing to go further away to find it. So identifying the category of the location the user is looking for lets me calculate the didstance radius more acurately. Fei Liu Thanks a lot for the papers you pointed me to. I cam accross them earlier in my research and re-reading them gave me new insights. However I believe that the Two steps approach you are recomending is not very viable under heavy load as it requires two passes on the index. I believe however that Identifying the dominant category(ies) of the resultset when no category could be clearly identified using the query alone, can be very valuable if sent back to the user as an information and a category link ! Now what I think I will do to pre-emptively identify the location category(ies) implied in the query : 1 - use my own custom category set and index their names using the synonym analyzer provided with Lucene and also use some sort of normalization such as stemmin. maybe also using snowball analyzer. 2 - break the query into Shingles (word based grams) and analyze each shingle using the analyzers that were used in (1). then query Lucene with these analyzed shingles against the category index built earlier. Hopefully the category with the highest Lucene score should be the one intended by the user Later on, I also intend to use some sort of training based approach using search queries that would have been tagged with the relevant location categories. What do you guys think ? Would this be a viable approach ? Thanks for all ! Cheers Alex
slow FieldCacheImpl.createValue
hi, I have a ValueSourceQuery that makes use of a stored field. The field contains roughly 27.27 million untokenized terms. The average length of each term is 8 digits. The first search always takes around 5 minutes, and it is due to the createValue function in the FieldCacheImpl. The search is executed on a RAID5 disk array of 15k rpm. any hints as to make the fieldcache createvalue faster ? I have tried a bigger cache size for BufferedIndexReader (8kb or more) , but the time it took for createValue to execute is still in the realm of 4, 5 minutes. thanks _ 5 GB 超大容量 、創新便捷、安全防護垃圾郵件和病毒 — 立即升級 Windows Live Hotmail http://mail.live.com
RE: slow FieldCacheImpl.createValue
Hi, thanks for the reply. Yes, after the first slow search, subsequent searches have good performance. I guess the issue is why exactally, is createValue taking so long, or should it take so long (4 ~ 5 minutes ). Given roughly 27million terms, each of roughly 8 characters long and few other bytes for the TermInfo record, a modern disk can easily read over the portion of the index (the .frq portion ) in a few seconds. Also, when I use tools like dstat, I see bunch of 1kb reads initiated while running createValue. > Date: Tue, 20 May 2008 11:02:38 +0530 > From: [EMAIL PROTECTED] > To: java-user@lucene.apache.org > Subject: Re: slow FieldCacheImpl.createValue > > Hey Alex, > I guess you haven't tried warming up the engine before putting it to use. > Though one of the simpler implementation, you could try warming up the > engine first by sending a few searches and then put it to use (put it into > the serving machine loop). You could also do a little bit of preprocessing > while initializing the daemon rather than waiting for the search to hit it. > I hope I understood the problem correctly here, else would have to look into > it. > > -- > Anshum _ 用部落格分享照片、影音、趣味小工具和最愛清單,盡情秀出你自己 — Windows Live Spaces http://spaces.live.com/
lucene memory consumption
Hi, other than the in memory terms (.tii), and the few kilobytes of opened file buffer, where are some other sources of significant memory consumption when searching on a large index ? (> 100GB). The queries are just normal term queries. _ 隨身的 Windows Live Messenger 和 Hotmail,不限時地掌握資訊盡在指間 — Windows Live for Mobile http://www.msn.com.tw/msnmobile/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: lucene memory consumption
Currently, searching on our index consumes around 2.5GB of ram. This is just a single term query, nothing that requires the in memory cache like in the FieldScoreQuery. Alex > Date: Thu, 29 May 2008 15:25:43 -0700 > From: [EMAIL PROTECTED] > To: java-user@lucene.apache.org > Subject: Re: lucene memory consumption > > Not that I can think about. But, if you have any cached field data, > norms array, that could be huge. > > Would be interested in knowing from others regarding this topic as well. > > Jian > > On 5/29/08, Alex wrote: >> >> Hi, >> other than the in memory terms (.tii), and the few kilobytes of opened file >> buffer, where are some other sources of significant memory consumption >> when searching on a large index ? (> 100GB). The queries are just normal >> term queries. >> >> >> _ >> 隨身的 Windows Live Messenger 和 Hotmail,不限時地掌握資訊盡在指間 — Windows Live for Mobile >> http://www.msn.com.tw/msnmobile/ >> >> - >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> _ 隨身的 Windows Live Messenger 和 Hotmail,不限時地掌握資訊盡在指間 — Windows Live for Mobile http://www.msn.com.tw/msnmobile/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: lucene memory consumption
I believe we have around 346 million documents Alex > Date: Thu, 29 May 2008 18:39:31 -0400 > From: [EMAIL PROTECTED] > To: java-user@lucene.apache.org > Subject: Re: lucene memory consumption > > Alex wrote: >> Currently, searching on our index consumes around 2.5GB of ram. >> This is just a single term query, nothing that requires the in memory cache >> like in >> the FieldScoreQuery. >> >> >> Alex >> >> >> > That seems rather high. You have 10/15 million + docs? > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > _ 聰明搜尋和瀏覽網路的免費工具列 — MSN 搜尋工具列 http://toolbar.live.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Is it possible to get only one Field from a Document?
if you have many terms across the fields, you might want to invoke IndexReader's setTermInfosIndexDivisor() method, which would reduce the in memory term infos used to lookup idf, but a (slightly) slower search. > From: [EMAIL PROTECTED] > To: java-user@lucene.apache.org > Subject: Re: Is it possible to get only one Field from a Document? > Date: Wed, 11 Jun 2008 08:22:22 -0400 > > For the record, Hits.id(int i) returns the document number. Note, > though, that Hits is now deprecated, as pointed out by the link to > 1290, so going the TopDocs route is probably better anyway. > > -Grant > > On Jun 11, 2008, at 7:43 AM, Daan de Wit wrote: > > > This is possible, you need to provider a FieldSelector to > > IndexReader#document(docId, selector). This won't work with Hits > > though, because Hits does not expose the document number, so you > > need to roll your own solution using TopDocs or HitCollector, for > > information see the discussion in this issue: > > https://issues.apache.org/jira/browse/LUCENE-1290 > > > > Kind regards, > > Daan de Wit > > > > -Original Message- > > From: Marcelo Schneider [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, June 11, 2008 13:29 > > To: java-user@lucene.apache.org > > Subject: Is it possible to get only one Field from a Document? > > > > I have a environment where we have indexed a DB with about 6mil > > entries > > with Lucene, and each row has 25 columns. 20 cols have integer codes > > used as filters (indexed/unstored), and the other 5 have (very) large > > texts (also indexed/unstored). Currently the search I'm doing is > > like this: > > > > Hits hits = searcher.search(query); > > for (int i = 0; i < this.hits.length(); i++) { > >Document doc = this.hits.doc(i); > >String s = doc.get("fieldWanted"); > > // does everything with the result, etc > > } > > > > We are trying to reduce memory usage, however. Is it possible to > > return > > a Document object with just the Fields I really need? In the example, > > each Document have 25 fields, and I just need one... would this > > theoretically make any difference? > > > > > > > > > > -- > > > > Marcelo Frantz Schneider > > SIC - TCO - Tecnologia em Engenharia do Conhecimento > > DÍGITRO TECNOLOGIA > > E-mail: [EMAIL PROTECTED] > > Site: www.digitro.com > > > > > > -- > > Esta mensagem foi verificada pelo sistema de antivírus da Dígitro e > > acredita-se estar livre de perigo. > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > -- > Grant Ingersoll > http://www.lucidimagination.com > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > _ 隨身的 Windows Live Messenger 和 Hotmail,不限時地掌握資訊盡在指間 — Windows Live for Mobile http://www.msn.com.tw/msnmobile/
RE: huge tii files
you can invoke IndexReader.setTermInfosIndexDivisor prior to any search to control the fraction of .tii file read into memory. _ 聰明搜尋和瀏覽網路的免費工具列 — MSN 搜尋工具列 http://toolbar.live.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Add a document in a single pass?
Hi, I have a stream-based document parser that extracts contents (as a character stream) as well as document metadata (as strings) from a file, in a single pass. From these data I want to create a Lucene document. The problem is that the metadata are available not until the complete document has been parsed, i.e. after IndexWriter.addDocument returned. Is there a way to affect the order the document fields are processed in IndexWriter.addDocument or another way to build the index efficiently? Thanks in advance Alex
Detailed file handling on hard disk
Hello everybody, I read the paper http://www2008.org/papers/pdf/p387-zhangA.pdf Performance of Compresses Inverted List Caching in Search Engines and now I am unsure how Lucene implements its structure on the hard disk. I am using Windos as OS and therefore I implemented FSDirectory based on Java.io.RandomAccessFile. How is the skipping in the .tis file realized? Do I use metadata at the beginning of each block too like in the mentioned paper above on page 388 (in the paper the metadata stores informations about how many inverted lists are in the block and where they start)? http://lucene.472066.n3.nabble.com/file/n1413062/Block_assignment.jpg Because I read in another article that I can seek to the correct position on the hard drive with the byte address using java.io.RandomAccessFile (which I can read from .tii-file in "IndexDelta"?). How do I find the correct position/location for my PostingList/Document? Do I need information/metadata about the blocks from the underlying file system? Or where can I find further informations about this stuff? :) Best regards Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Detailed-file-handling-on-hard-disk-tp1413062p1413062.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Implementing indexing of Versioned Document Collections
Hello everybody, I would like to implement the paper "Compact Full-Text Indexing of Versioned Document Collections" [1] from Torsten Suel for my diploma thesis in Lucene. The basic idea is to create a two-level index structure. On the first level a document is identified by document ID with a posting list entry if the term exists at least in one version. For every posting on the first level with term t we have a bitvector on the second one. These bitvectors contain as many bits as there are versions for one document, and bit i is set to 1 if version i contains term t or otherwise it remains 0. http://lucene.472066.n3.nabble.com/file/n1872701/Unbenannt_1.jpg This little picture is just for demonstration purposes. It shows a posting list for the term car and is composed of 4 document IDs. If a hit is found in document 6 another look-up is needed on the second level to get the corresponding versions (version 1, 5, 7, 8, 9, 10 from 10 versions at all). At the moment I am using wikipedia (simplewiki dump) as source with a SAXParser and can resolve each document with all its versions from the XML file (Fields are Title, ID, Content(seperated for each version)). My problem is that I am unsure how to connect the second level with the first one and how to store it. The key points that are needed: - Information from posting list creation to create the bitvector (term -> doc -> versions) - Storing the bitvectors - Implementing search on second level For the first steps I disabled term frequencies and positions because the paper isn't handling them. I would be happy to get any running version at all. :) At the moment I can create bitvectors for the documents. I realized this with a HashMap in TermsHashPerField where I grab the current term in add() (I hope this is the correct location for retrieving the inverted lists terms). Anyway I can create the corret bitvectors and write them into a text file. Excerpt of bitVectors from article "April": april : 110110111011 never : 0010 ayriway : 010110111011 inclusive : 1000 Next step would be storing all bitvecors in the index. At first glance I like to use an extra field to store the created bitvectors permanent in the index. It seems to be the easiest way for a first implementation without accessing the low level functions of Lucene. Can I add a field after I already started writing the document through IndexWriter? How would I do this? Or are there any other suggestions for storing? Another idea is to expand the index format of Lucene but this seems a little bit to difficult for me. Maybe I could write these information into my own file. Could anybody point me to the right direction? :) Currently I am focusing on storing and try to extend Lucenes search after the former step. THX in advance & best regards Alex [1] http://cis.poly.edu/suel/ -- View this message in context: http://lucene.472066.n3.nabble.com/Implementing-indexing-of-Versioned-Document-Collections-tp1872701p1872701.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Implementing indexing of Versioned Document Collections
Hello Pulkit, thank you for your answer and excuse me for my late reply. I am currently working on the payload stuff and have implemented my own Analyzer and Tokenfilter for adding custom payloads. As far as I understand I can add Payload for every term occurence and write this into the posting list. My posting list now looks like this: car -> DocID1, [Payload 1], DocID2, [Payload2]., DocID N, [Payload N] Where each payload is a BitSet depending on the versions of a document. I must admit that the index is getting really big at the moment because I am adding around 8 to 16 bytes with each payload. I have to find a good compression for the bitvectors. Further I am always getting the error org.apache.lucene.index.CorruptIndexException: checksum mismatch in segments file if I use my own Analyzer. After I uncomment the checksum test everything works fine. Even Luke isn't giving me an error. Any ideas? Another problem is the BitVector creation during tokenization. I am running through all versions during the tokenizing step for creating my bitvectors (stored in a HashMap). So my bitvectors are completly created after the last field is analyzed (I added every wikipedia verison as an own field). Therefore I need to add the payload after the tokenizing step. Is this possible? What happens if I add payload for a current term and I add another payload for the same term later ? Is it overwritten or appended? Greetings Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Implementing-indexing-of-Versioned-Document-Collections-tp1872701p1910449.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Implementing indexing of Versioned Document Collections
Hi again, my Payloads are working fine as I figured out now (haven't seen the nextPosition method). I really have problems with adding the bitvectors. Currently I am creating them during tokenization. Therefore, as already mentioned, they are only completely created when all fields are tokenized because I add every new term occurence into HashMap and create/update the linked bitvector during this analysis process. I read in another post that changing or updating already set payloads isn't possible. Furthermore I need to store payload only ONCE for a term and not on every term position. For example on the wiki article for April I would have around 5000 term occurrences for the term "April"! This would save a lot of memory. 1) Is it possible to pre analyze fields? Maybe analyzing twice. First time for getting the bitvectors (without writing them!) and second time for normal index writing with bitvector payloads. 2) Alternatively I could still add the bitvectors during tokenization if I would be able to set the current term in my custom Filter (extends TokenFilter). In my HashMap I have pairs of and I could iterate over all term keys. Is it possible to manually set the current term and the corresponding payload? I tried something like this after all fields and streams have been tokenized (Without success): for (Map.Entry e : map.entrySet()) { key = e.getKey(); value = e.getValue(); termAtt.setTermBuffer(key); bitvectorPalyoad = new Payload(toByteArray(value)); payloadAttr.setPayload(bitvectorPalyoad); } 3) Can I use payloads without term positions? If my questions are unclear please tell me! :) Best regards Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Implementing-indexing-of-Versioned-Document-Collections-tp1872701p1913140.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Indexing large XML dumps
Hello everybody, I am currently indexing wikipedia dumps and create an index for versioned document collections. As far everything is working fine but I have never thought that single articles of wikipedia would reach a size of around 2 GB! One article for example has 2 versions with an average length of 6 characters for each (HUGE in memory!). This means I need a heap space around 4 GB to perform indexing and I would like to decrease my memory consumption ;). At the moment I load every wikipedia article completely into memory containing all versions. Then I collect some statistical data about the article to store extra information about term occurences which are written into the index as payloads. The statistic is created during an own tokenization run which happens before the document is written into index. This means I am analyzing my documents twice! :( I know there is a CachingTokenFilter but I haven't found how and where to implement it exactly (I tried it in my Analyzer but stream.reset() seems not to work). Does somebody have a nice example? 1) Can I somehow avoid loading one complete article to get my statistics? 2) Is it possible to index large files without completely loading it into a field? 3) How can I avoid to parse an article twice? Best regards Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Indexing-large-XML-dumps-tp2185926p2185926.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Could not find implementing class
Hello everybody, I used a small indexing example from "Lucene in Action" and can run and compile the program under eclipse. If I want to compile and run it by console I get this error: java.lang.IllegalArgumentException: Could not find implementing class for org.apache.lucene.analysis.tokenattributes.TermAttribute at org.apache.lucene.util.AttributeSource$AttributeFactory$DefaultAttributeFactory.getClassForInterface(AttributeSource.java:87) at org.apache.lucene.util.AttributeSource$AttributeFactory$DefaultAttributeFactory.createAttributeInstance(AttributeSource.java:66) at org.apache.lucene.util.AttributeSource.addAttribute(AttributeSource.java:245) at org.apache.lucene.index.DocInverterPerThread$SingleTokenAttributeSource.(DocInverterPerThread.java:41) at org.apache.lucene.index.DocInverterPerThread$SingleTokenAttributeSource.(DocInverterPerThread.java:36) at org.apache.lucene.index.DocInverterPerThread.(DocInverterPerThread.java:34) at org.apache.lucene.index.DocInverter.addThread(DocInverter.java:95) at org.apache.lucene.index.DocFieldProcessorPerThread.(DocFieldProcessorPerThread.java:62) at org.apache.lucene.index.DocFieldProcessor.addThread(DocFieldProcessor.java:88) at org.apache.lucene.index.DocumentsWriterThreadState.(DocumentsWriterThreadState.java:43) at org.apache.lucene.index.DocumentsWriter.getThreadState(DocumentsWriter.java:739) at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:814) at org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:802) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1998) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1972) at Demo.setUp(Demo.java:86) at Demo.main(Demo.java:46) I compile the command with javac -cp Demo.java which finishes without errors but running the program isn't possible. What am I missing?? Basically I am just creating a directory, getting an indexwriter with analyzer etc.. Line 86 in Demo.java is writer.addDocument(doc);. Greetings Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Could-not-find-implementing-class-tp2330598p2330598.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Could not find implementing class
Hello Alexander, isn't it enough to add the classpath through -cp? If I don't use -cp I can't compile my project. I thought after compiling without errors all sources are correctly added. In Eclipse I added Lucene sources the same way(which works) and I also tried using the jar file. Therefore I seem to find all classes but I don't get a clue with the error message. This error message is thrown by the Lucene class DefaultAttributeFactory in org.apache.lucene.util.AttributeSource. I work under Ubuntu and configured java with - sudo update-alternatives --config java - sudo update-java-alternatives -java-6-sun Greetings Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Could-not-find-implementing-class-tp2330598p2331617.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: Could not find implementing class
Hello Uwe, I recompiled some classes manually in Lucene sources. No it's running fine! Something went wrong there. Thank you very much! Best regards Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Could-not-find-implementing-class-tp2330598p2332141.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Storing payloads without term-position and frequency
Hello everybody, I am currently using Lucene 3.0.2 with payloads. I store extra information in the payloads about the term like frequencies and therefore I don't need frequencies and term positions stored normally by Lucene. I would like to set f.setOmitTermFreqAndPositions(true) but then I am not able to retrieve payloads. Would it be hard to "hack" Lucene for my requests? Anymore I only store one payload per term if that information makes it easier. Best regards Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Storing-payloads-without-term-position-and-frequency-tp2408094p2408094.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
How are stored Fields/Payloads loaded
Hello everybody, I am currently unsure how stored data is written and loaded from index. I want to store for every term of a document some binary data but only once and not for every position! Therefore I am not sure if Payloads or stored Fields are the better solution (Or the not implemented feature Column Stride Field). As far as I know all fields of a document are loaded by Lucene during search. With large stored fields this can be time consuming and therefore exists the possibility to load specific fields with FieldSelector. Maybe I could create for each term a stored Field (up to several thousand Fields!) and read those fields depending on the query term. Is this a common approach? The other possibility (like I have implemented it at the moment) is to store per term a payload but only on the first term position. Payloads are loaded only if I retrieve them from a hit right? So my current posting list looks like this: http://lucene.472066.n3.nabble.com/file/n2598739/Payload.png Picture adapted from M. McCandless "Fun with Flex" How will the feature Column Stride Field (or per-document field) work? It's not clear for me what "per Document" exactly means for the posting list entries. I think (hope :P) it works like this: http://lucene.472066.n3.nabble.com/file/n2598739/CSD.png Picture adapted from M. McCandless "Fun with Flex" Do I understand the Column Stride Field correct? What would give me the best performance (Stored Field, Payload, CSD)? Are there other ways to retrieve payloads during search than Spanquery (I would like to use a normal query here)? Regards Alex -- View this message in context: http://lucene.472066.n3.nabble.com/How-are-stored-Fields-Payloads-loaded-tp2598739p2598739.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Early Termination
Hi, is Lucene capable of any early termination techniques during query processing? On the forum I only found some information about TimeLimitedCollector. Are there more implementations? Regards Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Early-Termination-tp2684557p2684557.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Lucene 4.0 Payloads
Hello everybody, I am currently experimenting with Lucene 4.0 and would like to add payloads. Payload should only be added once per term on the first position. My current code looks like this: public final boolean incrementToken() throws java.io.IOException { String term = characterAttr.toString(); if (!input.incrementToken()) { return false; } // hmh contains all terms for one document if(hmh.checkKey(term)){ // check if hashmap contains term Payload payload = new Payload(hmh.getCompressedData(term)); //get payload data payloadAttr.setPayload(payload); // add payload hmh.removeFromIndexingMap(term); // remove term from hashmap } return true; } Is this a correct way for adding payloads in Lucene 4.0? When I try to receive payloads I am not getting payload on the first position. For getting payloads I use this: DocsAndPositionsEnum tp = MultiFields.getTermPositionsEnum(ir, MultiFields.getDeletedDocs(ir), fieldName, new BytesRef(searchString)); while (tp.nextDoc() != tp.NO_MORE_DOCS) { if (tp.hasPayload() && counter < 10) { Document doc = ir.document(tp.docID()); BytesRef br = tp.getPayload(); System.out.println("Found payload \"" + br.utf8ToString() + "\" for document " + tp.docID() + " and query " + searchString + " in country " + doc.get("country")); } } As far as I know there are two possibilities to use payloads 1) During similarity scoring 2) During search Is there a better/faster way to receive payloads during search? Is it possible to run a normal query and read the payloads from hits? Is 1 or 2 the faster way to use payloads? Can I find somewhere example code for Lucene and loading payloads? Regards Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Lucene-4-0-Payloads-tp2695817p2695817.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
lucene-snowball 3.1.0 packages are missing?
Hello I'm trying to upgrade Lucene in my project to 3.1.0 release, but there is no lucene-snowball 3.1.0 package on maven central. Is it intended behaviour? Should I continue to use 3.0.3 for snowball package? -- With best wishes, Alex Ott http://alexott.blogspot.com/http://alexott.net/ http://alexott-ru.blogspot.com/ Skype: alex.ott - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
New codecs keep Freq skip/omit Pos
Hello everybody, I am currently testing several new Lucene 4.0 codec implementations to compare with an own solution. The difference is that I am only indexing frequencies and not positions. I would like to have this for the other codecs. I know there was already a post for this topic http://lucene.472066.n3.nabble.com/Omit-positions-but-not-TF-td599710.html. I just wanted to ask if there has something changed especially for the new codecs. I had a look at the FixedPostingWriterImpl and PostingsConsumer. Are those they right places for adapting Pos/Freq handling? What would happen if I just skip writing postions/payloads? Would it mess up the index? The written files have different endings like pyl, skp, pos, doc etc. Gives me "not counting" the pos file a correct index size estimation for W Freqs W/O Pos? Or where exactly are term positions written? Regards Alex PS: Some results with the current codecs if someone is interested. I indexed 10% of Wikipedia(english). Each version is indexed as document. Docs240179 Versions8467927 Distinct Terms 3501214 total Terms 1520008204 Avg. Versions 35.25 Avg. Terms per Version 179.50 Avg. Terms per Doc 6328.65 PforDelta W Freq W Pos 20.6 GB PforDelta W/O Freq W/O Pos 1.6 GB Standard 4.0 W Freq W Pos 28.1 GB Standard 4.0 W/O Freq W/O Pos6.2 GB Pfor W Freq W Pos 22 GB Pfor W/O Freq W/O Pos3.1 GB Performance follows ;) -- View this message in context: http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p2849776.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: New codecs keep Freq skip/omit Pos
Hello Robert, thank you for the answers! :) Currently I used PatchedFrameOfRef and PatchedFrameOfRef2. Therefore both implementations are PForDelta! Sorry my mistake. PatchedFrameOfRef2: PforDelta W/O Freq W/O Pos 1.6 GB PatchedFrameOfRef : Pfor W/O Freq W/O Pos 3.1 GB Here are some numbers: PatchedFrameOfRef2 w/o POS w/o FREQ segements.gen 20 Bytes _43.fdt 8,1 MB _43.fdx 64,4 MB _43.fnm 20 Bytes _43_0.skp 182,6 MB _43_0.tib 32,3 MB _43_0.tiv 1,0 MB segements_2 268 Bytes _43_0.doc 1,3 GB PatchedFrameOfRef w/o POS w/o FREQ segements.gen 20 Bytes _43.fdt 8,1 MB _43.fdx 64,4 MB _43.fnm 20 Bytes _43_0.skp 182,6 MB _43_0.tib 32,3 MB _43_0.tiv 1,1 MB segements_2 267 Bytes _43_0.doc 2,8 GB During indexing I use StandardAnalyzer (StandardFilter, LowerCaseFilter, StopFilter). Can I get somewhere more information for Codec creation or is there just "grubbing" through the code? My own implementation needs 2,8 GB of space including FREQ but not POS. This is why I am asking because I want somehow compare the result. Compared to 20 GB it is very nice and compared to 1,6 GB it is very bad ;). Regards Alex -- View this message in context: http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p2851809.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: New codecs keep Freq skip/omit Pos
I also indexed one time with Lucene 3.0. Are those sizes really completely the same? Standard 4.0 W Freq W Pos 28.1 GB Standard 4.0 W/O Freq W/O Pos 6.2 GB Standard 3.0 W Freq W Pos 28.1 GB Standard 3.0 WO Freq WO Pos 6.2 GB Regards Alex -- View this message in context: http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p2851898.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: New codecs keep Freq skip/omit Pos
Wow cool , I will give that a try! Thank you!! Alex -- View this message in context: http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p2852370.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: New codecs keep Freq skip/omit Pos
Hi Robert, the adapted codec is running but it seems to be incredible slow. Will take some time ;) Here are some performance results: Indexing scheme Index Size Avg. Query performance Max. Query Performance PforDelta2 W Freq W Pos 20.6 GB (3,3 GB w/o .pos) 81.97 ms 1295 ms PforDelta2 W/O Freq W/O Pos 1.6 GB 63.33 ms 766 ms Standard 4.0 W Freq W Pos 28.1 GB (8,1 GB w/o .prx) 77.71 ms 978 ms Standard 4.0 W/O Freq W/O Pos 6.2 GB 59.93 ms 718 ms Standard 3.0 W Freq W Pos 28.1 GB (8,1 GB w/o .prx) 71.41 ms 978 ms Standard 3.0 WO Freq WO Pos 6.2 GB 72.72 ms 845 ms PforDelta W Freq W Pos 22 GB (5 GB w/o .pos) 67.98 ms 783 ms PforDelta W/O Freq W/O Pos 3.1 GB 56.08 ms 596 ms Huffman BL10 W Freq W/O Pos 2.6 GB 216.29 ms (Mem 14 ms) 1338 ms I am a little bit curious about the Lucene 3.0 performance results because the larger index seems to work faster?!? I already ran the test several times. Are my results realistic at all? I thought PForDelta/2 would outperform the standard index implementations in query processing. The last result is my own implementation. I am still looking to get it smaller because I think I can improve compression further. For indexing I use PForDelta2 in combination with payloads. Those are causing the higher runtimes. In memory it looks nice. The gap between my solution and PForDelta is already 700 MB. I would say it is an improvement. :D I will have a look at it again after I got an index with your adapted implementation. I still have another question. The basic idea in my implementation is to create a "Two-Level" index structure. It is specialized for versioned document collections. On the first level I create a posting list entry for a document whenever a term occurs in one or more of its versions. The second level holds corresponding term frequency informations. Is it possible to build such a structure by creating a codec? For query processing it should filter per boolean query on the first level and only fetch information from the second level when the document is in the intersection of the first level. At the moment I use payloads to "simulate" a two-level structure. Normally all payloads corresponding to a query get fetched, right? If this structure would be possible there are several more implementations with promising results (Two-Level Diff/MSA in this paper http://cis.poly.edu/suel/papers/version.pdf). Regards Alex -- View this message in context: http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p284.html Sent from the Lucene - Java Users mailing list archive at Nabble.com.
Re: New codecs keep Freq skip/omit Pos
> it depends upon the type of query.. what queries are you using for > this benchmarking and how are you benchmarking? > FYI: for benchmarking standard query types with wikipedia you might be > interested in http://code.google.com/a/apache-extras.org/p/luceneutil/ I have 1 queries from a AOL data set where the followed link lead to wikipedia. I benchmark by warming up the indexSearcher with 5000 and perform the test with the remaining 5000 queries. I just measure the time needed to execute the queries. I use QueryParser. > wait, you are indexing payloads for your tests with these other codecs > when it says "W POS" ? No only my last implementation uses payloads. All others not. Therefore I use a payload aware query for Huffman. > keep in mind that even adding a single payload to your index slows > down the decompression of the positions tremendously, because payload > lengths are intertwined with the positions. For block codecs payloads > really need to be done differently so that blocks of positions are > really just blocks of positions. This hasn't yet been fixed for the > sep nor the fixed layouts, so if you add any payloads, and then > benchmark positional queries then the results are not realistic. Oh I know that payloads slow down query processing but I wasn't aware of the block codec problem. I suggest you mean with not realistic they will be slower? Some numbers for Huffman: 20 Bytes segements.gen 234.6 KB fdt 1.8 MB fdx 20 bytes fnm 626.1 MB pos 1.7 GB pyl 17.8 MB skp 39.8 MB tib 2028.5 KB tiv 268 Bytes Segments_2 214.6 MB doc I used here for query processing PayloadQueryParser and adapt the similarity according to my payloads. > No they do not, only if you use a payload based query such as > PayloadTermQuery. Normal non-positional queries like TermQuery and > even normal positional queries like PhraseQuery don't fetch payloads > at all... Sorry my question was misleading. I already focused on a payload aware query. When I use one how exactly are the payload informations fetched from disk? For example if a query needs to read two posting lists. Are all payloads fetched for them directly or is Lucene at first making a boolean intersection and then retrieves the payloads for documents within that intersection? > From the description of what you are doing I don't understand how > payloads fit in because they are per-position? But, I haven't had the > time to digest the paper you sent yet. I will try to summarize it and how I adapted it to Lucene. I already mentioned the idea of two levels for versioned document collections. When I parse Wikipedia I unite for one article all terms of all versions. From this word bag I extract each distinct term and index it with Lucene into one document. Frequency information is now "lost" for the first level but will be stored on the second. This is what I meant with " The first level contains a posting for a document when a term occurs at least in one version". For example if an article has two versions like version1: "a b b" and version2: "a a a c c" only 'a','b' and 'c' are indexed. For the second level I collected term frequency information during my parsing step. Those frequencies are stored as a vector in version order. For the above example the frequency vector for 'a' would be [1,3]. I store these vectors as payloads which I see as the "second level". Every distinct term on first level receives a single frequency vector on its first position. So I somehow abuse payloads. For query processing I now need to retrieve the docs and payloads. It would be optimal to process the posting lists first ignoring payloads and then fetch payloads (frequency information) for the remaining docs. The term frequency is then used for ranking purposes. At the moment I pick for ranking the highest value from the freq vector which corresponds to the most matching version. Regards Alex - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- View this message in context: http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p2856054.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Lucene query processing
Hello everybody, As far as I know Lucene processes documents DAAT. Depending on the query either the intersection or union is calculated. For the intersection only documents occurring in all posting lists are scored. In the union case every document is scored which makes it a more expensive operation. Lucene stores its index into several files. Depending on the query different files might be accessed for scoring. For example a payload query needs to read paylods from .pos. What is not clear for me how term frequencies or payloads are processed. Assuming I store term frequencies I need to set setOmitTermFreqAndPositions(false). 1) Which queries include term frequencies? I assume all queries if term frequencies are stored? 2) Why is fetching payloads so much more expensive than getting term frequencies. Both are stored in seperated files and therefore demand a disk seek. 3) What for a value contains tf if I set setOmitTermFreqAndPositions(true)? Allways 1? 4) How are term freqs, payloads read from disk? In bulk for all remaining docs at once or every time a document gets scored? Regards Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Lucene-query-processing-tp2868144p2868144.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Similarity class and searchPayloads
Hello everybody, I am just curious about following case. Currently, I create a boolean AND query which loads payloads. In some cases it occurs that Lucene loads payloads but does not return hits. Therefore, I assume that payloads are directly loaded whith each doc ID from the posting list before the boolean filter.Is that right? Is it possible to filter documents first and then load the payload? For example, I have three terms and I check in every posting list if the current doc ID is availabel. Only then I load payload. Or can anybody tell me where exactly Lucene loads payloads in code? Regards Alex -- View this message in context: http://lucene.472066.n3.nabble.com/Similarity-class-and-searchPayloads-tp3041463p3041463.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Question about the CompoundWordTokenFilterBase
Hi, While trying to play with the CompoundWordTokenFilterBase I noticed that the behavior is to include the original token together with the new sub-tokens. I assume this is expected (haven't found any relevant docs on this), but I was wondering if it's a hard requirement or can I propose a small change to skip the original token (controlled by a flag)? If there's interest I can put this in a JIRA issue and we can continue the discussion there. The patch is not too complicated, but I haven't ran any of the tests yet :) thanks, alex
Performance issues with the default field compression
Hi, I was investigating some performance issues and during profiling I noticed that there is a significant amount of time being spent decompressing fields which are unrelated to the actual field I'm trying to load from the lucene documents. In our benchmark doing mostly a simple full-test search, 40% of the time was lost in these parts. My code does the following: reader.document(id, Set(":path")).get(":path"), and this is where the fun begins :) I noticed 2 things, please excuse the ignorance if some of the things I write here are not 100% correct: - all the fields in the document are being decompressed prior to applying the field filter. We've noticed this because we have a lot of content stored in the index, so there is an important time lost around decompressing junk. At one point I tried adding the field first, thinking this will save some work, but it doesn't look like it's doing much. Reference code, the visitor is only used at the very end. [0] - second, and probably of a smaller impact would be to have the DocumentStoredFieldVisitor return STOP when there are no more fields in the visitor to visit. I only have one, and it looks like it will #skip through a bunch of other stuff before finishing a document. [1] thanks in advance, alex [0] https://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/codecs/compressing/CompressingStoredFieldsReader.java?view=markup#l364 [1] https://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/document/DocumentStoredFieldVisitor.java?view=markup#l100
Re: Performance issues with the default field compression
Hi Adrien, Thanks for clarifying! We're going to go the custom codec & custom visitor route. best, alex On Wed, Apr 9, 2014 at 10:38 PM, Adrien Grand wrote: > Hi Alex, > > Indeed, one or several (the number depends on the size of your > documents) documents need to be fully decompressed in order to read a > single field of a single document. > > Regarding the stored fields visitor, the default one doesn't return > STOP when the field has been found because other fields with the same > name might be stored further in the stream of stored fields (in case > of a multivalued field). If you know that you have a single field > value, you can write your own field visitor that will return STOP > after the first value has been read. As you noted, this probably has > less impact on performance than the first point that you raised. > > The default stored fields visitor is rather targeted at large indices > where compression helps save disk space and can also make stored > fields retrieval faster since a larger portion of the stored fields > can fit in the filesystem cache. However, if your index is small and > fully fits in the filesystem cache, this stored fields format might > indeed have non-negligible overhead. > > > On Wed, Apr 9, 2014 at 9:17 PM, Alex Parvulescu > wrote: > > Hi, > > > > I was investigating some performance issues and during profiling I > noticed > > that there is a significant amount of time being spent decompressing > fields > > which are unrelated to the actual field I'm trying to load from the > lucene > > documents. In our benchmark doing mostly a simple full-test search, 40% > of > > the time was lost in these parts. > > > > My code does the following: reader.document(id, > Set(":path")).get(":path"), > > and this is where the fun begins :) > > I noticed 2 things, please excuse the ignorance if some of the things I > > write here are not 100% correct: > > > > - all the fields in the document are being decompressed prior to > applying > > the field filter. We've noticed this because we have a lot of content > > stored in the index, so there is an important time lost around > > decompressing junk. At one point I tried adding the field first, thinking > > this will save some work, but it doesn't look like it's doing much. > > Reference code, the visitor is only used at the very end. [0] > > > > - second, and probably of a smaller impact would be to have the > > DocumentStoredFieldVisitor return STOP when there are no more fields in > the > > visitor to visit. I only have one, and it looks like it will #skip > through > > a bunch of other stuff before finishing a document. [1] > > > > thanks in advance, > > alex > > > > > > [0] > > > https://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/codecs/compressing/CompressingStoredFieldsReader.java?view=markup#l364 > > > > [1] > > > https://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/document/DocumentStoredFieldVisitor.java?view=markup#l100 > > > > -- > Adrien > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Lucene 5.2.0 global ordinal based query time join on multiple indexes
Hi, Does the Global Ordinal based query time join support joining on multiple indexes? >From my testing on 2 indexes with a common join field, the document ids I get back from the ScoreDoc[] when searching are incorrect, though the number of results is the same as if I use the older join query. For the parent (to) index, the value of the join field is unique to each document. For the child (from) index, multiple documents can have the same value for the join field, which must be found in the parent index. Both indexes have a join field indexed with SortedDocValuesField. The parent index had 7 segments and child index had 3 segments. Ordinal map is built with: SortedDocValues[] values = new SortedDocValues[searcher1 .getIndexReader().leaves().size()]; for (LeafReaderContext leadContext : searcher1.getIndexReader() .leaves()) { values[leadContext.ord] = DocValues.getSorted(leadContext.reader(), "join_field"); } MultiDocValues.OrdinalMap ordinalMap = null; ordinalMap = MultiDocValues.OrdinalMap.build(searcher1.getIndexReader() .getCoreCacheKey(), values, PackedInts.DEFAULT); Join Query: joinQuery = JoinUtil.createJoinQuery("join_field", fromQuery, new TermQuery(new Term("type", "to")), searcher2, ScoreMode.Max, ordinalMap); Thanks, Alex
Re: Lucene 5.2.0 global ordinal based query time join on multiple indexes
Seems if I create a MultiReader from my index searchers and create the ordinal map from that MultiReader (and use an IndexSearcher created from the MultiReader in the createJoinQuery), then the correct results are found. On Mon, Jul 20, 2015 at 5:48 PM, Alex Pang wrote: > Hi, > > > > Does the Global Ordinal based query time join support joining on multiple > indexes? > > > > From my testing on 2 indexes with a common join field, the document ids I > get back from the ScoreDoc[] when searching are incorrect, though the > number of results is the same as if I use the older join query. > > > For the parent (to) index, the value of the join field is unique to each > document. > > For the child (from) index, multiple documents can have the same value for > the join field, which must be found in the parent index. > > Both indexes have a join field indexed with SortedDocValuesField. > > > The parent index had 7 segments and child index had 3 segments. > > > Ordinal map is built with: > > SortedDocValues[] values = new SortedDocValues[searcher1 > > .getIndexReader().leaves().size()]; > > for (LeafReaderContext leadContext : searcher1.getIndexReader() > > .leaves()) { > > values[leadContext.ord] = DocValues.getSorted(leadContext.reader(), > > "join_field"); > > } > > MultiDocValues.OrdinalMap ordinalMap = null; > > ordinalMap = MultiDocValues.OrdinalMap.build(searcher1.getIndexReader() > > .getCoreCacheKey(), values, PackedInts.DEFAULT); > > > Join Query: > > joinQuery = JoinUtil.createJoinQuery("join_field", > > fromQuery, > > new TermQuery(new Term("type", "to")), searcher2, > > ScoreMode.Max, ordinalMap); > > > > Thanks, > > Alex >
LUCENE-8396 performance result?
LUCENE-8396 looks pretty good for LBS use cases, do we have performance result for this approach? It appears to me it would greatly reduce terms to index a polygon, and how about search performance? does it also perform well for complex polygon which has hundreds or more coordinates?
Legacy filter strategy in Lucene 6.0
As FilteredQuery are removed in Lucene 6.0, we should use boolean query to do the filtering. How about the legacy filter strategy such as LEAP_FROG_FILTER_FIRST_STRATEGY or QUERY_FIRST_FILTER_STRATEGY? What is the current filter strategy? Thanks,
Re: Legacy filter strategy in Lucene 6.0
Thanks Adrien, I want to filter out docs base on conditions which stored in doc values (those conditions are unselective ranges which is not appropriate to put into reverse index), so I plan to use some selective term conditions to do first round search and then filter in second phase. I see there is two phase iterator, but I did not find how to use it. Is it a appropriate scenario to use two phase iterator? or It is better to do it in a collector? Is there any guide of two phase iterator? Best Regards On Wed, 08 Aug 2018 16:08:39 +0800 Adrien Grand wrote Hi Alex, These strategies still exist internally, but BooleanQuery decides which one to use automatically based on the cost API (cheaper clauses run first) and whether sub clauses produce bitset-based or postings-based iterators. Le mer. 8 août 2018 à 09:46, alex stark a écrit : > As FilteredQuery are removed in Lucene 6.0, we should use boolean query to > do the filtering. How about the legacy filter strategy such as > LEAP_FROG_FILTER_FIRST_STRATEGY or QUERY_FIRST_FILTER_STRATEGY? What is the > current filter strategy? Thanks,
RE: Legacy filter strategy in Lucene 6.0
Thanks Uwe, I think you are recommending IndexOrDocValuesQuery/DocValuesRangeQuery, and the articles by Adrien, https://www.elastic.co/blog/better-query-planning-for-range-queries-in-elasticsearch It looks promising for my requirement, I will try on that. On Thu, 09 Aug 2018 16:04:27 +0800 Uwe Schindler wrote Hi, IMHO: I'd split the whole code into a BooleanQuery with two filter clauses. The reverse index based condition (term condition, e.g., TermInSetQuery) gets added as a Occur.FILTER and the DocValues condition is a separate Occur.FILTER. If Lucene executes such a query, it would use the more specific condition (based on cost) to lead the execution, which should be the terms condition. The docvalues condition is then only checked for matches of the first. But you can still go and implement the two-phase iterator, but I'd not do that. Uwe - Uwe Schindler Achterdiek 19, D-28357 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: alex stark > Sent: Thursday, August 9, 2018 9:12 AM > To: java-user > Cc: java-user@lucene.apache.org > Subject: Re: Legacy filter strategy in Lucene 6.0 > > Thanks Adrien, I want to filter out docs base on conditions which stored in > doc values (those conditions are unselective ranges which is not appropriate > to put into reverse index), so I plan to use some selective term conditions to > do first round search and then filter in second phase. I see there is two > phase iterator, but I did not find how to use it. Is it a appropriate scenario to > use two phase iterator? or It is better to do it in a collector? Is there any > guide of two phase iterator? Best Regards On Wed, 08 Aug 2018 > 16:08:39 +0800 Adrien Grand wrote Hi Alex, These > strategies still exist internally, but BooleanQuery decides which one to use > automatically based on the cost API (cheaper clauses run first) and whether > sub clauses produce bitset-based or postings-based iterators. Le mer. 8 août > 2018 à 09:46, alex stark a écrit : > As FilteredQuery > are removed in Lucene 6.0, we should use boolean query to > do the > filtering. How about the legacy filter strategy such as > > LEAP_FROG_FILTER_FIRST_STRATEGY or QUERY_FIRST_FILTER_STRATEGY? > What is the > current filter strategy? Thanks, - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Replacement of CollapsingTopDocsCollector
In Lucene 7.x, CollapsingTopDocsCollector is removed, is there any replacement of it?
Any way to improve document fetching performance?
Hello experts, I am wondering is there any way to improve document fetching performance, it appears to me that visiting from store field is quite slow. I simply tested to use indexsearch.doc() to get 2000 document which takes 50ms. Is there any idea to improve that?
Re: Any way to improve document fetching performance?
quite small, just serveral simple short text store fields. The total index size is around 1 GB (2m doc). On Mon, 27 Aug 2018 22:12:07 +0800 wrote Alex,- how big are those docs? Best regards On 8/27/18 10:09 AM, alex stark wrote: > Hello experts, I am wondering is there any way to improve document fetching performance, it appears to me that visiting from store field is quite slow. I simply tested to use indexsearch.doc() to get 2000 document which takes 50ms. Is there any idea to improve that? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Any way to improve document fetching performance?
In same machine, no net latency. When I reduce to 500 limit, it takes 20ms, which is also slower than I expected. btw, indexing is stopped. On Mon, 27 Aug 2018 22:17:41 +0800 wrote yes, it should be less than a ms actually for those type of files. index and search on the same machine? no net latency in between? Best On 8/27/18 10:14 AM, alex stark wrote: > quite small, just serveral simple short text store fields. The total index size is around 1 GB (2m doc). On Mon, 27 Aug 2018 22:12:07 +0800 wrote ---- Alex,- how big are those docs? Best regards On 8/27/18 10:09 AM, alex stark wrote: > Hello experts, I am wondering is there any way to improve document fetching performance, it appears to me that visiting from store field is quite slow. I simply tested to use indexsearch.doc() to get 2000 document which takes 50ms. Is there any idea to improve that? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Any way to improve document fetching performance?
I simple tried MultiDocValues.getBinaryValues to fetch result by doc value, it improves a lot, 2000 result takes only 5 ms. I even encode all the returnable fields to binary docvalues and then decode them, the results is also good enough. It seems store field is not perform well In our scenario (I think it is more common nowadays), search phrase should return as many results as possible so that rank phrase can resort the results by machine learning algorithm(on other clusters). Fetching performance is also important. On Tue, 28 Aug 2018 00:11:40 +0800 Erick Erickson wrote Don't use that call. You're exactly right, it goes out to disk, reads the doc, decompresses it (16K blocks minimum per doc IIUC) all just to get the field. 2,000 in 50ms actually isn't bad for all that work ;). This sounds like an XY problem. You're asking how to speed up fetching docs, but not telling us anything about _why_ you want to do this. Fetching 2,000 docs is not generally what Solr was built for, it's built for returning the top N where N is usually < 100, most frequently < 20. If you want to return lots of documents' data you should seriously look at putting the fields you want in docValues=true fields and pulling from there. The entire Streaming functionality is built on this and is quite fast. Best, Erick On Mon, Aug 27, 2018 at 7:35 AM wrote: > > can you post your query string? > > Best > > > On 8/27/18 10:33 AM, alex stark wrote: > > In same machine, no net latency. When I reduce to 500 limit, it takes 20ms, which is also slower than I expected. btw, indexing is stopped. On Mon, 27 Aug 2018 22:17:41 +0800 wrote yes, it should be less than a ms actually for those type of files. index and search on the same machine? no net latency in between? Best On 8/27/18 10:14 AM, alex stark wrote: > quite small, just serveral simple short text store fields. The total index size is around 1 GB (2m doc). On Mon, 27 Aug 2018 22:12:07 +0800 wrote Alex,- how big are those docs? Best regards On 8/27/18 10:09 AM, alex stark wrote: > Hello experts, I am wondering is there any way to improve document fetching performance, it appears to me that visiting from store field is quite slow. I simply tested to use indexsearch.doc() to get 2000 document which takes 50ms. Is there any idea to improve that? - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Lucene coreClosedListeners memory issues
Hi experts, I recently have memory issues on Lucene. By checking heap dump, most of them are occupied by SegmentCoreReaders.coreClosedListeners which is about nearly half of all. Dominator Tree num retain size(bytes) percent percent(live) class Name 0 14,024,859,136 21.76% 28.23% com.elesearch.activity.core.engine.lucene.LuceneIndex | 10,259,490,504 15.92% 20.65% --org.apache.lucene.index.SegmentCoreReaders | 10,258,783,280 15.92% 20.65% --[field coreClosedListeners] java.util.Collections$SynchronizedSet | 10,258,783,248 15.92% 20.65% --[field c] java.util.LinkedHashSet | 10,258,783,224 15.92% 20.65% --[field map] java.util.LinkedHashMap 1 11,865,993,448 18.41% 23.89% com.elesearch.activity.core.engine.lucene.LuceneIndex 2 11,815,171,240 18.33% 23.79% com.elesearch.activity.core.engine.lucene.LuceneIndex 36,504,382,648 10.09% 13.09% com.elesearch.activity.core.engine.lucene.LuceneIndex | 5,050,933,760 7.84% 10.17% --org.apache.lucene.index.SegmentCoreReaders | 5,050,256,008 7.84% 10.17% --[field coreClosedListeners] java.util.Collections$SynchronizedSet | 5,050,255,976 7.84% 10.17% --[field c] java.util.LinkedHashSet | 5,050,255,952 7.84% 10.17% --[field map] java.util.LinkedHashMap 42,798,684,240 4.34% 5.63% com.elesearch.activity.core.engine.lucene.LuceneIndex thread stack histogram num instances #bytes percent class Name 0 497,527 38,955,989,888 60.44% long[] 1 18,489,470 7,355,741,784 11.41% short[] 2 18,680,799 3,903,937,088 6.06% byte[] 3 35,643,993 3,775,822,640 5.86% char[] 4 4,017,462 1,851,518,792 2.87% int[] 5 7,788,280 962,103,784 1.49% java.lang.Object[] 6 5,256,391 618,467,640 0.96% java.lang.String[] 7 14,974,224 479,175,168 0.74% java.lang.String 8 9,585,494 460,103,712 0.71% java.util.HashMap$Node 9 18,133,885 435,213,240 0.68% org.apache.lucene.util.RoaringDocIdSet$ShortArrayDocIdSet 10 1,559,661 351,465,624 0.55% java.util.HashMap$Node[] 11 4,132,738 264,495,232 0.41% java.util.HashMap 12 1,519,178 243,068,480 0.38% java.lang.reflect.Method 13 4,068,400 195,283,200 0.30% com.sun.org.apache.xerces.internal.xni.QName 14 1,181,106 183,932,704 0.29% org.apache.lucene.search.DocIdSet[] 15 5,721,339 183,082,848 0.28% java.lang.StringBuilder 16 1,515,804 181,896,480 0.28% java.lang.reflect.Field 17 348,720 134,652,416 0.21% com.sun.org.apache.xerces.internal.xni.QName[] 18 3,358,251 134,330,040 0.21% java.util.ArrayList 19 2,775,517 88,816,544 0.14% org.apache.lucene.util.BytesRef total 232,140,701 64,452,630,104 We used LRUQueryCache with maxSize 1000 and maxRamBytesUsed 64MB. The coreClosedListeners occupied too much heap than I expected, is there any reason for that?
Re: Lucene coreClosedListeners memory issues
Hi Adrien, I didn't directly open readers. It is controlled by searcher manager. On Mon, 03 Jun 2019 16:32:06 +0800 Adrien Grand wrote It looks like you are leaking readers. On Mon, Jun 3, 2019 at 9:46 AM alex stark <mailto:alex.st...@zoho.com.invalid> wrote: > > Hi experts, > > > > I recently have memory issues on Lucene. By checking heap dump, most of them > are occupied by SegmentCoreReaders.coreClosedListeners which is about nearly > half of all. > > > > > > Dominator Tree > > num retain size(bytes) percent percent(live) class Name > > > > 0 14,024,859,136 21.76% 28.23% > com.elesearch.activity.core.engine.lucene.LuceneIndex > > | > > 10,259,490,504 15.92% 20.65% > --org.apache.lucene.index.SegmentCoreReaders > > | > > 10,258,783,280 15.92% 20.65% --[field coreClosedListeners] > java.util.Collections$SynchronizedSet > > | > > 10,258,783,248 15.92% 20.65% --[field c] java.util.LinkedHashSet > > | > > 10,258,783,224 15.92% 20.65% --[field map] > java.util.LinkedHashMap > > > > 1 11,865,993,448 18.41% 23.89% > com.elesearch.activity.core.engine.lucene.LuceneIndex > > > > 2 11,815,171,240 18.33% 23.79% > com.elesearch.activity.core.engine.lucene.LuceneIndex > > > > 36,504,382,648 10.09% 13.09% > com.elesearch.activity.core.engine.lucene.LuceneIndex > > | > > 5,050,933,760 7.84% 10.17% > --org.apache.lucene.index.SegmentCoreReaders > > | > > 5,050,256,008 7.84% 10.17% --[field coreClosedListeners] > java.util.Collections$SynchronizedSet > > | > > 5,050,255,976 7.84% 10.17% --[field c] java.util.LinkedHashSet > > | > > 5,050,255,952 7.84% 10.17% --[field map] > java.util.LinkedHashMap > > > > 42,798,684,240 4.34% 5.63% > com.elesearch.activity.core.engine.lucene.LuceneIndex > > > > thread stack > > > > histogram > > num instances #bytes percent class Name > > > > 0 497,527 38,955,989,888 60.44% long[] > > 1 18,489,470 7,355,741,784 11.41% short[] > > 2 18,680,799 3,903,937,088 6.06% byte[] > > 3 35,643,993 3,775,822,640 5.86% char[] > > 4 4,017,462 1,851,518,792 2.87% int[] > > 5 7,788,280 962,103,784 1.49% java.lang.Object[] > > 6 5,256,391 618,467,640 0.96% java.lang.String[] > > 7 14,974,224 479,175,168 0.74% java.lang.String > > 8 9,585,494 460,103,712 0.71% java.util.HashMap$Node > > 9 18,133,885 435,213,240 0.68% > org.apache.lucene.util.RoaringDocIdSet$ShortArrayDocIdSet > > 10 1,559,661 351,465,624 0.55% java.util.HashMap$Node[] > > 11 4,132,738 264,495,232 0.41% java.util.HashMap > > 12 1,519,178 243,068,480 0.38% java.lang.reflect.Method > > 13 4,068,400 195,283,200 0.30% > com.sun.org.apache.xerces.internal.xni.QName > > 14 1,181,106 183,932,704 0.29% > org.apache.lucene.search.DocIdSet[] > > 15 5,721,339 183,082,848 0.28% java.lang.StringBuilder > > 16 1,515,804 181,896,480 0.28% java.lang.reflect.Field > > 17 348,720 134,652,416 0.21% > com.sun.org.apache.xerces.internal.xni.QName[] > > 18 3,358,251 134,330,040 0.21% java.util.ArrayList > > 19 2,775,517 88,816,544 0.14% org.apache.lucene.util.BytesRef > > total 232,140,701 64,452,630,104 > > > We used LRUQueryCache with maxSize 1000 and maxRamBytesUsed 64MB. > > > > The coreClosedListeners occupied too much heap than I expected, is there any > reason for that? -- Adrien
Re: Lucene coreClosedListeners memory issues
Thanks Adrien. I double checked on all the acquire, and it all correctly released in finally. What does SegmentCoreReaders.coreClosedListeners do? It seems to close caches. GC log indicates it is highly possible a leak issue as old gen is keeping increasing without dropping while CMS. Why coreClosedListeners increased to such high number in a single day? On Mon, 03 Jun 2019 18:21:34 +0800 Adrien Grand wrote And do you call release on every searcher that you acquire? On Mon, Jun 3, 2019 at 11:47 AM alex stark <mailto:alex.st...@zoho.com> wrote: > > Hi Adrien, > > I didn't directly open readers. It is controlled by searcher manager. > > > > On Mon, 03 Jun 2019 16:32:06 +0800 Adrien Grand > <mailto:jpou...@gmail.com> wrote > > It looks like you are leaking readers. > > On Mon, Jun 3, 2019 at 9:46 AM alex stark > <mailto:alex.st...@zoho.com.invalid> wrote: > > > > Hi experts, > > > > > > > > I recently have memory issues on Lucene. By checking heap dump, most of > > them are occupied by SegmentCoreReaders.coreClosedListeners which is about > > nearly half of all. > > > > > > > > > > > > Dominator Tree > > > > num retain size(bytes) percent percent(live) class Name > > > > > > > > 0 14,024,859,136 21.76% 28.23% > > com.elesearch.activity.core.engine.lucene.LuceneIndex > > > > | > > > > 10,259,490,504 15.92% 20.65% --org.apache.lucene.index.SegmentCoreReaders > > > > | > > > > 10,258,783,280 15.92% 20.65% --[field coreClosedListeners] > > java.util.Collections$SynchronizedSet > > > > | > > > > 10,258,783,248 15.92% 20.65% --[field c] java.util.LinkedHashSet > > > > | > > > > 10,258,783,224 15.92% 20.65% --[field map] java.util.LinkedHashMap > > > > > > > > 1 11,865,993,448 18.41% 23.89% > > com.elesearch.activity.core.engine.lucene.LuceneIndex > > > > > > > > 2 11,815,171,240 18.33% 23.79% > > com.elesearch.activity.core.engine.lucene.LuceneIndex > > > > > > > > 3 6,504,382,648 10.09% 13.09% > > com.elesearch.activity.core.engine.lucene.LuceneIndex > > > > | > > > > 5,050,933,760 7.84% 10.17% --org.apache.lucene.index.SegmentCoreReaders > > > > | > > > > 5,050,256,008 7.84% 10.17% --[field coreClosedListeners] > > java.util.Collections$SynchronizedSet > > > > | > > > > 5,050,255,976 7.84% 10.17% --[field c] java.util.LinkedHashSet > > > > | > > > > 5,050,255,952 7.84% 10.17% --[field map] java.util.LinkedHashMap > > > > > > > > 4 2,798,684,240 4.34% 5.63% > > com.elesearch.activity.core.engine.lucene.LuceneIndex > > > > > > > > thread stack > > > > > > > > histogram > > > > num instances #bytes percent class Name > > > > > > > > 0 497,527 38,955,989,888 60.44% long[] > > > > 1 18,489,470 7,355,741,784 11.41% short[] > > > > 2 18,680,799 3,903,937,088 6.06% byte[] > > > > 3 35,643,993 3,775,822,640 5.86% char[] > > > > 4 4,017,462 1,851,518,792 2.87% int[] > > > > 5 7,788,280 962,103,784 1.49% java.lang.Object[] > > > > 6 5,256,391 618,467,640 0.96% java.lang.String[] > > > > 7 14,974,224 479,175,168 0.74% java.lang.String > > > > 8 9,585,494 460,103,712 0.71% java.util.HashMap$Node > > > > 9 18,133,885 435,213,240 0.68% > > org.apache.lucene.util.RoaringDocIdSet$ShortArrayDocIdSet > > > > 10 1,559,661 351,465,624 0.55% java.util.HashMap$Node[] > > > > 11 4,132,738 264,495,232 0.41% java.util.HashMap > > > > 12 1,519,178 243,068,480 0.38% java.lang.reflect.Method > > > > 13 4,068,400 195,283,200 0.30% com.sun.org.apache.xerces.internal.xni.QName > > > > 14 1,181,106 183,932,704 0.29% org.apache.lucene.search.DocIdSet[] > > > > 15 5,721,339 183,082,848 0.28% java.lang.StringBuilder > > > > 16 1,515,804 181,896,480 0.28% java.lang.reflect.Field > > > > 17 348,720 134,652,416 0.21% com.sun.org.apache.xerces.internal.xni.QName[] > > > > 18 3,358,251 134,330,040 0.21% java.util.ArrayList > > > > 19 2,775,517 88,816,544 0.14% org.apache.lucene.util.BytesRef > > > > total 232,140,701 64,452,630,104 > > > > > > We used LRUQueryCache with maxSize 1000 and maxRamBytesUsed 64MB. > > > > > > > > The coreClosedListeners occupied too much heap than I expected, is there > > any reason for that? > > > > -- > Adrien > > > -- Adrien - To unsubscribe, e-mail: mailto:java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: mailto:java-user-h...@lucene.apache.org
Optimizing a boolean query for 100s of term clauses
Hello all, I'm working on an Elasticsearch plugin (using Lucene internally) that allows users to index numerical vectors and run exact and approximate k-nearest-neighbors similarity queries. I'd like to get some feedback about my usage of BooleanQueries and TermQueries, and see if there are any optimizations or performance tricks for my use case. An example use case for the plugin is reverse image search. A user can store vectors representing images and run a nearest-neighbors query to retrieve the 10 vectors with the smallest L2 distance to a query vector. More detailed documentation here: http://elastiknn.klibisz.com/ The main method for indexing the vectors is based on Locality Sensitive Hashing <https://en.wikipedia.org/wiki/Locality-sensitive_hashing>. The general pattern is: 1. When indexing a vector, apply a hash function to it, producing a set of discrete hashes. Usually there are anywhere from 100 to 1000 hashes. Similar vectors are more likely to share hashes (i.e., similar vectors produce hash collisions). 2. Convert each hash to a byte array and store the byte array as a Lucene Term at a specific field. 3. Store the complete vector (i.e. floating point numbers) in a binary doc values field. In other words, I'm converting each vector into a bag of words, though the words have no semantic meaning. A query works as follows: 1. Given a query vector, apply the same hash function to produce a set of hashes. 2. Convert each hash to a byte array and create a Term. 3. Build and run a BooleanQuery with a clause for each Term. Each clause looks like this: `new BooleanClause(new ConstantScoreQuery(new TermQuery(new Term(field, new BytesRef(hashValue.toByteArray))), BooleanClause.Occur.SHOULD))`. 4. As the BooleanQuery produces results, maintain a fixed-size heap of its scores. For any score exceeding the min in the heap, load its vector from the binary doc values, compute the exact similarity, and update the heap. Otherwise the vector gets a score of 0. When profiling my benchmarks with VisualVM, I've found the Elasticsearch search threads spend > 50% of the runtime in these two methods: - org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of runtime) - org.apache.lucene.search.DisjunctionDISIApproximation.nextDoc (~8% of runtime) So the time seems to be dominated by collecting and ordering the results produced by the BooleanQuery from step 3 above. The exact similarity computation is only about 15% of the runtime. If I disable it entirely, I still see the same bottlenecks in VisualVM. Reducing the number of hashes yields roughly linear scaling (i.e., 400 hashes take ~2x longer than 200 hashes). The use case seems different to text search in that there's no semantic meaning to the terms, their length, their ordering, their stems, etc. I basically just need the index to be a rudimentary HashMap, and I only care about the scores for the top k results. With that in mind, I've made the following optimizations: - Disabled tokenization on the FieldType (setTokenized(false)) - Disabled norms on the FieldType (setOmitNorms(true)) - Set similarity to BooleanSimilarity on the elasticsearch MappedFieldType - Set index options to IndexOptions.Docs. - Used the MoreLikeThis heuristic to pick a subset of terms. This understandably only yields a speedup proportional to the number of discarded terms. I'm using Elasticsearch version 7.6.2 with Lucene 8.4.0. The main query implementation is here <https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala> . <https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala> The actual query that gets executed by Elasticsearch is instantiated on line 98 <https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala#L98> . It's in Scala but all of the Java query classes should look familiar. Maybe there are some settings that I'm not aware of? Maybe I could optimize this by implementing a custom query or scorer? Maybe there's just no way to speed this up? I appreciate any input, examples, links, etc.. :) Also, let me know if I can provide any additional details. Thanks, Alex Klibisz
Re: Optimizing a boolean query for 100s of term clauses
Hi Michael, Thanks for the quick response! I will look into the TermInSetQuery. My usage of "heap" might've been confusing. I'm using a FunctionScoreQuery from Elasticsearch. This gets instantiated with a Lucene query, in this case the boolean query as I described it, as well as a custom ScoreFunction object. The ScoreFunction exposes a single method that takes a doc id and the BooleanQuery score for that doc id, and returns another score. In that method I use a MinMaxPriorityQueue from the Guava library to maintain a fixed-capacity subset of the highest-scoring docs and evaluate exact similarity on them. Once the queue is at capacity, I just return 0 for any docs that had a boolean query score smaller than the min in the queue. But you can actually forget entirely that this ScoreFunction exists. It only contributes ~6% of the runtime. Even if I only use the BooleanQuery by itself, I still see the same behavior and bottlenecks. Thanks - AK On Tue, Jun 23, 2020 at 2:06 PM Michael Sokolov wrote: > You might consider using a TermInSetQuery in place of a BooleanQuery > for the hashes (since they are all in the same field). > > I don't really understand why you are seeing so much cost in the heap > - it's sounds as if you have a single heap with mixed scores - those > generated by the BooleanQuery and those generated by the vector > scoring operation. Maybe you comment a little more on the interaction > there - are there really two heaps? Do you override the standard > collector? > > On Tue, Jun 23, 2020 at 9:51 AM Alex K wrote: > > > > Hello all, > > > > I'm working on an Elasticsearch plugin (using Lucene internally) that > > allows users to index numerical vectors and run exact and approximate > > k-nearest-neighbors similarity queries. > > I'd like to get some feedback about my usage of BooleanQueries and > > TermQueries, and see if there are any optimizations or performance tricks > > for my use case. > > > > An example use case for the plugin is reverse image search. A user can > > store vectors representing images and run a nearest-neighbors query to > > retrieve the 10 vectors with the smallest L2 distance to a query vector. > > More detailed documentation here: http://elastiknn.klibisz.com/ > > > > The main method for indexing the vectors is based on Locality Sensitive > > Hashing <https://en.wikipedia.org/wiki/Locality-sensitive_hashing>. > > The general pattern is: > > > >1. When indexing a vector, apply a hash function to it, producing a > set > >of discrete hashes. Usually there are anywhere from 100 to 1000 > hashes. > >Similar vectors are more likely to share hashes (i.e., similar vectors > >produce hash collisions). > >2. Convert each hash to a byte array and store the byte array as a > >Lucene Term at a specific field. > >3. Store the complete vector (i.e. floating point numbers) in a binary > >doc values field. > > > > In other words, I'm converting each vector into a bag of words, though > the > > words have no semantic meaning. > > > > A query works as follows: > > > >1. Given a query vector, apply the same hash function to produce a set > >of hashes. > >2. Convert each hash to a byte array and create a Term. > >3. Build and run a BooleanQuery with a clause for each Term. Each > clause > >looks like this: `new BooleanClause(new ConstantScoreQuery(new > >TermQuery(new Term(field, new BytesRef(hashValue.toByteArray))), > >BooleanClause.Occur.SHOULD))`. > >4. As the BooleanQuery produces results, maintain a fixed-size heap of > >its scores. For any score exceeding the min in the heap, load its > vector > >from the binary doc values, compute the exact similarity, and update > the > >heap. Otherwise the vector gets a score of 0. > > > > When profiling my benchmarks with VisualVM, I've found the Elasticsearch > > search threads spend > 50% of the runtime in these two methods: > > > >- org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of > runtime) > >- org.apache.lucene.search.DisjunctionDISIApproximation.nextDoc (~8% > of > >runtime) > > > > So the time seems to be dominated by collecting and ordering the results > > produced by the BooleanQuery from step 3 above. > > The exact similarity computation is only about 15% of the runtime. If I > > disable it entirely, I still see the same bottlenecks in VisualVM. > > Reducing the number of hashes yields roughly linear scaling (i.e., 400 > > hashes take ~2x longer than 200 hashes). > >
Re: Optimizing a boolean query for 100s of term clauses
The TermsInSetQuery is definitely faster. Unfortunately it doesn't seem to return the number of terms that matched in a given document. Rather it just returns the boost value. I'll look into copying/modifying the internals to return the number of matched terms. Thanks - AK On Tue, Jun 23, 2020 at 3:17 PM Alex K wrote: > Hi Michael, > Thanks for the quick response! > > I will look into the TermInSetQuery. > > My usage of "heap" might've been confusing. > I'm using a FunctionScoreQuery from Elasticsearch. > This gets instantiated with a Lucene query, in this case the boolean query > as I described it, as well as a custom ScoreFunction object. > The ScoreFunction exposes a single method that takes a doc id and the > BooleanQuery score for that doc id, and returns another score. > In that method I use a MinMaxPriorityQueue from the Guava library to > maintain a fixed-capacity subset of the highest-scoring docs and evaluate > exact similarity on them. > Once the queue is at capacity, I just return 0 for any docs that had a > boolean query score smaller than the min in the queue. > > But you can actually forget entirely that this ScoreFunction exists. It > only contributes ~6% of the runtime. > Even if I only use the BooleanQuery by itself, I still see the same > behavior and bottlenecks. > > Thanks > - AK > > > On Tue, Jun 23, 2020 at 2:06 PM Michael Sokolov > wrote: > >> You might consider using a TermInSetQuery in place of a BooleanQuery >> for the hashes (since they are all in the same field). >> >> I don't really understand why you are seeing so much cost in the heap >> - it's sounds as if you have a single heap with mixed scores - those >> generated by the BooleanQuery and those generated by the vector >> scoring operation. Maybe you comment a little more on the interaction >> there - are there really two heaps? Do you override the standard >> collector? >> >> On Tue, Jun 23, 2020 at 9:51 AM Alex K wrote: >> > >> > Hello all, >> > >> > I'm working on an Elasticsearch plugin (using Lucene internally) that >> > allows users to index numerical vectors and run exact and approximate >> > k-nearest-neighbors similarity queries. >> > I'd like to get some feedback about my usage of BooleanQueries and >> > TermQueries, and see if there are any optimizations or performance >> tricks >> > for my use case. >> > >> > An example use case for the plugin is reverse image search. A user can >> > store vectors representing images and run a nearest-neighbors query to >> > retrieve the 10 vectors with the smallest L2 distance to a query vector. >> > More detailed documentation here: http://elastiknn.klibisz.com/ >> > >> > The main method for indexing the vectors is based on Locality Sensitive >> > Hashing <https://en.wikipedia.org/wiki/Locality-sensitive_hashing>. >> > The general pattern is: >> > >> >1. When indexing a vector, apply a hash function to it, producing a >> set >> >of discrete hashes. Usually there are anywhere from 100 to 1000 >> hashes. >> >Similar vectors are more likely to share hashes (i.e., similar >> vectors >> >produce hash collisions). >> >2. Convert each hash to a byte array and store the byte array as a >> >Lucene Term at a specific field. >> >3. Store the complete vector (i.e. floating point numbers) in a >> binary >> >doc values field. >> > >> > In other words, I'm converting each vector into a bag of words, though >> the >> > words have no semantic meaning. >> > >> > A query works as follows: >> > >> >1. Given a query vector, apply the same hash function to produce a >> set >> >of hashes. >> >2. Convert each hash to a byte array and create a Term. >> >3. Build and run a BooleanQuery with a clause for each Term. Each >> clause >> >looks like this: `new BooleanClause(new ConstantScoreQuery(new >> >TermQuery(new Term(field, new BytesRef(hashValue.toByteArray))), >> >BooleanClause.Occur.SHOULD))`. >> >4. As the BooleanQuery produces results, maintain a fixed-size heap >> of >> >its scores. For any score exceeding the min in the heap, load its >> vector >> >from the binary doc values, compute the exact similarity, and update >> the >> >heap. Otherwise the vector gets a score of 0. >> > >> > When profiling my benchmarks with Visual
Re: Optimizing a boolean query for 100s of term clauses
Thanks Michael. I managed to translate the TermInSetQuery into Scala yesterday so now I can modify it in my codebase. This seems promising so far. Fingers crossed there's a way to maintain scores without basically converging to the BooleanQuery implementation. - AK On Wed, Jun 24, 2020 at 8:40 AM Michael Sokolov wrote: > Yeah that will require some changes since what it does currently is to > maintain a bitset, and or into it repeatedly (once for each term's > docs). To maintain counts, you'd need a counter per doc (rather than a > bit), and you might lose some of the speed... > > On Tue, Jun 23, 2020 at 8:52 PM Alex K wrote: > > > > The TermsInSetQuery is definitely faster. Unfortunately it doesn't seem > to > > return the number of terms that matched in a given document. Rather it > just > > returns the boost value. I'll look into copying/modifying the internals > to > > return the number of matched terms. > > > > Thanks > > - AK > > > > On Tue, Jun 23, 2020 at 3:17 PM Alex K wrote: > > > > > Hi Michael, > > > Thanks for the quick response! > > > > > > I will look into the TermInSetQuery. > > > > > > My usage of "heap" might've been confusing. > > > I'm using a FunctionScoreQuery from Elasticsearch. > > > This gets instantiated with a Lucene query, in this case the boolean > query > > > as I described it, as well as a custom ScoreFunction object. > > > The ScoreFunction exposes a single method that takes a doc id and the > > > BooleanQuery score for that doc id, and returns another score. > > > In that method I use a MinMaxPriorityQueue from the Guava library to > > > maintain a fixed-capacity subset of the highest-scoring docs and > evaluate > > > exact similarity on them. > > > Once the queue is at capacity, I just return 0 for any docs that had a > > > boolean query score smaller than the min in the queue. > > > > > > But you can actually forget entirely that this ScoreFunction exists. It > > > only contributes ~6% of the runtime. > > > Even if I only use the BooleanQuery by itself, I still see the same > > > behavior and bottlenecks. > > > > > > Thanks > > > - AK > > > > > > > > > On Tue, Jun 23, 2020 at 2:06 PM Michael Sokolov > > > wrote: > > > > > >> You might consider using a TermInSetQuery in place of a BooleanQuery > > >> for the hashes (since they are all in the same field). > > >> > > >> I don't really understand why you are seeing so much cost in the heap > > >> - it's sounds as if you have a single heap with mixed scores - those > > >> generated by the BooleanQuery and those generated by the vector > > >> scoring operation. Maybe you comment a little more on the interaction > > >> there - are there really two heaps? Do you override the standard > > >> collector? > > >> > > >> On Tue, Jun 23, 2020 at 9:51 AM Alex K wrote: > > >> > > > >> > Hello all, > > >> > > > >> > I'm working on an Elasticsearch plugin (using Lucene internally) > that > > >> > allows users to index numerical vectors and run exact and > approximate > > >> > k-nearest-neighbors similarity queries. > > >> > I'd like to get some feedback about my usage of BooleanQueries and > > >> > TermQueries, and see if there are any optimizations or performance > > >> tricks > > >> > for my use case. > > >> > > > >> > An example use case for the plugin is reverse image search. A user > can > > >> > store vectors representing images and run a nearest-neighbors query > to > > >> > retrieve the 10 vectors with the smallest L2 distance to a query > vector. > > >> > More detailed documentation here: http://elastiknn.klibisz.com/ > > >> > > > >> > The main method for indexing the vectors is based on Locality > Sensitive > > >> > Hashing <https://en.wikipedia.org/wiki/Locality-sensitive_hashing>. > > >> > The general pattern is: > > >> > > > >> >1. When indexing a vector, apply a hash function to it, > producing a > > >> set > > >> >of discrete hashes. Usually there are anywhere from 100 to 1000 > > >> hashes. > > >> >Similar vectors are more likely to share hashes (i.e., similar > >
Re: Optimizing a boolean query for 100s of term clauses
Hi Toke. Indeed a nice coincidence. It's an interesting and fun problem space! My implementation isn't specific to any particular dataset or access pattern (i.e. infinite vs. subset). So far the plugin supports exact L1, L2, Jaccard, Hamming, and Angular similarities with LSH variants for all but L1. My exact implementation is generally faster than the approximate LSH implementation, hence the thread. You make a good point that this is valuable by itself if you're able to filter down to a small subset of docs. I put a lot of work into optimizing the vector serialization speed and the exact query execution. I imagine with my current implementation there is some breaking point where LSH becomes faster than exact, but so far I've tested with ~1.2M ~300-dimensional vectors and exact is still faster, especially when parallelized across many shards. So speeding up LSH is the current engineering challenge. Are you using Elasticsearch or Lucene directly? If you're using ES and have the time, I'd love some feedback on my plugin. It sounds like you want to compute hamming similarity on your bitmaps? If so that's currently supported. There's an example here: http://demo.elastiknn.klibisz.com/dataset/mnist-hamming?queryId=64121 Also I've compiled a small literature review on some related research here: https://docs.google.com/document/d/14Z7ZKk9dq29bGeDDmBH6Bsy92h7NvlHoiGhbKTB0YJs/edit *Fast and Exact NNS in Hamming Space on Full-Text Search Engines* describes some clever tricks to speed up Hamming similarity. *Large Scale Image Retrieval with Elasticsearch* describes the idea of using the largest absolute magnitude values instead of the full vector. Perhaps you've already read them but I figured I'd share. Cheers - AK On Wed, Jun 24, 2020 at 8:44 AM Toke Eskildsen wrote: > On Tue, 2020-06-23 at 09:50 -0400, Alex K wrote: > > I'm working on an Elasticsearch plugin (using Lucene internally) that > > allows users to index numerical vectors and run exact and approximate > > k-nearest-neighbors similarity queries. > > Quite a coincidence. I'm looking into the same thing :-) > > > 1. When indexing a vector, apply a hash function to it, producing > > a set of discrete hashes. Usually there are anywhere from 100 to 1000 > > hashes. > > Is it important to have "infinite" scaling with inverted index or is it > acceptable to have a (fast) sequential scan through all documents? If > the use case is to combine the nearest neighbour search with other > filters, so that the effective search-space is relatively small, you > could go directly to computing the Euclidian distance (or whatever you > use to calculate the exact similarity score). > > > 4. As the BooleanQuery produces results, maintain a fixed-size > > heap of its scores. For any score exceeding the min in the heap, load > > its vector from the binary doc values, compute the exact similarity, > > and update the heap. > > I did something quite similar for a non-Lucene bases proof of concept, > except that I delayed the exact similarity calculation and over- > collected on the heap. > > Fleshing that out: Instead of producing similarity hashes, I extracted > the top-X strongest signals (entries in the vector) and stored them as > indexes from the raw vector, so the top-3 signals from [10, 3, 6, 12, > 1, 20] are [0, 3, 5]. The query was similar to your "match as many as > possible", just with indexes instead of hashes. > > >- org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of > > runtime) > > This sounds strange. How large is your queue? Object-based priority > queues tend to become slow when they get large (100K+ values). > > > Maybe I could optimize this by implementing a custom query or scorer? > > My plan for a better implementation is to use an autoencoder to produce > a condensed representation of the raw vector for a document. In order > to do so, a network must be trained on (ideally) the full corpus, so it > will require a bootstrap process and will probably work poorly if > incoming vectors differ substantially in nature from the existing ones > (at least until the autoencoder is retrained and the condensed > representations are reindexed). As our domain is an always growing > image collection with fairly defines types of images (portraits, line > drawings, maps...) and since new types are introduced rarely, this is > acceptable for us. > > Back to Lucene, the condensed representation is expected to be a bitmap > where the (coarse) similarity between two representations is simply the > number of set bits at the same locations: An AND and a POPCNT of the > bitmaps. > > Th
Re: Optimizing a boolean query for 100s of term clauses
Hi Tommaso, thanks for the input and links! I'll add your paper to my literature review. So far I've seen very promising results from modifying the TermInSetQuery. It was pretty simple to keep a map of `doc id -> matched term count` and then only evaluate the exact similarity on the top k doc ids. On a small benchmark, I was able to drop the time for 1000 queries from 45 seconds to 14 seconds. Now the bottleneck is back in my own code, which I'm happy with because I can optimize that more easily. Hopefully I can merge these changes in the next couple days, and I'll post the diff when I do. - AK On Thu, Jun 25, 2020 at 5:07 AM Tommaso Teofili wrote: > hi Alex, > > I had worked on a similar problem directly on Lucene (within Anserini > toolkit) using LSH fingerprints of tokenized feature vector values. > You can find code at [1] and some information on the Anserini documentation > page [2] and in a short preprint [3]. > As a side note my current thinking is that it would be very cool if we > could leverage Lucene N dimensional point support by properly reducing the > dimensionality of the original vectors, however that is hard to do without > losing important information. > > My 2 cents, > Tommaso > > [1] : > > https://github.com/castorini/anserini/tree/master/src/main/java/io/anserini/ann > [2] : > > https://github.com/castorini/anserini/blob/master/docs/approximate-nearestneighbor.md > [3] : https://arxiv.org/abs/1910.10208 > > > > > > On Wed, 24 Jun 2020 at 19:47, Alex K wrote: > > > Hi Toke. Indeed a nice coincidence. It's an interesting and fun problem > > space! > > > > My implementation isn't specific to any particular dataset or access > > pattern (i.e. infinite vs. subset). > > So far the plugin supports exact L1, L2, Jaccard, Hamming, and Angular > > similarities with LSH variants for all but L1. > > My exact implementation is generally faster than the approximate LSH > > implementation, hence the thread. > > You make a good point that this is valuable by itself if you're able to > > filter down to a small subset of docs. > > I put a lot of work into optimizing the vector serialization speed and > the > > exact query execution. > > I imagine with my current implementation there is some breaking point > where > > LSH becomes faster than exact, but so far I've tested with ~1.2M > > ~300-dimensional vectors and exact is still faster, especially when > > parallelized across many shards. > > So speeding up LSH is the current engineering challenge. > > > > Are you using Elasticsearch or Lucene directly? > > If you're using ES and have the time, I'd love some feedback on my > plugin. > > It sounds like you want to compute hamming similarity on your bitmaps? > > If so that's currently supported. > > There's an example here: > > http://demo.elastiknn.klibisz.com/dataset/mnist-hamming?queryId=64121 > > > > Also I've compiled a small literature review on some related research > here: > > > > > https://docs.google.com/document/d/14Z7ZKk9dq29bGeDDmBH6Bsy92h7NvlHoiGhbKTB0YJs/edit > > *Fast and Exact NNS in Hamming Space on Full-Text Search Engines* > describes > > some clever tricks to speed up Hamming similarity. > > *Large Scale Image Retrieval with Elasticsearch* describes the idea of > > using the largest absolute magnitude values instead of the full vector. > > Perhaps you've already read them but I figured I'd share. > > > > Cheers > > - AK > > > > > > > > On Wed, Jun 24, 2020 at 8:44 AM Toke Eskildsen wrote: > > > > > On Tue, 2020-06-23 at 09:50 -0400, Alex K wrote: > > > > I'm working on an Elasticsearch plugin (using Lucene internally) that > > > > allows users to index numerical vectors and run exact and approximate > > > > k-nearest-neighbors similarity queries. > > > > > > Quite a coincidence. I'm looking into the same thing :-) > > > > > > > 1. When indexing a vector, apply a hash function to it, producing > > > > a set of discrete hashes. Usually there are anywhere from 100 to 1000 > > > > hashes. > > > > > > Is it important to have "infinite" scaling with inverted index or is it > > > acceptable to have a (fast) sequential scan through all documents? If > > > the use case is to combine the nearest neighbour search with other > > > filters, so that the effective search-space is relatively small, you > > > could go directly to computing the Euclidian distance (or whatever you > >
Re: ANN search current state
Hi Mikhail, I'm not sure about the state of ANN in lucene proper. Very interested to see the response from others. I've been doing some work on ANN for an Elasticsearch plugin: http://elastiknn.klibisz.com/ I think it's possible to extract my custom queries and modeling code so that it's elasticsearch-agnostic and can be used directly in Lucene apps. However I'm much more familiar with Elasticsearch's APIs and usage/testing patterns than I am with raw Lucene, so I'd likely need to get some help from the Lucene community. Please LMK if that sounds interesting to anyone. - Alex On Wed, Jul 15, 2020 at 11:11 AM Mikhail wrote: > > Hi, > >I want to incorporate semantic search in my project, which uses > Lucene. I want to use sentence embeddings and ANN (approximate nearest > neighbor) search. I found the related Lucene issues: > https://issues.apache.org/jira/browse/LUCENE-9004 , > https://issues.apache.org/jira/browse/LUCENE-9136 , > https://issues.apache.org/jira/browse/LUCENE-9322 . I see that there > are some related work and related PRs. What is the current state of this > functionality? > > -- > Thanks, > Mikhail > >
Optimizing term-occurrence counting (code included)
Hi all, I am working on a query that takes a set of terms, finds all documents containing at least one of those terms, computes a subset of candidate docs with the most matching terms, and applies a user-provided scoring function to each of the candidate docs Simple example of the query: - query terms ("aaa", "bbb") - indexed docs with terms: docId 0 has terms ("aaa", "bbb") docId 1 has terms ("aaa", "ccc") - number of top candidates = 1 - simple scoring function score(docId) = docId + 10 The query first builds a count array [2, 1], because docId 0 contains two matching terms and docId 1 contains 1 matching term. Then it picks docId 0 as the candidate subset. Then it applies the scoring function, returning a score of 10 for docId 0. The main bottleneck right now is doing the initial counting, i.e. the part that returns the [2, 1] array. I first started by using a BoolQuery containing a Should clause for every Term, so the returned score was the count. This was simple but very slow. Then I got a substantial speedup by copying and modifying the TermInSetQuery so that it tracks the number of times each docId contains a query term. The main construct here seems to be PrefixCodedTerms. At this point I'm not sure if there's any faster construct, or perhaps a more optimal way to use PrefixCodedTerms? Here is the specific query, highlighting some specific parts of the code: - Build the PrefixCodedTerms (in my case the terms are called 'hashes'): https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L27-L33 - Count the matching terms in a segment (this is the main bottleneck in my query): https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73 I appreciate any suggestions you might have. - Alex
Re: Optimizing term-occurrence counting (code included)
Thanks Ali. I don't think that will work in this case, since the data I'm counting is managed by lucene, but that looks like an interesting project. -Alex On Fri, Jul 24, 2020, 00:15 Ali Akhtar wrote: > I'm new to lucene so I'm not sure what the best way of speeding this up in > Lucene is, but I've previously used https://github.com/npgall/cqengine for > similar stuff. It provided really good performance, especially if you're > just counting things. > > On Fri, Jul 24, 2020 at 6:55 AM Alex K wrote: > > > Hi all, > > > > I am working on a query that takes a set of terms, finds all documents > > containing at least one of those terms, computes a subset of candidate > docs > > with the most matching terms, and applies a user-provided scoring > function > > to each of the candidate docs > > > > Simple example of the query: > > - query terms ("aaa", "bbb") > > - indexed docs with terms: > > docId 0 has terms ("aaa", "bbb") > > docId 1 has terms ("aaa", "ccc") > > - number of top candidates = 1 > > - simple scoring function score(docId) = docId + 10 > > The query first builds a count array [2, 1], because docId 0 contains two > > matching terms and docId 1 contains 1 matching term. > > Then it picks docId 0 as the candidate subset. > > Then it applies the scoring function, returning a score of 10 for docId > 0. > > > > The main bottleneck right now is doing the initial counting, i.e. the > part > > that returns the [2, 1] array. > > > > I first started by using a BoolQuery containing a Should clause for every > > Term, so the returned score was the count. This was simple but very slow. > > Then I got a substantial speedup by copying and modifying the > > TermInSetQuery so that it tracks the number of times each docId contains > a > > query term. The main construct here seems to be PrefixCodedTerms. > > > > At this point I'm not sure if there's any faster construct, or perhaps a > > more optimal way to use PrefixCodedTerms? > > > > Here is the specific query, highlighting some specific parts of the code: > > - Build the PrefixCodedTerms (in my case the terms are called 'hashes'): > > > > > https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L27-L33 > > - Count the matching terms in a segment (this is the main bottleneck in > my > > query): > > > > > https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73 > > > > I appreciate any suggestions you might have. > > > > - Alex > > >
Re: TermsEnum.seekExact degraded performance somewhere between Lucene 7.7.0 and 8.5.1.
Hi, Also have a look here: https://issues.apache.org/jira/plugins/servlet/mobile#issue/LUCENE-9378 Seems it might be related. - Alex On Sun, Jul 26, 2020, 23:31 Trejkaz wrote: > Hi all. > > I've been tracking down slow seeking performance in TermsEnum after > updating to Lucene 8.5.1. > > On 8.5.1: > > SegmentTermsEnum.seekExact: 33,829 ms (70.2%) (remaining time in our > code) > SegmentTermsEnumFrame.loadBlock: 29,104 ms (60.4%) > CompressionAlgorithm$2.read: 25,789 ms (53.5%) > LowercaseAsciiCompression.decompress: 25,789 ms (53.5%) > DataInput.readVInt: 24,690 ms (51.2%) > SegmentTermsEnumFrame.scanToTerm: 2,921 ms (6.1%) > > On 7.7.0 (previous version we were using): > > SegmentTermsEnum.seekExact: 5,897 ms (43.7%) (remaining time in our > code) > SegmentTermsEnumFrame.loadBlock: 3,499 ms (25.9%) > BufferedIndexInput.readBytes: 1,500 ms (11.1%) > DataInput.readVInt: 1,108 (8.2%) > SegmentTermsEnumFrame.scanToTerm: 1,501 ms (11.1%) > > So on the surface it sort of looks like the new version spends less > time scanning and much more time loading blocks to decompress? > > Looking for some clues to what might have changed here, and whether > it's something we can avoid, but currently LUCENE-4702 looks like it > may be related. > > TX > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Simultaneous Indexing and searching
FWIW, I agree with Michael: this is not a simple problem and there's been a lot of effort in Elasticsearch and Solr to solve it in a robust way. If you can't use ES/solr, I believe there are some posts on the ES blog about how they write/delete/merge shards (Lucene indices). On Tue, Sep 1, 2020 at 11:40 AM Michael Sokolov wrote: > So ... this is a fairly complex topic I can't really cover it in depth > here; how to architect a distributed search engine service. Most > people opt to use Solr or Elasticsearch since they solve that problem > for you. Those systems work best when the indexes are local to the > service that is accessing them, and build systems to distribute data > internally; distributing via NFS is generally not a *good idea* (tm), > although it may work most of the time. In your case, have you > considered building a search service that runs on the same box as your > indexer and responds to queries from the web server(s)? > > On Tue, Sep 1, 2020 at 11:13 AM Richard So > wrote: > > > > Hi there, > > > > I am beginner for using Lucene especially in the area of Indexing and > searching simultaneously. > > > > Our environment is that we have several webserver for the search > front-end that submit search request and also a backend server that do the > full text indexing; whereas the index files are stored in a NFS volume such > that both the indexing and searchs are pointing to this same NFS volume. > The indexing may happen whenever something new documents comes in or get > updated. > > > > Our project requires that both indexing and searching can be happened at > the same time (or the blocking should be as short as possible, e.g. under a > second) > > > > We have search through the Internet and found something like this > references: > > > http://blog.mikemccandless.com/2011/09/lucenes-searchermanager-simplifies.html > > > http://blog.mikemccandless.com/2011/11/near-real-time-readers-with-lucenes.html > > > > but seems those only apply to indexing and search in the same server > (correct me if I am wrong). > > > > Could somebody tell me how to implement such system, e.g. what Lucene > classes to be used and the caveat, or how to setup ,etc? > > > > Regards > > Richard > > > > > > > > > > > > > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Optimizing term-occurrence counting (code included)
Hi all, I'm still a bit stuck on this particular issue.I posted an issue on the Elastiknn repo outlining some measurements and thoughts on potential solutions: https://github.com/alexklibisz/elastiknn/issues/160 To restate the question: Is there a known optimal way to find and count docs matching 10s to 100s of terms? It seems the bottleneck is in the PostingsFormat implementation. Perhaps there is a PostingsFormat better suited for this usecase? Thanks, Alex On Fri, Jul 24, 2020 at 7:59 AM Alex K wrote: > Thanks Ali. I don't think that will work in this case, since the data I'm > counting is managed by lucene, but that looks like an interesting project. > -Alex > > On Fri, Jul 24, 2020, 00:15 Ali Akhtar wrote: > >> I'm new to lucene so I'm not sure what the best way of speeding this up in >> Lucene is, but I've previously used https://github.com/npgall/cqengine >> for >> similar stuff. It provided really good performance, especially if you're >> just counting things. >> >> On Fri, Jul 24, 2020 at 6:55 AM Alex K wrote: >> >> > Hi all, >> > >> > I am working on a query that takes a set of terms, finds all documents >> > containing at least one of those terms, computes a subset of candidate >> docs >> > with the most matching terms, and applies a user-provided scoring >> function >> > to each of the candidate docs >> > >> > Simple example of the query: >> > - query terms ("aaa", "bbb") >> > - indexed docs with terms: >> > docId 0 has terms ("aaa", "bbb") >> > docId 1 has terms ("aaa", "ccc") >> > - number of top candidates = 1 >> > - simple scoring function score(docId) = docId + 10 >> > The query first builds a count array [2, 1], because docId 0 contains >> two >> > matching terms and docId 1 contains 1 matching term. >> > Then it picks docId 0 as the candidate subset. >> > Then it applies the scoring function, returning a score of 10 for docId >> 0. >> > >> > The main bottleneck right now is doing the initial counting, i.e. the >> part >> > that returns the [2, 1] array. >> > >> > I first started by using a BoolQuery containing a Should clause for >> every >> > Term, so the returned score was the count. This was simple but very >> slow. >> > Then I got a substantial speedup by copying and modifying the >> > TermInSetQuery so that it tracks the number of times each docId >> contains a >> > query term. The main construct here seems to be PrefixCodedTerms. >> > >> > At this point I'm not sure if there's any faster construct, or perhaps a >> > more optimal way to use PrefixCodedTerms? >> > >> > Here is the specific query, highlighting some specific parts of the >> code: >> > - Build the PrefixCodedTerms (in my case the terms are called 'hashes'): >> > >> > >> https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L27-L33 >> > - Count the matching terms in a segment (this is the main bottleneck in >> my >> > query): >> > >> > >> https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73 >> > >> > I appreciate any suggestions you might have. >> > >> > - Alex >> > >> >
How to access block-max metadata?
Hi all, There was some fairly recent work in Lucene to introduce Block-Max WAND Scoring ( https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf , https://issues.apache.org/jira/browse/LUCENE-8135). I've been working on a use-case where I need very efficient top-k scoring for 100s of query terms (usually between 300 and 600 terms, k between 100 and 1, each term contributes a simple TF-IDF score). There's some discussion here: https://github.com/alexklibisz/elastiknn/issues/160. Now that block-based metadata are presumably available in Lucene, how would I access this metadata? I've read the WANDScorer.java code, but I couldn't quite understand how exactly it is leveraging a block-max codec or block-based statistics. In my own code, I'm exploring some ways to prune low-quality docs, and I figured there might be some block-max metadata that I can access to improve the pruning. I'm iterating over the docs matching each term using the .advance() and .nextDoc() methods on a PostingsEnum. I don't see any block-related methods on the PostingsEnum interface. I feel like I'm missing something.. hopefully something simple! I appreciate any tips or examples! Thanks, Alex
Re: How to access block-max metadata?
Thanks Adrien. Very helpful. The doc for ImpactSource.advanceShallow says it's more efficient than DocIDSetIterator.advance. Is that because advanceShallow is skipping entire blocks at a time, whereas advance is not? One possible optimization I've explored involves skipping pruned docIDs. I tried this using .advance() instead of .nextDoc(), but found the improvement was negligible. I'm thinking maybe advanceShallow() would let me get that speedup. - AK On Mon, Oct 12, 2020 at 2:59 AM Adrien Grand wrote: > Hi Alex, > > The entry point for block-max metadata is TermsEnum#impacts ( > > https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/TermsEnum.html#impacts(int) > ) > which returns a view of the postings lists that includes block-max > metadata. In particular, see documentation for ImpactsSource#advanceShallow > and ImpactsSource#getImpacts ( > > https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/ImpactsSource.html > ). > > You can look at ImpactsDISI to see how this metadata is leveraged in > practice to turn this metadata into score upper bounds, which is in-turn > used to skip irrelevant documents. > > On Mon, Oct 12, 2020 at 2:45 AM Alex K wrote: > > > Hi all, > > There was some fairly recent work in Lucene to introduce Block-Max WAND > > Scoring ( > > > > > https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf > > , https://issues.apache.org/jira/browse/LUCENE-8135). > > > > I've been working on a use-case where I need very efficient top-k scoring > > for 100s of query terms (usually between 300 and 600 terms, k between 100 > > and 1, each term contributes a simple TF-IDF score). There's some > > discussion here: https://github.com/alexklibisz/elastiknn/issues/160. > > > > Now that block-based metadata are presumably available in Lucene, how > would > > I access this metadata? > > > > I've read the WANDScorer.java code, but I couldn't quite understand how > > exactly it is leveraging a block-max codec or block-based statistics. In > my > > own code, I'm exploring some ways to prune low-quality docs, and I > figured > > there might be some block-max metadata that I can access to improve the > > pruning. I'm iterating over the docs matching each term using the > > .advance() and .nextDoc() methods on a PostingsEnum. I don't see any > > block-related methods on the PostingsEnum interface. I feel like I'm > > missing something.. hopefully something simple! > > > > I appreciate any tips or examples! > > > > Thanks, > > Alex > > > > > -- > Adrien >
Re: How to access block-max metadata?
I see. So I'm most likely rarely skipping a block's worth of docs, so using advance() vs nextDoc() doesn't make much of a difference. All good to know. Thank you. On Mon, Oct 12, 2020 at 11:42 AM Adrien Grand wrote: > advanceShallow is indeed faster than advance because it does less: > advanceShallow only advances the cursor for block-max metadata, this allows > reasoning about maximum scores without actually advancing the doc ID. > advanceShallow is implicitly called via advance. > > If your optimization rarely helps skip entire blocks, then it's expected > that advance doesn't help much over nextDoc. advanceShallow is rarely a > drop-in replacement for advance since it's unable to tell whether a > document matches or not, it can only be used to reason about maximum scores > for a range of doc IDs when combined with ImpactsSource#getImpacts. > > On Mon, Oct 12, 2020 at 5:21 PM Alex K wrote: > > > Thanks Adrien. Very helpful. > > The doc for ImpactSource.advanceShallow says it's more efficient than > > DocIDSetIterator.advance. > > Is that because advanceShallow is skipping entire blocks at a time, > whereas > > advance is not? > > One possible optimization I've explored involves skipping pruned docIDs. > I > > tried this using .advance() instead of .nextDoc(), but found the > > improvement was negligible. I'm thinking maybe advanceShallow() would let > > me get that speedup. > > - AK > > > > On Mon, Oct 12, 2020 at 2:59 AM Adrien Grand wrote: > > > > > Hi Alex, > > > > > > The entry point for block-max metadata is TermsEnum#impacts ( > > > > > > > > > https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/TermsEnum.html#impacts(int) > > > ) > > > which returns a view of the postings lists that includes block-max > > > metadata. In particular, see documentation for > > ImpactsSource#advanceShallow > > > and ImpactsSource#getImpacts ( > > > > > > > > > https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/ImpactsSource.html > > > ). > > > > > > You can look at ImpactsDISI to see how this metadata is leveraged in > > > practice to turn this metadata into score upper bounds, which is > in-turn > > > used to skip irrelevant documents. > > > > > > On Mon, Oct 12, 2020 at 2:45 AM Alex K wrote: > > > > > > > Hi all, > > > > There was some fairly recent work in Lucene to introduce Block-Max > WAND > > > > Scoring ( > > > > > > > > > > > > > > https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf > > > > , https://issues.apache.org/jira/browse/LUCENE-8135). > > > > > > > > I've been working on a use-case where I need very efficient top-k > > scoring > > > > for 100s of query terms (usually between 300 and 600 terms, k between > > 100 > > > > and 1, each term contributes a simple TF-IDF score). There's some > > > > discussion here: https://github.com/alexklibisz/elastiknn/issues/160 > . > > > > > > > > Now that block-based metadata are presumably available in Lucene, how > > > would > > > > I access this metadata? > > > > > > > > I've read the WANDScorer.java code, but I couldn't quite understand > how > > > > exactly it is leveraging a block-max codec or block-based statistics. > > In > > > my > > > > own code, I'm exploring some ways to prune low-quality docs, and I > > > figured > > > > there might be some block-max metadata that I can access to improve > the > > > > pruning. I'm iterating over the docs matching each term using the > > > > .advance() and .nextDoc() methods on a PostingsEnum. I don't see any > > > > block-related methods on the PostingsEnum interface. I feel like I'm > > > > missing something.. hopefully something simple! > > > > > > > > I appreciate any tips or examples! > > > > > > > > Thanks, > > > > Alex > > > > > > > > > > > > > -- > > > Adrien > > > > > > > > -- > Adrien >
Re: Lucene/Solr and BERT
There were a couple additions recently merged into lucene but not yet released: - A first-class vector codec - An implementation of HNSW for approximate nearest neighbor search They are however available in the snapshot releases. I started on a small project to get the HNSW implementation into the ann-benchmarks project, but had to set it aside. Here's the code: https://github.com/alexklibisz/ann-benchmarks-lucene. There are some test suites that index and search Glove vectors. My first impression was that indexing seems surprisingly slow, but it's entirely possible I'm doing something wrong. On Wed, Apr 21, 2021 at 9:31 AM Michael Wechner wrote: > Hi > > I recently found the following articles re Lucene/Solr and BERT > > https://dmitry-kan.medium.com/neural-search-with-bert-and-solr-ea5ead060b28 > > https://medium.com/swlh/fun-with-apache-lucene-and-bert-embeddings-c2c496baa559 > > and would like to ask whether there might be more recent developments > within the Lucene/Solr community re BERT integration? > > Also how these developments relate to > > https://sbert.net/ > > ? > > Thanks very much for your insights! > > Michael > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Lucene/Solr and BERT
Hi Michael and others, Sorry just now getting back to you. For your three original questions: - Yes, I was referring to the Lucene90Hnsw* classes. Michael S. had a thorough response. - As far as I know Opendistro is calling out to a C/C++ binary to run the actual HNSW algorithm and store the HNSW part of the index. When they implemented it about a year ago, Lucene did not have this yet. I assume the Lucene HNSW implementation is solid, but would not be surprised if it's slower than the C/C++ based implementation, given the JVM has some disadvantages for these kinds of CPU-bound/number crunching algos. - I just haven't had much time to invest into my benchmark recently. In particular, I got stuck on why indexing was taking extremely long. Just indexing the vectors would have easily exceeded the current time limitations in the ANN-benchmarks project. Maybe I had some naive mistake in my implementation, but I profiled and dug pretty deep to make it fast. I'm assuming you want to use Lucene, but not necessarily via Elasticsearch? If so, another option you might try for ANN is the elastiknn-models and elastiknn-lucene packages. elastiknn-models contains the Locality Sensitive Hashing implementations of ANN used by Elastiknn, and elastiknn-lucene contains the Lucene queries used by Elastiknn.The Lucene query is the MatchHashesAndScoreQuery <https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L18-L22>. There are a couple of scala test suites that show how to use it: MatchHashesAndScoreQuerySuite <https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQuerySuite.scala>. MatchHashesAndScoreQueryPerformanceSuite <https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQueryPerformanceSuite.scala>. This is all designed to work independently from Elasticsearch and is published on Maven: com.klibisz.elastiknn / lucene <https://search.maven.org/artifact/com.klibisz.elastiknn/lucene/7.12.1.0/jar> and com.klibisz.elastiknn / models <https://search.maven.org/artifact/com.klibisz.elastiknn/models/7.12.1.0/jar>. The tests are Scala but all of the implementation is in Java. Thanks, Alex On Mon, May 24, 2021 at 3:06 AM Michael Wechner wrote: > Hi Russ > > I would like to use it for detecting duplicated questions, whereas I am > currently using the project sbert.net you mention below to do the > embedding with a size of 768 for indexing and querying. > > sbert has an example listed using "util.pytorch_cos_sim(A,B) as a > brute-force approach > > https://sbert.net/docs/usage/semantic_textual_similarity.html > > and "paraphrase mining" approach for larger document collections > > https://sbert.net/examples/applications/paraphrase-mining/README.html > > Re the Lucene ANN implementation(s) I think it would be very interesting > to participate in the ANN benchmarking challenge which Julie mentioned > on the dev list > > > http://mail-archives.apache.org/mod_mbox/lucene-dev/202105.mbox/%3CCAKDq9%3D4rSuuczoK%2BcVg_N6Lwvh42E%2BXUoSGQ6m7BgqzuDvACew%40mail.gmail.com%3E > > > https://medium.com/big-ann-benchmarks/neurips-2021-announcement-the-billion-scale-approximate-nearest-neighbor-search-challenge-72858f768f69 > > Thanks > > Michael > > > > Am 24.05.21 um 05:31 schrieb Russell Jurney: > > For practical search using BERT on any reasonable sized dataset, they're > > going to need ANN, which Lucene recently added. This won't work in > practice > > if the query and document are of a different size, which is where > sentence > > transformers see a lot of use for documents up to 500 words. > > > > https://issues.apache.org/jira/plugins/servlet/mobile#issue/LUCENE-9004 > > > > https://github.com/UKPLab/sentence-transformers > > > > Russ > > > > On Sun, May 23, 2021 at 8:23 PM Michael Sokolov > wrote: > > > >> Hi Michael, that is fully-functional in the sense that Lucene will > >> build an HNSW graph for a vector-valued field and you can then use the > >> VectorReader.search method to do KNN-based search. Next steps may > >> include some integration with lexical, inverted-index type search so > >> that you can retrieve N-closest constrained by other constraints. > >> Today you can approximate that by oversampling and filtering. There is > >> also interest in pursuing other KNN search algorithms, and we have > >> been working to make sure the VectorFormat API (might still get > >> renamed due to confusion with other kinds of vectors existing in > >>
Re: Lucene/Solr and BERT
Thanks Michael. IIRC, the thing that was taking so long was merging into a single segment. Is there already benchmarking code for HNSW available somewhere? I feel like I remember someone posting benchmarking results on one of the Jira tickets. Thanks, Alex On Wed, May 26, 2021 at 3:41 PM Michael Sokolov wrote: > This java implementation will be slower than the C implementation. I > believe the algorithm is essentially the same, however this is new and > there may be bugs! I (and I think Julie had similar results IIRC) > measured something like 8x slower than hnswlib (using ann-benchmarks). > It is also surprising (to me) though how this varies with > differently-learned vectors so YMMV. I still think there is value > here, and look forward to improved performance, especially as JDK16 > has some improved support for vectorized instructions. > > Please also understand that the HNSW algorithm interacts with Lucene's > segmented architecture in a tricky way. Because we built a graph > *per-segment* when flushing/merging, these must be rebuilt whenever > segments are merged. So your indexing performance can be heavily > influenced by how often you flush, as well as by your merge policy > settings. Also, when searching, there is a bigger than usual benefit > for searching across fewer segments, since the cost of searching an > HNSW graph scales more or less with log N (so searching a single large > graph is cheaper than searching the same documents divided among > smaller graphs). So I do recommend using a multithreaded collector in > order to get best latency with HNSW-based search. To get the best > indexing, and searching, performance, you should generally index as > large a number of documents as possible before flushing. > > -Mike > > On Wed, May 26, 2021 at 9:43 AM Michael Wechner > wrote: > > > > Hi Alex > > > > Thank you very much for your feedback and the various insights! > > > > Am 26.05.21 um 04:41 schrieb Alex K: > > > Hi Michael and others, > > > > > > Sorry just now getting back to you. For your three original questions: > > > > > > - Yes, I was referring to the Lucene90Hnsw* classes. Michael S. had a > > > thorough response. > > > - As far as I know Opendistro is calling out to a C/C++ binary to run > the > > > actual HNSW algorithm and store the HNSW part of the index. When they > > > implemented it about a year ago, Lucene did not have this yet. I > assume the > > > Lucene HNSW implementation is solid, but would not be surprised if it's > > > slower than the C/C++ based implementation, given the JVM has some > > > disadvantages for these kinds of CPU-bound/number crunching algos. > > > - I just haven't had much time to invest into my benchmark recently. In > > > particular, I got stuck on why indexing was taking extremely long. Just > > > indexing the vectors would have easily exceeded the current time > > > limitations in the ANN-benchmarks project. Maybe I had some naive > mistake > > > in my implementation, but I profiled and dug pretty deep to make it > fast. > > > > I am trying to get Julie's branch running > > > > https://github.com/jtibshirani/lucene/tree/hnsw-bench > > > > Maybe this will help and is comparable > > > > > > > > > > I'm assuming you want to use Lucene, but not necessarily via > Elasticsearch? > > > > Yes, for more simple setups I would like to use Lucene standalone, but > > for setups which have to scale I would use either Elasticsearch or Solr. > > > > Thanks > > > > Michael > > > > > > > > > If so, another option you might try for ANN is the elastiknn-models > > > and elastiknn-lucene packages. elastiknn-models contains the Locality > > > Sensitive Hashing implementations of ANN used by Elastiknn, and > > > elastiknn-lucene contains the Lucene queries used by Elastiknn.The > Lucene > > > query is the MatchHashesAndScoreQuery > > > < > https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L18-L22 > >. > > > There are a couple of scala test suites that show how to use it: > > > MatchHashesAndScoreQuerySuite > > > < > https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQuerySuite.scala > >. > > > MatchHashesAndScoreQueryPerformanceSuite > > > < > https://github.com/alexklibisz/elastiknn/blob/master/ela
Does Lucene have anything like a covering index as an alternative to DocValues?
Hi all, I am curious if there is anything in Lucene that resembles a covering index (from the relational database world) as an alternative to DocValues for commonly-accessed values? Consider the following use-case: I'm indexing docs in a Lucene index. Each doc has some terms, which are not stored. Each doc also has a UUID corresponding to some other system, which is stored using DocValues. When I run a query, I get back the TopDocs and use the doc ID to fetch the UUID from DocValues. I know that I will *always* need to go fetch this UUID. Is there any way to have the UUID stored in the actual index, rather than using DocValues? Thanks in advance for any tips Alex Klibisz
Control the number of segments without using forceMerge.
Hi all, I'm trying to figure out if there is a way to control the number of segments in an index without explicitly calling forceMerge. My use-case looks like this: I need to index a static dataset of ~1 billion documents. I know the exact number of docs before indexing starts. I know the VM where this index is searched has 64 threads. I'd like to end up with exactly 64 segments, so I can search them in a parallelized fashion. I know that I could call forceMerge(64), but this takes an extremely long time. Is there a straightforward way to ensure that I end up with 64 threads without force-merging after adding all of the documents? Thanks in advance for any tips Alex Klibisz
Re: Does Lucene have anything like a covering index as an alternative to DocValues?
Hi Uwe, Thanks for clarifying. That makes sense. Thanks, Alex Klibisz On Mon, Jul 5, 2021 at 9:22 AM Uwe Schindler wrote: > Hi, > > Sorry I misunderstood you question, you want to lookup the UUID in another > system! > Then the approach you are doing is correct. Either store as stored field > or as docvalue. An inverted index cannot store additional data, because it > *is* inverted, it is focused around *terms* not documents. The posting list > of each term can only store internal, numeric lucene doc ids. Those have > then to be used to lookup the actual contents from e.g. stored fields > (possibility A) or DocValues (possibility B). We can't store UUIDs in the > highly compressed posting list. > > Uwe > > - > Uwe Schindler > Achterdiek 19, D-28357 Bremen > https://www.thetaphi.de > eMail: u...@thetaphi.de > > > -Original Message- > > From: Uwe Schindler > > Sent: Monday, July 5, 2021 3:10 PM > > To: java-user@lucene.apache.org > > Subject: RE: Does Lucene have anything like a covering index as an > alternative > > to DocValues? > > > > You need to index the UUID as a standard indexed StringField. Then you > can do > > a lookup using TermQuery. That's how all systems like Solr or > Elasticsearch > > handle document identifiers. > > > > DocValues are for facetting and sorting, but looking up by ID is a > typical use > > case for an inverted index. If you still need to store it as DocValues > field, just > > add it with both types. > > > > Uwe > > > > - > > Uwe Schindler > > Achterdiek 19, D-28357 Bremen > > https://www.thetaphi.de > > eMail: u...@thetaphi.de > > > > > -Original Message- > > > From: Alex K > > > Sent: Monday, July 5, 2021 2:30 AM > > > To: java-user@lucene.apache.org > > > Subject: Does Lucene have anything like a covering index as an > alternative to > > > DocValues? > > > > > > Hi all, > > > > > > I am curious if there is anything in Lucene that resembles a covering > index > > > (from the relational database world) as an alternative to DocValues for > > > commonly-accessed values? > > > > > > Consider the following use-case: I'm indexing docs in a Lucene index. > Each > > > doc has some terms, which are not stored. Each doc also has a UUID > > > corresponding to some other system, which is stored using DocValues. > When I > > > run a query, I get back the TopDocs and use the doc ID to fetch the > UUID > > > from DocValues. I know that I will *always* need to go fetch this > UUID. Is > > > there any way to have the UUID stored in the actual index, rather than > > > using DocValues? > > > > > > Thanks in advance for any tips > > > > > > Alex Klibisz > > > > > > - > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Control the number of segments without using forceMerge.
Ok, so it sounds like if you want a very specific number of segments you have to do a forceMerge at some point? Is there some simple summary on how segments are formed in the first place? Something like, "one segment is created every time you flush from an IndexWriter"? Based on some experimenting and reading the code, it seems to be quite complicated, especially once you start calling addDocument from several threads in parallel. It's good to learn about the MultiReader. I'll look into that some more. Thanks, Alex On Mon, Jul 5, 2021 at 9:14 AM Uwe Schindler wrote: > If you want an exact number of segments, create 64 indexes, each > forceMerged to one segment. > After that use MultiReader to create a view on all separate indexes. > MultiReaders's contents are always flattened to a list of those 64 indexes. > > But keep in mind that this should only ever be done with *static* indexes. > As soon as you have updates, this is a bad idea (forceMerge in general) and > also splitting indexes like this. Parallelization should normally come from > multiple queries running in parallel, but you shouldn't force Lucene to run > a single query over so many indexes. > > Uwe > > - > Uwe Schindler > Achterdiek 19, D-28357 Bremen > https://www.thetaphi.de > eMail: u...@thetaphi.de > > > -Original Message- > > From: Alex K > > Sent: Monday, July 5, 2021 4:04 AM > > To: java-user@lucene.apache.org > > Subject: Control the number of segments without using forceMerge. > > > > Hi all, > > > > I'm trying to figure out if there is a way to control the number of > > segments in an index without explicitly calling forceMerge. > > > > My use-case looks like this: I need to index a static dataset of ~1 > > billion documents. I know the exact number of docs before indexing > starts. > > I know the VM where this index is searched has 64 threads. I'd like to > end > > up with exactly 64 segments, so I can search them in a parallelized > fashion. > > > > I know that I could call forceMerge(64), but this takes an extremely long > > time. > > > > Is there a straightforward way to ensure that I end up with 64 threads > > without force-merging after adding all of the documents? > > > > Thanks in advance for any tips > > > > Alex Klibisz > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Control the number of segments without using forceMerge.
After some more reading, the NoMergePolicy seems to mostly solve my problem. I've configured my IndexWriterConfig with: .setMaxBufferedDocs(Integer.MAX_VALUE) .setRAMBufferSizeMB(Double.MAX_VALUE) .setMergePolicy(NoMergePolicy.INSTANCE) With this config I consistently end up with a number of segments that is a multiple of the number of processors on the indexing VM. I don't have to force merge at all. This also makes the indexing job faster overall. I think I was previously confused by the behavior of the ConcurrentMergeScheduler. I'm sure it's great for most use-cases, but I really need to just move as many docs as possible as fast as possible to a predictable number of segments, so the NoMergePolicy seems to be a good choice for my use-case. Also, I learned a lot from Uwe's recent talk at Berlin Buzzwords <https://2021.berlinbuzzwords.de/sites/berlinbuzzwords.de/files/2021-06/The%20future%20of%20Lucene%27s%20MMapDirectory.pdf>, and his great post about MMapDirectory from a few years ago <https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html>. Definitely recommended for others. Thanks, Alex On Mon, Jul 5, 2021 at 1:53 PM Alex K wrote: > Ok, so it sounds like if you want a very specific number of segments you > have to do a forceMerge at some point? > > Is there some simple summary on how segments are formed in the first > place? Something like, "one segment is created every time you flush from an > IndexWriter"? Based on some experimenting and reading the code, it seems to > be quite complicated, especially once you start calling addDocument from > several threads in parallel. > > It's good to learn about the MultiReader. I'll look into that some more. > > Thanks, > Alex > > On Mon, Jul 5, 2021 at 9:14 AM Uwe Schindler wrote: > >> If you want an exact number of segments, create 64 indexes, each >> forceMerged to one segment. >> After that use MultiReader to create a view on all separate indexes. >> MultiReaders's contents are always flattened to a list of those 64 indexes. >> >> But keep in mind that this should only ever be done with *static* >> indexes. As soon as you have updates, this is a bad idea (forceMerge in >> general) and also splitting indexes like this. Parallelization should >> normally come from multiple queries running in parallel, but you shouldn't >> force Lucene to run a single query over so many indexes. >> >> Uwe >> >> - >> Uwe Schindler >> Achterdiek 19, D-28357 Bremen >> https://www.thetaphi.de >> eMail: u...@thetaphi.de >> >> > -Original Message- >> > From: Alex K >> > Sent: Monday, July 5, 2021 4:04 AM >> > To: java-user@lucene.apache.org >> > Subject: Control the number of segments without using forceMerge. >> > >> > Hi all, >> > >> > I'm trying to figure out if there is a way to control the number of >> > segments in an index without explicitly calling forceMerge. >> > >> > My use-case looks like this: I need to index a static dataset of ~1 >> > billion documents. I know the exact number of docs before indexing >> starts. >> > I know the VM where this index is searched has 64 threads. I'd like to >> end >> > up with exactly 64 segments, so I can search them in a parallelized >> fashion. >> > >> > I know that I could call forceMerge(64), but this takes an extremely >> long >> > time. >> > >> > Is there a straightforward way to ensure that I end up with 64 threads >> > without force-merging after adding all of the documents? >> > >> > Thanks in advance for any tips >> > >> > Alex Klibisz >> >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >>
Using setIndexSort on a binary field
Hi all, Could someone point me to an example of using the IndexWriterConfig.setIndexSort for a field containing binary values? To be specific, the fields are constructed using the Field(String name, byte[] value, IndexableFieldType type) constructor, and I'd like to try using the java.util.Arrays.compareUnsigned method to sort the fields. Thanks, Alex
Re: Using setIndexSort on a binary field
Thanks Adrien. This makes me think I might not be understanding the use case for index sorting correctly. I basically want to make it so that my terms are sorted across segments. For example, let's say I have integer terms 1 to 100 and 10 segments. I'd like terms 1 to 10 to occur in segment 1, terms 11 to 20 in segment 2, terms 21 to 30 in segment 3, and so on. With default indexing settings, I see terms duplicated across segments. I thought index sorting was the way to achieve this, but the use of doc values makes me think it might actually be used for something else? Is something like what I described possible? Any clarification would be great. Thanks, Alex On Fri, Oct 15, 2021 at 12:43 PM Adrien Grand wrote: > Hi Alex, > > You need to use a BinaryDocValuesField so that the field is indexed with > doc values. > > `Field` is not going to work because it only indexes the data while index > sorting requires doc values. > > On Fri, Oct 15, 2021 at 6:40 PM Alex K wrote: > > > Hi all, > > > > Could someone point me to an example of using the > > IndexWriterConfig.setIndexSort for a field containing binary values? > > > > To be specific, the fields are constructed using the Field(String name, > > byte[] value, IndexableFieldType type) constructor, and I'd like to try > > using the java.util.Arrays.compareUnsigned method to sort the fields. > > > > Thanks, > > Alex > > > > > -- > Adrien >
Re: Using setIndexSort on a binary field
Thanks Michael. Totally agree this is a contrived setup. It's mostly for benchmarking purposes right now. I was actually able to rephrase my problem in a way that made more sense for the existing setIndexSort API using float doc values and saw an appreciable speedup in searches. The IndexRearranger is also good to know about. Cheers, Alex On Sun, Oct 17, 2021 at 9:32 AM Michael Sokolov wrote: > Yeah, index sorting doesn't do that -- it sorts *within* each segment > so that when documents are iterated (within that segment) by any of > the many DocIdSetIterators that underlie the Lucene search API, they > are retrieved in the order specified (which is then also docid order). > > To achieve what you want you would have to tightly control the > indexing process. For example you could configure a NoMergePolicy to > prevent the segments you manually create from being merged, set a very > large RAM buffer size on the index writer so it doesn't unexpectedly > flush a segment while you're indexing, and then index documents in the > sequence you want to group them by, committing after each block of > documents. But this is a very artificial setup; it wouldn't survive > any normal indexing workflow where merges are allowed, documents may > be updated, etc. > > For testing purposes we've recently added the ability to rearrange the > index (IndexRearranger) according to a specific assignment of docids > to segments - you could apply this to an existing index. But again, > this is not really intended for use in a production on-line index that > receives updates. > > On Fri, Oct 15, 2021 at 1:27 PM Alex K wrote: > > > > Thanks Adrien. This makes me think I might not be understanding the use > > case for index sorting correctly. I basically want to make it so that my > > terms are sorted across segments. For example, let's say I have integer > > terms 1 to 100 and 10 segments. I'd like terms 1 to 10 to occur in > segment > > 1, terms 11 to 20 in segment 2, terms 21 to 30 in segment 3, and so on. > > With default indexing settings, I see terms duplicated across segments. I > > thought index sorting was the way to achieve this, but the use of doc > > values makes me think it might actually be used for something else? Is > > something like what I described possible? Any clarification would be > great. > > Thanks, > > Alex > > > > > > On Fri, Oct 15, 2021 at 12:43 PM Adrien Grand wrote: > > > > > Hi Alex, > > > > > > You need to use a BinaryDocValuesField so that the field is indexed > with > > > doc values. > > > > > > `Field` is not going to work because it only indexes the data while > index > > > sorting requires doc values. > > > > > > On Fri, Oct 15, 2021 at 6:40 PM Alex K wrote: > > > > > > > Hi all, > > > > > > > > Could someone point me to an example of using the > > > > IndexWriterConfig.setIndexSort for a field containing binary values? > > > > > > > > To be specific, the fields are constructed using the Field(String > name, > > > > byte[] value, IndexableFieldType type) constructor, and I'd like to > try > > > > using the java.util.Arrays.compareUnsigned method to sort the fields. > > > > > > > > Thanks, > > > > Alex > > > > > > > > > > > > > -- > > > Adrien > > > > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
lucene source code changes
Hello, I have a need to implement an custom inverted index in Lucene. I have files like the ones I have attached here. The Files have words and and scores assigned to that word. There will 100's of such files. Each file will have atleast 5 such name value pairs. Note: Currently the file only shows 10s of such name value pairs. But My real production data will have 5 plus name value pairs in file. Currently I index the data using Lucene's Inverted Index. The query that is being execute against the Index has 100 Words. When the query is excuted against the index the result is returned in 100 milli seconds or so. Problem: Once i have the results of the query, I have to go through each file (for ex. attached file one). Then for each word in the user input query, I have to compute the total score. Doing this against 100's of files and 100's of keywords is causing the score computation to be slow i.e. about 3-5seconds. I need help resolving the above problem so that score computation takes less than 200Milli Seconds or so. One Resolution I was thinking is modifying the Lucene Source Code for creating inverted index. In this index we store the score in the index itself. When the results of the query are returned, we will get the scores along with the file names, there by eleminating the need to search the file for the keyword and corresponding score. I need to compute the total of all scores that belong to one single file. I am also open to any other ideas that you may have. Any suggestions regarding this will be very helpful. Thanks, Abhilasha - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: lucene code changes
I have a need to implement an custom inverted index in Lucene. I have files like the ones I have attached here. The Files have words and and scores assigned to that word. There will 100's of such files. Each file will have atleast 5 such name value pairs. Note: Currently the file only shows 10s of such name value pairs. But My real production data will have 5 plus name value pairs in file. Currently I index the data using Lucene's Inverted Index. The query that is being execute against the Index has 100 Words. When the query is excuted against the index the result is returned in 100 milli seconds or so. Problem: Once i have the results of the query, I have to go through each file (for ex. attached file one). Then for each word in the user input query, I have to compute the total score. Doing this against 100's of files and 100's of keywords is causing the score computation to be slow i.e. about 3-5seconds. I need help resolving the above problem so that score computation takes less than 200Milli Seconds or so. One Resolution I was thinking is modifying the Lucene Source Code for creating inverted index. In this index we store the score in the index itself. When the results of the query are returned, we will get the scores along with the file names, there by eleminating the need to search the file for the keyword and corresponding score. I need to compute the total of all scores that belong to one single file. I am also open to any other ideas that you may have. Any suggestions regarding this will be very helpful. a. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
cannot retrieve the values of a field is not stored in the index
Hi, Is there a way I can retrieve the value of a field that is not stored in the Index? private static void indexFile(IndexWriter writer, File f) throws IOException { if (f.isHidden() || !f.exists() || !f.canRead()) { return; } System.out.println("Indexing " + f.getCanonicalPath()); Document doc = new Document(); // add contents of file FileReader fr = new FileReader(f); doc.add(new Field("contents", fr)); //adding second field which contains the path of the file doc.add(new Field("path", f.getCanonicalPath(), Field.Store.NO, Field.Index.NOT_ANALYZED)); } Is there a way I can access the value of the field "path" from the document hits? Thanks, a
Lucene sorting case-sensitive by default?
Hi All, I was searching my index with sorting on a field called "Label" which is not tokenized, here is what came back: Extended Sites Catalog Asset Store Extended Sites Catalog Asset Store SALES Print Catalog 2 Print catalog test Test Print Catalog Test refresh catalog print test 3 test catalog 1 Looks like Lucene is separating upper case and lower case while sorting. Can someone shed some light on this as to while this is happening and how to fix it? Thanks in advance for your help! Alex
RE: Lucene sorting case-sensitive by default?
Thanks everyone for your replies! Guess I did not fully understand the meaning of "natural order" in the Lucene Java doc. To add another all-lower-case field for each sortable field in my index is a little too much, since the app requires sorting on pretty much all fields (over 100). Toke, you mentioned "Using a Collator works but does take a fair amount of memory", can you please elaborate a little more on that. Thanks. Alex -Original Message- From: Toke Eskildsen [mailto:[EMAIL PROTECTED] Sent: Monday, January 14, 2008 3:13 AM To: java-user@lucene.apache.org Subject: Re: Lucene sorting case-sensitive by default? On Fri, 2008-01-11 at 11:40 -0500, Alex Wang wrote: > Looks like Lucene is separating upper case and lower case while sorting. As Tom points out, default sorting uses natural order. It's worth noting that this implies that default sorting does not produce usable results as soon as you use non-ASCII characters. Using a Collator works but does take a fair amount of memory. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene sorting case-sensitive by default?
Thanks a lot Erik for the great tip! I do need to display all the fields and allow the users to sort by each field as they wish. My index is currently about 200 mb. Your suggestion about storing (but not index) the cased version, and indexing (but not store) the lower-case version is an excellent solution for me. Is it possible to do it in the same field or do I have to do it in 2 separate fields? If I do it in one field, what are the Lucene class/methods I need to overwrite? Thanks again for your help! Alex This message may contain confidential and/or privileged information. If you are not the addressee or authorized to receive this for the addressee, you must not use, copy, disclose, or take any action based on this message or any information herein. If you have received this message in error, please advise the sender immediately by reply e-mail and delete this message. Thank you for your cooperation. -Original Message- From: Erick Erickson [mailto:[EMAIL PROTECTED] Sent: Monday, January 14, 2008 11:24 AM To: java-user@lucene.apache.org Subject: Re: Lucene sorting case-sensitive by default? Several things: 1> do you need to display all the fields? Would just storing them lower-case work? The only time I've needed to store fields case- sensitive is when I'm showing them to the user. If the user is just searching on them, I can store them any way I want and she'll never know. 2> You might very well be surprised at how little extra it takes to index (but not store) the lower-case version. How big is your index anyway? And be warned that the size increase is not linear, so just comparing the index sizes for, say, 10 document is misleading. If your index is 10M, there's no reason at all not to store twice. If it's 10G 3> You could store (but not index) the cased version. You could index (but not store) the lower-case version. The total size of your index is (I believe) about the same as indexing AND storing the fields. That gives you a way to search caselessly and display case-sensitively. Best Erick On Jan 14, 2008 10:58 AM, Alex Wang <[EMAIL PROTECTED]> wrote: > Thanks everyone for your replies! Guess I did not fully understand the > meaning of "natural order" in the Lucene Java doc. > > To add another all-lower-case field for each sortable field in my index > is a little too much, since the app requires sorting on pretty much all > fields (over 100). > > Toke, you mentioned "Using a Collator works but does take a fair amount > of memory", can you please elaborate a little more on that. Thanks. > > Alex > > -Original Message- > From: Toke Eskildsen [mailto:[EMAIL PROTECTED] > Sent: Monday, January 14, 2008 3:13 AM > To: java-user@lucene.apache.org > Subject: Re: Lucene sorting case-sensitive by default? > > On Fri, 2008-01-11 at 11:40 -0500, Alex Wang wrote: > > Looks like Lucene is separating upper case and lower case while > sorting. > > As Tom points out, default sorting uses natural order. It's worth noting > that this implies that default sorting does not produce usable results > as soon as you use non-ASCII characters. Using a Collator works but does > take a fair amount of memory. > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: Lucene sorting case-sensitive by default?
No problem Erick. Thanks for clarifying it. Alex -Original Message- From: Erick Erickson [mailto:[EMAIL PROTECTED] Sent: Monday, January 14, 2008 12:35 PM To: java-user@lucene.apache.org Subject: Re: Lucene sorting case-sensitive by default? Sorry, I was confused about this for the longest time (and it shows!). You don't actually have to store two separate fields. Field.Store.YES stores the input exactly as is, without passing it through anything. So you really only have to store your field. I still think of it conceptually as two entirely different things, but it's not. This code: public static void main(String[] args) throws Exception { try { RAMDirectory dir = new RAMDirectory(); IndexWriter iw = new IndexWriter( dir, new StandardAnalyzer(Collections.emptySet()), true); Document doc = new Document(); doc.add( new Field( "f", "This is Some Mixed, case Junk($*%& With Ugly SYmbols", Field.Store.YES, Field.Index.TOKENIZED)); iw.addDocument(doc); iw.close(); IndexReader ir = IndexReader.open(dir); Document d = ir.document(0); System.out.println(d.get("f")); } catch (Exception e) { e.printStackTrace(); } System.out.println("done"); } prints "This is Some Mixed, case Junk($*%& With Ugly SYmbols" yet still finds the document with a search for "junk" using StandardAnalyzer. Sorry for the confusion! Erick On Jan 14, 2008 11:48 AM, Alex Wang <[EMAIL PROTECTED]> wrote: > Thanks a lot Erik for the great tip! I do need to display all the fields > and allow the users to sort by each field as they wish. My index is > currently about 200 mb. > > Your suggestion about storing (but not index) the cased version, and > indexing (but not store) the lower-case version is an excellent solution > for me. > > Is it possible to do it in the same field or do I have to do it in 2 > separate fields? If I do it in one field, what are the Lucene > class/methods I need to overwrite? > > Thanks again for your help! > > Alex > > > This message may contain confidential and/or privileged information. If > you are not the addressee or authorized to receive this for the > addressee, you must not use, copy, disclose, or take any action based on > this message or any information herein. If you have received this > message in error, please advise the sender immediately by reply e-mail > and delete this message. Thank you for your cooperation. > > > -Original Message- > From: Erick Erickson [mailto:[EMAIL PROTECTED] > Sent: Monday, January 14, 2008 11:24 AM > To: java-user@lucene.apache.org > Subject: Re: Lucene sorting case-sensitive by default? > > Several things: > > 1> do you need to display all the fields? Would just storing them > lower-case work? The only time I've needed to store fields case- > sensitive is when I'm showing them to the user. If the user is just > searching on them, I can store them any way I want and she'll never > know. > > 2> You might very well be surprised at how little extra it takes to > index (but not store) the lower-case version. How big is your index > anyway? And be warned that the size increase is not linear, so > just comparing the index sizes for, say, 10 document is misleading. > If your index is 10M, there's no reason at all not to store twice. If > it's > 10G > > 3> You could store (but not index) the cased version. You could > index (but not store) the lower-case version. The total size of > your index is (I believe) about the same as indexing AND storing > the fields. That gives you a way to search caselessly and display > case-sensitively. > > Best > Erick > > On Jan 14, 2008 10:58 AM, Alex Wang <[EMAIL PROTECTED]> wrote: > > > Thanks everyone for your replies! Guess I did not fully understand the > > meaning of "natural order" in the Lucene Java doc. > > > > To add another all-lower-case field for each sortable field in my > index > > is a little too much, since the app requires sorting on pretty much > all > > fields (over 100). > > > > Toke, you mentioned "Using a Collator works but does take a fair > amount > > of memory", can you please elaborate a little more on that. Thanks. > > > > Alex > > > > -Original Message- > > From: Toke Eskildsen [mailto:[EMAIL PROTECTED] > > Sent: Mon
Can I using HFS in lucene 2.3.1?
Hi, Does somebody have practice building a distributed application with lucene and Hadoop/HFS? Lucene 2.3.1 looks not explose HFSDirectory. Any advice will be appreciated. Regards, Alex
Can I using HFS in lucene 2.3.1?
Hi, Does somebody have practice building a distributed application with lucene and Hadoop/HFS? Lucene 2.3.1 looks not explose HFSDirectory. Any advice will be appreciated. Regards, Alex
searching for C++
Hello: I have a problem where I need to search for the term "C++". If I use StandardAnalyzer, the "+" characters are removed and the search is done on just the "c" character which is not what is intended. Yet, I need to use standard analyzer for the other benefits it provides. I think I need to write a specialized tokenizer (and accompanying analyzer) that let the "+" characters pass. I would use the JFlex provided one, modify it and add it to my project. My question is: Is there any simpler way to accomplish the same? Best regards, Alex Soto [EMAIL PROTECTED] - Amicus Plato, sed magis amica veritas. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
searching for words with symbols
Hello: I have a problem where I need to search for the word "C++". If I use StandardAnalyzer, the "+" characters are removed and the search is done on just the "c" character which is not what is intended. Yet, I need to use standard analyzer for the other benefits it provides. I think I need to write a specialized tokenizer (and accompanying analyzer) that let the "+" characters pass. I would use the JFlex provided one, modify it and add it to my project. My question is: Is there any simpler way to accomplish the same? -- Alex Soto [EMAIL PROTECTED] - Amicus Plato, sed magis amica veritas. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: searching for C++
Thanks everyone. I appreciate the help. I think I will write my own tokenizer, because I do not have a predefined list of words with symbols. I will modify the grammar by defining a SYMBOL token as John suggested and redefine ALPHANUM to include it. Regards, Alex Soto On Tue, Jun 24, 2008 at 12:12 PM, N. Hira <[EMAIL PROTECTED]> wrote: > This isn't ideal, but if you have a defined list of such terms, you may find > it easier to filter these terms out into a separate field for indexing. > > -h > -- > Hira, N.R. > Solutions Architect > Cognocys, Inc. > (773) 251-7453 > > On 24-Jun-2008, at 11:03 AM, John Byrne wrote: > >> I don't think there is a simpler way. I think you will have to modify the >> tokenizer. Once you go beyond basic human-readable text, you always end up >> having to do that. I have modified the JavaCC version of StandardTokenizer >> for allowing symbols to pass through, but I've never used the JFlex version >> - don't know anything about JFlex I'm afraid! >> >> A good strategy might be to make a new type of lexical token called >> "SYMBOL" and try to catch as many symbols as you can think of; then maybe >> create new token types which are ALPHANUM types that can have pre-fixed or >> post-fixed symbols. >> >> That way, you'll be able to catch things like "c++" in a TokenFilter, and >> you can choose to pass it through as a single token, or split it up into two >> tokens, or whatever you want. >> >> Hope that helps. >> >> Regards, >> JB >> >> Alex Soto wrote: >>> >>> Hello: >>> >>> I have a problem where I need to search for the term "C++". >>> If I use StandardAnalyzer, the "+" characters are removed and the >>> search is done on just the "c" character which is not what is >>> intended. >>> Yet, I need to use standard analyzer for the other benefits it provides. >>> >>> I think I need to write a specialized tokenizer (and accompanying >>> analyzer) that let the "+" characters pass. >>> I would use the JFlex provided one, modify it and add it to my project. >>> >>> My question is: >>> >>> Is there any simpler way to accomplish the same? >>> >>> >>> Best regards, >>> Alex Soto >>> [EMAIL PROTECTED] >>> >>> - >>> Amicus Plato, sed magis amica veritas. >>> >>> - >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >>> >>> >>> >>> > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Alex Soto [EMAIL PROTECTED] - Amicus Plato, sed magis amica veritas. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
instruct IndexDeletionPolicy to delete old commits after N minutes
hi, what is the correct way to instruct the indexwriter to delete old commit points after N minutes ? I tried to write a customized IndexDeletionPolicy that uses the parameters to schedule future jobs to do file deletion. However, I am only getting the filenames, and not absolute file names. thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
IndexDeletionPolicy to delete after N minutes
hi, what is the correct way to instruct the indexwriter to delete old commit points after N minutes ? I tried to write a customized IndexDeletionPolicy that uses the parameters to schedule future jobs to do file deletion. However, I am only getting the filenames, and not absolute file names. thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
IndexDeletionPolicy to delete commits after N minutes
hi, what is the correct way to instruct the indexwriter (or other classes?) to delete old commit points after N minutes ? I tried to write a customized IndexDeletionPolicy that uses the parameters to schedule future jobs to perform file deletion. However, I am only getting the filenames through the parameters and not absolute file names. thanks. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: How Lucene Search
the debugger that came with eclipse is pretty good for this purpose. You can create a small project and then attach Lucene source for the purpose of debugging. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Urgent Help Please: "Resource Tempararily Unavailable"
Hi Everyone, We have an application built using Lucene 1.9. The app allows incremental updating to the index while other users are searching the same index. Today, some search suddenly returns nothing when we know it should return some hits. This does not happen all the time. Sometimes the search succeeded. When checking the logs, I found the following error during searching: Parameter[0]: java.io.IOException: Resource temporarily unavailable When this error occurred, there were 2 other users deleting documents from the same index. The deletions seemed to succeed, but the search failed. I have no clue what could have caused such error. Unfortunately there is no further info in the logs. Can someone please shed some light on this? Thanks. Alex
Urgent Help Please: "Resource temporarily unavailable"
Hi Everyone, We have an application built using Lucene 1.9. The app allows incremental updating to the index while other users are searching the same index. Today, some search suddenly returns nothing when we know it should return some hits. This does not happen all the time. Sometimes the search succeeded. When checking the logs, I found the following error during searching: Parameter[0]: java.io.IOException: Resource temporarily unavailable When this error occurred, there were 2 other users deleting documents from the same index. The deletions seemed to succeed, but the search failed. I have no clue what could have caused such error. Unfortunately there is no further info in the logs. Can someone please shed some light on this? Thanks. Alex