Re: Storing payloads without term-position and frequency

2011-02-03 Thread Alex
Hello Grant,

I am currently storing the first term instance only because I just index
each token for an article once. What I want to achieve is an index for
versioned document collections like wikipedia (See this paper
http://www.cis.poly.edu/suel/papers/archive.pdf). 

In detail I create on the first level (Lucene) a document for one
wikipedia article containing all distinct terms of its versions. On the
second level (payloads) I store the frequency information corresponding
to each article version and its terms. If I search now I can find an
article by its term and through the term and its payload I receive
informations about the other versions and how often a token occured (In
my case with one term the payload pos is always 1!). So I look on the
first level and pick only the information from the second level which I
need. By this I can avoid storing informations several times because
most wikipedia versions are very similar (in term context).

This is working so far and I just want to reduce my index size but I
don't know how much I can save by disabling term freqs/pos.
I hope I could explain the problem a little bit. If not just tell me I
try to explain it again. :)

Best regards
Alex

PS: I am currently looking for a bedroom in New York, Brooklyn (Park
Slope or near NYU Poly). Maybe somebody rents a room from 15 Feb until
15 April. :)

Am Donnerstag, den 03.02.2011, 12:38 -0500 schrieb Grant Ingersoll:
> Payloads only make sense in terms of specific positions in the index, so I 
> don't think there is a way to hack Lucene for it.  You could, I suppose, just 
> store the payload for the first instance of the term.
> 
> Also, what's the use case you are trying to solve here?  Why store term 
> frequency as a payload when Lucene already does it (and it probably does it 
> more efficiently)
> 
> -Grant
> 
> On Feb 2, 2011, at 2:35 PM, Alex vB wrote:
> 
> > 
> > Hello everybody,
> > 
> > I am currently using Lucene 3.0.2 with payloads. I store extra information
> > in the payloads about the term like frequencies and therefore I don't need
> > frequencies and term positions stored normally by Lucene. I would like to
> > set f.setOmitTermFreqAndPositions(true) but then I am not able to retrieve
> > payloads. Would it be hard to "hack" Lucene for my requests? Anymore I only
> > store one payload per term if that information makes it easier.
> > 
> > Best regards
> > Alex
> > -- 
> > View this message in context: 
> > http://lucene.472066.n3.nabble.com/Storing-payloads-without-term-position-and-frequency-tp2408094p2408094.html
> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> > 
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> > 
> 
> --
> Grant Ingersoll
> http://www.lucidimagination.com
> 
> 
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
> 



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



which parser to use?

2011-09-22 Thread alex

hi all,
I need to create analyzer and I need to choose what parser to use.
can anyone recommend ?

JFlex
javacc
antlr

thanks.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Query and language conversion

2009-09-01 Thread Alex
Hi,

I am new to Lucene so excuse me if this is a trivial question ..


I have data that I Index in a given language (English). My users will come
from different countries and my search screen will be internationalized. My
users will then probably query thing in their own language. Is it possible
too lookup for Items that were indexed in a different language.


To make thing a bit more clear.

My "Business" object has a "type" attribute. In lucene the "type" field is
created. The Business object for  "Doctor Smuck" will be indexed with the
"type" field as  "medical doctor" or anything similar.
My German users will query using german languange. He tries to find a Doctor
using "Arzt" or maybe "Mediziner" as a query.
Is Lucene able to match the query to the value that was indexed in another
language ?
Is there an analyser for that ?

By the way : I can provide the probable input language, based on the
client's search page language,  as a parameter if that helps (it probably
will) .


Many thanks for your thoughts !


Re: Query and language conversion

2009-09-01 Thread Alex
Many thanks Steve for all that information.

I understand by your answer that cross-lingual search doesn't come "out-of
-the-box" in Lucene.

Cheers.

Alex

On Tue, Sep 1, 2009 at 6:46 PM, Steven A Rowe  wrote:

> Hi Alex,
>
> What you want to do is commonly referred to as "Cross Language Information
> Retrieval".  Doug Oard at the University of Maryland has a page of CLIR
> resources here:
>
> http://terpconnect.umd.edu/~dlrg/clir/<http://terpconnect.umd.edu/%7Edlrg/clir/>
>
> Grant Ingersoll responded to a similar question a couple of years ago on
> this list:
>
> <
> http://search.lucidimagination.com/search/document/e1398067af353a49/cross_lingual_ir#e1398067af353a49
> >
>
> Here's another recent thread with lots of good info, from the solr-user
> mailing list, on the same topic:
>
> <
> http://search.lucidimagination.com/search/document/f7c17dc516c89bf6/preparing_the_ground_for_a_real_multilang_index#797001daa3f73e17
> >
>
> Here's a paper written by a group that put together a Greek-English
> cross-language retrieval system using Lucene:
>
> http://www.springerlink.com/content/n172420t1346q683/
>
> And here's another paper written by a group that made a Hindi and Telugu to
> English cross-language retrieval system using Lucene, from the CLEF 2006
> conference proceedings:
>
> http://www.iiit.ac.in/techreports/2008_76.pdf
>
> Steve
>
> > -Original Message-
> > From: Alex [mailto:azli...@gmail.com]
> > Sent: Tuesday, September 01, 2009 10:30 AM
> > To: java-user@lucene.apache.org
> > Subject: Query and language conversion
> >
> > Hi,
> >
> > I am new to Lucene so excuse me if this is a trivial question ..
> >
> >
> > I have data that I Index in a given language (English). My users will
> > come from different countries and my search screen will be
> > internationalized. My users will then probably query thing in their
> > own language. Is it possible too lookup for Items that were indexed
> > in a different language.
> >
> > To make thing a bit more clear.
> >
> > My "Business" object has a "type" attribute. In lucene the "type" field
> > is created. The Business object for  "Doctor Smuck" will be indexed with
> > the "type" field as  "medical doctor" or anything similar. My German
> > users will query using german languange. He tries to find a Doctor
> > using "Arzt" or maybe "Mediziner" as a query. Is Lucene able to match
> > the query to the value that was indexed in another language ?
> > Is there an analyser for that ?
> >
> > By the way : I can provide the probable input language, based on the
> > client's search page language,  as a parameter if that helps (it
> > probably will) .
> >
> > Many thanks for your thoughts !
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Filtering query results based on relevance/acuracy

2009-09-21 Thread Alex
Hi,

I'm, a total newbie with lucene and trying to understand how to achieve my
(complicated) goals. So what I'm doing is yet totally experimental for me
but is probably extremely trivial for the experts in this list :)

I use lucene and Hibernate Search to index locations by their name, type,
etc ...
The LocationType is an Object that has it's "name" field indexed both
tokenized and untokenized.

The following LocationType names are indexed
"Restaurant"
"Mexican Restaurant"
"Chinese Restaurant"
"Greek Restaurant"
etc...

Considering the following query  :

"Mexican Restaurant"

I systematically get all the entries as a result, most certainly because the
"Restaurant" keyword is present in all of them.
I'm trying to have a finer grained result set.
Obviously for "Mexican Restaurant" I want the "Mexican Restaurant" entry as
a result but NOT "Chinese Restaurant" nor "Greek Restaurant" as they are
irrelevant. But maybe "Restaurant" itself should be returned with a lower
wight/score or maybe it shouldn't ... im not sure about this one.

1)
How can I do that ?

Here is the code I use for querying :


String[] typeFields = {"name", "tokenized_name"};
Map boostPerField = new HashMap(2);
boostPerField.put( "name", (float) 4);
boostPerField.put( "tokenized_name", (float) 2);


QueryParser parser = new MultiFieldQueryParser(
typeFields ,
new StandardAnalyzer(),
boostPerField
);

org.apache.lucene.search.Query luceneQuery;

try {
luceneQuery = parser.parse(queryString);
}
catch (ParseException e) {
throw new RuntimeException("Unable to parse query: " +
queryString, e);
}





I guess that there is a way to filter out results that have a score below a
given threshold or a way to filter out results based on score gap or
anything similar. But I have no idea on how to do this...


What is the best way to achieve what I want?

Thank you for your help !

Cheers,

Alex


Re: Filtering query results based on relevance/acuracy

2009-09-26 Thread Alex
Hi Otis and thank your for helping me out.

Sorry for the late reply.



Although a Phrase query or TermQuery  would be perfectly suited in some
cases, this will not work in my case.

Basically my application's search feature is a single field "à la Google"
and the user can be looking for a lot of different things...

For example the user can search for
"Chinese Restaurant in New York USA"
or maybe just
"Chinese Restaurant"  (which should be understood as "nearby Chinese
Restaurant"
or maybe
"Chinese Retaurant at 12 Main St. New York"
or
"1223 Main Street New York"



So basically I will get many different query structures depending on the
user's intent/meaning/logic and I think I need to figure out a good analysis
algorithm to get Locations as acurately as possible.

As a first step in my algo I am trying to isolate/identify a potential
LocationType from the query string.
So my idea was to separate each words and use them to query my Index for
LocationTypes that would best match what's included in the query.
I could then get the best matching LocationTypes based on how it scored
against the luicene query and then move on to the next step of my algo which
would try to find another potential feature of the query such as the
presence of a Country name or City name etc 

That's why a phrase query would not be appropriate here as this would mean
that the entire query string would be used and would most of the times
return no relevant LocationTypes.

Once I have analysed the query string and isolated the various features
(LocationType, City Name, Country Name , Address  ) I could maybe create
a Boolean Query where I would use all that was fetched earlier


So basically I'm not sure what feature of Lucene I should use here in the
first step of the algo to only find the most relevant LocationTypes and
filter out the ones that are not relevant enough.


Any help and any thoughts on my approach greatly appreciated.


Thanks in advance.

Cheers,

Alex.


Re: Filtering query results based on relevance/acuracy

2009-09-29 Thread Alex
anybody can help ?

On Sat, Sep 26, 2009 at 11:22 PM, Alex  wrote:

> Hi Otis and thank your for helping me out.
>
> Sorry for the late reply.
>
>
>
> Although a Phrase query or TermQuery  would be perfectly suited in some
> cases, this will not work in my case.
>
> Basically my application's search feature is a single field "à la Google"
> and the user can be looking for a lot of different things...
>
> For example the user can search for
> "Chinese Restaurant in New York USA"
> or maybe just
> "Chinese Restaurant"  (which should be understood as "nearby Chinese
> Restaurant"
> or maybe
> "Chinese Retaurant at 12 Main St. New York"
> or
> "1223 Main Street New York"
>
>
>
> So basically I will get many different query structures depending on the
> user's intent/meaning/logic and I think I need to figure out a good analysis
> algorithm to get Locations as acurately as possible.
>
> As a first step in my algo I am trying to isolate/identify a potential
> LocationType from the query string.
> So my idea was to separate each words and use them to query my Index for
> LocationTypes that would best match what's included in the query.
> I could then get the best matching LocationTypes based on how it scored
> against the luicene query and then move on to the next step of my algo which
> would try to find another potential feature of the query such as the
> presence of a Country name or City name etc 
>
> That's why a phrase query would not be appropriate here as this would mean
> that the entire query string would be used and would most of the times
> return no relevant LocationTypes.
>
> Once I have analysed the query string and isolated the various features
> (LocationType, City Name, Country Name , Address  ) I could maybe create
> a Boolean Query where I would use all that was fetched earlier
>
>
> So basically I'm not sure what feature of Lucene I should use here in the
> first step of the algo to only find the most relevant LocationTypes and
> filter out the ones that are not relevant enough.
>
>
> Any help and any thoughts on my approach greatly appreciated.
>
>
> Thanks in advance.
>
> Cheers,
>
> Alex.
>
>
>


Document category identification in query

2009-12-14 Thread Alex
Hi,

I am trying to expand user queries to figure out potential document
categories implied in the query.
I wanted to know what was the best way to figure out the document category
that is the most relevant to the query.

Let me explain further:
I have created categories that are applied to documents I want to index.
Some example categories are :

Hotel
Restaurant
Fast Food
Chinese Restaurant
Church
Bank
Gas station


I also am trying to create category aliases such as Chinese food can also be
named Chinese restaurant with the same category unique ID.


The documents I index have 1 primary category and 1...N  secondary
categories such as :

McDonalds will be categorized under Fast food as a primary category but also
under Restaurant as secondary category.
The London Pub at the corner of my street will be categorized as Pub as
primary category and also as Bar, Food and Beverages, Restaurant, and Fast
Food (since then also have takeaway burgers ;).

This all gives me a set of categories that are quite clearly identified, as
well as a set of category aliases even though I'm aware that I can't figure
out all the possible aliases of all my categories. At least I have the most
obvious ones.


Now with all this, I wanted to know, with the help of Lucene (or any other
efficient method),  how I could figure out the most relevant category (if
any) behind a user query.


For example :

If my user looks for :
"Chang's chinese restaurant" the obvious category should be "Chinese
Restaurant"
but if my user looks for
"chines restauran"  (misspelled) the category should also be "Chinese
Restaurant" (such as google is capable of doing)
OR
"chinese bistro" should probably also return me the category "Chinese
Restaurant" since bistro is a very similar concept to "Retaurant" ...


Once the category is identified I can then query the index for documents
that match that category the best.


What is the proper way to identify the most relevant category in a user
query based on the above ?
Should I consider any other better approach ?


Any help appreciated.


Many thanks

Alex.


Re: Document category identification in query

2009-12-15 Thread Alex
Can anybody help me or maybe point me to relevant resources I could learn
from ?

Thanks.


Re: Document category identification in query

2009-12-20 Thread Alex
Hi !

Many thanks to both of you for your suggestions and answers!

What Weiwei Wang suggests is a part of the solution I am willing to
implement. I will definitely use the suggest-as-you-type approach in the
query form as it will allow for pre-emptive disambiguation and I believe,
will give very satisfying results.

However, search users are wild beasts and I can't count on them to always
use the given suggestions. All I can count on is very erratic, sparse and
ambiguous queries :) So I need an almost fool proof solution.

To answer your question :
"BTW, I do not understand why you need to know the category of user input"
I am trying to understand the user intent behind the query to filter out
results based on a given category of locations. If a user queries "Fast Food
in Nanjing" I don't want to return all the documents that contain the words
"Fast" and "Food" and "Nanjing". I use a custom algorithm to figure out the
intended location first. Then using the Spatial contrib I filter out the
results based on a given area that was identified earlier. Finally I sort
the results according to distance from the location point / centroid found
earlier.

 Identifying the category allows me two things :

1) Filter out irrelevant results : I dont want my resultset to include a
Supermarket in "Nanjing" where the "food" is fresh and service is "fast"
just because the query words were included in the description of the
location. Since I am using custom, distance based, sorting of the results, I
can't afford to have the supermarket be the top result because it is the
closest to the location centroid identified earlier. The user intent was
clearly "fast food" and not a supermarket !

2) Understand user intent to provide targetted advertizing.

3) Understanding the category of location a user is looking for also allows
to calculate more accurately the bounding box  = the maximum distance at
which the location should be located to be relevant to the user. A user
looking for Pizza in New York is expecting to have his results within a
radius of a maximum of 1 or 2 miles. If he is looking for a Theme Park he
will probably be willing to go further away to find it. So identifying the
category of the location the user is looking for lets me calculate the
didstance radius more acurately.




Fei Liu

Thanks a lot for the papers you pointed me to. I cam accross them earlier in
my research and re-reading them gave me new insights. However I believe that
the Two steps approach you are recomending is not very viable under heavy
load as it requires two passes on the index.
I believe however that Identifying the dominant category(ies) of the
resultset when no category could be clearly identified using the query
alone, can be very valuable if sent back to the user as an information and a
category link !

Now what I think I will do to pre-emptively identify the location
category(ies) implied in the query :

1 - use my own custom category set and index their names using the synonym
analyzer provided with Lucene and also use some sort of normalization such
as stemmin. maybe also using snowball analyzer.
2 - break the query into Shingles (word based grams) and analyze each
shingle using the analyzers that were used in (1). then query Lucene with
these analyzed shingles against the category index built earlier.

Hopefully the category with the highest Lucene score should be the one
intended by the user

Later on, I also intend to use some sort of training based approach using
search queries that would have been tagged with the relevant location
categories.

What do you guys think ?

Would this be a viable approach ?

Thanks for all !


Cheers

Alex


slow FieldCacheImpl.createValue

2008-05-19 Thread Alex
hi,
I have a ValueSourceQuery that makes use of a stored field. The field contains 
roughly 27.27 million untokenized terms.
The average length of each term is 8 digits.
The first search always takes around 5 minutes, and it is due to the 
createValue function in the FieldCacheImpl.
The search is executed on a RAID5 disk array of 15k rpm. 


any hints as to make the fieldcache createvalue faster ? I have tried a bigger 
cache size for BufferedIndexReader (8kb or more) ,
but the time it took for createValue to execute is still in the realm of 4, 5 
minutes. 


thanks

_
5 GB 超大容量 、創新便捷、安全防護垃圾郵件和病毒 — 立即升級 Windows Live Hotmail
http://mail.live.com 

RE: slow FieldCacheImpl.createValue

2008-05-19 Thread Alex
Hi,
thanks for the reply. Yes, after the first slow search, subsequent searches 
have good performance.

I guess the issue is why exactally, is createValue taking so long, or should it 
take so long (4 ~ 5 minutes ).
Given roughly 27million terms, each of roughly 8 characters long and few other 
bytes for the TermInfo record,
a modern disk can easily read over the portion of the index (the .frq portion ) 
in a few seconds. Also,
when I use tools like dstat, I see bunch of 1kb reads initiated while running 
createValue. 




> Date: Tue, 20 May 2008 11:02:38 +0530
> From: [EMAIL PROTECTED]
> To: java-user@lucene.apache.org
> Subject: Re: slow FieldCacheImpl.createValue
> 
> Hey Alex,
> I guess you haven't tried warming up the engine before putting it to use.
> Though one of the simpler implementation, you could try warming up the
> engine first by sending a few searches and then put it to use (put it into
> the serving machine loop). You could also do a little bit of preprocessing
> while initializing the daemon rather than waiting for the search to hit it.
> I hope I understood the problem correctly here, else would have to look into
> it.
> 
> --
> Anshum


_
用部落格分享照片、影音、趣味小工具和最愛清單,盡情秀出你自己 — Windows Live Spaces
http://spaces.live.com/

lucene memory consumption

2008-05-29 Thread Alex

Hi,
other than the in memory terms (.tii), and the few kilobytes of opened file 
buffer, where are some other sources of significant memory consumption
when searching on a large index ?  (> 100GB). The queries are just normal term 
queries.


_
隨身的 Windows Live Messenger 和 Hotmail,不限時地掌握資訊盡在指間 — Windows Live for Mobile 
http://www.msn.com.tw/msnmobile/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: lucene memory consumption

2008-05-29 Thread Alex

Currently, searching on our index consumes around 2.5GB of ram. 
This is just a single term query, nothing that requires the in memory cache 
like in
the FieldScoreQuery. 


Alex


> Date: Thu, 29 May 2008 15:25:43 -0700
> From: [EMAIL PROTECTED]
> To: java-user@lucene.apache.org
> Subject: Re: lucene memory consumption
> 
> Not that I can think about. But, if you have any cached field data,
> norms array, that could be huge.
> 
> Would be interested in knowing from others regarding this topic as well.
> 
> Jian
> 
> On 5/29/08, Alex  wrote:
>>
>> Hi,
>> other than the in memory terms (.tii), and the few kilobytes of opened file
>> buffer, where are some other sources of significant memory consumption
>> when searching on a large index ?  (> 100GB). The queries are just normal
>> term queries.
>>
>>
>> _
>> 隨身的 Windows Live Messenger 和 Hotmail,不限時地掌握資訊盡在指間 — Windows Live for Mobile
>> http://www.msn.com.tw/msnmobile/
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>

_
隨身的 Windows Live Messenger 和 Hotmail,不限時地掌握資訊盡在指間 — Windows Live for Mobile 
http://www.msn.com.tw/msnmobile/ 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: lucene memory consumption

2008-05-29 Thread Alex

I believe we have around 346 million documents


Alex


> Date: Thu, 29 May 2008 18:39:31 -0400
> From: [EMAIL PROTECTED]
> To: java-user@lucene.apache.org
> Subject: Re: lucene memory consumption
> 
> Alex wrote:
>> Currently, searching on our index consumes around 2.5GB of ram. 
>> This is just a single term query, nothing that requires the in memory cache 
>> like in
>> the FieldScoreQuery. 
>>
>>
>> Alex
>>
>> 
>>   
> That seems rather high. You have 10/15 million + docs?
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

_
聰明搜尋和瀏覽網路的免費工具列 — MSN 搜尋工具列 
http://toolbar.live.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Is it possible to get only one Field from a Document?

2008-06-11 Thread Alex
if you have many terms across the fields, you might want to invoke
IndexReader's setTermInfosIndexDivisor() method, which would
reduce the in memory term infos used to lookup idf, but a (slightly)
slower search.




> From: [EMAIL PROTECTED]
> To: java-user@lucene.apache.org
> Subject: Re: Is it possible to get only one Field from a Document?
> Date: Wed, 11 Jun 2008 08:22:22 -0400
> 
> For the record, Hits.id(int i) returns the document number.  Note,  
> though, that Hits is now deprecated, as pointed out by the link to  
> 1290, so going the TopDocs route is probably better anyway.
> 
> -Grant
> 
> On Jun 11, 2008, at 7:43 AM, Daan de Wit wrote:
> 
> > This is possible, you need to provider a FieldSelector to  
> > IndexReader#document(docId, selector). This won't work with Hits  
> > though, because Hits does not expose the document number, so you  
> > need to roll your own solution using TopDocs or HitCollector, for  
> > information see the discussion in this issue: 
> > https://issues.apache.org/jira/browse/LUCENE-1290
> >
> > Kind regards,
> > Daan de Wit
> >
> > -Original Message-
> > From: Marcelo Schneider [mailto:[EMAIL PROTECTED]
> > Sent: Wednesday, June 11, 2008 13:29
> > To: java-user@lucene.apache.org
> > Subject: Is it possible to get only one Field from a Document?
> >
> > I have a environment where we have indexed a DB with about 6mil  
> > entries
> > with Lucene, and each row has 25 columns. 20 cols have integer codes
> > used as filters (indexed/unstored), and the other 5 have (very) large
> > texts (also indexed/unstored). Currently the search I'm doing is  
> > like this:
> >
> > Hits hits = searcher.search(query);
> > for (int i = 0; i < this.hits.length(); i++) {
> >Document doc = this.hits.doc(i);
> >String s = doc.get("fieldWanted");
> > // does everything with the result, etc
> > }
> >
> > We are trying to reduce memory usage, however. Is it possible to  
> > return
> > a Document object with just the Fields I really need? In the example,
> > each Document have 25 fields, and I just need one... would this
> > theoretically make any difference?
> >
> >
> >
> >
> > -- 
> >
> > Marcelo Frantz Schneider
> > SIC - TCO - Tecnologia em Engenharia do Conhecimento
> > DÍGITRO TECNOLOGIA
> > E-mail: [EMAIL PROTECTED]
> > Site: www.digitro.com
> >
> >
> > -- 
> > Esta mensagem foi verificada pelo sistema de antivírus da Dígitro e
> > acredita-se estar livre de perigo.
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> >
> 
> --
> Grant Ingersoll
> http://www.lucidimagination.com
> 
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
> 
> 
> 
> 
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

_
隨身的 Windows Live Messenger 和 Hotmail,不限時地掌握資訊盡在指間 — Windows Live for Mobile 
http://www.msn.com.tw/msnmobile/ 

RE: huge tii files

2008-06-17 Thread Alex

you can invoke IndexReader.setTermInfosIndexDivisor prior to any search to 
control the fraction of .tii file read into memory.


_
聰明搜尋和瀏覽網路的免費工具列 — MSN 搜尋工具列 
http://toolbar.live.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Add a document in a single pass?

2006-11-18 Thread alex

Hi,

I have a stream-based document parser that extracts contents (as a character
stream) as well as document metadata (as strings) from a file, in a single
pass. From these data I want to create a Lucene document. The problem is
that the metadata are available not until the complete document has been
parsed, i.e. after IndexWriter.addDocument returned.

Is there a way to affect the order the document fields are processed in
IndexWriter.addDocument or another way to build the index efficiently?
Thanks in advance
Alex


Detailed file handling on hard disk

2010-09-03 Thread Alex vB

Hello everybody,

I read the paper  http://www2008.org/papers/pdf/p387-zhangA.pdf Performance
of Compresses Inverted List Caching in Search Engines  and now I am unsure
how Lucene implements its structure on the hard disk. I am using Windos as
OS and therefore I implemented FSDirectory based on
Java.io.RandomAccessFile. 

How is the skipping in the .tis file realized? Do I use metadata at the
beginning of each block too like in the mentioned paper above on page 388
(in the paper the metadata stores informations about how many inverted lists
are in the block and where they start)? 

http://lucene.472066.n3.nabble.com/file/n1413062/Block_assignment.jpg 

Because I read in another article that I can seek to the correct position on
the hard drive with the byte address using java.io.RandomAccessFile (which I
can read from .tii-file in "IndexDelta"?).

How do I find the correct position/location for my PostingList/Document?
Do I need information/metadata about the blocks from the underlying file
system?
Or where can I find further informations about this stuff? :)

Best regards
Alex
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Detailed-file-handling-on-hard-disk-tp1413062p1413062.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Implementing indexing of Versioned Document Collections

2010-11-09 Thread Alex vB

Hello everybody,

I would like to implement the paper "Compact Full-Text Indexing of Versioned
Document Collections" [1] from Torsten Suel for my diploma thesis in Lucene.
The basic idea is to create a two-level index structure. On the first level
a document is identified by document ID with a posting list entry if the
term exists at least in one version. For every posting on the first level
with term t we have a bitvector on the second one. These bitvectors contain
as many bits as there are versions for one document, and bit i is set to 1
if version i contains term t or otherwise it remains 0. 

http://lucene.472066.n3.nabble.com/file/n1872701/Unbenannt_1.jpg 

This little picture is just for demonstration purposes. It shows a posting
list for the term car and is composed of 4 document IDs. If a hit is found
in document 6 another look-up is needed on the second level to get the
corresponding versions (version 1, 5, 7, 8, 9, 10 from 10 versions at all). 

At the moment I am using wikipedia (simplewiki dump) as source with a
SAXParser and can resolve each document with all its versions from the XML
file (Fields are Title, ID, Content(seperated for each version)). My problem
is that I am unsure how to connect the second level with the first one and
how to store it. The key points that are needed:
- Information from posting list creation to create the bitvector (term ->
doc -> versions)
- Storing the bitvectors
- Implementing search on second level

For the first steps I disabled term frequencies and positions because the
paper isn't handling them. I would be happy to get any running version at
all. :)
At the moment I can create bitvectors for the documents. I realized this
with a HashMap in TermsHashPerField where I grab the current
term in add() (I hope this is the correct location for retrieving the
inverted lists terms). Anyway I can create the corret bitvectors and write
them into a text file.
Excerpt of bitVectors from article "April":
april : 
110110111011
never : 
0010
ayriway : 
010110111011
inclusive : 
1000
 

Next step would be storing all bitvecors in the index. At first glance I
like to use an extra field to store the created bitvectors permanent in the
index. It seems to be the easiest way for a first implementation without
accessing the low level functions of Lucene. Can I add a field after I
already started writing the document through IndexWriter? How would I do
this? Or are there any other suggestions for storing? Another idea is to
expand the index format of Lucene but this seems a little bit to difficult
for me. Maybe I could write these information into my own file. Could
anybody point me to the right direction? :)

Currently I am focusing on storing and try to extend Lucenes search after
the former step.

THX in advance & best regards 
Alex

[1] http://cis.poly.edu/suel/
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Implementing-indexing-of-Versioned-Document-Collections-tp1872701p1872701.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Implementing indexing of Versioned Document Collections

2010-11-16 Thread Alex vB

Hello Pulkit,

thank you for your answer and excuse me for my late reply. I am currently
working on the payload stuff and have implemented my own Analyzer and
Tokenfilter for adding custom payloads. As far as I understand I can add
Payload for every term occurence and write this into the posting list. My
posting list now looks like this:

car -> DocID1, [Payload 1], DocID2, [Payload2]., DocID N, [Payload N]

Where each payload is a BitSet depending on the versions of a document. I
must admit that the index is getting really big at the moment because I am
adding around 8 to 16 bytes with each payload. I have to find a good
compression for the bitvectors. 
Further I am always getting the error
org.apache.lucene.index.CorruptIndexException: checksum mismatch in segments
file if I use my own Analyzer. After I uncomment the checksum test
everything works fine. Even Luke isn't giving me an error. Any ideas?
Another problem is the BitVector creation during tokenization. I am running
through all versions during the tokenizing step for creating my bitvectors
(stored in a HashMap). So my bitvectors are completly created after the last
field is analyzed (I added every wikipedia verison as an own field).
Therefore I need to add the payload after the tokenizing step. Is this
possible? What happens if I add payload for a current term and I add another
payload for the same term later ? Is it overwritten or appended?

Greetings
Alex
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Implementing-indexing-of-Versioned-Document-Collections-tp1872701p1910449.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Implementing indexing of Versioned Document Collections

2010-11-16 Thread Alex vB

Hi again,

my Payloads are working fine as I figured out now (haven't seen the
nextPosition method). I really have problems with adding the bitvectors.
Currently I am creating them during tokenization. Therefore, as already
mentioned, they are only completely created when all fields are tokenized
because I add every new term occurence into HashMap and create/update the
linked bitvector during this analysis process. I read in another post that
changing or updating already set payloads isn't possible. Furthermore I need
to store payload only ONCE for a term and not on every term position. For
example on the wiki article for April I would have around 5000 term
occurrences for the term "April"! This would save a lot of memory.

1) Is it possible to pre analyze fields? Maybe analyzing twice. First time
for getting the bitvectors (without writing them!) and second time for
normal index writing with bitvector payloads.
2) Alternatively I could still add the bitvectors during tokenization if I
would be able to set the current term in my custom Filter (extends
TokenFilter). In my HashMap I have pairs of  and I could
iterate over all term keys. Is it possible to manually set the current term
and the corresponding payload? I tried something like this after all fields
and streams have been tokenized (Without success):

for (Map.Entry e : map.entrySet()) {
key = e.getKey();
value = e.getValue();

termAtt.setTermBuffer(key);
bitvectorPalyoad = new Payload(toByteArray(value)); 
payloadAttr.setPayload(bitvectorPalyoad);
}

3) Can I use payloads without term positions? 

If my questions are unclear please tell me! :)

Best regards
Alex



-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Implementing-indexing-of-Versioned-Document-Collections-tp1872701p1913140.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Indexing large XML dumps

2011-01-03 Thread Alex vB

Hello everybody,

I am currently indexing wikipedia dumps and create an index for versioned
document collections. As far everything is working fine but I have never
thought that single articles of wikipedia would reach a size of around 2 GB!
One article for example has 2 versions with an average length of 6
characters  for each (HUGE in memory!). This means I need a heap space
around 4 GB to perform indexing and I would like to decrease my memory
consumption ;).

At the moment I load every wikipedia article completely into memory
containing all versions. Then I collect some statistical data about the
article to store extra information about term occurences which are written
into the index as payloads. The statistic is created during an own
tokenization run which happens before the document is written into index.
This means I am analyzing my documents twice! :( I know there is a
CachingTokenFilter but I haven't found how and where to implement it exactly
(I tried it in my Analyzer but stream.reset() seems not to work). Does
somebody have a nice example?

1) Can I somehow avoid loading one complete article to get my statistics? 
2) Is it possible to index large files without completely loading it into a
field? 
3) How can I avoid to parse an article twice? 

Best regards 
Alex


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-large-XML-dumps-tp2185926p2185926.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Could not find implementing class

2011-01-25 Thread Alex vB

Hello everybody,

I used a small indexing example from "Lucene in Action" and can run and
compile the program under eclipse. If I want to compile and run it by
console I get this error:


java.lang.IllegalArgumentException: Could not find implementing class for
org.apache.lucene.analysis.tokenattributes.TermAttribute
at
org.apache.lucene.util.AttributeSource$AttributeFactory$DefaultAttributeFactory.getClassForInterface(AttributeSource.java:87)
at
org.apache.lucene.util.AttributeSource$AttributeFactory$DefaultAttributeFactory.createAttributeInstance(AttributeSource.java:66)
at
org.apache.lucene.util.AttributeSource.addAttribute(AttributeSource.java:245)
at
org.apache.lucene.index.DocInverterPerThread$SingleTokenAttributeSource.(DocInverterPerThread.java:41)
at
org.apache.lucene.index.DocInverterPerThread$SingleTokenAttributeSource.(DocInverterPerThread.java:36)
at
org.apache.lucene.index.DocInverterPerThread.(DocInverterPerThread.java:34)
at org.apache.lucene.index.DocInverter.addThread(DocInverter.java:95)
at
org.apache.lucene.index.DocFieldProcessorPerThread.(DocFieldProcessorPerThread.java:62)
at
org.apache.lucene.index.DocFieldProcessor.addThread(DocFieldProcessor.java:88)
at
org.apache.lucene.index.DocumentsWriterThreadState.(DocumentsWriterThreadState.java:43)
at
org.apache.lucene.index.DocumentsWriter.getThreadState(DocumentsWriter.java:739)
at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:814)
at
org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:802)
at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1998)
at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1972)
at Demo.setUp(Demo.java:86)
at Demo.main(Demo.java:46)

I compile the command with javac -cp  Demo.java which
finishes without errors but running the program isn't possible. What am I
missing?? Basically I am just creating a directory, getting an indexwriter
with analyzer etc.. Line 86 in Demo.java is writer.addDocument(doc);.

Greetings Alex
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Could-not-find-implementing-class-tp2330598p2330598.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Could not find implementing class

2011-01-25 Thread Alex vB

Hello Alexander,

isn't it enough to add the classpath through -cp? If I don't use -cp I can't
compile my project. I thought after compiling without errors all sources are
correctly added. In Eclipse I added Lucene sources the same  way(which
works) and I also tried using the jar file. Therefore I seem to find all
classes but I don't get a clue with the error message. This error message is
thrown by the Lucene class DefaultAttributeFactory in 
org.apache.lucene.util.AttributeSource. I work under Ubuntu and configured
java with 

- sudo update-alternatives --config java 
- sudo update-java-alternatives -java-6-sun

Greetings
Alex


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Could-not-find-implementing-class-tp2330598p2331617.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Could not find implementing class

2011-01-25 Thread Alex vB

Hello Uwe,

I recompiled some classes manually in Lucene sources. No it's running fine!
Something went wrong there.

Thank you very much!

Best regards
Alex
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Could-not-find-implementing-class-tp2330598p2332141.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Storing payloads without term-position and frequency

2011-02-02 Thread Alex vB

Hello everybody,

I am currently using Lucene 3.0.2 with payloads. I store extra information
in the payloads about the term like frequencies and therefore I don't need
frequencies and term positions stored normally by Lucene. I would like to
set f.setOmitTermFreqAndPositions(true) but then I am not able to retrieve
payloads. Would it be hard to "hack" Lucene for my requests? Anymore I only
store one payload per term if that information makes it easier.

Best regards
Alex
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Storing-payloads-without-term-position-and-frequency-tp2408094p2408094.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



How are stored Fields/Payloads loaded

2011-02-28 Thread Alex vB
Hello everybody,

I am currently unsure how stored data is written and loaded from index. 
I want to store for every term of a document some binary data but only once
and not for every position! 
Therefore I am not sure if Payloads or stored Fields are the better solution
(Or the not implemented feature Column Stride Field).

As far as I know all fields of a document are loaded by Lucene during
search. With large stored fields this can be time consuming and therefore
exists the possibility to load specific fields with FieldSelector. Maybe I
could create for each term a stored Field (up to several thousand Fields!)
and read those fields depending on the query term. Is this a common
approach?
The other possibility (like I have implemented it at the moment) is to store
per term a payload but only on the first term position. Payloads are loaded
only if I retrieve them from a hit right? So my current posting list looks
like this:
http://lucene.472066.n3.nabble.com/file/n2598739/Payload.png 
Picture adapted from M. McCandless "Fun with Flex"

How will the feature Column Stride Field (or per-document field) work? It's
not clear for me what "per Document" exactly means for the posting list
entries. I think (hope :P) it works like this:
http://lucene.472066.n3.nabble.com/file/n2598739/CSD.png 
Picture adapted from M. McCandless "Fun with Flex"


Do I understand the Column Stride Field correct? What would give me the best
performance (Stored Field, Payload, CSD)? Are there other ways to retrieve
payloads during search than Spanquery (I would like to use a normal query
here)?

Regards
Alex

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/How-are-stored-Fields-Payloads-loaded-tp2598739p2598739.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Early Termination

2011-03-15 Thread Alex vB
Hi,

is Lucene capable of any early termination techniques during query
processing?
On the forum I only found some information about TimeLimitedCollector. Are
there more implementations?

Regards
Alex

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Early-Termination-tp2684557p2684557.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Lucene 4.0 Payloads

2011-03-17 Thread Alex vB
Hello everybody,

I am currently experimenting with Lucene 4.0 and would like to add payloads.
Payload should only be added once per term on the first position. My current
code looks like this:

public final boolean incrementToken() throws java.io.IOException {
   String term = characterAttr.toString();

  if (!input.incrementToken()) {
 return false;
  }

  // hmh contains all terms for one document
  if(hmh.checkKey(term)){ //  check if hashmap contains term
  Payload payload = new Payload(hmh.getCompressedData(term)); 
//get
payload data
  payloadAttr.setPayload(payload); // add payload
  hmh.removeFromIndexingMap(term);  // remove term from 
hashmap
  }
  
  return true;
}

Is this a correct way for adding payloads in Lucene 4.0? When I try to
receive payloads I am not getting payload on the first position. For getting
payloads I use this:

DocsAndPositionsEnum tp = MultiFields.getTermPositionsEnum(ir,   
MultiFields.getDeletedDocs(ir), fieldName,
new BytesRef(searchString));

while (tp.nextDoc() != tp.NO_MORE_DOCS) {
if (tp.hasPayload() && counter < 10) {
Document doc = ir.document(tp.docID());
BytesRef br = tp.getPayload();
System.out.println("Found payload \"" + 
br.utf8ToString() + "\" for
document " +  
  tp.docID() + " and query " +
searchString +  " in country " +  
  doc.get("country"));
}
}

As far as I know there are two possibilities to use payloads
1) During similarity scoring
2) During search

Is there a better/faster way to receive payloads during search? Is it
possible to run a normal query and read the payloads from hits? Is 1 or 2
the faster way to use payloads? Can I find somewhere example code for Lucene
and loading payloads?

Regards
Alex

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Lucene-4-0-Payloads-tp2695817p2695817.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



lucene-snowball 3.1.0 packages are missing?

2011-04-03 Thread Alex Ott
Hello

I'm trying to upgrade Lucene in my project to 3.1.0 release, but there is
no lucene-snowball 3.1.0 package on maven central. Is it intended
behaviour? Should I continue to use 3.0.3 for snowball package?


-- 
With best wishes, Alex Ott
http://alexott.blogspot.com/http://alexott.net/
http://alexott-ru.blogspot.com/
Skype: alex.ott

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



New codecs keep Freq skip/omit Pos

2011-04-21 Thread Alex vB
Hello everybody,

I am currently testing several new Lucene 4.0 codec implementations to
compare with an own solution.
The difference is that I am only indexing frequencies and not positions. I
would like to have this for the other codecs. I know there was already a
post for this topic
http://lucene.472066.n3.nabble.com/Omit-positions-but-not-TF-td599710.html. 

I just wanted to ask if there has something changed especially for the new
codecs.
I had a look at the FixedPostingWriterImpl and PostingsConsumer. Are those
they right places for adapting Pos/Freq handling? What would happen if I
just skip writing postions/payloads? Would it mess up the index? 

The written files have different endings like pyl, skp, pos, doc etc. Gives
me "not counting" the pos file a correct index size estimation for W Freqs
W/O Pos? Or where exactly are term positions written?

Regards
Alex

PS: Some results with the current codecs if someone is interested. I indexed
10% of Wikipedia(english).
Each version is indexed as document.

Docs240179
Versions8467927
Distinct Terms  3501214
total Terms 1520008204
Avg. Versions   35.25
Avg. Terms per Version  179.50
Avg. Terms per Doc  6328.65

PforDelta W Freq W Pos 20.6 GB
PforDelta W/O Freq W/O Pos   1.6 GB
Standard 4.0 W Freq W Pos  28.1 GB
Standard 4.0 W/O Freq W/O Pos6.2 GB
Pfor W Freq W Pos 22 GB
Pfor W/O Freq W/O Pos3.1 GB

Performance follows ;)


--
View this message in context: 
http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p2849776.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: New codecs keep Freq skip/omit Pos

2011-04-22 Thread Alex vB
Hello Robert,

thank you for the answers! :)
Currently I used PatchedFrameOfRef and PatchedFrameOfRef2. Therefore both
implementations are PForDelta! Sorry my mistake.

PatchedFrameOfRef2: PforDelta W/O Freq W/O Pos   1.6 GB 
PatchedFrameOfRef :  Pfor W/O Freq W/O Pos  3.1 GB 

Here are some numbers:
PatchedFrameOfRef2 w/o POS w/o FREQ
segements.gen  20 Bytes
_43.fdt  8,1 MB
_43.fdx  64,4 MB
_43.fnm  20 Bytes
_43_0.skp  182,6 MB
_43_0.tib  32,3 MB
_43_0.tiv  1,0 MB
segements_2  268 Bytes
_43_0.doc  1,3 GB

PatchedFrameOfRef w/o POS w/o FREQ
segements.gen  20 Bytes
_43.fdt  8,1 MB
_43.fdx  64,4 MB
_43.fnm  20 Bytes
_43_0.skp  182,6 MB
_43_0.tib  32,3 MB
_43_0.tiv  1,1 MB
segements_2  267 Bytes
_43_0.doc  2,8 GB

During indexing I use StandardAnalyzer (StandardFilter, LowerCaseFilter,
StopFilter). 
Can I get somewhere more information for Codec creation or is there just
"grubbing" through the code? 

My own implementation needs 2,8 GB of space including FREQ but not POS. This
is why I am asking because I want somehow compare the result. Compared to 20
GB it is very nice and compared to 1,6 GB it is very bad ;).

Regards
Alex


--
View this message in context: 
http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p2851809.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: New codecs keep Freq skip/omit Pos

2011-04-22 Thread Alex vB
I also indexed one time with Lucene 3.0. Are those sizes really completely
the same?

Standard 4.0 W Freq W Pos   28.1 GB
Standard 4.0 W/O Freq W/O Pos   6.2 GB
Standard 3.0 W Freq W Pos   28.1 GB
Standard 3.0 WO Freq WO Pos 6.2 GB

Regards
Alex


--
View this message in context: 
http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p2851898.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: New codecs keep Freq skip/omit Pos

2011-04-22 Thread Alex vB
Wow cool ,

I will give that a try!

Thank you!!

Alex

--
View this message in context: 
http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p2852370.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: New codecs keep Freq skip/omit Pos

2011-04-23 Thread Alex vB
Hi Robert,


the adapted codec is running but it seems to be incredible slow. Will take
some time ;)
Here are some performance results:


 
 
 

 
Indexing scheme 
Index Size 
Avg. Query performance 
Max. Query Performance 
 

 
PforDelta2 W Freq W Pos 
20.6 GB (3,3 GB w/o .pos) 
81.97 ms 
1295 ms 
 

 
PforDelta2 W/O Freq W/O Pos 
1.6 GB 
63.33 ms 
766 ms 
 

 
Standard 4.0 W Freq W Pos 
28.1 GB (8,1 GB w/o .prx) 
77.71 ms 
978 ms 
 

 
Standard 4.0 W/O Freq W/O Pos 
6.2 GB 
59.93 ms 
718 ms 
 

 
Standard 3.0 W Freq W Pos 
28.1 GB (8,1 GB w/o .prx) 
71.41 ms 
978 ms 
 

 
Standard 3.0 WO Freq WO Pos 
6.2 GB 
72.72 ms 
 845 ms 
 

 
PforDelta W Freq W Pos 
22 GB (5 GB w/o .pos) 
67.98 ms 
783 ms 
 

 
PforDelta W/O Freq W/O Pos 
3.1 GB 
56.08 ms 
596 ms 
 

 
Huffman BL10 W Freq W/O Pos 
2.6 GB 
216.29 ms (Mem 14 ms) 
1338 ms 
 
 
 
I am a little bit curious about the Lucene 3.0 performance results because
the larger index seems to
work faster?!? I already ran the test several times. Are my results
realistic at all? I thought PForDelta/2 would outperform the standard index
implementations in query processing. 


The last result is my own implementation. I am still looking to get it
smaller because I think I can improve compression further. For indexing I
use PForDelta2 in combination with payloads. Those are causing the higher
runtimes. In memory it looks nice. The gap between my solution and PForDelta
is already 700 MB. I would say it is an improvement. :D I will have a look
at it again after I got an index with your adapted implementation.


I still have another question. The basic idea in my implementation is to
create a "Two-Level" index structure. It is specialized for versioned
document collections. On the first level I create a posting list entry for a
document whenever a term occurs in one or more of its versions. The second
level holds corresponding term frequency informations. Is it possible to
build such a structure by creating a codec? For query processing it should
filter per boolean query on the first level and only fetch information from
the second level when the document is in the intersection of the first
level. At the moment I use payloads to "simulate" a two-level structure.
Normally all payloads corresponding to a query get fetched, right?


If this structure would be possible there are several more implementations
with promising results (Two-Level Diff/MSA in this paper
http://cis.poly.edu/suel/papers/version.pdf).

Regards Alex



--
View this message in context: 
http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p284.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

Re: New codecs keep Freq skip/omit Pos

2011-04-23 Thread Alex vB
> it depends upon the type of query.. what queries are you using for
> this benchmarking and how are you benchmarking?
> FYI: for benchmarking standard query types with wikipedia you might be
> interested in http://code.google.com/a/apache-extras.org/p/luceneutil/

I have 1 queries from a AOL data set where the followed link lead to
wikipedia.
I benchmark by warming up the indexSearcher with 5000 and perform the test
with the remaining 5000 queries. I just measure the time needed to execute
the queries. I use QueryParser.

> wait, you are indexing payloads for your tests with these other codecs
> when it says "W POS" ?

No only my last implementation uses payloads. All others not. Therefore I
use a payload aware query for Huffman.

> keep in mind that even adding a single payload to your index slows
> down the decompression of the positions tremendously, because payload
> lengths are intertwined with the positions. For block codecs payloads
> really need to be done differently so that blocks of positions are
> really just blocks of positions. This hasn't yet been fixed for the
> sep nor the fixed layouts, so if you add any payloads, and then
> benchmark positional queries then the results are not realistic.

Oh I know that payloads slow down query processing but I wasn't aware of the
block codec problem. I suggest you mean with not realistic they will be
slower? Some numbers for Huffman:
20 Bytes segements.gen
234.6 KB fdt
1.8 MB fdx
20 bytes fnm
626.1 MB pos
1.7 GB pyl
17.8 MB skp
39.8 MB tib
2028.5 KB tiv
268 Bytes Segments_2
214.6 MB doc

I used here for query processing PayloadQueryParser and adapt the similarity
according to my payloads.

> No they do not, only if you use a payload based query such as
> PayloadTermQuery. Normal non-positional queries like TermQuery and
> even normal positional queries like PhraseQuery don't fetch payloads
> at all...

Sorry my question was misleading. I already focused on a payload aware
query. When I use one how exactly are the payload informations fetched from
disk? For example if a query needs to read two posting lists. Are all
payloads fetched for them directly or is Lucene at first making a boolean
intersection and then retrieves the payloads for documents within that
intersection?

> From the description of what you are doing I don't understand how
> payloads fit in because they are per-position? But, I haven't had the
> time to digest the paper you sent yet.

I will try to summarize it and how I adapted it to Lucene. 

I already mentioned the idea of two levels for versioned document
collections. When I parse Wikipedia I unite for one article all terms of all
versions. From this word bag I extract each distinct term and index it with
Lucene into one document. Frequency information is now "lost" for the first
level but will be stored on the second. This is what I meant with " The
first level contains a posting for a document when a term occurs at least in
one version". For example if an article has two versions like version1: "a b
b" and version2: "a a a c c" only 'a','b' and 'c' are indexed.

For the second level I collected term frequency information during my
parsing step. Those frequencies are stored as a vector in version order. For
the above example the frequency vector for 'a' would be [1,3].  I store
these vectors as payloads which I see as the "second level". Every distinct
term on first level receives a single frequency vector on its first
position. So I somehow abuse payloads.

For query processing I now need to retrieve the docs and payloads. It would
be optimal to process the posting lists first ignoring payloads and then
fetch payloads (frequency information) for the remaining docs. The term
frequency is then used for ranking purposes. At the moment I pick for
ranking the highest value from the freq vector which corresponds to the most
matching version.

Regards
Alex

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



--
View this message in context: 
http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p2856054.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Lucene query processing

2011-04-26 Thread Alex vB
Hello everybody,

As far as I know Lucene processes documents DAAT. Depending on the query
either the intersection or union is calculated. For the intersection only
documents occurring in all posting lists are scored. In the union case every
document is scored which makes it a more expensive operation. 

Lucene stores its index into several files. Depending on the query different
files might be accessed for scoring. For example a payload query needs to
read paylods from .pos.

What is not clear for me how term frequencies or payloads are processed.
Assuming I store term frequencies I need to set
setOmitTermFreqAndPositions(false). 
1) Which queries include term frequencies? I assume all queries if term
frequencies are stored?
2) Why is fetching payloads so much more expensive than getting term
frequencies. Both are stored in seperated files and therefore demand a disk
seek.
3) What for a value contains tf if I set setOmitTermFreqAndPositions(true)?
Allways 1?
4) How are term freqs, payloads read from disk? In bulk for all remaining
docs at once or every time a document gets scored?

Regards
Alex



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Lucene-query-processing-tp2868144p2868144.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Similarity class and searchPayloads

2011-06-08 Thread Alex vB
Hello everybody,

I am just curious about following case.
Currently, I create a boolean AND query which loads payloads.
In some cases it occurs that Lucene loads payloads but does not return hits.

Therefore, I assume that payloads are directly loaded whith each doc ID from
the posting list before the boolean filter.Is that right?
Is it possible to filter documents first and then load the payload?
For example, I have three terms and I check in every posting list if the
current doc ID is availabel.
Only then I load payload.

Or can anybody tell me where exactly Lucene loads payloads in code?

Regards
Alex

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Similarity-class-and-searchPayloads-tp3041463p3041463.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Question about the CompoundWordTokenFilterBase

2013-09-18 Thread Alex Parvulescu
Hi,

While trying to play with the CompoundWordTokenFilterBase I noticed that
the behavior is to include the original token together with the new
sub-tokens.

I assume this is expected (haven't found any relevant docs on this), but I
was wondering if it's a hard requirement or can I propose a small change to
skip the original token (controlled by a flag)?

If there's interest I can put this in a JIRA issue and we can continue the
discussion there.

The patch is not too complicated, but I haven't ran any of the tests yet :)

thanks,
alex


Performance issues with the default field compression

2014-04-09 Thread Alex Parvulescu
Hi,

I was investigating some performance issues and during profiling I noticed
that there is a significant amount of time being spent decompressing fields
which are unrelated to the actual field I'm trying to load from the lucene
documents. In our benchmark doing mostly a simple full-test search, 40% of
the time was lost in these parts.

My code does the following: reader.document(id, Set(":path")).get(":path"),
and this is where the fun begins :)
I noticed 2 things, please excuse the ignorance if some of the things I
write here are not 100% correct:

 - all the fields in the document are being decompressed prior to applying
the field filter. We've noticed this because we have a lot of content
stored in the index, so there is an important time lost around
decompressing junk. At one point I tried adding the field first, thinking
this will save some work, but it doesn't look like it's doing much.
Reference code, the visitor is only used at the very end. [0]

 - second, and probably of a smaller impact would be to have the
DocumentStoredFieldVisitor return STOP when there are no more fields in the
visitor to visit. I only have one, and it looks like it will #skip through
a bunch of other stuff before finishing a document. [1]

thanks in advance,
alex


[0]
https://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/codecs/compressing/CompressingStoredFieldsReader.java?view=markup#l364

[1]
https://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/document/DocumentStoredFieldVisitor.java?view=markup#l100


Re: Performance issues with the default field compression

2014-04-10 Thread Alex Parvulescu
Hi Adrien,

Thanks for clarifying!
We're going to go the custom codec & custom visitor route.

best,
alex



On Wed, Apr 9, 2014 at 10:38 PM, Adrien Grand  wrote:

> Hi Alex,
>
> Indeed, one or several (the number depends on the size of your
> documents) documents need to be fully decompressed in order to read a
> single field of a single document.
>
> Regarding the stored fields visitor, the default one doesn't return
> STOP when the field has been found because other fields with the same
> name might be stored further in the stream of stored fields (in case
> of a multivalued field). If you know that you have a single field
> value, you can write your own field visitor that will return STOP
> after the first value has been read. As you noted, this probably has
> less impact on performance than the first point that you raised.
>
> The default stored fields visitor is rather targeted at large indices
> where compression helps save disk space and can also make stored
> fields retrieval faster since a larger portion of the stored fields
> can fit in the filesystem cache. However, if your index is small and
> fully fits in the filesystem cache, this stored fields format might
> indeed have non-negligible overhead.
>
>
> On Wed, Apr 9, 2014 at 9:17 PM, Alex Parvulescu
>  wrote:
> > Hi,
> >
> > I was investigating some performance issues and during profiling I
> noticed
> > that there is a significant amount of time being spent decompressing
> fields
> > which are unrelated to the actual field I'm trying to load from the
> lucene
> > documents. In our benchmark doing mostly a simple full-test search, 40%
> of
> > the time was lost in these parts.
> >
> > My code does the following: reader.document(id,
> Set(":path")).get(":path"),
> > and this is where the fun begins :)
> > I noticed 2 things, please excuse the ignorance if some of the things I
> > write here are not 100% correct:
> >
> >  - all the fields in the document are being decompressed prior to
> applying
> > the field filter. We've noticed this because we have a lot of content
> > stored in the index, so there is an important time lost around
> > decompressing junk. At one point I tried adding the field first, thinking
> > this will save some work, but it doesn't look like it's doing much.
> > Reference code, the visitor is only used at the very end. [0]
> >
> >  - second, and probably of a smaller impact would be to have the
> > DocumentStoredFieldVisitor return STOP when there are no more fields in
> the
> > visitor to visit. I only have one, and it looks like it will #skip
> through
> > a bunch of other stuff before finishing a document. [1]
> >
> > thanks in advance,
> > alex
> >
> >
> > [0]
> >
> https://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/codecs/compressing/CompressingStoredFieldsReader.java?view=markup#l364
> >
> > [1]
> >
> https://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/document/DocumentStoredFieldVisitor.java?view=markup#l100
>
>
>
> --
> Adrien
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Lucene 5.2.0 global ordinal based query time join on multiple indexes

2015-07-20 Thread Alex Pang
Hi,



Does the Global Ordinal based query time join support joining on multiple
indexes?



>From my testing on 2 indexes with a common join field, the document ids I
get back from the ScoreDoc[] when searching are incorrect, though the
number of results is the same as if I use the older join query.


For the parent (to) index, the value of the join field is unique to each
document.

For the child (from) index, multiple documents can have the same value for
the join field, which must be found in the parent index.

Both indexes have a join field indexed with SortedDocValuesField.


The parent index had 7 segments and child index had 3 segments.


Ordinal map is built with:

SortedDocValues[] values = new SortedDocValues[searcher1

.getIndexReader().leaves().size()];

for (LeafReaderContext leadContext : searcher1.getIndexReader()

.leaves()) {

  values[leadContext.ord] = DocValues.getSorted(leadContext.reader(),

  "join_field");

}

MultiDocValues.OrdinalMap ordinalMap = null;

ordinalMap = MultiDocValues.OrdinalMap.build(searcher1.getIndexReader()

.getCoreCacheKey(), values, PackedInts.DEFAULT);


Join Query:

joinQuery = JoinUtil.createJoinQuery("join_field",

  fromQuery,

  new TermQuery(new Term("type", "to")), searcher2,

      ScoreMode.Max, ordinalMap);



Thanks,

Alex


Re: Lucene 5.2.0 global ordinal based query time join on multiple indexes

2015-07-21 Thread Alex Pang
Seems if I create a MultiReader from my index searchers and create the
ordinal map from that MultiReader (and use an IndexSearcher created from
the MultiReader in the createJoinQuery), then the correct results are found.


On Mon, Jul 20, 2015 at 5:48 PM, Alex Pang  wrote:

> Hi,
>
>
>
> Does the Global Ordinal based query time join support joining on multiple
> indexes?
>
>
>
> From my testing on 2 indexes with a common join field, the document ids I
> get back from the ScoreDoc[] when searching are incorrect, though the
> number of results is the same as if I use the older join query.
>
>
> For the parent (to) index, the value of the join field is unique to each
> document.
>
> For the child (from) index, multiple documents can have the same value for
> the join field, which must be found in the parent index.
>
> Both indexes have a join field indexed with SortedDocValuesField.
>
>
> The parent index had 7 segments and child index had 3 segments.
>
>
> Ordinal map is built with:
>
> SortedDocValues[] values = new SortedDocValues[searcher1
>
> .getIndexReader().leaves().size()];
>
> for (LeafReaderContext leadContext : searcher1.getIndexReader()
>
> .leaves()) {
>
>   values[leadContext.ord] = DocValues.getSorted(leadContext.reader(),
>
>   "join_field");
>
> }
>
> MultiDocValues.OrdinalMap ordinalMap = null;
>
> ordinalMap = MultiDocValues.OrdinalMap.build(searcher1.getIndexReader()
>
> .getCoreCacheKey(), values, PackedInts.DEFAULT);
>
>
> Join Query:
>
> joinQuery = JoinUtil.createJoinQuery("join_field",
>
>   fromQuery,
>
>   new TermQuery(new Term("type", "to")), searcher2,
>
>   ScoreMode.Max, ordinalMap);
>
>
>
> Thanks,
>
> Alex
>


LUCENE-8396 performance result?

2018-07-17 Thread alex stark
LUCENE-8396 looks pretty good for LBS use cases, do we have performance result 
for this approach? It appears to me it would greatly reduce terms to index a 
polygon, and how about search performance? does it also perform well for 
complex polygon which has hundreds or more coordinates? 

Legacy filter strategy in Lucene 6.0

2018-08-08 Thread alex stark
As FilteredQuery are removed in Lucene 6.0, we should use boolean query to do 
the filtering. How about the legacy filter strategy such as 
LEAP_FROG_FILTER_FIRST_STRATEGY or QUERY_FIRST_FILTER_STRATEGY? What is the 
current filter strategy?  Thanks,

Re: Legacy filter strategy in Lucene 6.0

2018-08-09 Thread alex stark
Thanks Adrien, I want to filter out docs base on conditions which stored in doc 
values (those conditions are unselective ranges which is not appropriate to put 
into reverse index), so I plan to use some selective term conditions to do 
first round search and then filter in second phase.  I see there is two phase 
iterator, but I did not find how to use it. Is it a appropriate scenario to use 
two phase iterator? or It is better to do it in a collector? Is there any guide 
of two phase iterator? Best Regards    On Wed, 08 Aug 2018 16:08:39 +0800 
Adrien Grand  wrote  Hi Alex, These strategies still 
exist internally, but BooleanQuery decides which one to use automatically based 
on the cost API (cheaper clauses run first) and whether sub clauses produce 
bitset-based or postings-based iterators. Le mer. 8 août 2018 à 09:46, alex 
stark  a écrit : > As FilteredQuery are removed in Lucene 
6.0, we should use boolean query to > do the filtering. How about the legacy 
filter strategy such as > LEAP_FROG_FILTER_FIRST_STRATEGY or 
QUERY_FIRST_FILTER_STRATEGY? What is the > current filter strategy? Thanks,

RE: Legacy filter strategy in Lucene 6.0

2018-08-09 Thread alex stark
Thanks Uwe, I think you are recommending 
IndexOrDocValuesQuery/DocValuesRangeQuery, and the articles by Adrien,  
https://www.elastic.co/blog/better-query-planning-for-range-queries-in-elasticsearch
 It looks promising for my requirement, I will try on that.  On Thu, 09 Aug 
2018 16:04:27 +0800 Uwe Schindler  wrote  Hi, IMHO: I'd 
split the whole code into a BooleanQuery with two filter clauses. The reverse 
index based condition (term condition, e.g., TermInSetQuery) gets added as a 
Occur.FILTER and the DocValues condition is a separate Occur.FILTER. If Lucene 
executes such a query, it would use the more specific condition (based on cost) 
to lead the execution, which should be the terms condition. The docvalues 
condition is then only checked for matches of the first. But you can still go 
and implement the two-phase iterator, but I'd not do that. Uwe - Uwe 
Schindler Achterdiek 19, D-28357 Bremen http://www.thetaphi.de eMail: 
u...@thetaphi.de > -Original Message- > From: alex stark 
 > Sent: Thursday, August 9, 2018 9:12 AM > To: java-user 
 > Cc: java-user@lucene.apache.org > Subject: Re: 
Legacy filter strategy in Lucene 6.0 > > Thanks Adrien, I want to filter out 
docs base on conditions which stored in > doc values (those conditions are 
unselective ranges which is not appropriate > to put into reverse index), so I 
plan to use some selective term conditions to > do first round search and then 
filter in second phase. I see there is two > phase iterator, but I did not find 
how to use it. Is it a appropriate scenario to > use two phase iterator? or It 
is better to do it in a collector? Is there any > guide of two phase iterator? 
Best Regards  On Wed, 08 Aug 2018 > 16:08:39 +0800 Adrien Grand 
 wrote  Hi Alex, These > strategies still exist 
internally, but BooleanQuery decides which one to use > automatically based on 
the cost API (cheaper clauses run first) and whether > sub clauses produce 
bitset-based or postings-based iterators. Le mer. 8 août > 2018 à 09:46, alex 
stark  a écrit : > As FilteredQuery > are removed in 
Lucene 6.0, we should use boolean query to > do the > filtering. How about the 
legacy filter strategy such as > > LEAP_FROG_FILTER_FIRST_STRATEGY or 
QUERY_FIRST_FILTER_STRATEGY? > What is the > current filter strategy? Thanks, 
- To 
unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional 
commands, e-mail: java-user-h...@lucene.apache.org

Replacement of CollapsingTopDocsCollector

2018-08-20 Thread alex stark
In Lucene 7.x, CollapsingTopDocsCollector is removed, is there any replacement 
of it?

Any way to improve document fetching performance?

2018-08-27 Thread alex stark
Hello experts, I am wondering is there any way to improve document fetching 
performance, it appears to me that visiting from store field is quite slow. I 
simply tested to use indexsearch.doc() to get 2000 document which takes 50ms. 
Is there any idea to improve that? 

Re: Any way to improve document fetching performance?

2018-08-27 Thread alex stark
quite small, just serveral simple short text store fields. The total index size 
is around 1 GB (2m doc).  On Mon, 27 Aug 2018 22:12:07 +0800 
 wrote  Alex,- how big are those docs? Best regards 
On 8/27/18 10:09 AM, alex stark wrote: > Hello experts, I am wondering is there 
any way to improve document fetching performance, it appears to me that 
visiting from store field is quite slow. I simply tested to use 
indexsearch.doc() to get 2000 document which takes 50ms. Is there any idea to 
improve that? 
- To 
unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional 
commands, e-mail: java-user-h...@lucene.apache.org

Re: Any way to improve document fetching performance?

2018-08-27 Thread alex stark
In same machine, no net latency.  When I reduce to 500 limit, it takes 20ms, 
which is also slower than I expected. btw, indexing is stopped.  On Mon, 27 
Aug 2018 22:17:41 +0800  wrote  yes, it should be 
less than a ms actually for those type of files. index and search on the same 
machine? no net latency in between? Best On 8/27/18 10:14 AM, alex stark wrote: 
> quite small, just serveral simple short text store fields. The total index 
size is around 1 GB (2m doc).  On Mon, 27 Aug 2018 22:12:07 +0800 
 wrote ---- Alex,- how big are those docs? Best regards 
On 8/27/18 10:09 AM, alex stark wrote: > Hello experts, I am wondering is there 
any way to improve document fetching performance, it appears to me that 
visiting from store field is quite slow. I simply tested to use 
indexsearch.doc() to get 2000 document which takes 50ms. Is there any idea to 
improve that? 
- To 
unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional 
commands, e-mail: java-user-h...@lucene.apache.org 
- To 
unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional 
commands, e-mail: java-user-h...@lucene.apache.org

Re: Any way to improve document fetching performance?

2018-08-28 Thread alex stark
I simple tried MultiDocValues.getBinaryValues to fetch result by doc value, it 
improves a lot, 2000 result takes only 5 ms. I even encode all the returnable 
fields to binary docvalues and then decode them, the results is also good 
enough. It seems store field is not perform well In our scenario (I think 
it is more common nowadays), search phrase should return as many results as 
possible so that rank phrase can resort the results by machine learning 
algorithm(on other clusters). Fetching performance is also important.  On 
Tue, 28 Aug 2018 00:11:40 +0800 Erick Erickson  wrote 
 Don't use that call. You're exactly right, it goes out to disk, reads the 
doc, decompresses it (16K blocks minimum per doc IIUC) all just to get the 
field. 2,000 in 50ms actually isn't bad for all that work ;). This sounds like 
an XY problem. You're asking how to speed up fetching docs, but not telling us 
anything about _why_ you want to do this. Fetching 2,000 docs is not generally 
what Solr was built for, it's built for returning the top N where N is usually 
< 100, most frequently < 20. If you want to return lots of documents' data you 
should seriously look at putting the fields you want in docValues=true fields 
and pulling from there. The entire Streaming functionality is built on this and 
is quite fast. Best, Erick On Mon, Aug 27, 2018 at 7:35 AM 
 wrote: > > can you post your query string? > > Best > 
> > On 8/27/18 10:33 AM, alex stark wrote: > > In same machine, no net latency. 
When I reduce to 500 limit, it takes 20ms, which is also slower than I 
expected. btw, indexing is stopped.  On Mon, 27 Aug 2018 22:17:41 +0800 
 wrote  yes, it should be less than a ms actually 
for those type of files. index and search on the same machine? no net latency 
in between? Best On 8/27/18 10:14 AM, alex stark wrote: > quite small, just 
serveral simple short text store fields. The total index size is around 1 GB 
(2m doc).  On Mon, 27 Aug 2018 22:12:07 +0800  
wrote  Alex,- how big are those docs? Best regards On 8/27/18 10:09 AM, 
alex stark wrote: > Hello experts, I am wondering is there any way to improve 
document fetching performance, it appears to me that visiting from store field 
is quite slow. I simply tested to use indexsearch.doc() to get 2000 document 
which takes 50ms. Is there any idea to improve that? 
- To 
unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional 
commands, e-mail: java-user-h...@lucene.apache.org 
- To 
unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional 
commands, e-mail: java-user-h...@lucene.apache.org > > > > 
- > To 
unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional 
commands, e-mail: java-user-h...@lucene.apache.org > 
- To 
unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional 
commands, e-mail: java-user-h...@lucene.apache.org

Lucene coreClosedListeners memory issues

2019-06-03 Thread alex stark
Hi experts,



I recently have memory issues on Lucene. By checking heap dump, most of them 
are occupied by SegmentCoreReaders.coreClosedListeners which is about nearly 
half of all.





Dominator Tree

num  retain size(bytes)  percent  percent(live) class Name



0   14,024,859,136   21.76%   28.23%   
com.elesearch.activity.core.engine.lucene.LuceneIndex

|

10,259,490,504   15.92%   20.65% 
--org.apache.lucene.index.SegmentCoreReaders

|

10,258,783,280   15.92%   20.65%   --[field coreClosedListeners] 
java.util.Collections$SynchronizedSet

|

10,258,783,248   15.92%   20.65% --[field c] java.util.LinkedHashSet

|

10,258,783,224   15.92%   20.65%   --[field map] java.util.LinkedHashMap



1   11,865,993,448   18.41%   23.89%   
com.elesearch.activity.core.engine.lucene.LuceneIndex



2   11,815,171,240   18.33%   23.79%   
com.elesearch.activity.core.engine.lucene.LuceneIndex



36,504,382,648   10.09%   13.09%   
com.elesearch.activity.core.engine.lucene.LuceneIndex

|

5,050,933,760   7.84%   10.17% --org.apache.lucene.index.SegmentCoreReaders

|

5,050,256,008   7.84%   10.17%   --[field coreClosedListeners] 
java.util.Collections$SynchronizedSet

|

5,050,255,976   7.84%   10.17% --[field c] java.util.LinkedHashSet

|

5,050,255,952   7.84%   10.17%   --[field map] java.util.LinkedHashMap



42,798,684,240   4.34%   5.63%   
com.elesearch.activity.core.engine.lucene.LuceneIndex



thread stack



histogram

num instances  #bytes  percent  class Name



0 497,527  38,955,989,888  60.44%  long[]

1  18,489,470   7,355,741,784  11.41%  short[]

2  18,680,799   3,903,937,088  6.06%  byte[]

3  35,643,993   3,775,822,640  5.86%  char[]

4   4,017,462   1,851,518,792  2.87%  int[]

5   7,788,280 962,103,784  1.49%  java.lang.Object[]

6   5,256,391 618,467,640  0.96%  java.lang.String[]

7  14,974,224 479,175,168  0.74%  java.lang.String

8   9,585,494 460,103,712  0.71%  java.util.HashMap$Node

9  18,133,885 435,213,240  0.68%  
org.apache.lucene.util.RoaringDocIdSet$ShortArrayDocIdSet

10   1,559,661 351,465,624  0.55%  java.util.HashMap$Node[]

11   4,132,738 264,495,232  0.41%  java.util.HashMap

12   1,519,178 243,068,480  0.38%  java.lang.reflect.Method

13   4,068,400 195,283,200  0.30%  
com.sun.org.apache.xerces.internal.xni.QName

14   1,181,106 183,932,704  0.29%  org.apache.lucene.search.DocIdSet[]

15   5,721,339 183,082,848  0.28%  java.lang.StringBuilder

16   1,515,804 181,896,480  0.28%  java.lang.reflect.Field

17 348,720 134,652,416  0.21%  
com.sun.org.apache.xerces.internal.xni.QName[]

18   3,358,251 134,330,040  0.21%  java.util.ArrayList

19   2,775,517  88,816,544  0.14%  org.apache.lucene.util.BytesRef

total  232,140,701  64,452,630,104


We used LRUQueryCache with maxSize 1000 and maxRamBytesUsed   64MB.



The coreClosedListeners occupied too much heap than I expected, is there any 
reason for that?

Re: Lucene coreClosedListeners memory issues

2019-06-03 Thread alex stark
Hi Adrien,



I didn't directly open readers. It is controlled by searcher manager.







 On Mon, 03 Jun 2019 16:32:06 +0800 Adrien Grand  wrote 




It looks like you are leaking readers. 
 
On Mon, Jun 3, 2019 at 9:46 AM alex stark <mailto:alex.st...@zoho.com.invalid> 
wrote: 
> 
> Hi experts, 
> 
> 
> 
> I recently have memory issues on Lucene. By checking heap dump, most of them 
> are occupied by SegmentCoreReaders.coreClosedListeners which is about nearly 
> half of all. 
> 
> 
> 
> 
> 
> Dominator Tree 
> 
> num  retain size(bytes)  percent  percent(live) class Name 
> 
>  
> 
> 0   14,024,859,136   21.76%   28.23%   
> com.elesearch.activity.core.engine.lucene.LuceneIndex 
> 
> | 
> 
> 10,259,490,504   15.92%   20.65% 
> --org.apache.lucene.index.SegmentCoreReaders 
> 
> | 
> 
> 10,258,783,280   15.92%   20.65%   --[field coreClosedListeners] 
> java.util.Collections$SynchronizedSet 
> 
> | 
> 
> 10,258,783,248   15.92%   20.65% --[field c] java.util.LinkedHashSet 
> 
> | 
> 
> 10,258,783,224   15.92%   20.65%   --[field map] 
> java.util.LinkedHashMap 
> 
>  
> 
> 1   11,865,993,448   18.41%   23.89%   
> com.elesearch.activity.core.engine.lucene.LuceneIndex 
> 
>  
> 
> 2   11,815,171,240   18.33%   23.79%   
> com.elesearch.activity.core.engine.lucene.LuceneIndex 
> 
>  
> 
> 36,504,382,648   10.09%   13.09%   
> com.elesearch.activity.core.engine.lucene.LuceneIndex 
> 
> | 
> 
> 5,050,933,760   7.84%   10.17% 
> --org.apache.lucene.index.SegmentCoreReaders 
> 
> | 
> 
> 5,050,256,008   7.84%   10.17%   --[field coreClosedListeners] 
> java.util.Collections$SynchronizedSet 
> 
> | 
> 
> 5,050,255,976   7.84%   10.17% --[field c] java.util.LinkedHashSet 
> 
> | 
> 
> 5,050,255,952   7.84%   10.17%   --[field map] 
> java.util.LinkedHashMap 
> 
>  
> 
> 42,798,684,240   4.34%   5.63%   
> com.elesearch.activity.core.engine.lucene.LuceneIndex 
> 
> 
> 
> thread stack 
> 
> 
> 
> histogram 
> 
> num instances  #bytes  percent  class Name 
> 
>  
> 
> 0 497,527  38,955,989,888  60.44%  long[] 
> 
> 1  18,489,470   7,355,741,784  11.41%  short[] 
> 
> 2  18,680,799   3,903,937,088  6.06%  byte[] 
> 
> 3  35,643,993   3,775,822,640  5.86%  char[] 
> 
> 4   4,017,462   1,851,518,792  2.87%  int[] 
> 
> 5   7,788,280 962,103,784  1.49%  java.lang.Object[] 
> 
> 6   5,256,391 618,467,640  0.96%  java.lang.String[] 
> 
> 7  14,974,224 479,175,168  0.74%  java.lang.String 
> 
> 8   9,585,494 460,103,712  0.71%  java.util.HashMap$Node 
> 
> 9  18,133,885 435,213,240  0.68%  
> org.apache.lucene.util.RoaringDocIdSet$ShortArrayDocIdSet 
> 
> 10   1,559,661 351,465,624  0.55%  java.util.HashMap$Node[] 
> 
> 11   4,132,738 264,495,232  0.41%  java.util.HashMap 
> 
> 12   1,519,178 243,068,480  0.38%  java.lang.reflect.Method 
> 
> 13   4,068,400 195,283,200  0.30%  
> com.sun.org.apache.xerces.internal.xni.QName 
> 
> 14   1,181,106 183,932,704  0.29%  
> org.apache.lucene.search.DocIdSet[] 
> 
> 15   5,721,339 183,082,848  0.28%  java.lang.StringBuilder 
> 
> 16   1,515,804 181,896,480  0.28%  java.lang.reflect.Field 
> 
> 17 348,720 134,652,416  0.21%  
> com.sun.org.apache.xerces.internal.xni.QName[] 
> 
> 18   3,358,251 134,330,040  0.21%  java.util.ArrayList 
> 
> 19   2,775,517  88,816,544  0.14%  org.apache.lucene.util.BytesRef 
> 
> total  232,140,701  64,452,630,104 
> 
> 
> We used LRUQueryCache with maxSize 1000 and maxRamBytesUsed   64MB. 
> 
> 
> 
> The coreClosedListeners occupied too much heap than I expected, is there any 
> reason for that? 
 
 
 
-- 
Adrien

Re: Lucene coreClosedListeners memory issues

2019-06-03 Thread alex stark
Thanks Adrien.



I double checked on all the acquire, and it all correctly released in finally.



What does SegmentCoreReaders.coreClosedListeners do? It seems to close caches.



GC log indicates it is highly possible a leak issue as old gen is keeping 
increasing without dropping while CMS.



Why coreClosedListeners increased to such high number in a single day? 



 On Mon, 03 Jun 2019 18:21:34 +0800 Adrien Grand  wrote 



And do you call release on every searcher that you acquire? 
 
On Mon, Jun 3, 2019 at 11:47 AM alex stark <mailto:alex.st...@zoho.com> wrote: 
> 
> Hi Adrien, 
> 
> I didn't directly open readers. It is controlled by searcher manager. 
> 
> 
> 
>  On Mon, 03 Jun 2019 16:32:06 +0800 Adrien Grand 
> <mailto:jpou...@gmail.com> wrote  
> 
> It looks like you are leaking readers. 
> 
> On Mon, Jun 3, 2019 at 9:46 AM alex stark 
> <mailto:alex.st...@zoho.com.invalid> wrote: 
> > 
> > Hi experts, 
> > 
> > 
> > 
> > I recently have memory issues on Lucene. By checking heap dump, most of 
> > them are occupied by SegmentCoreReaders.coreClosedListeners which is about 
> > nearly half of all. 
> > 
> > 
> > 
> > 
> > 
> > Dominator Tree 
> > 
> > num retain size(bytes) percent percent(live) class Name 
> > 
> >  
> > 
> > 0 14,024,859,136 21.76% 28.23% 
> > com.elesearch.activity.core.engine.lucene.LuceneIndex 
> > 
> > | 
> > 
> > 10,259,490,504 15.92% 20.65% --org.apache.lucene.index.SegmentCoreReaders 
> > 
> > | 
> > 
> > 10,258,783,280 15.92% 20.65% --[field coreClosedListeners] 
> > java.util.Collections$SynchronizedSet 
> > 
> > | 
> > 
> > 10,258,783,248 15.92% 20.65% --[field c] java.util.LinkedHashSet 
> > 
> > | 
> > 
> > 10,258,783,224 15.92% 20.65% --[field map] java.util.LinkedHashMap 
> > 
> >  
> > 
> > 1 11,865,993,448 18.41% 23.89% 
> > com.elesearch.activity.core.engine.lucene.LuceneIndex 
> > 
> >  
> > 
> > 2 11,815,171,240 18.33% 23.79% 
> > com.elesearch.activity.core.engine.lucene.LuceneIndex 
> > 
> >  
> > 
> > 3 6,504,382,648 10.09% 13.09% 
> > com.elesearch.activity.core.engine.lucene.LuceneIndex 
> > 
> > | 
> > 
> > 5,050,933,760 7.84% 10.17% --org.apache.lucene.index.SegmentCoreReaders 
> > 
> > | 
> > 
> > 5,050,256,008 7.84% 10.17% --[field coreClosedListeners] 
> > java.util.Collections$SynchronizedSet 
> > 
> > | 
> > 
> > 5,050,255,976 7.84% 10.17% --[field c] java.util.LinkedHashSet 
> > 
> > | 
> > 
> > 5,050,255,952 7.84% 10.17% --[field map] java.util.LinkedHashMap 
> > 
> >  
> > 
> > 4 2,798,684,240 4.34% 5.63% 
> > com.elesearch.activity.core.engine.lucene.LuceneIndex 
> > 
> > 
> > 
> > thread stack 
> > 
> > 
> > 
> > histogram 
> > 
> > num instances #bytes percent class Name 
> > 
> >  
> > 
> > 0 497,527 38,955,989,888 60.44% long[] 
> > 
> > 1 18,489,470 7,355,741,784 11.41% short[] 
> > 
> > 2 18,680,799 3,903,937,088 6.06% byte[] 
> > 
> > 3 35,643,993 3,775,822,640 5.86% char[] 
> > 
> > 4 4,017,462 1,851,518,792 2.87% int[] 
> > 
> > 5 7,788,280 962,103,784 1.49% java.lang.Object[] 
> > 
> > 6 5,256,391 618,467,640 0.96% java.lang.String[] 
> > 
> > 7 14,974,224 479,175,168 0.74% java.lang.String 
> > 
> > 8 9,585,494 460,103,712 0.71% java.util.HashMap$Node 
> > 
> > 9 18,133,885 435,213,240 0.68% 
> > org.apache.lucene.util.RoaringDocIdSet$ShortArrayDocIdSet 
> > 
> > 10 1,559,661 351,465,624 0.55% java.util.HashMap$Node[] 
> > 
> > 11 4,132,738 264,495,232 0.41% java.util.HashMap 
> > 
> > 12 1,519,178 243,068,480 0.38% java.lang.reflect.Method 
> > 
> > 13 4,068,400 195,283,200 0.30% com.sun.org.apache.xerces.internal.xni.QName 
> > 
> > 14 1,181,106 183,932,704 0.29% org.apache.lucene.search.DocIdSet[] 
> > 
> > 15 5,721,339 183,082,848 0.28% java.lang.StringBuilder 
> > 
> > 16 1,515,804 181,896,480 0.28% java.lang.reflect.Field 
> > 
> > 17 348,720 134,652,416 0.21% com.sun.org.apache.xerces.internal.xni.QName[] 
> > 
> > 18 3,358,251 134,330,040 0.21% java.util.ArrayList 
> > 
> > 19 2,775,517 88,816,544 0.14% org.apache.lucene.util.BytesRef 
> > 
> > total 232,140,701 64,452,630,104 
> > 
> > 
> > We used LRUQueryCache with maxSize 1000 and maxRamBytesUsed 64MB. 
> > 
> > 
> > 
> > The coreClosedListeners occupied too much heap than I expected, is there 
> > any reason for that? 
> 
> 
> 
> -- 
> Adrien 
> 
> 
> 
 
 
-- 
Adrien 
 
- 
To unsubscribe, e-mail: mailto:java-user-unsubscr...@lucene.apache.org 
For additional commands, e-mail: mailto:java-user-h...@lucene.apache.org

Optimizing a boolean query for 100s of term clauses

2020-06-23 Thread Alex K
Hello all,

I'm working on an Elasticsearch plugin (using Lucene internally) that
allows users to index numerical vectors and run exact and approximate
k-nearest-neighbors similarity queries.
I'd like to get some feedback about my usage of BooleanQueries and
TermQueries, and see if there are any optimizations or performance tricks
for my use case.

An example use case for the plugin is reverse image search. A user can
store vectors representing images and run a nearest-neighbors query to
retrieve the 10 vectors with the smallest L2 distance to a query vector.
More detailed documentation here: http://elastiknn.klibisz.com/

The main method for indexing the vectors is based on Locality Sensitive
Hashing <https://en.wikipedia.org/wiki/Locality-sensitive_hashing>.
The general pattern is:

   1. When indexing a vector, apply a hash function to it, producing a set
   of discrete hashes. Usually there are anywhere from 100 to 1000 hashes.
   Similar vectors are more likely to share hashes (i.e., similar vectors
   produce hash collisions).
   2. Convert each hash to a byte array and store the byte array as a
   Lucene Term at a specific field.
   3. Store the complete vector (i.e. floating point numbers) in a binary
   doc values field.

In other words, I'm converting each vector into a bag of words, though the
words have no semantic meaning.

A query works as follows:

   1. Given a query vector, apply the same hash function to produce a set
   of hashes.
   2. Convert each hash to a byte array and create a Term.
   3. Build and run a BooleanQuery with a clause for each Term. Each clause
   looks like this: `new BooleanClause(new ConstantScoreQuery(new
   TermQuery(new Term(field, new BytesRef(hashValue.toByteArray))),
   BooleanClause.Occur.SHOULD))`.
   4. As the BooleanQuery produces results, maintain a fixed-size heap of
   its scores. For any score exceeding the min in the heap, load its vector
   from the binary doc values, compute the exact similarity, and update the
   heap. Otherwise the vector gets a score of 0.

When profiling my benchmarks with VisualVM, I've found the Elasticsearch
search threads spend > 50% of the runtime in these two methods:

   - org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of runtime)
   - org.apache.lucene.search.DisjunctionDISIApproximation.nextDoc (~8% of
   runtime)

So the time seems to be dominated by collecting and ordering the results
produced by the BooleanQuery from step 3 above.
The exact similarity computation is only about 15% of the runtime. If I
disable it entirely, I still see the same bottlenecks in VisualVM.
Reducing the number of hashes yields roughly linear scaling (i.e., 400
hashes take ~2x longer than 200 hashes).

The use case seems different to text search in that there's no semantic
meaning to the terms, their length, their ordering, their stems, etc.
I basically just need the index to be a rudimentary HashMap, and I only
care about the scores for the top k results.
With that in mind, I've made the following optimizations:

   - Disabled tokenization on the FieldType (setTokenized(false))
   - Disabled norms on the FieldType (setOmitNorms(true))
   - Set similarity to BooleanSimilarity on the elasticsearch
   MappedFieldType
   - Set index options to IndexOptions.Docs.
   - Used the MoreLikeThis heuristic to pick a subset of terms. This
   understandably only yields a speedup proportional to the number of
   discarded terms.

I'm using Elasticsearch version 7.6.2 with Lucene 8.4.0.
The main query implementation is here
<https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala>
.
<https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala>
The actual query that gets executed by Elasticsearch is instantiated on line
98
<https://github.com/alexklibisz/elastiknn/blob/c951cf562ab0f911ee760c8be47c19aba98504b9/plugin/src/main/scala/com/klibisz/elastiknn/query/LshQuery.scala#L98>
.
It's in Scala but all of the Java query classes should look familiar.

Maybe there are some settings that I'm not aware of?
Maybe I could optimize this by implementing a custom query or scorer?
Maybe there's just no way to speed this up?

I appreciate any input, examples, links, etc.. :)
Also, let me know if I can provide any additional details.

Thanks,
Alex Klibisz


Re: Optimizing a boolean query for 100s of term clauses

2020-06-23 Thread Alex K
Hi Michael,
Thanks for the quick response!

I will look into the TermInSetQuery.

My usage of "heap" might've been confusing.
I'm using a FunctionScoreQuery from Elasticsearch.
This gets instantiated with a Lucene query, in this case the boolean query
as I described it, as well as a custom ScoreFunction object.
The ScoreFunction exposes a single method that takes a doc id and the
BooleanQuery score for that doc id, and returns another score.
In that method I use a MinMaxPriorityQueue from the Guava library to
maintain a fixed-capacity subset of the highest-scoring docs and evaluate
exact similarity on them.
Once the queue is at capacity, I just return 0 for any docs that had a
boolean query score smaller than the min in the queue.

But you can actually forget entirely that this ScoreFunction exists. It
only contributes ~6% of the runtime.
Even if I only use the BooleanQuery by itself, I still see the same
behavior and bottlenecks.

Thanks
- AK


On Tue, Jun 23, 2020 at 2:06 PM Michael Sokolov  wrote:

> You might consider using a TermInSetQuery in place of a BooleanQuery
> for the hashes (since they are all in the same field).
>
> I don't really understand why you are seeing so much cost in the heap
> - it's sounds as if you have a single heap with mixed scores - those
> generated by the BooleanQuery and those generated by the vector
> scoring operation. Maybe you comment a little more on the interaction
> there - are there really two heaps? Do you override the standard
> collector?
>
> On Tue, Jun 23, 2020 at 9:51 AM Alex K  wrote:
> >
> > Hello all,
> >
> > I'm working on an Elasticsearch plugin (using Lucene internally) that
> > allows users to index numerical vectors and run exact and approximate
> > k-nearest-neighbors similarity queries.
> > I'd like to get some feedback about my usage of BooleanQueries and
> > TermQueries, and see if there are any optimizations or performance tricks
> > for my use case.
> >
> > An example use case for the plugin is reverse image search. A user can
> > store vectors representing images and run a nearest-neighbors query to
> > retrieve the 10 vectors with the smallest L2 distance to a query vector.
> > More detailed documentation here: http://elastiknn.klibisz.com/
> >
> > The main method for indexing the vectors is based on Locality Sensitive
> > Hashing <https://en.wikipedia.org/wiki/Locality-sensitive_hashing>.
> > The general pattern is:
> >
> >1. When indexing a vector, apply a hash function to it, producing a
> set
> >of discrete hashes. Usually there are anywhere from 100 to 1000
> hashes.
> >Similar vectors are more likely to share hashes (i.e., similar vectors
> >produce hash collisions).
> >2. Convert each hash to a byte array and store the byte array as a
> >Lucene Term at a specific field.
> >3. Store the complete vector (i.e. floating point numbers) in a binary
> >doc values field.
> >
> > In other words, I'm converting each vector into a bag of words, though
> the
> > words have no semantic meaning.
> >
> > A query works as follows:
> >
> >1. Given a query vector, apply the same hash function to produce a set
> >of hashes.
> >2. Convert each hash to a byte array and create a Term.
> >3. Build and run a BooleanQuery with a clause for each Term. Each
> clause
> >looks like this: `new BooleanClause(new ConstantScoreQuery(new
> >TermQuery(new Term(field, new BytesRef(hashValue.toByteArray))),
> >BooleanClause.Occur.SHOULD))`.
> >4. As the BooleanQuery produces results, maintain a fixed-size heap of
> >its scores. For any score exceeding the min in the heap, load its
> vector
> >from the binary doc values, compute the exact similarity, and update
> the
> >heap. Otherwise the vector gets a score of 0.
> >
> > When profiling my benchmarks with VisualVM, I've found the Elasticsearch
> > search threads spend > 50% of the runtime in these two methods:
> >
> >- org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of
> runtime)
> >- org.apache.lucene.search.DisjunctionDISIApproximation.nextDoc (~8%
> of
> >runtime)
> >
> > So the time seems to be dominated by collecting and ordering the results
> > produced by the BooleanQuery from step 3 above.
> > The exact similarity computation is only about 15% of the runtime. If I
> > disable it entirely, I still see the same bottlenecks in VisualVM.
> > Reducing the number of hashes yields roughly linear scaling (i.e., 400
> > hashes take ~2x longer than 200 hashes).
> >

Re: Optimizing a boolean query for 100s of term clauses

2020-06-23 Thread Alex K
The TermsInSetQuery is definitely faster. Unfortunately it doesn't seem to
return the number of terms that matched in a given document. Rather it just
returns the boost value. I'll look into copying/modifying the internals to
return the number of matched terms.

Thanks
- AK

On Tue, Jun 23, 2020 at 3:17 PM Alex K  wrote:

> Hi Michael,
> Thanks for the quick response!
>
> I will look into the TermInSetQuery.
>
> My usage of "heap" might've been confusing.
> I'm using a FunctionScoreQuery from Elasticsearch.
> This gets instantiated with a Lucene query, in this case the boolean query
> as I described it, as well as a custom ScoreFunction object.
> The ScoreFunction exposes a single method that takes a doc id and the
> BooleanQuery score for that doc id, and returns another score.
> In that method I use a MinMaxPriorityQueue from the Guava library to
> maintain a fixed-capacity subset of the highest-scoring docs and evaluate
> exact similarity on them.
> Once the queue is at capacity, I just return 0 for any docs that had a
> boolean query score smaller than the min in the queue.
>
> But you can actually forget entirely that this ScoreFunction exists. It
> only contributes ~6% of the runtime.
> Even if I only use the BooleanQuery by itself, I still see the same
> behavior and bottlenecks.
>
> Thanks
> - AK
>
>
> On Tue, Jun 23, 2020 at 2:06 PM Michael Sokolov 
> wrote:
>
>> You might consider using a TermInSetQuery in place of a BooleanQuery
>> for the hashes (since they are all in the same field).
>>
>> I don't really understand why you are seeing so much cost in the heap
>> - it's sounds as if you have a single heap with mixed scores - those
>> generated by the BooleanQuery and those generated by the vector
>> scoring operation. Maybe you comment a little more on the interaction
>> there - are there really two heaps? Do you override the standard
>> collector?
>>
>> On Tue, Jun 23, 2020 at 9:51 AM Alex K  wrote:
>> >
>> > Hello all,
>> >
>> > I'm working on an Elasticsearch plugin (using Lucene internally) that
>> > allows users to index numerical vectors and run exact and approximate
>> > k-nearest-neighbors similarity queries.
>> > I'd like to get some feedback about my usage of BooleanQueries and
>> > TermQueries, and see if there are any optimizations or performance
>> tricks
>> > for my use case.
>> >
>> > An example use case for the plugin is reverse image search. A user can
>> > store vectors representing images and run a nearest-neighbors query to
>> > retrieve the 10 vectors with the smallest L2 distance to a query vector.
>> > More detailed documentation here: http://elastiknn.klibisz.com/
>> >
>> > The main method for indexing the vectors is based on Locality Sensitive
>> > Hashing <https://en.wikipedia.org/wiki/Locality-sensitive_hashing>.
>> > The general pattern is:
>> >
>> >1. When indexing a vector, apply a hash function to it, producing a
>> set
>> >of discrete hashes. Usually there are anywhere from 100 to 1000
>> hashes.
>> >Similar vectors are more likely to share hashes (i.e., similar
>> vectors
>> >produce hash collisions).
>> >2. Convert each hash to a byte array and store the byte array as a
>> >Lucene Term at a specific field.
>> >3. Store the complete vector (i.e. floating point numbers) in a
>> binary
>> >doc values field.
>> >
>> > In other words, I'm converting each vector into a bag of words, though
>> the
>> > words have no semantic meaning.
>> >
>> > A query works as follows:
>> >
>> >1. Given a query vector, apply the same hash function to produce a
>> set
>> >of hashes.
>> >2. Convert each hash to a byte array and create a Term.
>> >3. Build and run a BooleanQuery with a clause for each Term. Each
>> clause
>> >looks like this: `new BooleanClause(new ConstantScoreQuery(new
>> >TermQuery(new Term(field, new BytesRef(hashValue.toByteArray))),
>> >BooleanClause.Occur.SHOULD))`.
>> >4. As the BooleanQuery produces results, maintain a fixed-size heap
>> of
>> >its scores. For any score exceeding the min in the heap, load its
>> vector
>> >from the binary doc values, compute the exact similarity, and update
>> the
>> >heap. Otherwise the vector gets a score of 0.
>> >
>> > When profiling my benchmarks with Visual

Re: Optimizing a boolean query for 100s of term clauses

2020-06-24 Thread Alex K
Thanks Michael. I managed to translate the TermInSetQuery into Scala
yesterday so now I can modify it in my codebase. This seems promising so
far. Fingers crossed there's a way to maintain scores without basically
converging to the BooleanQuery implementation.
- AK

On Wed, Jun 24, 2020 at 8:40 AM Michael Sokolov  wrote:

> Yeah that will require some changes since what it does currently is to
> maintain a bitset, and or into it repeatedly (once for each term's
> docs). To maintain counts, you'd need a counter per doc (rather than a
> bit), and you might lose some of the speed...
>
> On Tue, Jun 23, 2020 at 8:52 PM Alex K  wrote:
> >
> > The TermsInSetQuery is definitely faster. Unfortunately it doesn't seem
> to
> > return the number of terms that matched in a given document. Rather it
> just
> > returns the boost value. I'll look into copying/modifying the internals
> to
> > return the number of matched terms.
> >
> > Thanks
> > - AK
> >
> > On Tue, Jun 23, 2020 at 3:17 PM Alex K  wrote:
> >
> > > Hi Michael,
> > > Thanks for the quick response!
> > >
> > > I will look into the TermInSetQuery.
> > >
> > > My usage of "heap" might've been confusing.
> > > I'm using a FunctionScoreQuery from Elasticsearch.
> > > This gets instantiated with a Lucene query, in this case the boolean
> query
> > > as I described it, as well as a custom ScoreFunction object.
> > > The ScoreFunction exposes a single method that takes a doc id and the
> > > BooleanQuery score for that doc id, and returns another score.
> > > In that method I use a MinMaxPriorityQueue from the Guava library to
> > > maintain a fixed-capacity subset of the highest-scoring docs and
> evaluate
> > > exact similarity on them.
> > > Once the queue is at capacity, I just return 0 for any docs that had a
> > > boolean query score smaller than the min in the queue.
> > >
> > > But you can actually forget entirely that this ScoreFunction exists. It
> > > only contributes ~6% of the runtime.
> > > Even if I only use the BooleanQuery by itself, I still see the same
> > > behavior and bottlenecks.
> > >
> > > Thanks
> > > - AK
> > >
> > >
> > > On Tue, Jun 23, 2020 at 2:06 PM Michael Sokolov 
> > > wrote:
> > >
> > >> You might consider using a TermInSetQuery in place of a BooleanQuery
> > >> for the hashes (since they are all in the same field).
> > >>
> > >> I don't really understand why you are seeing so much cost in the heap
> > >> - it's sounds as if you have a single heap with mixed scores - those
> > >> generated by the BooleanQuery and those generated by the vector
> > >> scoring operation. Maybe you comment a little more on the interaction
> > >> there - are there really two heaps? Do you override the standard
> > >> collector?
> > >>
> > >> On Tue, Jun 23, 2020 at 9:51 AM Alex K  wrote:
> > >> >
> > >> > Hello all,
> > >> >
> > >> > I'm working on an Elasticsearch plugin (using Lucene internally)
> that
> > >> > allows users to index numerical vectors and run exact and
> approximate
> > >> > k-nearest-neighbors similarity queries.
> > >> > I'd like to get some feedback about my usage of BooleanQueries and
> > >> > TermQueries, and see if there are any optimizations or performance
> > >> tricks
> > >> > for my use case.
> > >> >
> > >> > An example use case for the plugin is reverse image search. A user
> can
> > >> > store vectors representing images and run a nearest-neighbors query
> to
> > >> > retrieve the 10 vectors with the smallest L2 distance to a query
> vector.
> > >> > More detailed documentation here: http://elastiknn.klibisz.com/
> > >> >
> > >> > The main method for indexing the vectors is based on Locality
> Sensitive
> > >> > Hashing <https://en.wikipedia.org/wiki/Locality-sensitive_hashing>.
> > >> > The general pattern is:
> > >> >
> > >> >1. When indexing a vector, apply a hash function to it,
> producing a
> > >> set
> > >> >of discrete hashes. Usually there are anywhere from 100 to 1000
> > >> hashes.
> > >> >Similar vectors are more likely to share hashes (i.e., similar
> > 

Re: Optimizing a boolean query for 100s of term clauses

2020-06-24 Thread Alex K
Hi Toke. Indeed a nice coincidence. It's an interesting and fun problem
space!

My implementation isn't specific to any particular dataset or access
pattern (i.e. infinite vs. subset).
So far the plugin supports exact L1, L2, Jaccard, Hamming, and Angular
similarities with LSH variants for all but L1.
My exact implementation is generally faster than the approximate LSH
implementation, hence the thread.
You make a good point that this is valuable by itself if you're able to
filter down to a small subset of docs.
I put a lot of work into optimizing the vector serialization speed and the
exact query execution.
I imagine with my current implementation there is some breaking point where
LSH becomes faster than exact, but so far I've tested with ~1.2M
~300-dimensional vectors and exact is still faster, especially when
parallelized across many shards.
So speeding up LSH is the current engineering challenge.

Are you using Elasticsearch or Lucene directly?
If you're using ES and have the time, I'd love some feedback on my plugin.
It sounds like you want to compute hamming similarity on your bitmaps?
If so that's currently supported.
There's an example here:
http://demo.elastiknn.klibisz.com/dataset/mnist-hamming?queryId=64121

Also I've compiled a small literature review on some related research here:
https://docs.google.com/document/d/14Z7ZKk9dq29bGeDDmBH6Bsy92h7NvlHoiGhbKTB0YJs/edit
*Fast and Exact NNS in Hamming Space on Full-Text Search Engines* describes
some clever tricks to speed up Hamming similarity.
*Large Scale Image Retrieval with Elasticsearch* describes the idea of
using the largest absolute magnitude values instead of the full vector.
Perhaps you've already read them but I figured I'd share.

Cheers
- AK



On Wed, Jun 24, 2020 at 8:44 AM Toke Eskildsen  wrote:

> On Tue, 2020-06-23 at 09:50 -0400, Alex K wrote:
> > I'm working on an Elasticsearch plugin (using Lucene internally) that
> > allows users to index numerical vectors and run exact and approximate
> > k-nearest-neighbors similarity queries.
>
> Quite a coincidence. I'm looking into the same thing :-)
>
> >   1. When indexing a vector, apply a hash function to it, producing
> > a set of discrete hashes. Usually there are anywhere from 100 to 1000
> > hashes.
>
> Is it important to have "infinite" scaling with inverted index or is it
> acceptable to have a (fast) sequential scan through all documents? If
> the use case is to combine the nearest neighbour search with other
> filters, so that the effective search-space is relatively small, you
> could go directly to computing the Euclidian distance (or whatever you
> use to calculate the exact similarity score).
>
> >   4. As the BooleanQuery produces results, maintain a fixed-size
> > heap of its scores. For any score exceeding the min in the heap, load
> > its vector from the binary doc values, compute the exact similarity,
> > and update the heap.
>
> I did something quite similar for a non-Lucene bases proof of concept,
> except that I delayed the exact similarity calculation and over-
> collected on the heap.
>
> Fleshing that out: Instead of producing similarity hashes, I extracted
> the top-X strongest signals (entries in the vector) and stored them as
> indexes from the raw vector, so the top-3 signals from [10, 3, 6, 12,
> 1, 20] are [0, 3, 5]. The query was similar to your "match as many as
> possible", just with indexes instead of hashes.
>
> >- org.apache.lucene.search.DisiPriorityQueue.downHeap (~58% of
> > runtime)
>
> This sounds strange. How large is your queue? Object-based priority
> queues tend to become slow when they get large (100K+ values).
>
> > Maybe I could optimize this by implementing a custom query or scorer?
>
> My plan for a better implementation is to use an autoencoder to produce
> a condensed representation of the raw vector for a document. In order
> to do so, a network must be trained on (ideally) the full corpus, so it
> will require a bootstrap process and will probably work poorly if
> incoming vectors differ substantially in nature from the existing ones
> (at least until the autoencoder is retrained and the condensed
> representations are reindexed). As our domain is an always growing
> image collection with fairly defines types of images (portraits, line
> drawings, maps...) and since new types are introduced rarely, this is
> acceptable for us.
>
> Back to Lucene, the condensed representation is expected to be a bitmap
> where the (coarse) similarity between two representations is simply the
> number of set bits at the same locations: An AND and a POPCNT of the
> bitmaps.
>
> Th

Re: Optimizing a boolean query for 100s of term clauses

2020-06-25 Thread Alex K
Hi Tommaso, thanks for the input and links! I'll add your paper to my
literature review.

So far I've seen very promising results from modifying the TermInSetQuery.
It was pretty simple to keep a map of `doc id -> matched term count` and
then only evaluate the exact similarity on the top k doc ids.
On a small benchmark, I was able to drop the time for 1000 queries from 45
seconds to 14 seconds.
Now the bottleneck is back in my own code, which I'm happy with because I
can optimize that more easily.
Hopefully I can merge these changes in the next couple days, and I'll post
the diff when I do.

- AK



On Thu, Jun 25, 2020 at 5:07 AM Tommaso Teofili 
wrote:

> hi Alex,
>
> I had worked on a similar problem directly on Lucene (within Anserini
> toolkit) using LSH fingerprints of tokenized feature vector values.
> You can find code at [1] and some information on the Anserini documentation
> page [2] and in a short preprint [3].
> As a side note my current thinking is that it would be very cool if we
> could leverage Lucene N dimensional point support by properly reducing the
> dimensionality of the original vectors, however that is hard to do without
> losing important information.
>
> My 2 cents,
> Tommaso
>
> [1] :
>
> https://github.com/castorini/anserini/tree/master/src/main/java/io/anserini/ann
> [2] :
>
> https://github.com/castorini/anserini/blob/master/docs/approximate-nearestneighbor.md
> [3] : https://arxiv.org/abs/1910.10208
>
>
>
>
>
> On Wed, 24 Jun 2020 at 19:47, Alex K  wrote:
>
> > Hi Toke. Indeed a nice coincidence. It's an interesting and fun problem
> > space!
> >
> > My implementation isn't specific to any particular dataset or access
> > pattern (i.e. infinite vs. subset).
> > So far the plugin supports exact L1, L2, Jaccard, Hamming, and Angular
> > similarities with LSH variants for all but L1.
> > My exact implementation is generally faster than the approximate LSH
> > implementation, hence the thread.
> > You make a good point that this is valuable by itself if you're able to
> > filter down to a small subset of docs.
> > I put a lot of work into optimizing the vector serialization speed and
> the
> > exact query execution.
> > I imagine with my current implementation there is some breaking point
> where
> > LSH becomes faster than exact, but so far I've tested with ~1.2M
> > ~300-dimensional vectors and exact is still faster, especially when
> > parallelized across many shards.
> > So speeding up LSH is the current engineering challenge.
> >
> > Are you using Elasticsearch or Lucene directly?
> > If you're using ES and have the time, I'd love some feedback on my
> plugin.
> > It sounds like you want to compute hamming similarity on your bitmaps?
> > If so that's currently supported.
> > There's an example here:
> > http://demo.elastiknn.klibisz.com/dataset/mnist-hamming?queryId=64121
> >
> > Also I've compiled a small literature review on some related research
> here:
> >
> >
> https://docs.google.com/document/d/14Z7ZKk9dq29bGeDDmBH6Bsy92h7NvlHoiGhbKTB0YJs/edit
> > *Fast and Exact NNS in Hamming Space on Full-Text Search Engines*
> describes
> > some clever tricks to speed up Hamming similarity.
> > *Large Scale Image Retrieval with Elasticsearch* describes the idea of
> > using the largest absolute magnitude values instead of the full vector.
> > Perhaps you've already read them but I figured I'd share.
> >
> > Cheers
> > - AK
> >
> >
> >
> > On Wed, Jun 24, 2020 at 8:44 AM Toke Eskildsen  wrote:
> >
> > > On Tue, 2020-06-23 at 09:50 -0400, Alex K wrote:
> > > > I'm working on an Elasticsearch plugin (using Lucene internally) that
> > > > allows users to index numerical vectors and run exact and approximate
> > > > k-nearest-neighbors similarity queries.
> > >
> > > Quite a coincidence. I'm looking into the same thing :-)
> > >
> > > >   1. When indexing a vector, apply a hash function to it, producing
> > > > a set of discrete hashes. Usually there are anywhere from 100 to 1000
> > > > hashes.
> > >
> > > Is it important to have "infinite" scaling with inverted index or is it
> > > acceptable to have a (fast) sequential scan through all documents? If
> > > the use case is to combine the nearest neighbour search with other
> > > filters, so that the effective search-space is relatively small, you
> > > could go directly to computing the Euclidian distance (or whatever you
> >

Re: ANN search current state

2020-07-15 Thread Alex K
Hi Mikhail,

I'm not sure about the state of ANN in lucene proper. Very interested to
see the response from others.
I've been doing some work on ANN for an Elasticsearch plugin:
http://elastiknn.klibisz.com/
I think it's possible to extract my custom queries and modeling code so
that it's elasticsearch-agnostic and can be used directly in Lucene apps.
However I'm much more familiar with Elasticsearch's APIs and usage/testing
patterns than I am with raw Lucene, so I'd likely need to get some help
from the Lucene community.
Please LMK if that sounds interesting to anyone.

- Alex



On Wed, Jul 15, 2020 at 11:11 AM Mikhail  wrote:

>
> Hi,
>
>I want to incorporate semantic search in my project, which uses
> Lucene. I want to use sentence embeddings and ANN (approximate nearest
> neighbor) search. I found the related Lucene issues:
> https://issues.apache.org/jira/browse/LUCENE-9004 ,
> https://issues.apache.org/jira/browse/LUCENE-9136 ,
> https://issues.apache.org/jira/browse/LUCENE-9322 . I see that there
> are some related work and related PRs. What is the current state of this
> functionality?
>
> --
> Thanks,
> Mikhail
>
>


Optimizing term-occurrence counting (code included)

2020-07-23 Thread Alex K
Hi all,

I am working on a query that takes a set of terms, finds all documents
containing at least one of those terms, computes a subset of candidate docs
with the most matching terms, and applies a user-provided scoring function
to each of the candidate docs

Simple example of the query:
- query terms ("aaa", "bbb")
- indexed docs with terms:
  docId 0 has terms ("aaa", "bbb")
  docId 1 has terms ("aaa", "ccc")
- number of top candidates = 1
- simple scoring function score(docId) = docId + 10
The query first builds a count array [2, 1], because docId 0 contains two
matching terms and docId 1 contains 1 matching term.
Then it picks docId 0 as the candidate subset.
Then it applies the scoring function, returning a score of 10 for docId 0.

The main bottleneck right now is doing the initial counting, i.e. the part
that returns the [2, 1] array.

I first started by using a BoolQuery containing a Should clause for every
Term, so the returned score was the count. This was simple but very slow.
Then I got a substantial speedup by copying and modifying the
TermInSetQuery so that it tracks the number of times each docId contains a
query term. The main construct here seems to be PrefixCodedTerms.

At this point I'm not sure if there's any faster construct, or perhaps a
more optimal way to use PrefixCodedTerms?

Here is the specific query, highlighting some specific parts of the code:
- Build the PrefixCodedTerms (in my case the terms are called 'hashes'):
https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L27-L33
- Count the matching terms in a segment (this is the main bottleneck in my
query):
https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73

I appreciate any suggestions you might have.

- Alex


Re: Optimizing term-occurrence counting (code included)

2020-07-24 Thread Alex K
Thanks Ali. I don't think that will work in this case, since the data I'm
counting is managed by lucene, but that looks like an interesting project.
-Alex

On Fri, Jul 24, 2020, 00:15 Ali Akhtar  wrote:

> I'm new to lucene so I'm not sure what the best way of speeding this up in
> Lucene is, but I've previously used https://github.com/npgall/cqengine for
> similar stuff. It provided really good performance, especially if you're
> just counting things.
>
> On Fri, Jul 24, 2020 at 6:55 AM Alex K  wrote:
>
> > Hi all,
> >
> > I am working on a query that takes a set of terms, finds all documents
> > containing at least one of those terms, computes a subset of candidate
> docs
> > with the most matching terms, and applies a user-provided scoring
> function
> > to each of the candidate docs
> >
> > Simple example of the query:
> > - query terms ("aaa", "bbb")
> > - indexed docs with terms:
> >   docId 0 has terms ("aaa", "bbb")
> >   docId 1 has terms ("aaa", "ccc")
> > - number of top candidates = 1
> > - simple scoring function score(docId) = docId + 10
> > The query first builds a count array [2, 1], because docId 0 contains two
> > matching terms and docId 1 contains 1 matching term.
> > Then it picks docId 0 as the candidate subset.
> > Then it applies the scoring function, returning a score of 10 for docId
> 0.
> >
> > The main bottleneck right now is doing the initial counting, i.e. the
> part
> > that returns the [2, 1] array.
> >
> > I first started by using a BoolQuery containing a Should clause for every
> > Term, so the returned score was the count. This was simple but very slow.
> > Then I got a substantial speedup by copying and modifying the
> > TermInSetQuery so that it tracks the number of times each docId contains
> a
> > query term. The main construct here seems to be PrefixCodedTerms.
> >
> > At this point I'm not sure if there's any faster construct, or perhaps a
> > more optimal way to use PrefixCodedTerms?
> >
> > Here is the specific query, highlighting some specific parts of the code:
> > - Build the PrefixCodedTerms (in my case the terms are called 'hashes'):
> >
> >
> https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L27-L33
> > - Count the matching terms in a segment (this is the main bottleneck in
> my
> > query):
> >
> >
> https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73
> >
> > I appreciate any suggestions you might have.
> >
> > - Alex
> >
>


Re: TermsEnum.seekExact degraded performance somewhere between Lucene 7.7.0 and 8.5.1.

2020-07-26 Thread Alex K
Hi,

Also have a look here:
https://issues.apache.org/jira/plugins/servlet/mobile#issue/LUCENE-9378

Seems it might be related.
- Alex

On Sun, Jul 26, 2020, 23:31 Trejkaz  wrote:

> Hi all.
>
> I've been tracking down slow seeking performance in TermsEnum after
> updating to Lucene 8.5.1.
>
> On 8.5.1:
>
> SegmentTermsEnum.seekExact: 33,829 ms (70.2%) (remaining time in our
> code)
> SegmentTermsEnumFrame.loadBlock: 29,104 ms (60.4%)
> CompressionAlgorithm$2.read: 25,789 ms (53.5%)
> LowercaseAsciiCompression.decompress: 25,789 ms (53.5%)
> DataInput.readVInt: 24,690 ms (51.2%)
> SegmentTermsEnumFrame.scanToTerm: 2,921 ms (6.1%)
>
> On 7.7.0 (previous version we were using):
>
> SegmentTermsEnum.seekExact: 5,897 ms (43.7%) (remaining time in our
> code)
> SegmentTermsEnumFrame.loadBlock: 3,499 ms (25.9%)
> BufferedIndexInput.readBytes: 1,500 ms (11.1%)
> DataInput.readVInt: 1,108 (8.2%)
> SegmentTermsEnumFrame.scanToTerm: 1,501 ms (11.1%)
>
> So on the surface it sort of looks like the new version spends less
> time scanning and much more time loading blocks to decompress?
>
> Looking for some clues to what might have changed here, and whether
> it's something we can avoid, but currently LUCENE-4702 looks like it
> may be related.
>
> TX
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Simultaneous Indexing and searching

2020-09-02 Thread Alex K
FWIW, I agree with Michael: this is not a simple problem and there's been a
lot of effort in Elasticsearch and Solr to solve it in a robust way. If you
can't use ES/solr, I believe there are some posts on the ES blog about how
they write/delete/merge shards (Lucene indices).

On Tue, Sep 1, 2020 at 11:40 AM Michael Sokolov  wrote:

> So ... this is a fairly complex topic I can't really cover it in depth
> here; how to architect a distributed search engine service. Most
> people opt to use Solr or Elasticsearch since they solve that problem
> for you. Those systems work best when the indexes are local to the
> service that is accessing them, and build systems to distribute data
> internally; distributing via NFS is generally not a *good idea* (tm),
> although it may work most of the time. In your case, have you
> considered building a search service that runs on the same box as your
> indexer and responds to queries from the web server(s)?
>
> On Tue, Sep 1, 2020 at 11:13 AM Richard So
>  wrote:
> >
> > Hi there,
> >
> > I am beginner for using Lucene especially in the area of Indexing and
> searching simultaneously.
> >
> > Our environment is that we have several webserver for the search
> front-end that submit search request and also a backend server that do the
> full text indexing; whereas the index files are stored in a NFS volume such
> that both the indexing and searchs are pointing to this same NFS volume.
> The indexing may happen whenever something new documents comes in or get
> updated.
> >
> > Our project requires that both indexing and searching can be happened at
> the same time (or the blocking should be as short as possible, e.g. under a
> second)
> >
> > We have search through the Internet and found something like this
> references:
> >
> http://blog.mikemccandless.com/2011/09/lucenes-searchermanager-simplifies.html
> >
> http://blog.mikemccandless.com/2011/11/near-real-time-readers-with-lucenes.html
> >
> > but seems those only apply to indexing and search in the same server
> (correct me if I am wrong).
> >
> > Could somebody tell me how to implement such system, e.g. what Lucene
> classes to be used and the caveat, or how to setup ,etc?
> >
> > Regards
> > Richard
> >
> >
> >
> >
> >
> >
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Optimizing term-occurrence counting (code included)

2020-09-20 Thread Alex K
Hi all, I'm still a bit stuck on this particular issue.I posted an issue on
the Elastiknn repo outlining some measurements and thoughts on potential
solutions: https://github.com/alexklibisz/elastiknn/issues/160

To restate the question: Is there a known optimal way to find and count
docs matching 10s to 100s of terms? It seems the bottleneck is in the
PostingsFormat implementation. Perhaps there is a PostingsFormat better
suited for this usecase?

Thanks,
Alex

On Fri, Jul 24, 2020 at 7:59 AM Alex K  wrote:

> Thanks Ali. I don't think that will work in this case, since the data I'm
> counting is managed by lucene, but that looks like an interesting project.
> -Alex
>
> On Fri, Jul 24, 2020, 00:15 Ali Akhtar  wrote:
>
>> I'm new to lucene so I'm not sure what the best way of speeding this up in
>> Lucene is, but I've previously used https://github.com/npgall/cqengine
>> for
>> similar stuff. It provided really good performance, especially if you're
>> just counting things.
>>
>> On Fri, Jul 24, 2020 at 6:55 AM Alex K  wrote:
>>
>> > Hi all,
>> >
>> > I am working on a query that takes a set of terms, finds all documents
>> > containing at least one of those terms, computes a subset of candidate
>> docs
>> > with the most matching terms, and applies a user-provided scoring
>> function
>> > to each of the candidate docs
>> >
>> > Simple example of the query:
>> > - query terms ("aaa", "bbb")
>> > - indexed docs with terms:
>> >   docId 0 has terms ("aaa", "bbb")
>> >   docId 1 has terms ("aaa", "ccc")
>> > - number of top candidates = 1
>> > - simple scoring function score(docId) = docId + 10
>> > The query first builds a count array [2, 1], because docId 0 contains
>> two
>> > matching terms and docId 1 contains 1 matching term.
>> > Then it picks docId 0 as the candidate subset.
>> > Then it applies the scoring function, returning a score of 10 for docId
>> 0.
>> >
>> > The main bottleneck right now is doing the initial counting, i.e. the
>> part
>> > that returns the [2, 1] array.
>> >
>> > I first started by using a BoolQuery containing a Should clause for
>> every
>> > Term, so the returned score was the count. This was simple but very
>> slow.
>> > Then I got a substantial speedup by copying and modifying the
>> > TermInSetQuery so that it tracks the number of times each docId
>> contains a
>> > query term. The main construct here seems to be PrefixCodedTerms.
>> >
>> > At this point I'm not sure if there's any faster construct, or perhaps a
>> > more optimal way to use PrefixCodedTerms?
>> >
>> > Here is the specific query, highlighting some specific parts of the
>> code:
>> > - Build the PrefixCodedTerms (in my case the terms are called 'hashes'):
>> >
>> >
>> https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L27-L33
>> > - Count the matching terms in a segment (this is the main bottleneck in
>> my
>> > query):
>> >
>> >
>> https://github.com/alexklibisz/elastiknn/blob/c75b23f/plugin/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L54-L73
>> >
>> > I appreciate any suggestions you might have.
>> >
>> > - Alex
>> >
>>
>


How to access block-max metadata?

2020-10-11 Thread Alex K
Hi all,
There was some fairly recent work in Lucene to introduce Block-Max WAND
Scoring (
https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf
, https://issues.apache.org/jira/browse/LUCENE-8135).

I've been working on a use-case where I need very efficient top-k scoring
for 100s of query terms (usually between 300 and 600 terms, k between 100
and 1, each term contributes a simple TF-IDF score). There's some
discussion here: https://github.com/alexklibisz/elastiknn/issues/160.

Now that block-based metadata are presumably available in Lucene, how would
I access this metadata?

I've read the WANDScorer.java code, but I couldn't quite understand how
exactly it is leveraging a block-max codec or block-based statistics. In my
own code, I'm exploring some ways to prune low-quality docs, and I figured
there might be some block-max metadata that I can access to improve the
pruning. I'm iterating over the docs matching each term using the
.advance() and .nextDoc() methods on a PostingsEnum. I don't see any
block-related methods on the PostingsEnum interface. I feel like I'm
missing something.. hopefully something simple!

I appreciate any tips or examples!

Thanks,
Alex


Re: How to access block-max metadata?

2020-10-12 Thread Alex K
Thanks Adrien. Very helpful.
The doc for ImpactSource.advanceShallow says it's more efficient than
DocIDSetIterator.advance.
Is that because advanceShallow is skipping entire blocks at a time, whereas
advance is not?
One possible optimization I've explored involves skipping pruned docIDs. I
tried this using .advance() instead of .nextDoc(), but found the
improvement was negligible. I'm thinking maybe advanceShallow() would let
me get that speedup.
- AK

On Mon, Oct 12, 2020 at 2:59 AM Adrien Grand  wrote:

> Hi Alex,
>
> The entry point for block-max metadata is TermsEnum#impacts (
>
> https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/TermsEnum.html#impacts(int)
> )
> which returns a view of the postings lists that includes block-max
> metadata. In particular, see documentation for ImpactsSource#advanceShallow
> and ImpactsSource#getImpacts (
>
> https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/ImpactsSource.html
> ).
>
> You can look at ImpactsDISI to see how this metadata is leveraged in
> practice to turn this metadata into score upper bounds, which is in-turn
> used to skip irrelevant documents.
>
> On Mon, Oct 12, 2020 at 2:45 AM Alex K  wrote:
>
> > Hi all,
> > There was some fairly recent work in Lucene to introduce Block-Max WAND
> > Scoring (
> >
> >
> https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf
> > , https://issues.apache.org/jira/browse/LUCENE-8135).
> >
> > I've been working on a use-case where I need very efficient top-k scoring
> > for 100s of query terms (usually between 300 and 600 terms, k between 100
> > and 1, each term contributes a simple TF-IDF score). There's some
> > discussion here: https://github.com/alexklibisz/elastiknn/issues/160.
> >
> > Now that block-based metadata are presumably available in Lucene, how
> would
> > I access this metadata?
> >
> > I've read the WANDScorer.java code, but I couldn't quite understand how
> > exactly it is leveraging a block-max codec or block-based statistics. In
> my
> > own code, I'm exploring some ways to prune low-quality docs, and I
> figured
> > there might be some block-max metadata that I can access to improve the
> > pruning. I'm iterating over the docs matching each term using the
> > .advance() and .nextDoc() methods on a PostingsEnum. I don't see any
> > block-related methods on the PostingsEnum interface. I feel like I'm
> > missing something.. hopefully something simple!
> >
> > I appreciate any tips or examples!
> >
> > Thanks,
> > Alex
> >
>
>
> --
> Adrien
>


Re: How to access block-max metadata?

2020-10-12 Thread Alex K
I see. So I'm most likely rarely skipping a block's worth of docs, so using
advance() vs nextDoc() doesn't make much of a difference.
All good to know. Thank you.

On Mon, Oct 12, 2020 at 11:42 AM Adrien Grand  wrote:

> advanceShallow is indeed faster than advance because it does less:
> advanceShallow only advances the cursor for block-max metadata, this allows
> reasoning about maximum scores without actually advancing the doc ID.
> advanceShallow is implicitly called via advance.
>
> If your optimization rarely helps skip entire blocks, then it's expected
> that advance doesn't help much over nextDoc. advanceShallow is rarely a
> drop-in replacement for advance since it's unable to tell whether a
> document matches or not, it can only be used to reason about maximum scores
> for a range of doc IDs when combined with ImpactsSource#getImpacts.
>
> On Mon, Oct 12, 2020 at 5:21 PM Alex K  wrote:
>
> > Thanks Adrien. Very helpful.
> > The doc for ImpactSource.advanceShallow says it's more efficient than
> > DocIDSetIterator.advance.
> > Is that because advanceShallow is skipping entire blocks at a time,
> whereas
> > advance is not?
> > One possible optimization I've explored involves skipping pruned docIDs.
> I
> > tried this using .advance() instead of .nextDoc(), but found the
> > improvement was negligible. I'm thinking maybe advanceShallow() would let
> > me get that speedup.
> > - AK
> >
> > On Mon, Oct 12, 2020 at 2:59 AM Adrien Grand  wrote:
> >
> > > Hi Alex,
> > >
> > > The entry point for block-max metadata is TermsEnum#impacts (
> > >
> > >
> >
> https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/TermsEnum.html#impacts(int)
> > > )
> > > which returns a view of the postings lists that includes block-max
> > > metadata. In particular, see documentation for
> > ImpactsSource#advanceShallow
> > > and ImpactsSource#getImpacts (
> > >
> > >
> >
> https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/ImpactsSource.html
> > > ).
> > >
> > > You can look at ImpactsDISI to see how this metadata is leveraged in
> > > practice to turn this metadata into score upper bounds, which is
> in-turn
> > > used to skip irrelevant documents.
> > >
> > > On Mon, Oct 12, 2020 at 2:45 AM Alex K  wrote:
> > >
> > > > Hi all,
> > > > There was some fairly recent work in Lucene to introduce Block-Max
> WAND
> > > > Scoring (
> > > >
> > > >
> > >
> >
> https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf
> > > > , https://issues.apache.org/jira/browse/LUCENE-8135).
> > > >
> > > > I've been working on a use-case where I need very efficient top-k
> > scoring
> > > > for 100s of query terms (usually between 300 and 600 terms, k between
> > 100
> > > > and 1, each term contributes a simple TF-IDF score). There's some
> > > > discussion here: https://github.com/alexklibisz/elastiknn/issues/160
> .
> > > >
> > > > Now that block-based metadata are presumably available in Lucene, how
> > > would
> > > > I access this metadata?
> > > >
> > > > I've read the WANDScorer.java code, but I couldn't quite understand
> how
> > > > exactly it is leveraging a block-max codec or block-based statistics.
> > In
> > > my
> > > > own code, I'm exploring some ways to prune low-quality docs, and I
> > > figured
> > > > there might be some block-max metadata that I can access to improve
> the
> > > > pruning. I'm iterating over the docs matching each term using the
> > > > .advance() and .nextDoc() methods on a PostingsEnum. I don't see any
> > > > block-related methods on the PostingsEnum interface. I feel like I'm
> > > > missing something.. hopefully something simple!
> > > >
> > > > I appreciate any tips or examples!
> > > >
> > > > Thanks,
> > > > Alex
> > > >
> > >
> > >
> > > --
> > > Adrien
> > >
> >
>
>
> --
> Adrien
>


Re: Lucene/Solr and BERT

2021-04-21 Thread Alex K
There were a couple additions recently merged into lucene but not yet
released:
- A first-class vector codec
- An implementation of HNSW for approximate nearest neighbor search

They are however available in the snapshot releases. I started on a small
project to get the HNSW implementation into the ann-benchmarks project, but
had to set it aside. Here's the code:
https://github.com/alexklibisz/ann-benchmarks-lucene. There are some test
suites that index and search Glove vectors. My first impression was that
indexing seems surprisingly slow, but it's entirely possible I'm doing
something wrong.

On Wed, Apr 21, 2021 at 9:31 AM Michael Wechner 
wrote:

> Hi
>
> I recently found the following articles re Lucene/Solr and BERT
>
> https://dmitry-kan.medium.com/neural-search-with-bert-and-solr-ea5ead060b28
>
> https://medium.com/swlh/fun-with-apache-lucene-and-bert-embeddings-c2c496baa559
>
> and would like to ask whether there might be more recent developments
> within the Lucene/Solr community re BERT integration?
>
> Also how these developments relate to
>
> https://sbert.net/
>
> ?
>
> Thanks very much for your insights!
>
> Michael
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Lucene/Solr and BERT

2021-05-25 Thread Alex K
Hi Michael and others,

Sorry just now getting back to you. For your three original questions:

- Yes, I was referring to the Lucene90Hnsw* classes. Michael S. had a
thorough response.
- As far as I know Opendistro is calling out to a C/C++ binary to run the
actual HNSW algorithm and store the HNSW part of the index. When they
implemented it about a year ago, Lucene did not have this yet. I assume the
Lucene HNSW implementation is solid, but would not be surprised if it's
slower than the C/C++ based implementation, given the JVM has some
disadvantages for these kinds of CPU-bound/number crunching algos.
- I just haven't had much time to invest into my benchmark recently. In
particular, I got stuck on why indexing was taking extremely long. Just
indexing the vectors would have easily exceeded the current time
limitations in the ANN-benchmarks project. Maybe I had some naive mistake
in my implementation, but I profiled and dug pretty deep to make it fast.

I'm assuming you want to use Lucene, but not necessarily via Elasticsearch?
If so, another option you might try for ANN is the elastiknn-models
and elastiknn-lucene packages. elastiknn-models contains the Locality
Sensitive Hashing implementations of ANN used by Elastiknn, and
elastiknn-lucene contains the Lucene queries used by Elastiknn.The Lucene
query is the MatchHashesAndScoreQuery
<https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L18-L22>.
There are a couple of scala test suites that show how to use it:
MatchHashesAndScoreQuerySuite
<https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQuerySuite.scala>.
MatchHashesAndScoreQueryPerformanceSuite
<https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQueryPerformanceSuite.scala>.
This is all designed to work independently from Elasticsearch and is
published on Maven: com.klibisz.elastiknn / lucene
<https://search.maven.org/artifact/com.klibisz.elastiknn/lucene/7.12.1.0/jar>
and
com.klibisz.elastiknn / models
<https://search.maven.org/artifact/com.klibisz.elastiknn/models/7.12.1.0/jar>.
The tests are Scala but all of the implementation is in Java.

Thanks,
Alex

On Mon, May 24, 2021 at 3:06 AM Michael Wechner 
wrote:

> Hi Russ
>
> I would like to use it for detecting duplicated questions, whereas I am
> currently using the project sbert.net you mention below to do the
> embedding with a size of 768 for indexing and querying.
>
> sbert has an example listed using "util.pytorch_cos_sim(A,B) as a
> brute-force approach
>
> https://sbert.net/docs/usage/semantic_textual_similarity.html
>
> and "paraphrase mining" approach for larger document collections
>
> https://sbert.net/examples/applications/paraphrase-mining/README.html
>
> Re the Lucene ANN implementation(s) I think it would be very interesting
> to participate in the ANN benchmarking challenge which Julie mentioned
> on the dev list
>
>
> http://mail-archives.apache.org/mod_mbox/lucene-dev/202105.mbox/%3CCAKDq9%3D4rSuuczoK%2BcVg_N6Lwvh42E%2BXUoSGQ6m7BgqzuDvACew%40mail.gmail.com%3E
>
>
> https://medium.com/big-ann-benchmarks/neurips-2021-announcement-the-billion-scale-approximate-nearest-neighbor-search-challenge-72858f768f69
>
> Thanks
>
> Michael
>
>
>
> Am 24.05.21 um 05:31 schrieb Russell Jurney:
> > For practical search using BERT on any reasonable sized dataset, they're
> > going to need ANN, which Lucene recently added. This won't work in
> practice
> > if the query and document are of a different size, which is where
> sentence
> > transformers see a lot of use for documents up to 500 words.
> >
> > https://issues.apache.org/jira/plugins/servlet/mobile#issue/LUCENE-9004
> >
> > https://github.com/UKPLab/sentence-transformers
> >
> > Russ
> >
> > On Sun, May 23, 2021 at 8:23 PM Michael Sokolov 
> wrote:
> >
> >> Hi Michael, that is fully-functional in the sense that Lucene will
> >> build an HNSW graph for a vector-valued field and you can then use the
> >> VectorReader.search method to do KNN-based search. Next steps may
> >> include some integration with lexical, inverted-index type search so
> >> that you can retrieve N-closest constrained by other constraints.
> >> Today you can approximate that by oversampling and filtering. There is
> >> also interest in pursuing other KNN search algorithms, and we have
> >> been working to make sure the VectorFormat API (might still get
> >> renamed due to confusion with other kinds of vectors existing in
> >> 

Re: Lucene/Solr and BERT

2021-05-26 Thread Alex K
Thanks Michael. IIRC, the thing that was taking so long was merging into a
single segment. Is there already benchmarking code for HNSW
available somewhere? I feel like I remember someone posting benchmarking
results on one of the Jira tickets.

Thanks,
Alex

On Wed, May 26, 2021 at 3:41 PM Michael Sokolov  wrote:

> This java implementation will be slower than the C implementation. I
> believe the algorithm is essentially the same, however this is new and
> there may be bugs!  I (and I think Julie had similar results IIRC)
> measured something like 8x slower than hnswlib (using ann-benchmarks).
> It is also surprising (to me) though how this varies with
> differently-learned vectors so YMMV. I still think there is value
> here, and look forward to improved performance, especially as JDK16
> has some improved support for vectorized instructions.
>
> Please also understand that the HNSW algorithm interacts with Lucene's
> segmented architecture in a tricky way. Because we built a graph
> *per-segment* when flushing/merging, these must be rebuilt whenever
> segments are merged. So your indexing performance can be heavily
> influenced by how often you flush, as well as by your merge policy
> settings. Also, when searching, there is a bigger than usual benefit
> for searching across fewer segments, since the cost of searching an
> HNSW graph scales more or less with log N (so searching a single large
> graph is cheaper than searching the same documents divided among
> smaller graphs). So I do recommend using a multithreaded collector in
> order to get best latency with HNSW-based search. To get the best
> indexing, and searching, performance, you should generally index as
> large a number of documents as possible before flushing.
>
> -Mike
>
> On Wed, May 26, 2021 at 9:43 AM Michael Wechner
>  wrote:
> >
> > Hi Alex
> >
> > Thank you very much for your feedback and the various insights!
> >
> > Am 26.05.21 um 04:41 schrieb Alex K:
> > > Hi Michael and others,
> > >
> > > Sorry just now getting back to you. For your three original questions:
> > >
> > > - Yes, I was referring to the Lucene90Hnsw* classes. Michael S. had a
> > > thorough response.
> > > - As far as I know Opendistro is calling out to a C/C++ binary to run
> the
> > > actual HNSW algorithm and store the HNSW part of the index. When they
> > > implemented it about a year ago, Lucene did not have this yet. I
> assume the
> > > Lucene HNSW implementation is solid, but would not be surprised if it's
> > > slower than the C/C++ based implementation, given the JVM has some
> > > disadvantages for these kinds of CPU-bound/number crunching algos.
> > > - I just haven't had much time to invest into my benchmark recently. In
> > > particular, I got stuck on why indexing was taking extremely long. Just
> > > indexing the vectors would have easily exceeded the current time
> > > limitations in the ANN-benchmarks project. Maybe I had some naive
> mistake
> > > in my implementation, but I profiled and dug pretty deep to make it
> fast.
> >
> > I am trying to get Julie's branch running
> >
> > https://github.com/jtibshirani/lucene/tree/hnsw-bench
> >
> > Maybe this will help and is comparable
> >
> >
> > >
> > > I'm assuming you want to use Lucene, but not necessarily via
> Elasticsearch?
> >
> > Yes, for more simple setups I would like to use Lucene standalone, but
> > for setups which have to scale I would use either Elasticsearch or Solr.
> >
> > Thanks
> >
> > Michael
> >
> >
> >
> > > If so, another option you might try for ANN is the elastiknn-models
> > > and elastiknn-lucene packages. elastiknn-models contains the Locality
> > > Sensitive Hashing implementations of ANN used by Elastiknn, and
> > > elastiknn-lucene contains the Lucene queries used by Elastiknn.The
> Lucene
> > > query is the MatchHashesAndScoreQuery
> > > <
> https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-lucene/src/main/java/org/apache/lucene/search/MatchHashesAndScoreQuery.java#L18-L22
> >.
> > > There are a couple of scala test suites that show how to use it:
> > > MatchHashesAndScoreQuerySuite
> > > <
> https://github.com/alexklibisz/elastiknn/blob/master/elastiknn-testing/src/test/scala/com/klibisz/elastiknn/query/MatchHashesAndScoreQuerySuite.scala
> >.
> > > MatchHashesAndScoreQueryPerformanceSuite
> > > <
> https://github.com/alexklibisz/elastiknn/blob/master/ela

Does Lucene have anything like a covering index as an alternative to DocValues?

2021-07-04 Thread Alex K
Hi all,

I am curious if there is anything in Lucene that resembles a covering index
(from the relational database world) as an alternative to DocValues for
commonly-accessed values?

Consider the following use-case: I'm indexing docs in a Lucene index. Each
doc has some terms, which are not stored. Each doc also has a UUID
corresponding to some other system, which is stored using DocValues. When I
run a query, I get back the TopDocs and use the doc ID to fetch the UUID
from DocValues. I know that I will *always* need to go fetch this UUID. Is
there any way to have the UUID stored in the actual index, rather than
using DocValues?

Thanks in advance for any tips

Alex Klibisz


Control the number of segments without using forceMerge.

2021-07-04 Thread Alex K
Hi all,

I'm trying to figure out if there is a way to control the number of
segments in an index without explicitly calling forceMerge.

My use-case looks like this: I need to index a static dataset of ~1
billion documents. I know the exact number of docs before indexing starts.
I know the VM where this index is searched has 64 threads. I'd like to end
up with exactly 64 segments, so I can search them in a parallelized fashion.

I know that I could call forceMerge(64), but this takes an extremely long
time.

Is there a straightforward way to ensure that I end up with 64 threads
without force-merging after adding all of the documents?

Thanks in advance for any tips

Alex Klibisz


Re: Does Lucene have anything like a covering index as an alternative to DocValues?

2021-07-05 Thread Alex K
Hi Uwe,
Thanks for clarifying. That makes sense.
Thanks,
Alex Klibisz

On Mon, Jul 5, 2021 at 9:22 AM Uwe Schindler  wrote:

> Hi,
>
> Sorry I misunderstood you question, you want to lookup the UUID in another
> system!
> Then the approach you are doing is correct. Either store as stored field
> or as docvalue. An inverted index cannot store additional data, because it
> *is* inverted, it is focused around *terms* not documents. The posting list
> of each term can only store internal, numeric lucene doc ids. Those have
> then to be used to lookup the actual contents from e.g. stored fields
> (possibility A) or DocValues (possibility B). We can't store UUIDs in the
> highly compressed posting list.
>
> Uwe
>
> -
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: u...@thetaphi.de
>
> > -Original Message-
> > From: Uwe Schindler 
> > Sent: Monday, July 5, 2021 3:10 PM
> > To: java-user@lucene.apache.org
> > Subject: RE: Does Lucene have anything like a covering index as an
> alternative
> > to DocValues?
> >
> > You need to index the UUID as a standard indexed StringField. Then you
> can do
> > a lookup using TermQuery. That's how all systems like Solr or
> Elasticsearch
> > handle document identifiers.
> >
> > DocValues are for facetting and sorting, but looking up by ID is a
> typical use
> > case for an inverted index. If you still need to store it as DocValues
> field, just
> > add it with both types.
> >
> > Uwe
> >
> > -
> > Uwe Schindler
> > Achterdiek 19, D-28357 Bremen
> > https://www.thetaphi.de
> > eMail: u...@thetaphi.de
> >
> > > -Original Message-
> > > From: Alex K 
> > > Sent: Monday, July 5, 2021 2:30 AM
> > > To: java-user@lucene.apache.org
> > > Subject: Does Lucene have anything like a covering index as an
> alternative to
> > > DocValues?
> > >
> > > Hi all,
> > >
> > > I am curious if there is anything in Lucene that resembles a covering
> index
> > > (from the relational database world) as an alternative to DocValues for
> > > commonly-accessed values?
> > >
> > > Consider the following use-case: I'm indexing docs in a Lucene index.
> Each
> > > doc has some terms, which are not stored. Each doc also has a UUID
> > > corresponding to some other system, which is stored using DocValues.
> When I
> > > run a query, I get back the TopDocs and use the doc ID to fetch the
> UUID
> > > from DocValues. I know that I will *always* need to go fetch this
> UUID. Is
> > > there any way to have the UUID stored in the actual index, rather than
> > > using DocValues?
> > >
> > > Thanks in advance for any tips
> > >
> > > Alex Klibisz
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Control the number of segments without using forceMerge.

2021-07-05 Thread Alex K
Ok, so it sounds like if you want a very specific number of segments you
have to do a forceMerge at some point?

Is there some simple summary on how segments are formed in the first place?
Something like, "one segment is created every time you flush from an
IndexWriter"? Based on some experimenting and reading the code, it seems to
be quite complicated, especially once you start calling addDocument from
several threads in parallel.

It's good to learn about the MultiReader. I'll look into that some more.

Thanks,
Alex

On Mon, Jul 5, 2021 at 9:14 AM Uwe Schindler  wrote:

> If you want an exact number of segments, create 64 indexes, each
> forceMerged to one segment.
> After that use MultiReader to create a view on all separate indexes.
> MultiReaders's contents are always flattened to a list of those 64 indexes.
>
> But keep in mind that this should only ever be done with *static* indexes.
> As soon as you have updates, this is a bad idea (forceMerge in general) and
> also splitting indexes like this. Parallelization should normally come from
> multiple queries running in parallel, but you shouldn't force Lucene to run
> a single query over so many indexes.
>
> Uwe
>
> -
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: u...@thetaphi.de
>
> > -Original Message-
> > From: Alex K 
> > Sent: Monday, July 5, 2021 4:04 AM
> > To: java-user@lucene.apache.org
> > Subject: Control the number of segments without using forceMerge.
> >
> > Hi all,
> >
> > I'm trying to figure out if there is a way to control the number of
> > segments in an index without explicitly calling forceMerge.
> >
> > My use-case looks like this: I need to index a static dataset of ~1
> > billion documents. I know the exact number of docs before indexing
> starts.
> > I know the VM where this index is searched has 64 threads. I'd like to
> end
> > up with exactly 64 segments, so I can search them in a parallelized
> fashion.
> >
> > I know that I could call forceMerge(64), but this takes an extremely long
> > time.
> >
> > Is there a straightforward way to ensure that I end up with 64 threads
> > without force-merging after adding all of the documents?
> >
> > Thanks in advance for any tips
> >
> > Alex Klibisz
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Control the number of segments without using forceMerge.

2021-07-05 Thread Alex K
After some more reading, the NoMergePolicy seems to mostly solve my problem.

I've configured my IndexWriterConfig with:

.setMaxBufferedDocs(Integer.MAX_VALUE)
.setRAMBufferSizeMB(Double.MAX_VALUE)
.setMergePolicy(NoMergePolicy.INSTANCE)

With this config I consistently end up with a number of segments that is a
multiple of the number of processors on the indexing VM. I don't have to
force merge at all. This also makes the indexing job faster overall.

I think I was previously confused by the behavior of the
ConcurrentMergeScheduler. I'm sure it's great for most use-cases, but I
really need to just move as many docs as possible as fast as possible to a
predictable number of segments, so the NoMergePolicy seems to be a good
choice for my use-case.

Also, I learned a lot from Uwe's recent talk at Berlin Buzzwords
<https://2021.berlinbuzzwords.de/sites/berlinbuzzwords.de/files/2021-06/The%20future%20of%20Lucene%27s%20MMapDirectory.pdf>,
and his great post about MMapDirectory from a few years ago
<https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html>.
Definitely recommended for others.

Thanks,
Alex

On Mon, Jul 5, 2021 at 1:53 PM Alex K  wrote:

> Ok, so it sounds like if you want a very specific number of segments you
> have to do a forceMerge at some point?
>
> Is there some simple summary on how segments are formed in the first
> place? Something like, "one segment is created every time you flush from an
> IndexWriter"? Based on some experimenting and reading the code, it seems to
> be quite complicated, especially once you start calling addDocument from
> several threads in parallel.
>
> It's good to learn about the MultiReader. I'll look into that some more.
>
> Thanks,
> Alex
>
> On Mon, Jul 5, 2021 at 9:14 AM Uwe Schindler  wrote:
>
>> If you want an exact number of segments, create 64 indexes, each
>> forceMerged to one segment.
>> After that use MultiReader to create a view on all separate indexes.
>> MultiReaders's contents are always flattened to a list of those 64 indexes.
>>
>> But keep in mind that this should only ever be done with *static*
>> indexes. As soon as you have updates, this is a bad idea (forceMerge in
>> general) and also splitting indexes like this. Parallelization should
>> normally come from multiple queries running in parallel, but you shouldn't
>> force Lucene to run a single query over so many indexes.
>>
>> Uwe
>>
>> -
>> Uwe Schindler
>> Achterdiek 19, D-28357 Bremen
>> https://www.thetaphi.de
>> eMail: u...@thetaphi.de
>>
>> > -Original Message-
>> > From: Alex K 
>> > Sent: Monday, July 5, 2021 4:04 AM
>> > To: java-user@lucene.apache.org
>> > Subject: Control the number of segments without using forceMerge.
>> >
>> > Hi all,
>> >
>> > I'm trying to figure out if there is a way to control the number of
>> > segments in an index without explicitly calling forceMerge.
>> >
>> > My use-case looks like this: I need to index a static dataset of ~1
>> > billion documents. I know the exact number of docs before indexing
>> starts.
>> > I know the VM where this index is searched has 64 threads. I'd like to
>> end
>> > up with exactly 64 segments, so I can search them in a parallelized
>> fashion.
>> >
>> > I know that I could call forceMerge(64), but this takes an extremely
>> long
>> > time.
>> >
>> > Is there a straightforward way to ensure that I end up with 64 threads
>> > without force-merging after adding all of the documents?
>> >
>> > Thanks in advance for any tips
>> >
>> > Alex Klibisz
>>
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>


Using setIndexSort on a binary field

2021-10-15 Thread Alex K
Hi all,

Could someone point me to an example of using the
IndexWriterConfig.setIndexSort for a field containing binary values?

To be specific, the fields are constructed using the Field(String name,
byte[] value, IndexableFieldType type) constructor, and I'd like to try
using the java.util.Arrays.compareUnsigned method to sort the fields.

Thanks,
Alex


Re: Using setIndexSort on a binary field

2021-10-15 Thread Alex K
Thanks Adrien. This makes me think I might not be understanding the use
case for index sorting correctly. I basically want to make it so that my
terms are sorted across segments. For example, let's say I have integer
terms 1 to 100 and 10 segments. I'd like terms 1 to 10 to occur in segment
1, terms 11 to 20 in segment 2, terms 21 to 30 in segment 3, and so on.
With default indexing settings, I see terms duplicated across segments. I
thought index sorting was the way to achieve this, but the use of doc
values makes me think it might actually be used for something else? Is
something like what I described possible? Any clarification would be great.
Thanks,
Alex


On Fri, Oct 15, 2021 at 12:43 PM Adrien Grand  wrote:

> Hi Alex,
>
> You need to use a BinaryDocValuesField so that the field is indexed with
> doc values.
>
> `Field` is not going to work because it only indexes the data while index
> sorting requires doc values.
>
> On Fri, Oct 15, 2021 at 6:40 PM Alex K  wrote:
>
> > Hi all,
> >
> > Could someone point me to an example of using the
> > IndexWriterConfig.setIndexSort for a field containing binary values?
> >
> > To be specific, the fields are constructed using the Field(String name,
> > byte[] value, IndexableFieldType type) constructor, and I'd like to try
> > using the java.util.Arrays.compareUnsigned method to sort the fields.
> >
> > Thanks,
> > Alex
> >
>
>
> --
> Adrien
>


Re: Using setIndexSort on a binary field

2021-10-18 Thread Alex K
Thanks Michael. Totally agree this is a contrived setup. It's mostly for
benchmarking purposes right now. I was actually able to rephrase my problem
in a way that made more sense for the existing setIndexSort API using float
doc values and saw an appreciable speedup in searches. The IndexRearranger
is also good to know about.

Cheers,
Alex

On Sun, Oct 17, 2021 at 9:32 AM Michael Sokolov  wrote:

> Yeah, index sorting doesn't do that -- it sorts *within* each segment
> so that when documents are iterated (within that segment) by any of
> the many DocIdSetIterators that underlie the Lucene search API, they
> are retrieved in the order specified (which is then also docid order).
>
> To achieve what you want you would have to tightly control the
> indexing process. For example you could configure a NoMergePolicy to
> prevent the segments you manually create from being merged, set a very
> large RAM buffer size on the index writer so it doesn't unexpectedly
> flush a segment while you're indexing, and then index documents in the
> sequence you want to group them by, committing after each block of
> documents. But this is a very artificial setup; it wouldn't survive
> any normal indexing workflow where merges are allowed, documents may
> be updated, etc.
>
> For testing purposes we've recently added the ability to rearrange the
> index (IndexRearranger) according to a specific assignment of docids
> to segments - you could apply this to an existing index. But again,
> this is not really intended for use in a production on-line index that
> receives updates.
>
> On Fri, Oct 15, 2021 at 1:27 PM Alex K  wrote:
> >
> > Thanks Adrien. This makes me think I might not be understanding the use
> > case for index sorting correctly. I basically want to make it so that my
> > terms are sorted across segments. For example, let's say I have integer
> > terms 1 to 100 and 10 segments. I'd like terms 1 to 10 to occur in
> segment
> > 1, terms 11 to 20 in segment 2, terms 21 to 30 in segment 3, and so on.
> > With default indexing settings, I see terms duplicated across segments. I
> > thought index sorting was the way to achieve this, but the use of doc
> > values makes me think it might actually be used for something else? Is
> > something like what I described possible? Any clarification would be
> great.
> > Thanks,
> > Alex
> >
> >
> > On Fri, Oct 15, 2021 at 12:43 PM Adrien Grand  wrote:
> >
> > > Hi Alex,
> > >
> > > You need to use a BinaryDocValuesField so that the field is indexed
> with
> > > doc values.
> > >
> > > `Field` is not going to work because it only indexes the data while
> index
> > > sorting requires doc values.
> > >
> > > On Fri, Oct 15, 2021 at 6:40 PM Alex K  wrote:
> > >
> > > > Hi all,
> > > >
> > > > Could someone point me to an example of using the
> > > > IndexWriterConfig.setIndexSort for a field containing binary values?
> > > >
> > > > To be specific, the fields are constructed using the Field(String
> name,
> > > > byte[] value, IndexableFieldType type) constructor, and I'd like to
> try
> > > > using the java.util.Arrays.compareUnsigned method to sort the fields.
> > > >
> > > > Thanks,
> > > > Alex
> > > >
> > >
> > >
> > > --
> > > Adrien
> > >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


lucene source code changes

2009-05-19 Thread Alex Steward
Hello,

 I have a need to implement an custom inverted index in Lucene.
I
have files like the ones I have attached here. The Files have words and
and scores assigned to that word. There will 100's of such files. Each
file will have atleast 5 such name value pairs. 

Note: Currently the file only shows 10s of such name value pairs. But
My real production data will have 5 plus name value pairs in file.

Currently
I index the data using Lucene's Inverted Index. The query that is being
execute against the Index has 100 Words. When the query is excuted
against the index the result is returned in 100 milli seconds or so. 


Problem: Once i have the results of the query, I have to go
through each file (for ex. attached file one). Then for each word in
the user input query, I have to compute the total score. Doing this
against 100's of files and 100's of keywords is causing the score
computation to be slow i.e. about 3-5seconds. 

I need help resolving the above problem so that score computation takes less 
than 200Milli Seconds or so.
One Resolution I was thinking is modifying the Lucene Source Code
for creating inverted index. In this index we store the score in the
index itself. When the results of the query are returned, we will get
the scores along with the file names, there by eleminating the need to
search the file for the keyword and corresponding score. I need to
compute the total of all scores that belong to one single file.


I am also open to any other ideas that you may have. Any suggestions regarding 
this will be very helpful.

Thanks,
Abhilasha




  
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: lucene code changes

2009-05-19 Thread Alex Steward
 I have a need to implement an custom inverted index in Lucene.
I
have files like the ones I have attached here. The Files have words and
and scores assigned to that word. There will 100's of such files. Each
file will have atleast 5 such name value pairs. 

Note: Currently the file only shows 10s of such name value pairs. But
My real production data will have 5 plus name value pairs in file.

Currently
I index the data using Lucene's Inverted Index. The query that is being
execute against the Index has 100 Words. When the query is excuted
against the index the result is returned in 100 milli seconds or so. 


Problem: Once i have the results of the query, I have to go
through each file (for ex. attached file one). Then for each word in
the user input query, I have to compute the total score. Doing this
against 100's of files and 100's of keywords is causing the score
computation to be slow i.e. about 3-5seconds. 

I need help resolving the above problem so that score computation takes less 
than 200Milli Seconds or so.
One Resolution I was thinking is modifying the Lucene Source Code
for creating inverted index. In this index we store the score in the
index itself. When the results of the query are returned, we will get
the scores along with the file names, there by eleminating the need to
search the file for the keyword and corresponding score. I need to
compute the total of all scores that belong to one single file.


I am also open to any other ideas that you may have. Any suggestions regarding 
this will be very helpful.

a.




  


  
-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

cannot retrieve the values of a field is not stored in the index

2009-06-04 Thread Alex Steward


Hi,

  Is there a way I can retrieve the value of a field that is not stored in the 
Index?


private static void indexFile(IndexWriter writer, File f)
    throws IOException {

    if (f.isHidden() || !f.exists() || !f.canRead()) {
  return;
    }

    System.out.println("Indexing " + f.getCanonicalPath());

    Document doc = new Document();

    // add contents of file
    FileReader fr = new FileReader(f);
    
    doc.add(new Field("contents", fr));

    //adding second field which contains the path of the file
    doc.add(new Field("path", f.getCanonicalPath(),
    Field.Store.NO,
    Field.Index.NOT_ANALYZED));
}

Is there a way I can access the value of the field "path" from the document 
hits?

Thanks,
a



  

Lucene sorting case-sensitive by default?

2008-01-11 Thread Alex Wang
Hi All,

 

I was searching my index with sorting on a field called "Label" which is
not tokenized, here is what came back:

 

Extended Sites Catalog Asset Store

Extended Sites Catalog Asset Store SALES

Print Catalog 2

Print catalog test

Test Print Catalog

Test refresh catalog

print test  3

test catalog 1

 

Looks like Lucene is separating upper case and lower case while sorting.
Can someone shed some light on this as to while this is happening and
how to fix it?

 

Thanks in advance for your help!

 

Alex

 

 



RE: Lucene sorting case-sensitive by default?

2008-01-14 Thread Alex Wang
Thanks everyone for your replies! Guess I did not fully understand the
meaning of "natural order" in the Lucene Java doc.

To add another all-lower-case field for each sortable field in my index
is a little too much, since the app requires sorting on pretty much all
fields (over 100).

Toke, you mentioned "Using a Collator works but does take a fair amount
of memory", can you please elaborate a little more on that. Thanks.

Alex

-Original Message-
From: Toke Eskildsen [mailto:[EMAIL PROTECTED] 
Sent: Monday, January 14, 2008 3:13 AM
To: java-user@lucene.apache.org
Subject: Re: Lucene sorting case-sensitive by default?

On Fri, 2008-01-11 at 11:40 -0500, Alex Wang wrote:
> Looks like Lucene is separating upper case and lower case while
sorting.

As Tom points out, default sorting uses natural order. It's worth noting
that this implies that default sorting does not produce usable results
as soon as you use non-ASCII characters. Using a Collator works but does
take a fair amount of memory.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene sorting case-sensitive by default?

2008-01-14 Thread Alex Wang
Thanks a lot Erik for the great tip! I do need to display all the fields
and allow the users to sort by each field as they wish. My index is
currently about 200 mb.

Your suggestion about storing (but not index) the cased version, and
indexing (but not store) the lower-case version is an excellent solution
for me. 

Is it possible to do it in the same field or do I have to do it in 2
separate fields? If I do it in one field, what are the Lucene
class/methods I need to overwrite?

Thanks again for your help!

Alex
 

This message may contain confidential and/or privileged information. If
you are not the addressee or authorized to receive this for the
addressee, you must not use, copy, disclose, or take any action based on
this message or any information herein. If you have received this
message in error, please advise the sender immediately by reply e-mail
and delete this message. Thank you for your cooperation.


-Original Message-
From: Erick Erickson [mailto:[EMAIL PROTECTED] 
Sent: Monday, January 14, 2008 11:24 AM
To: java-user@lucene.apache.org
Subject: Re: Lucene sorting case-sensitive by default?

Several things:

1> do you need to display all the fields? Would just storing them
lower-case work? The only time I've needed to store fields case-
sensitive is when I'm showing them to the user. If the user is just
searching on them, I can store them any way I want and she'll never
know.

2> You might very well be surprised at how little extra it takes to
index (but not store) the lower-case version. How big is your index
anyway? And be warned that the size increase is not linear, so
just comparing the index sizes for, say, 10 document is misleading.
If your index is 10M, there's no reason at all not to store twice. If
it's
10G

3> You could store (but not index) the cased version. You could
index (but not store) the lower-case version. The total size of
your index is (I believe) about the same as indexing AND storing
the fields. That gives you a way to search caselessly and display
case-sensitively.

Best
Erick

On Jan 14, 2008 10:58 AM, Alex Wang <[EMAIL PROTECTED]> wrote:

> Thanks everyone for your replies! Guess I did not fully understand the
> meaning of "natural order" in the Lucene Java doc.
>
> To add another all-lower-case field for each sortable field in my
index
> is a little too much, since the app requires sorting on pretty much
all
> fields (over 100).
>
> Toke, you mentioned "Using a Collator works but does take a fair
amount
> of memory", can you please elaborate a little more on that. Thanks.
>
> Alex
>
> -Original Message-
> From: Toke Eskildsen [mailto:[EMAIL PROTECTED]
> Sent: Monday, January 14, 2008 3:13 AM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene sorting case-sensitive by default?
>
> On Fri, 2008-01-11 at 11:40 -0500, Alex Wang wrote:
> > Looks like Lucene is separating upper case and lower case while
> sorting.
>
> As Tom points out, default sorting uses natural order. It's worth
noting
> that this implies that default sorting does not produce usable results
> as soon as you use non-ASCII characters. Using a Collator works but
does
> take a fair amount of memory.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Lucene sorting case-sensitive by default?

2008-01-14 Thread Alex Wang
No problem Erick. Thanks for clarifying it.

Alex

-Original Message-
From: Erick Erickson [mailto:[EMAIL PROTECTED] 
Sent: Monday, January 14, 2008 12:35 PM
To: java-user@lucene.apache.org
Subject: Re: Lucene sorting case-sensitive by default?

Sorry, I was confused about this for the longest time (and it shows!).
You
don't actually have to store two separate fields. Field.Store.YES stores
the input exactly as is, without passing it through anything. So you
really only have to store your field. I still think of it conceptually
as
two
entirely different things, but it's not.

This code:
   public static void main(String[] args) throws Exception
{
try {
RAMDirectory dir = new RAMDirectory();
IndexWriter iw = new IndexWriter(
dir,
new StandardAnalyzer(Collections.emptySet()),
true);

Document doc = new Document();

doc.add(
new Field(
"f",
"This is Some Mixed, case Junk($*%& With
Ugly
SYmbols",
Field.Store.YES,
Field.Index.TOKENIZED));
iw.addDocument(doc);
iw.close();
IndexReader ir =  IndexReader.open(dir);
Document d = ir.document(0);
System.out.println(d.get("f"));
} catch (Exception e) {
e.printStackTrace();
}

System.out.println("done");
}

prints "This is Some Mixed, case Junk($*%& With Ugly SYmbols"
yet still finds the document with a search for "junk" using
StandardAnalyzer.

Sorry for the confusion!
Erick

On Jan 14, 2008 11:48 AM, Alex Wang <[EMAIL PROTECTED]> wrote:

> Thanks a lot Erik for the great tip! I do need to display all the
fields
> and allow the users to sort by each field as they wish. My index is
> currently about 200 mb.
>
> Your suggestion about storing (but not index) the cased version, and
> indexing (but not store) the lower-case version is an excellent
solution
> for me.
>
> Is it possible to do it in the same field or do I have to do it in 2
> separate fields? If I do it in one field, what are the Lucene
> class/methods I need to overwrite?
>
> Thanks again for your help!
>
> Alex
>
>
> This message may contain confidential and/or privileged information.
If
> you are not the addressee or authorized to receive this for the
> addressee, you must not use, copy, disclose, or take any action based
on
> this message or any information herein. If you have received this
> message in error, please advise the sender immediately by reply e-mail
> and delete this message. Thank you for your cooperation.
>
>
> -Original Message-
> From: Erick Erickson [mailto:[EMAIL PROTECTED]
> Sent: Monday, January 14, 2008 11:24 AM
> To: java-user@lucene.apache.org
> Subject: Re: Lucene sorting case-sensitive by default?
>
> Several things:
>
> 1> do you need to display all the fields? Would just storing them
> lower-case work? The only time I've needed to store fields case-
> sensitive is when I'm showing them to the user. If the user is just
> searching on them, I can store them any way I want and she'll never
> know.
>
> 2> You might very well be surprised at how little extra it takes to
> index (but not store) the lower-case version. How big is your index
> anyway? And be warned that the size increase is not linear, so
> just comparing the index sizes for, say, 10 document is misleading.
> If your index is 10M, there's no reason at all not to store twice. If
> it's
> 10G
>
> 3> You could store (but not index) the cased version. You could
> index (but not store) the lower-case version. The total size of
> your index is (I believe) about the same as indexing AND storing
> the fields. That gives you a way to search caselessly and display
> case-sensitively.
>
> Best
> Erick
>
> On Jan 14, 2008 10:58 AM, Alex Wang <[EMAIL PROTECTED]> wrote:
>
> > Thanks everyone for your replies! Guess I did not fully understand
the
> > meaning of "natural order" in the Lucene Java doc.
> >
> > To add another all-lower-case field for each sortable field in my
> index
> > is a little too much, since the app requires sorting on pretty much
> all
> > fields (over 100).
> >
> > Toke, you mentioned "Using a Collator works but does take a fair
> amount
> > of memory", can you please elaborate a little more on that. Thanks.
> >
> > Alex
> >
> > -Original Message-
> > From: Toke Eskildsen [mailto:[EMAIL PROTECTED]
> > Sent: Mon

Can I using HFS in lucene 2.3.1?

2008-04-25 Thread Alex Chew
Hi,
Does somebody have practice building a distributed application with lucene
and Hadoop/HFS?
Lucene 2.3.1 looks not explose HFSDirectory.

Any advice will be appreciated.
Regards,
Alex


Can I using HFS in lucene 2.3.1?

2008-04-25 Thread Alex Chew
Hi,
Does somebody have practice building a distributed application with lucene
and Hadoop/HFS?
Lucene 2.3.1 looks not explose HFSDirectory.

Any advice will be appreciated.
Regards,
Alex


searching for C++

2008-06-24 Thread Alex Soto
Hello:

I have a problem where I need to search for the term "C++".
If I use StandardAnalyzer, the "+" characters are removed and the
search is done on just the "c" character which is not what is
intended.
Yet, I need to use standard analyzer for the other benefits it provides.

I think I need to write a specialized tokenizer (and accompanying
analyzer) that let the "+" characters pass.
I would use the JFlex provided one, modify it and add it to my project.

My question is:

Is there any simpler way to accomplish the same?


Best regards,
Alex Soto
[EMAIL PROTECTED]

-
Amicus Plato, sed magis amica veritas.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



searching for words with symbols

2008-06-24 Thread Alex Soto
Hello:

I have a problem where I need to search for the word "C++".
If I use StandardAnalyzer, the "+" characters are removed and the
search is done on just the "c" character which is not what is
intended.
Yet, I need to use standard analyzer for the other benefits it provides.

I think I need to write a specialized tokenizer (and accompanying
analyzer) that let the "+" characters pass.
I would use the JFlex provided one, modify it and add it to my project.

My question is:

Is there any simpler way to accomplish the same?

-- 
Alex Soto
[EMAIL PROTECTED]

-
Amicus Plato, sed magis amica veritas.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: searching for C++

2008-06-24 Thread Alex Soto
Thanks everyone. I appreciate the help.

I think I will write my own tokenizer, because I do not have a
predefined list of words with symbols.
I will modify the grammar by defining a SYMBOL token as John suggested
and redefine ALPHANUM to include it.

Regards,
Alex Soto



On Tue, Jun 24, 2008 at 12:12 PM, N. Hira <[EMAIL PROTECTED]> wrote:
> This isn't ideal, but if you have a defined list of such terms, you may find
> it easier to filter these terms out into a separate field for indexing.
>
> -h
> --
> Hira, N.R.
> Solutions Architect
> Cognocys, Inc.
> (773) 251-7453
>
> On 24-Jun-2008, at 11:03 AM, John Byrne wrote:
>
>> I don't think there is a simpler way. I think you will have to modify the
>> tokenizer. Once you go beyond basic human-readable text, you always end up
>> having to do that. I have modified the JavaCC version of StandardTokenizer
>>  for allowing symbols to pass through, but I've never used the JFlex version
>> - don't know anything about JFlex I'm afraid!
>>
>> A good strategy might be to make a new type of lexical token called
>> "SYMBOL" and try to catch as many symbols as you can think of; then maybe
>> create new token types which are ALPHANUM types that can have pre-fixed or
>> post-fixed symbols.
>>
>> That way, you'll be able to catch things like "c++" in a TokenFilter, and
>> you can choose to pass it through as a single token, or split it up into two
>> tokens, or whatever you want.
>>
>> Hope that helps.
>>
>> Regards,
>> JB
>>
>> Alex Soto wrote:
>>>
>>> Hello:
>>>
>>> I have a problem where I need to search for the term "C++".
>>> If I use StandardAnalyzer, the "+" characters are removed and the
>>> search is done on just the "c" character which is not what is
>>> intended.
>>> Yet, I need to use standard analyzer for the other benefits it provides.
>>>
>>> I think I need to write a specialized tokenizer (and accompanying
>>> analyzer) that let the "+" characters pass.
>>> I would use the JFlex provided one, modify it and add it to my project.
>>>
>>> My question is:
>>>
>>> Is there any simpler way to accomplish the same?
>>>
>>>
>>> Best regards,
>>> Alex Soto
>>> [EMAIL PROTECTED]
>>>
>>> -
>>> Amicus Plato, sed magis amica veritas.
>>>
>>> -
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>>
>>>
>>>
>
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>



-- 
Alex Soto
[EMAIL PROTECTED]

-
Amicus Plato, sed magis amica veritas.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



instruct IndexDeletionPolicy to delete old commits after N minutes

2008-06-25 Thread Alex Cheng
hi,
what is the correct way to instruct the indexwriter to delete old
commit points after N minutes ?
I tried to write a customized IndexDeletionPolicy that uses the
parameters to schedule future
jobs to do file deletion. However, I am only getting the filenames,
and not absolute file names.

thanks.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



IndexDeletionPolicy to delete after N minutes

2008-06-25 Thread Alex Cheng
hi,
what is the correct way to instruct the indexwriter to delete old
commit points after N minutes ?
I tried to write a customized IndexDeletionPolicy that uses the
parameters to schedule future
jobs to do file deletion. However, I am only getting the filenames,
and not absolute file names.

thanks.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



IndexDeletionPolicy to delete commits after N minutes

2008-06-25 Thread Alex Cheng
hi,
what is the correct way to instruct the indexwriter (or other
classes?) to delete old
commit points after N minutes ?
I tried to write a customized IndexDeletionPolicy that uses the
parameters to schedule future
jobs to perform file deletion. However, I am only getting the
filenames through the parameters
and not absolute file names.

thanks.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: How Lucene Search

2008-06-26 Thread Alex Cheng
the debugger that came with eclipse is pretty good for this purpose.
You can create a small project and then attach Lucene source for the
purpose of debugging.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Urgent Help Please: "Resource Tempararily Unavailable"

2008-08-06 Thread Alex Wang
Hi Everyone,

We have an application built using Lucene 1.9. The app allows incremental 
updating to the index while other users are searching the same index. Today, 
some search suddenly returns nothing when we know it should return some hits. 
This does not happen all the time. Sometimes the search succeeded. When 
checking the logs, I found the following error during searching:

Parameter[0]: java.io.IOException: Resource temporarily unavailable


When this error occurred, there were 2 other users deleting documents from the 
same index. The deletions seemed to succeed, but the search failed.

I have no clue what could have caused such error.  Unfortunately there is no 
further info in the logs. Can someone please shed some light on this? Thanks.
Alex



Urgent Help Please: "Resource temporarily unavailable"

2008-08-06 Thread Alex Wang

Hi Everyone,

We have an application built using Lucene 1.9. The app allows incremental 
updating to the index while other users are searching the same index. Today, 
some search suddenly returns nothing when we know it should return some hits. 
This does not happen all the time. Sometimes the search succeeded. When 
checking the logs, I found the following error during searching:

Parameter[0]: java.io.IOException: Resource temporarily unavailable


When this error occurred, there were 2 other users deleting documents from the 
same index. The deletions seemed to succeed, but the search failed.

I have no clue what could have caused such error.  Unfortunately there is no 
further info in the logs. Can someone please shed some light on this? Thanks.


Alex




  1   2   >