Re: merging indexes together
Thanks. I didn't think about such simple solution:) Mordo, Aviran (EXP N-NANNATEK) wrote: Why don't you just add the new information directly to the main index ? As long as you don't get a new IndexReader you should be able to access the old information. Once your indexing and deletion is done just get a new IndexReader instance to access the new documents. Aviran http://www.aviransplace.com -Original Message- From: Volodymyr Bychkoviak [mailto:[EMAIL PROTECTED] Sent: Monday, August 08, 2005 1:50 PM To: java-user@lucene.apache.org Subject: merging indexes together Hello All. In my program I index new information to temporary dir and after then I delete outdated information from main index and add new information by calling indexWriter.addIndexes() method. This works fine when doc number is relatively small but when index size grows, every call to addIndexes can take very long. (NOTE: new information is ONLY part of all index) The reason I'm using this approach is that I want old information to be available during indexing new information and then switch as fast as I can to new information. current index 336Mb / 110 Docs. and growing... current time to merge indexes is about 5min. Any ideas how to optimize this? -- regards, Volodymyr Bychkoviak - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- regards, Volodymyr Bychkoviak - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: JDBC proxy implementing Lucene?
Hi, That is exactly the path I took with Compass and Hibernate. Compass integrates with Hibernate events (update/delete/create) and syncs the search engine using it. I had problems with Hibernate 2 interceptors (external id is null, and other stuff) so it currently works only with Hibernate events. The nice thing about Compass, is that once the application works on the Object level for the search as well (OSEM), than it becomes very simple to do it. Compass simply registers with Hibernate events, and persist to the search engine any changes made to objects that have both ORM (obviously) and OSEM definitions. One can than extend the notion of intercepting and actually define a generic Aspect (AOP) for search engine syncs. Which can be applied to any application, as long as it has OSEM - Object to Search Engine mappings (or any other type of mappings), since you must have some kind of knowledge how to combine the two. Shay Otis Gospodnetic wrote: Hi Chris, --- Chris Lu <[EMAIL PROTECTED]> wrote: Hi, Just an idea to make Lucene work with databases more easily. When I communicated with Shay Banon(Compass' author), it came to me that maybe Lucene can be wrapped around JDBC drivers. Let's say it's L-JDBC. So whenever an object is stored through JDBC, according to some XML configuration file, L-JDBC can index the updated object/document, or delete it from the index. Basically make Lucene indexing transparent to new/existing applications. Not really a super idea. I am wondering anyone will find it helpful? Yes, that would be handy, as lots of people have applications that use both Lucene and a RDBMS and use various tricks to keep the two in sync. If an application uses Hibernate, then one can make use of various Hibernate interceptors and use them to trigger operations on an external Lucene index. I know at least one application that does something similar (see my .signature). Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ -- Find it. Tag it. Share it. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Regarding range queries.
1. Use RangeFilters on the lowest precision date you need. If you only need to filter to the day, index the date in a separate field with day precision. This will speed up filter creation a great deal. 2. Use as few characters as possible when indexing, so if you can come up with your own date representation as a String, that will work well for you. 3. Try to update your index as little as possible. If you need to update your index regularly, consider having two indexes. For example... 1 small index that allows many updates that you use for TODAY. 1 large index that each night is updated with the contents of index 1. then swap out index 1 for a new one. This is very handy if docs are added in date order. you can use this fact to sort more efficiently (i.e. no cross index sorting - just append the sorted results of one index to the other). 4. Use a robust filter caching scheme that is shared across users (give the users the ability and ease of selecting common date ranges). By robust, I mean, cache some in memory and cache some to disk. reading a filter from disk can be a heck of a lot cheaper than recreating the filter. Use a simple list and put recently used filters at the front. store a certain number of filters in memory, then store a certain number of filters on disk, then drop the rest. as a side note: I think there are a few things that should be added to lucene to really give a huge benefit to applications of lucene that are centered around dates. If documents are added in date order (generally but not exactly), you can use this fact to improve memory usage of lucene in several ways. 1. a sparse bitset can be used instead of a full array for Date RangeFilters. 2. sorting can improved by storing the StringIndex (sort array) to disk when index is updated. Then, load only the portions required for a particular search. If most people will be searching more recent docs and so you can keep those portions of the sort array in memory and load only those "older" portions when needed. 3. use the same sparse (and reversible) bitset instead of the lucene BitVector for storing the deleted docs for a particular segment. (very old docs are probably deleted again, based on date). 4. sorting can also be greatly improved by NOT storing the field values in memory if the index is not used in a "multi-index" environment. I have implemented these techniques for my particular implementation of an application logs search tool and have seen incredible results. I have many users searching 50 million application logs (1k each) with 512 MB memory for my app where users are sorting and filtering on every search. Again, these features will only be useful for indexes that have relative date to docid correlation (which I believe happens to be very common). Tony Schwartz [EMAIL PROTECTED] "What we need is more cowbell." > Hi all, > I am new user of lucene. This query is posted at least > once on alomost all lucene mailing lists. The query > being about handling of date fields. > > In my case I need to find documents with dates older > than a particular date. So ideally I am not supposed > to specify the lower bound. When using the default > date handling provdied by lucene in conjunction with > the RangeQuery, it results in a havaoc. > > But recently during my search for a solution to this > problem I came across a solution which said to > convert the dates to string format of the form > :MM:DD. This is beacuse - "Lucene can handle > String ranges without having to add every possible > value as a comparison clause". Here is the link > http://www.redhillconsulting.com.au/blogs/simon/archives/000232.html > > Now my question is:- > (1) Is the above statement true? > (2) If yes will it work with :MM:DD HH:MM:SS > format too? > > Other solutions are also welcome. > > Thanks alot. > Santo. > > > > > Start your day with Yahoo! - make it your home page > http://www.yahoo.com/r/hs > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Regarding range queries.
On Aug 9, 2005, at 2:27 AM, santo santo wrote: Hi all, I am new user of lucene. This query is posted at least once on alomost all lucene mailing lists. The query being about handling of date fields. In my case I need to find documents with dates older than a particular date. So ideally I am not supposed to specify the lower bound. When using the default date handling provdied by lucene in conjunction with the RangeQuery, it results in a havaoc. Could you elaborate on the havoc you've experienced? But recently during my search for a solution to this problem I came across a solution which said to convert the dates to string format of the form :MM:DD. This is beacuse - "Lucene can handle String ranges without having to add every possible value as a comparison clause". Here is the link http://www.redhillconsulting.com.au/blogs/simon/archives/000232.html Now my question is:- (1) Is the above statement true? (2) If yes will it work with :MM:DD HH:MM:SS format too? Yes, and yes. You still have to watch out for the TooManyClauses exception when doing a plain RangeQuery, but there is now a RangeFilter available to help with this situation (which may require changing how you construct Query objects in some way). You need to ensure that the string representation of any terms used for range queries be in lexicographical order. Every term in Lucene is essentially a string. Hope this helps some. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Regarding range queries.
Tony, If your improvements are of general utility, please contribute them. Even if they are not, post them as-is and perhaps someone will take the time to make them more reusable. Cheers, Doug Tony Schwartz wrote: I think there are a few things that should be added to lucene to really give a huge benefit to applications of lucene that are centered around dates. If documents are added in date order (generally but not exactly), you can use this fact to improve memory usage of lucene in several ways. 1. a sparse bitset can be used instead of a full array for Date RangeFilters. 2. sorting can improved by storing the StringIndex (sort array) to disk when index is updated. Then, load only the portions required for a particular search. If most people will be searching more recent docs and so you can keep those portions of the sort array in memory and load only those "older" portions when needed. 3. use the same sparse (and reversible) bitset instead of the lucene BitVector for storing the deleted docs for a particular segment. (very old docs are probably deleted again, based on date). 4. sorting can also be greatly improved by NOT storing the field values in memory if the index is not used in a "multi-index" environment. I have implemented these techniques for my particular implementation of an application logs search tool and have seen incredible results. I have many users searching 50 million application logs (1k each) with 512 MB memory for my app where users are sorting and filtering on every search. Again, these features will only be useful for indexes that have relative date to docid correlation (which I believe happens to be very common). Tony Schwartz [EMAIL PROTECTED] "What we need is more cowbell." - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
IBM Open-Sources New Search Technology
FYI - as it is relevant to search technology. I can't for the life of me figure out the current or future open source licensing, though...? Scott IBM Open-Sources New Search Technology http://www.eweek.com/article2/0,1895,1844710,00.asp "IBM plans to release as open-source a sophisticated new search and text analysis technology that is able to find relationships, trends and facts buried in a wide range of unstructured data, including e- mails, Web pages, text documents, images, audio and video. Called the UIMA (Unstructured Information Management Architecture), the technology is able is able to go beyond the keyword analysis typically used by most search engines to discern the semantic meanings within text and other unstructured data, said Nelson Mattos, vice president of information integration with IBM in San Jose, Calif." - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Why is Hits.java not Serializable?
Hello I am looking at the RemoteSearchable code for inspiration on how to do remote searches (I will probably use something like SEDA to implement the rpc to avoid heavy thread creation issues of rmi, my question should apply to any implementation of a remote searcher however). I see that RemoteSearchable does not extend Searcher, but implements Searchable only. In particular this means that the "public Hits search(){...}" interfaces of Searcher are not implemented in RemoteSearchable. In my case, this is transparent to the client, since I obtain RemoteSearchables from multiple remote indexes and combine them using MultiSearcer (which does implement "public Hits search(){...}"). I am concerned about what goes on under the hood here. Which form of the Searchable interface gets called on the server? The javadoc for example says that "void search(Query query, Filter filter, HitCollector results)" should not be used unless one is after all of the results. So if I'm only interested in the top 100 hits, this seems not to be a good thing if this particular interface gets called. Maybe the form that returns "TopDocs" gets called (the javadoc gives an "expert" qualification for this interface). I could dig into the code to see what happens, but I am hoping an expert can answer this question in much shorter order. Another way to ask this question is, why is Hits.java not declared serializable, so that the the search methods which return Hits objects can be exposed via the Searchable interface rather than the abstract class Searcher? Hits would have to declared Serializable since Searchable implements java.rmi.Remote (presumably because it is implemented by RemoteSearchable!). I can think of 3 reasons why search methods returning Hits objects are not exposed in Searchable: 1) Someone forgot to declare Hits Serializable 2) There is a fundamental reason the forms of search which return Hits objects cannot be called remotely, some non optimal form of search will get called on the server(s) and I can't do anything about it. For example "void search(Query query, Filter filter, HitCollector results)" gets called. 3) Under the hood everything takes care of itself. When I call the "public Hits search(){...}" on the client, and use the Hits object to retrieve the 100 most relevant or top sorting results, a non optimal form of search does *not* get called on the server (maybe a form returning "TopDocs" is called). In this case I'm worrying unnecessarily!? My hoped for answers are 3) or at least 1). Or I may be missing something and there is another answer. Sorry for the long winded question, I just can't seem to ask this question in a few words. Many thanks Ali - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]