Re: merging indexes together

2005-08-09 Thread Volodymyr Bychkoviak

Thanks. I didn't think about such simple solution:)

Mordo, Aviran (EXP N-NANNATEK) wrote:


Why don't you just add the new information directly to the main index ?
As long as you don't get a new IndexReader you should be able to access
the old information. Once your indexing and deletion is done just get a
new IndexReader instance to access the new documents.

Aviran
http://www.aviransplace.com

-Original Message-
From: Volodymyr Bychkoviak [mailto:[EMAIL PROTECTED] 
Sent: Monday, August 08, 2005 1:50 PM

To: java-user@lucene.apache.org
Subject: merging indexes together

Hello All.

In my program I index new information to temporary dir and after then I
delete outdated information from main index and add new information by
calling indexWriter.addIndexes() method. This works fine when doc number
is relatively small but when index size grows, every call to addIndexes
can take very long. (NOTE: new information is ONLY part of all index)

The reason I'm using this approach is that I want old information to be
available during indexing new information and then switch as fast as I
can to new information.

current index 336Mb / 110 Docs. and growing...
current time to merge indexes is about 5min.

Any ideas how to optimize this?

--
regards,
Volodymyr Bychkoviak


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


 



--
regards,
Volodymyr Bychkoviak


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: JDBC proxy implementing Lucene?

2005-08-09 Thread Shay Banon

Hi,

   That is exactly the path I took with Compass and Hibernate. Compass 
integrates with Hibernate events (update/delete/create) and syncs the 
search engine using it. I had problems with Hibernate 2 interceptors 
(external id is null, and other stuff) so it currently works only with 
Hibernate events. The nice thing about Compass, is that once the 
application works on the Object level for the search as well (OSEM), 
than it becomes very simple to do it. Compass simply registers with 
Hibernate events, and persist to the search engine any changes made to 
objects that have both ORM (obviously) and OSEM definitions.


   One can than extend the notion of intercepting and actually define a 
generic Aspect (AOP) for search engine syncs. Which can be applied to 
any application, as long as it has OSEM - Object to Search Engine 
mappings (or any other type of mappings), since you must have some kind 
of knowledge how to combine the two.


   Shay

Otis Gospodnetic wrote:


Hi Chris,

--- Chris Lu <[EMAIL PROTECTED]> wrote:

 


Hi, Just an idea to make Lucene work with databases more easily.

When I communicated with Shay Banon(Compass' author), it came to me
that maybe Lucene can be wrapped around JDBC drivers. Let's say it's
L-JDBC.

So whenever an object is stored through JDBC, according to some XML
configuration file, L-JDBC can index the updated object/document, or
delete it from the index.

Basically make Lucene indexing transparent to new/existing
applications.

Not really a super idea. I am wondering anyone will find it helpful?
   



Yes, that would be handy, as lots of people have applications that use
both Lucene and a RDBMS and use various tricks to keep the two in sync.
If an application uses Hibernate, then one can make use of various
Hibernate interceptors and use them to trigger operations on an
external Lucene index.  I know at least one application that does
something similar (see my .signature).

Otis

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ -- Find it. Tag it. Share it.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


 




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Regarding range queries.

2005-08-09 Thread Tony Schwartz
1.  Use RangeFilters on the lowest precision date you need.  If you only need 
to filter
to the day, index the date in a separate field with day precision.  This will 
speed up
filter creation a great deal.
2.  Use as few characters as possible when indexing, so if you can come up with 
your own
date representation as a String, that will work well for you.
3.  Try to update your index as little as possible.  If you need to update your 
index
regularly, consider having two indexes.  For example... 1 small index that 
allows many
updates that you use for TODAY.  1 large index that each night is updated with 
the
contents of index 1.  then swap out index 1 for a new one.  This is very handy 
if docs
are added in date order.  you can use this fact to sort more efficiently (i.e. 
no cross
index sorting - just append the sorted results of one index to the other).
4.  Use a robust filter caching scheme that is shared across users (give the 
users the
ability and ease of selecting common date ranges).  By robust, I mean, cache 
some in
memory and cache some to disk.  reading a filter from disk can be a heck of a 
lot
cheaper than recreating the filter.  Use a simple list and put recently used 
filters at
the front.  store a certain number of filters in memory, then store a certain 
number of
filters on disk, then drop the rest.



as a side note:

I think there are a few things that should be added to lucene to really give a 
huge
benefit to applications of lucene that are centered around dates.  If documents 
are
added in date order (generally but not exactly), you can use this fact to 
improve memory
usage of lucene in several ways.

1.  a sparse bitset can be used instead of a full array for Date RangeFilters.
2.  sorting can improved by storing the StringIndex (sort array) to disk when 
index is
updated.  Then, load only the portions required for a particular search.  If 
most people
will be searching more recent docs and so you can keep those portions of the 
sort array
in memory and load only those "older" portions when needed.
3.  use the same sparse (and reversible) bitset instead of the lucene BitVector 
for
storing the deleted docs for a particular segment. (very old docs are probably 
deleted
again, based on date).
4.  sorting can also be greatly improved by NOT storing the field values in 
memory if
the index is not used in a "multi-index" environment.

I have implemented these techniques for my particular implementation of an 
application
logs search tool and have seen incredible results.  I have many users searching 
50
million application logs (1k each) with 512 MB memory for my app where users 
are sorting
and filtering on every search.

Again, these features will only be useful for indexes that have relative date 
to docid
correlation (which I believe happens to be very common).

Tony Schwartz
[EMAIL PROTECTED]
"What we need is more cowbell."

> Hi all,
> I am new user of lucene. This query is posted at least
> once on alomost all lucene mailing lists. The query
> being about handling of date fields.
>
> In my case I need to find documents with dates older
> than a particular date. So ideally I am not supposed
> to specify the lower bound. When using the default
> date handling provdied by lucene in conjunction with
> the RangeQuery, it results in a havaoc.
>
> But recently during my search for a solution to this
> problem I came across a solution  which said to
> convert the dates to string format of the form
> :MM:DD. This is beacuse - "Lucene can handle
> String ranges without having to add every possible
> value as a comparison clause". Here is the link
> http://www.redhillconsulting.com.au/blogs/simon/archives/000232.html
>
> Now my question is:-
> (1) Is the above statement true?
> (2) If yes will it work with :MM:DD HH:MM:SS
> format  too?
>
> Other solutions are also welcome.
>
> Thanks alot.
> Santo.
>
>
>
> 
> Start your day with Yahoo! - make it your home page
> http://www.yahoo.com/r/hs
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Regarding range queries.

2005-08-09 Thread Erik Hatcher


On Aug 9, 2005, at 2:27 AM, santo santo wrote:


Hi all,
I am new user of lucene. This query is posted at least
once on alomost all lucene mailing lists. The query
being about handling of date fields.

In my case I need to find documents with dates older
than a particular date. So ideally I am not supposed
to specify the lower bound. When using the default
date handling provdied by lucene in conjunction with
the RangeQuery, it results in a havaoc.


Could you elaborate on the havoc you've experienced?


But recently during my search for a solution to this
problem I came across a solution  which said to
convert the dates to string format of the form
:MM:DD. This is beacuse - "Lucene can handle
String ranges without having to add every possible
value as a comparison clause". Here is the link
http://www.redhillconsulting.com.au/blogs/simon/archives/000232.html




Now my question is:-
(1) Is the above statement true?
(2) If yes will it work with :MM:DD HH:MM:SS
format  too?


Yes, and yes.  You still have to watch out for the TooManyClauses  
exception when doing a plain RangeQuery, but there is now a  
RangeFilter available to help with this situation (which may require  
changing how you construct Query objects in some way).


You need to ensure that the string representation of any terms used  
for range queries be in lexicographical order.  Every term in Lucene  
is essentially a string.


Hope this helps some.

Erik



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Regarding range queries.

2005-08-09 Thread Doug Cutting

Tony,

If your improvements are of general utility, please contribute them. 
Even if they are not, post them as-is and perhaps someone will take the 
time to make them more reusable.


Cheers,

Doug

Tony Schwartz wrote:

I think there are a few things that should be added to lucene to really give a 
huge
benefit to applications of lucene that are centered around dates.  If documents 
are
added in date order (generally but not exactly), you can use this fact to 
improve memory
usage of lucene in several ways.

1.  a sparse bitset can be used instead of a full array for Date RangeFilters.
2.  sorting can improved by storing the StringIndex (sort array) to disk when 
index is
updated.  Then, load only the portions required for a particular search.  If 
most people
will be searching more recent docs and so you can keep those portions of the 
sort array
in memory and load only those "older" portions when needed.
3.  use the same sparse (and reversible) bitset instead of the lucene BitVector 
for
storing the deleted docs for a particular segment. (very old docs are probably 
deleted
again, based on date).
4.  sorting can also be greatly improved by NOT storing the field values in 
memory if
the index is not used in a "multi-index" environment.

I have implemented these techniques for my particular implementation of an 
application
logs search tool and have seen incredible results.  I have many users searching 
50
million application logs (1k each) with 512 MB memory for my app where users 
are sorting
and filtering on every search.

Again, these features will only be useful for indexes that have relative date 
to docid
correlation (which I believe happens to be very common).

Tony Schwartz
[EMAIL PROTECTED]
"What we need is more cowbell."


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



IBM Open-Sources New Search Technology

2005-08-09 Thread Scott Ganyo
FYI - as it is relevant to search technology.  I can't for the life  
of me figure out the current or future open source licensing, though...?


Scott


IBM Open-Sources New Search Technology
http://www.eweek.com/article2/0,1895,1844710,00.asp

"IBM plans to release as open-source a sophisticated new search and  
text analysis technology that is able to find relationships, trends  
and facts buried in a wide range of unstructured data, including e- 
mails, Web pages, text documents, images, audio and video.
Called the UIMA (Unstructured Information Management Architecture),  
the technology is able is able to go beyond the keyword analysis  
typically used by most search engines to discern the semantic  
meanings within text and other unstructured data, said Nelson Mattos,  
vice president of information integration with IBM in San Jose, Calif."


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Why is Hits.java not Serializable?

2005-08-09 Thread Ali Rouhi
Hello

I am looking at the RemoteSearchable code for inspiration on how to do
remote searches (I will probably use something like SEDA to implement
the rpc to avoid heavy thread creation issues of rmi, my question
should apply to any implementation of a remote searcher however).

I see that RemoteSearchable does not extend Searcher, but implements
Searchable only. In particular this means that the "public Hits
search(){...}" interfaces of Searcher are not implemented in
RemoteSearchable. In my case, this is transparent to the client, since
I obtain RemoteSearchables from multiple remote indexes and combine
them using MultiSearcer (which does implement "public Hits
search(){...}").

I am concerned about what goes on under the hood here. Which form of
the Searchable interface gets called on the server? The javadoc for
example says that "void search(Query query, Filter filter,
HitCollector results)" should not be used unless one is after all of
the results. So if I'm only interested in the top 100 hits, this seems
not to be a good thing if this particular interface gets called. Maybe
the form that returns "TopDocs" gets called (the javadoc gives an
"expert" qualification for this interface). I could dig into the code
to see what happens, but I am hoping an expert can answer this
question in much shorter order.

Another way to ask this question is, why is  Hits.java not declared
serializable, so that the the search methods which return Hits objects
can be exposed via the Searchable interface rather than the abstract
class Searcher? Hits would have to declared Serializable since
Searchable implements java.rmi.Remote (presumably because it is
implemented by RemoteSearchable!).

I can think of  3 reasons why search methods returning Hits objects
are not exposed in Searchable:

1) Someone forgot to declare Hits Serializable
2) There is a fundamental reason the forms of search which return Hits
objects cannot be called remotely, some non optimal form of search
will get called on the server(s) and I can't do anything about it. For
example "void search(Query query, Filter filter, HitCollector
results)" gets called.
3) Under the hood everything takes care of itself. When I call the
"public Hits search(){...}" on the client, and use the Hits object to
retrieve the 100 most relevant or top sorting results, a non optimal
form of search does *not* get called on the server (maybe a form
returning "TopDocs" is called). In this case I'm worrying
unnecessarily!?

My hoped for answers are 3) or at least 1). Or I may be missing
something and there is another answer.

Sorry for the long winded question, I just can't seem to ask this
question in a few words.

Many thanks
Ali

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]