Re: Using lucene as a database... good idea or bad idea?

2008-07-31 Thread Andy Liu
If essentially all you need is key-value storage, Berkeley DB for Java works
well.  Lookup by ID is fast, can iterate through documents, supports
secondary keys, updates, etc.

Lucene would work relatively well for this, although inserting documents
might not be as fast, because segments need to be merged and data ends up
getting copied over again at certain points.  So if you're running a batch
process with a lot of inserts, you might get better throughput with BDB as
opposed to Lucene, but, of course, benchmark to confirm ;)

Andy

On Thu, Jul 31, 2008 at 9:12 AM, Karsten F.
<[EMAIL PROTECTED]>wrote:

>
> Hi Ganesh,
>
> in this Thread nobody said, that lucene is a good storage server.
> Only "it could be used as storage server" (Grant: Connect data storage with
> simple, fast lookup and Lucene..)
>
> I don't now about automatic rentention.
> But for the rest in your list of features I suggest to take a deep look to
>  - Jackrabbit (Standard jcr jsr170 implemention, I like the webDAV support)
>  - dSpace (real working content repository software, with good permissions
> management)
>
> Both use lucene for searching
>
> Best regards
>Karsten
>
>
> Ganesh - yahoo wrote:
> >
> > which one will be the best to use as storage server. Lucene or
> Jackrabbit.
> >
> > My requirement is to provide support to
> > 1) Archive the documents
> > 2) Do full text search on the documents.
> > 3) Do backup the index store and archive store. [periodical basis]
> > 4) Remove the documents after certain period [rentention policy]
> >
> > Whether Lucene could be used as archival store. Most of them in this
> > mailing
> > list said 'yes'. If so going for separate database to archive the data
> and
> > separate database to index it, will be better option or one database to
> be
> > used as archive and index.
> >
> > One more idea from this list is to use Jackrabbit / JDBM / My SQL to
> > archive
> > the data. Which will be the best?
> >
> > I am in desiging phase and i have time to explore and prototype any other
> > products. Please do suggest me a good one.
> >
> > Regards
> > Ganesh
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Using-lucene-as-a-database...-good-idea-or-bad-idea--tp18703473p18754258.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


Using ParallelReader over large immutable index and small updatable index

2007-03-06 Thread Andy Liu

Is there a working solution out there that would let me use ParallelReader
to search over a large, immutable index and a smaller, auxillary index that
is updated frequently?  Currently, from my understanding, the ParallelReader
fails when one of the indexes is updated because the document ID's get out
of synch.  Using ParallelReader in this way is attractive for me because it
would allow me to quickly make updates to only the fields that change.

The alternative is to use one index.  However, an update would require me to
delete the entire document (which is quite large in my application) and
reinsert it after making updates.  This requires a lot more I/O and is a lot
slower, and I'd like to avoid this if possible.

I can think of other alternatives, but all involve storing data and/or
bitsets in memory, which is not very scalable.  I need to be able to handle
millions of documents.

I'm also open to any solution that don't involve ParallelReader that would
help me make quick updates in the most non-disruptive and scalable fashion.
But it just seems that ParallelReader would be perfect for me needs, if I
can get past this issue.

I've seen posts about this issue on the list, but nothing pointing to a
solution.  Can somebody help me out?

Andy


Re: Using ParallelReader over large immutable index and small updatable index

2007-03-07 Thread Andy Liu

From my understanding, MultiSearcher is used to combine two indexes that

have the same fields but different documents.  ParallelReader is used to
combine two indexes that have same documents but different fields.  I'm
trying to do the latter.  Is my understanding correct?  For example, what
I'm trying to do is have one immutable index that has these fields:

field1
field2
field3

and my "update" index that has one field

field4

Both indexes have the same documents, and the docId's are synchronized.
This allows me to execute searches like:

+field1:foo +field4:bar

field4 is a field that would be updated frequently and as real-time as
possible.  However, once I update field4, the docId's are no longer
synchronized, and ParallelReader fails.

Andy

On 3/6/07, Alexey Lef <[EMAIL PROTECTED]> wrote:


We use MultiSearcher for a similar scenario. This way you can keep the
Searcher/Reader for the read-only index alive and refresh the small index
Searcher whenever an update is made. If you have any cached filters, they
are mapped to a Reader, so the cached filters for the big index will stay
alive as well. The only (small) problem I have found so far is how
MultiSearcher handles custom Similarity (see
https://issues.apache.org/jira/browse/LUCENE-789).

Hope this helps,

Alexey

-----Original Message-
From: Andy Liu [mailto:[EMAIL PROTECTED]
Sent: Tuesday, March 06, 2007 3:34 PM
To: java-user@lucene.apache.org
Subject: Using ParallelReader over large immutable index and small
updatable index

Is there a working solution out there that would let me use ParallelReader
to search over a large, immutable index and a smaller, auxillary index
that
is updated frequently?  Currently, from my understanding, the
ParallelReader
fails when one of the indexes is updated because the document ID's get out
of synch.  Using ParallelReader in this way is attractive for me because
it
would allow me to quickly make updates to only the fields that change.

The alternative is to use one index.  However, an update would require me
to
delete the entire document (which is quite large in my application) and
reinsert it after making updates.  This requires a lot more I/O and is a
lot
slower, and I'd like to avoid this if possible.

I can think of other alternatives, but all involve storing data and/or
bitsets in memory, which is not very scalable.  I need to be able to
handle
millions of documents.

I'm also open to any solution that don't involve ParallelReader that would
help me make quick updates in the most non-disruptive and scalable
fashion.
But it just seems that ParallelReader would be perfect for me needs, if I
can get past this issue.

I've seen posts about this issue on the list, but nothing pointing to a
solution.  Can somebody help me out?

Andy

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Range search in numeric fields

2007-04-03 Thread Andy Liu

You can try using MemoryCachedRangeFilter.

https://issues.apache.org/jira/browse/LUCENE-855

It stores field values in memory as longs so your values don't have to be
lexigraphically comparable.  Also, MemoryCachedRangeFilter can be orders of
magnitude faster than standard RangeFilter, depending on your data.

Andy

On 4/3/07, Ivan Vasilev <[EMAIL PROTECTED]> wrote:


Hi All,
I have the following problem:
I have to implement range search for fields that contain numbers. For
example the field size that contains file size. The problem is that the
numbers are not kept in strings with strikt length. There are field
values like this: "32", "421", "1201". So when makeing search like this:
+size:[10 TO 50], as the order for string is lexicorafical the result
contains the documents with size 32 and 1201. I can see the following
possible aproaches:
1. Changing indexing process so that all data entered in those fields is
with fixed length. Example 032, 421, 0001201.
Disadvantages here are:
- Have to be reindexed all existng indexes;
- The index will grow a bit.

2. Generating query without ranges but including all numbers between the
bounds - +size=10 +size=11 +size=12 +size=49 + size=50. For
narrow ranges it makes sense but for large ones... :)

3. Generating query with intervals (inclusive and exclusive) but the
number of this intervals will be the same (or one more) than the
conditions in point 2. +size:[10 TO 50] -size:[10 TO 119] -
size:[11 TO 1299] ... etc.

So if someone can help with some new oportunity please mail.

Thanks in advance.
Ivan

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Re: Index updates between machines

2007-04-03 Thread Andy Liu

Sounds like you might have an I/O issue.  If you have multiple partitions /
disks on the searching server you can search from one partition and copy to
another and alternate.  If you're using RAID different RAID levels are
optimized for simultaneous reads and writes.

If you have a 3rd machine you can load balance 2 search servers and take one
out of the cluster when the index is being copied.  Alternatively, if it's
possible, you can copy the index at an offpeak hour.

Andy

On 4/3/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:


How fast are your disks?  Perhaps they are having trouble keeping up with
simultaneous searches and massive file copying.

Otis
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Chun Wei Ho <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, April 3, 2007 10:40:16 AM
Subject: Index updates between machines

We are running a search service on the internet using two machines. We
have a crawler machine which crawls the web and merges new documents
found into the Lucene index. We have a searcher machine which allows
users to perform searches on the Lucene index.

Periodically, we would copy the newest version of the index from the
crawler machine over to the searcher machine (via copy over a NFS
mount). The searcher would then detect the new version, close the old
index, open the new index and resume the search service.

As the index have been growing in size, we have been noticing that the
search response time on the searcher machine increases drastically
when an index (about 15GB) is being copied from the crawler to the
searcher. Both machines run Fedora Core 4 and are on a gbps lan.

We've tried a number of ways to reduce the impact of the copy over NFS
on searching performance, such as "nice"ing the copy process, but to
no avail. I wonder if anyone is running a lucene search service over a
similar architecture and how you are managing the updates to the
lucene index.

Thanks!

Regards,
CW

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Lopsided scores for each term in BooleanQuery

2006-09-18 Thread Andy Liu

For multi-word queries, I would like to reward documents that contain a more
even distribution of each word and penalize documents that have a skewed
distribution.  For example, if my search query is:

+content:fast +content:car

I would prefer a document that contains each word an equal number of times
over a document that contains the word "fast" 100 times and the word "car" 1
time.  In other words, I would like to compare the scores of each
BooleanQuery term and adjust the score according to the distribution.

Can somebody point me in the right direction as to how I would implement
this?

Thanks,
Andy


Re: Lopsided scores for each term in BooleanQuery

2006-09-18 Thread Andy Liu

In our application we have multiple fields that are searched.  So fast car
becomes:

+(field1:fast field2:fast field3:fast) +(field1:car field2:car field3:car)

I understand that the default sqrt implementation of tf() would help the
"lopsided score" phenomenon with searches within the same field.  But when
searching in multiple fields, this effect is obscured since each matching
field adds to the score of that clause.  Is there a way to "peek" at the
scores of each clause, and adjust based on how divergent the scores are?  Or
is there an easier way to do this that I'm just not seeing?

Andy

On 9/18/06, Paul Elschot <[EMAIL PROTECTED]> wrote:


On Monday 18 September 2006 23:08, Andy Liu wrote:
> For multi-word queries, I would like to reward documents that contain a
more
> even distribution of each word and penalize documents that have a skewed
> distribution.  For example, if my search query is:
>
> +content:fast +content:car
>
> I would prefer a document that contains each word an equal number of
times
> over a document that contains the word "fast" 100 times and the word
"car" 1
> time.  In other words, I would like to compare the scores of each
> BooleanQuery term and adjust the score according to the distribution.
>
> Can somebody point me in the right direction as to how I would implement
> this?

It's already there in DefaultSimilarity.tf() which is the square root:

(sqrt(1) + sqrt(1)) > (sqrt(0) + sqrt(2))


Regards,
Paul Elschot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




Relative term frequency?

2005-06-06 Thread Andy Liu
Is there a way to calculate term frequency scores that are relative to
the number of terms in the field of the document?  We want to override
tf() in this way to curb keyword spamming in web pages.  In
Similarity, only the document's term frequency is passed into the tf()
method:

float tf(int freq)

It would be nice to have something like:

float tf(int freq, String fieldName, int numTerms)

If this isn't available out of the box, how difficult would it be to
hack up Lucene to allow for this?

Thanks,
Andy

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: A very technical question.

2005-09-28 Thread Andy Liu
While you're indexing, you can assign each doc with a field that refers to
how long the document is. So, for example, you can add a field named
"docLength" for each document, and assign it with discrete values such as
"veryshort", "short", "medium", "long", "verylong", depending on how
granular you need it. Then at query time you can specify the field and a
given boost value, i.e.

civil war docLength:verylong^5 docLength:long^3

Andy

On 9/28/05, Dawid Weiss <[EMAIL PROTECTED]> wrote:
>
>
> Hi.
>
> I have a very technical question. I need to alter document score (or in
> fact: document boosts) for an existing index, but for each query. In
> other words, I'd like these to have pseudo-queries of the form:
>
> 1. civil war PREFER:shorter
> 2. civil war PREFER:longer
>
> for these two queries, 1. would score shorter documents higher then
> option 2, which would in turn score longer documents higher. Note that
> these preferences can be expressed at query time, so static document
> boosts are of little help.
>
> I'd appreciate if those familiar with the internals of Lucene gave me
> brief instructions on how this could be achieved (my rough guess is that
> I'll need to build my own Scorer... but how to access document length
> and where to plug in that scorer... besides I'd rather hear it from
> somebody with more expertise).
>
> Thanks,
> D.
>
> ---------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


--
Andy Liu
[EMAIL PROTECTED]
(301) 873-8458