Incorrect Token Offset when using multiple fieldable instance

2008-03-04 Thread Renaud Delbru

Hi,

I currently use multiple fieldable instances for indexing sentences of a 
document.
When there is only one single fieldable instance, the token offset 
generation performed in DocumentWriter is correct.
The problem appears when there is two or more fieldable instances. In 
DocumentWriter$FieldData#invertField method, if the field is tokenized, 
instead of updating offset attribute with stringValue.length() (which is 
performed if the field is not tokenized, line 1458), you update the 
offset attribute with the end offset of the last token (line 1503: 
offset = offsetEnd+1;).
As a consequence, if a token has been filtered (for example a stopword, 
a dot, a space, etc.), the offset attribute is updated with the end 
offset of the last token not filtered. In this case, you store inside 
the offset attribute an incorrect offset (the offset is shift back) and 
all the next fieldable instances will have their offset shifted back.


Is it a bug ? Or is it a desired behavior (in this case, why ?) ?

Regards.

--
Renaud Delbru,
E.C.S., Ph.D. Student,
Semantic Information Systems and
Language Engineering Group (SmILE),
Digital Enterprise Research Institute,
National University of Ireland, Galway.
http://smile.deri.ie/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Why indexing database is necessary? (RE: indexing database)

2008-03-04 Thread Duan, Nick
Could anyone provide any insight on why someone would use nutch/lucene
or any other search engines to index relational databases? With use
cases if possible?  Shouldn't the database's own indexing mechanism be
used since it is more efficient?

If there is such a need of indexing the database content using search
engines, what would be the best approach other than de-normalizing the
database?  

Thanks a lot in advance!

ND
-Original Message-
From: payo [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 04, 2008 12:36 PM
To: [EMAIL PROTECTED]
Subject: indexing database


hi to all

i can index a database with nutch?

i am use nutch 0.8.1

thanks
-- 
View this message in context:
http://www.nabble.com/indexing-database-tp15832696p15832696.html
Sent from the Nutch - User mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Why indexing database is necessary? (RE: indexing database)

2008-03-04 Thread Darren Hartford
Indexing with lucene/nutch on top of/instead of DB indexing for:

1) relativity scoring
2) alias searching (i.e. a large amount of aliases, like first names)
3) highlighting
4) cross-datasource searching (multi DB, DB + XML files, etc).

As for best approach to externally index, I do not have any direct
pointers.  I would recommend looking at an ETL tool that can be extended
for this purpose (I've started writing a plugin for Pentaho, but got
pulled off and haven't finished it -- and that was for Solr, not
lucene/nutch).

-D

> -Original Message-
> From: Duan, Nick [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, March 04, 2008 1:33 PM
> To: java-user@lucene.apache.org
> Subject: Why indexing database is necessary? (RE: indexing database)
> 
> Could anyone provide any insight on why someone would use nutch/lucene
> or any other search engines to index relational databases? With use
> cases if possible?  Shouldn't the database's own indexing mechanism be
> used since it is more efficient?
> 
> If there is such a need of indexing the database content using search
> engines, what would be the best approach other than de-normalizing the
> database?
> 
> Thanks a lot in advance!
> 
> ND
> -Original Message-
> From: payo [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, March 04, 2008 12:36 PM
> To: [EMAIL PROTECTED]
> Subject: indexing database
> 
> 
> hi to all
> 
> i can index a database with nutch?
> 
> i am use nutch 0.8.1
> 
> thanks
> --
> View this message in context:
> http://www.nabble.com/indexing-database-tp15832696p15832696.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Why indexing database is necessary? (RE: indexing database)

2008-03-04 Thread Will Johnson
Don't forget the number 1 reason: speed.  For certain types of queries a
search engine can return results orders of magnitude faster than a database.
I've seen search engines return hits in hundreds of milliseconds when the
same database query took hours or even days.  That's not to say that a
search engine is always better, just the it often times is for when the
inputs and outputs are carefully defined.

- will

-Original Message-
From: Darren Hartford [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 04, 2008 1:52 PM
To: java-user@lucene.apache.org
Subject: RE: Why indexing database is necessary? (RE: indexing database)

Indexing with lucene/nutch on top of/instead of DB indexing for:

1) relativity scoring
2) alias searching (i.e. a large amount of aliases, like first names)
3) highlighting
4) cross-datasource searching (multi DB, DB + XML files, etc).

As for best approach to externally index, I do not have any direct
pointers.  I would recommend looking at an ETL tool that can be extended
for this purpose (I've started writing a plugin for Pentaho, but got
pulled off and haven't finished it -- and that was for Solr, not
lucene/nutch).

-D

> -Original Message-
> From: Duan, Nick [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, March 04, 2008 1:33 PM
> To: java-user@lucene.apache.org
> Subject: Why indexing database is necessary? (RE: indexing database)
> 
> Could anyone provide any insight on why someone would use nutch/lucene
> or any other search engines to index relational databases? With use
> cases if possible?  Shouldn't the database's own indexing mechanism be
> used since it is more efficient?
> 
> If there is such a need of indexing the database content using search
> engines, what would be the best approach other than de-normalizing the
> database?
> 
> Thanks a lot in advance!
> 
> ND
> -Original Message-
> From: payo [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, March 04, 2008 12:36 PM
> To: [EMAIL PROTECTED]
> Subject: indexing database
> 
> 
> hi to all
> 
> i can index a database with nutch?
> 
> i am use nutch 0.8.1
> 
> thanks
> --
> View this message in context:
> http://www.nabble.com/indexing-database-tp15832696p15832696.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



C++ as token in StandardAnalyzer?

2008-03-04 Thread Donna L Gresh
I saw some discussion in the archives some time ago about the fact that 
C++ is tokenized as C in the StandardAnalyzer; this seems to still be the 
case; I was wondering if there is a simple way for me to get the behavior 
I want for C++ (that it is tokenized as C++) in particular, and perhaps 
for other more ideosyncratic terms I may have in my own application--
Thanks
Donna




RE: Why indexing database is necessary? (RE: indexing database)

2008-03-04 Thread Duan, Nick
Hmm, I guess that's because a database query returns a list of records,
whereas search engine returns only the links, not the actual content.
So a search engine works only in the index space, whereas a database
query engine would have to work in both index and content space...


ND

-Original Message-
From: Will Johnson [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 04, 2008 2:18 PM
To: java-user@lucene.apache.org
Subject: RE: Why indexing database is necessary? (RE: indexing database)

Don't forget the number 1 reason: speed.  For certain types of queries a
search engine can return results orders of magnitude faster than a
database.
I've seen search engines return hits in hundreds of milliseconds when
the
same database query took hours or even days.  That's not to say that a
search engine is always better, just the it often times is for when the
inputs and outputs are carefully defined.

- will

-Original Message-
From: Darren Hartford [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 04, 2008 1:52 PM
To: java-user@lucene.apache.org
Subject: RE: Why indexing database is necessary? (RE: indexing database)

Indexing with lucene/nutch on top of/instead of DB indexing for:

1) relativity scoring
2) alias searching (i.e. a large amount of aliases, like first names)
3) highlighting
4) cross-datasource searching (multi DB, DB + XML files, etc).

As for best approach to externally index, I do not have any direct
pointers.  I would recommend looking at an ETL tool that can be extended
for this purpose (I've started writing a plugin for Pentaho, but got
pulled off and haven't finished it -- and that was for Solr, not
lucene/nutch).

-D

> -Original Message-
> From: Duan, Nick [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, March 04, 2008 1:33 PM
> To: java-user@lucene.apache.org
> Subject: Why indexing database is necessary? (RE: indexing database)
> 
> Could anyone provide any insight on why someone would use nutch/lucene
> or any other search engines to index relational databases? With use
> cases if possible?  Shouldn't the database's own indexing mechanism be
> used since it is more efficient?
> 
> If there is such a need of indexing the database content using search
> engines, what would be the best approach other than de-normalizing the
> database?
> 
> Thanks a lot in advance!
> 
> ND
> -Original Message-
> From: payo [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, March 04, 2008 12:36 PM
> To: [EMAIL PROTECTED]
> Subject: indexing database
> 
> 
> hi to all
> 
> i can index a database with nutch?
> 
> i am use nutch 0.8.1
> 
> thanks
> --
> View this message in context:
> http://www.nabble.com/indexing-database-tp15832696p15832696.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: Why indexing database is necessary? (RE: indexing database)

2008-03-04 Thread Will Johnson
Not necessarily, many of the high traffic search sites on the market today
for everything from yellow pages to job boards to ecommerce sites use search
engines to exclusively search *and* retrieve/serve content.  The key is that
they don't have to return all matching rows only the 'best' which are
probably the ones you would want anyways.

- will

-Original Message-
From: Duan, Nick [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 04, 2008 2:29 PM
To: java-user@lucene.apache.org
Subject: RE: Why indexing database is necessary? (RE: indexing database)

Hmm, I guess that's because a database query returns a list of records,
whereas search engine returns only the links, not the actual content.
So a search engine works only in the index space, whereas a database
query engine would have to work in both index and content space...


ND

-Original Message-
From: Will Johnson [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 04, 2008 2:18 PM
To: java-user@lucene.apache.org
Subject: RE: Why indexing database is necessary? (RE: indexing database)

Don't forget the number 1 reason: speed.  For certain types of queries a
search engine can return results orders of magnitude faster than a
database.
I've seen search engines return hits in hundreds of milliseconds when
the
same database query took hours or even days.  That's not to say that a
search engine is always better, just the it often times is for when the
inputs and outputs are carefully defined.

- will

-Original Message-
From: Darren Hartford [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 04, 2008 1:52 PM
To: java-user@lucene.apache.org
Subject: RE: Why indexing database is necessary? (RE: indexing database)

Indexing with lucene/nutch on top of/instead of DB indexing for:

1) relativity scoring
2) alias searching (i.e. a large amount of aliases, like first names)
3) highlighting
4) cross-datasource searching (multi DB, DB + XML files, etc).

As for best approach to externally index, I do not have any direct
pointers.  I would recommend looking at an ETL tool that can be extended
for this purpose (I've started writing a plugin for Pentaho, but got
pulled off and haven't finished it -- and that was for Solr, not
lucene/nutch).

-D

> -Original Message-
> From: Duan, Nick [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, March 04, 2008 1:33 PM
> To: java-user@lucene.apache.org
> Subject: Why indexing database is necessary? (RE: indexing database)
> 
> Could anyone provide any insight on why someone would use nutch/lucene
> or any other search engines to index relational databases? With use
> cases if possible?  Shouldn't the database's own indexing mechanism be
> used since it is more efficient?
> 
> If there is such a need of indexing the database content using search
> engines, what would be the best approach other than de-normalizing the
> database?
> 
> Thanks a lot in advance!
> 
> ND
> -Original Message-
> From: payo [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, March 04, 2008 12:36 PM
> To: [EMAIL PROTECTED]
> Subject: indexing database
> 
> 
> hi to all
> 
> i can index a database with nutch?
> 
> i am use nutch 0.8.1
> 
> thanks
> --
> View this message in context:
> http://www.nabble.com/indexing-database-tp15832696p15832696.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: More IndexDeletionPolicy questions

2008-03-04 Thread Tim Brennan
The bigger picture here is NFS-safety.  When I run a search, I hand off the 
search results to another thread so that they can be processed as necessary -- 
in particular so that they can be JOINed with a SQL DB -- but I don't want to 
completely lock the index from writes while doing a bunch of SQL calls.  Using 
the commit point tracking I can make sure my appropriate snapshot stays around 
until I'm completely done with it, even if I'm using an NFS mounted filesystem 
that doesn't have delete-last semantics.

--tim




> > It seems like it should be pretty simple -- keep a list of open  
> > IndexReaders, track what Segment files they're pointing to, and in 
> 
> > onCommit don't delete those segments.
> 
> This implies you have multiple readers in a single JVM?  If so, you  
> should not need to make a custom deletion policy to handle this case 
> 
> -- the OS should be properly protecting open files from deletion.   
> Can you  shed more light on the bigger picture here?
>
> > Unfortunately it ends up being very difficult to directly determine 
> > what Segment an IndexReader is pointing to.  Is there some  
> > straightforward way that I'm missing -- all I've managed to do so  
> > far is to remember the most recent one from onCommit/onInit and use 
> 
> > that onethat works OK, but makes bootstrapping a pain if you  
> > try to open a Reader before you've opened the writer once.
> >
> > Also, when I use IndexReader.reopen(), can I assume that the newly 
> 
> > returned reader is pointing at the "most recent" segment?  I think 
> 
> > so...
> 
> Yes, except you have a synchronization challenge: if the writer is in 
> 
> the process of committing just as your reader opens you can't be  
> certain whether the reader got the new commit or the previous one.   
> If you have external synchronization to ensure reader only re-opens  
> after writer has fully committed then this isn't an issue.
> 
> >
> > Here's a sequence of steps that happen in my app:
> >
> > 0) open writer, onInit tells me that seg_1 is the most recent
> segment
> >
> > 1) Open reader, assume it is pointing to seg_1 from (0)
> >
> > 2) New write commits into seg_2, don't delete seg_1 b/c of (1)
> >
> > 3) Call reader.reopen() on the reader from (1)new reader is  
> > pointing to seg_2 now?
> >
> > 4) seg_1 stays around until the next time I open or commit a  
> > writer, then it is removed.
> >
> >
> > Does that seem reasonable?
> >
> >
> > --tim
> >
> >
> >
> >
> >
> >
> >
> >
> -
> > To unsubscribe, e-mail: java-user-unsubscri
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Looking for an example of Using Position Increment Gap

2008-03-04 Thread Matthew Hall

Fellows,

I'm working on a project here where we are trying to use our lucene 
indexes to return concrete objects.  One of the things we want to be 
able to match by is by vocabulary terms annotated to that object, as 
well as all of the child vocabulary terms of that annotated term.


So, what I was thinking about doing is extending my index that returns 
objects of that type to include a new field say "sub_term".  In this 
field I would put all of the text of these vocabulary sub terms 
together, and introduce phrase boundries using some of the techniques 
that are described in the Javadoc in the analysis section.  (Basically 
writing a custom analyzer that introduces a position increment gap 
between phrases)  I am however curious if an example of a usage like 
that exists somewhere that I could use as a basis for the analyzer that 
I'm going to have to write to handle this case.


Does anyone know of a good example?

Matt


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: C++ as token in StandardAnalyzer?

2008-03-04 Thread Erick Erickson
 Almost by definition, you have to write your own analyzer. This may be as
simple as chaining another filter into one of the regular analyzers or as
complex as defining your own grammar.

As far as I know, there's no "keep word" list. But that would be an
interesting addition. That is, a variety of analyzer that you not only
passed a list of stop words to, but also passed a list of "keep words",
or words that should NOT be massaged at all. I can imagine that this
would get pretty tricky for, say, StandardAnalyzer, but something like
this in the chain of WhitespaceTokenizer >> LowercaseFilter >>
KeepwordFilter might be useful...

All this right off the top of my head without much thought, but

Best
Erick

On Tue, Mar 4, 2008 at 2:22 PM, Donna L Gresh <[EMAIL PROTECTED]> wrote:

> I saw some discussion in the archives some time ago about the fact that
> C++ is tokenized as C in the StandardAnalyzer; this seems to still be the
> case; I was wondering if there is a simple way for me to get the behavior
> I want for C++ (that it is tokenized as C++) in particular, and perhaps
> for other more ideosyncratic terms I may have in my own application--
> Thanks
> Donna
>
>
>


Re: Why indexing database is necessary? (RE: indexing database)

2008-03-04 Thread Chris Lu
Hi, Nick,

Lucene Index in a sense is more like another kind of database indexes,
because it's inverted, etc.

If we ask why we need many database indexes, the answer is, different
query execution path.
Same thing for Lucene index, which is faster for term matching.

Lucene index actually can do more. For example, facet-search, which
tells you how many matches in each category(facet), in addition to the
matched results. This way is more convenient for websites to display
results, and provide additional links for users to narrow down the
results.

-- 
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request)
got 2.6 Million Euro funding!


On Tue, Mar 4, 2008 at 11:28 AM, Duan, Nick <[EMAIL PROTECTED]> wrote:
> Hmm, I guess that's because a database query returns a list of records,
>  whereas search engine returns only the links, not the actual content.
>  So a search engine works only in the index space, whereas a database
>  query engine would have to work in both index and content space...
>
>
>  ND
>
>
>
>  -Original Message-
>  From: Will Johnson [mailto:[EMAIL PROTECTED]
>  Sent: Tuesday, March 04, 2008 2:18 PM
>  To: java-user@lucene.apache.org
>  Subject: RE: Why indexing database is necessary? (RE: indexing database)
>
>  Don't forget the number 1 reason: speed.  For certain types of queries a
>  search engine can return results orders of magnitude faster than a
>  database.
>  I've seen search engines return hits in hundreds of milliseconds when
>  the
>  same database query took hours or even days.  That's not to say that a
>  search engine is always better, just the it often times is for when the
>  inputs and outputs are carefully defined.
>
>  - will
>
>  -Original Message-
>  From: Darren Hartford [mailto:[EMAIL PROTECTED]
>  Sent: Tuesday, March 04, 2008 1:52 PM
>  To: java-user@lucene.apache.org
>  Subject: RE: Why indexing database is necessary? (RE: indexing database)
>
>  Indexing with lucene/nutch on top of/instead of DB indexing for:
>
>  1) relativity scoring
>  2) alias searching (i.e. a large amount of aliases, like first names)
>  3) highlighting
>  4) cross-datasource searching (multi DB, DB + XML files, etc).
>
>  As for best approach to externally index, I do not have any direct
>  pointers.  I would recommend looking at an ETL tool that can be extended
>  for this purpose (I've started writing a plugin for Pentaho, but got
>  pulled off and haven't finished it -- and that was for Solr, not
>  lucene/nutch).
>
>  -D
>
>  > -Original Message-
>  > From: Duan, Nick [mailto:[EMAIL PROTECTED]
>  > Sent: Tuesday, March 04, 2008 1:33 PM
>  > To: java-user@lucene.apache.org
>  > Subject: Why indexing database is necessary? (RE: indexing database)
>  >
>  > Could anyone provide any insight on why someone would use nutch/lucene
>  > or any other search engines to index relational databases? With use
>  > cases if possible?  Shouldn't the database's own indexing mechanism be
>  > used since it is more efficient?
>  >
>  > If there is such a need of indexing the database content using search
>  > engines, what would be the best approach other than de-normalizing the
>  > database?
>  >
>  > Thanks a lot in advance!
>  >
>  > ND
>  > -Original Message-
>  > From: payo [mailto:[EMAIL PROTECTED]
>  > Sent: Tuesday, March 04, 2008 12:36 PM
>  > To: [EMAIL PROTECTED]
>  > Subject: indexing database
>  >
>  >
>  > hi to all
>  >
>  > i can index a database with nutch?
>  >
>  > i am use nutch 0.8.1
>  >
>  > thanks
>  > --
>  > View this message in context:
>  > http://www.nabble.com/indexing-database-tp15832696p15832696.html
>  > Sent from the Nutch - User mailing list archive at Nabble.com.
>  >
>  >
>  > -
>  > To unsubscribe, e-mail: [EMAIL PROTECTED]
>  > For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>  -
>  To unsubscribe, e-mail: [EMAIL PROTECTED]
>  For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>  -
>  To unsubscribe, e-mail: [EMAIL PROTECTED]
>  For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>  -
>  To unsubscribe, e-mail: [EMAIL PROTECTED]
>  For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Why indexing database is necessary? (RE: indexing database)

2008-03-04 Thread Erick Erickson
And one other point. You probably *don't* need a search engine for your
database *if* you don't have much textual data. That is, if your database
consists of "classical" tables with columns like "firstname", "lastname",
etc.

But if your database has columns in it containing, say, a page of text then
searching that text is a real pain. *That's* where a search engine shines.

Searching a large DB text field for a single word becomes...er...awkward.

That said, there's a long thread on the Lucene thread that I didn't
understand
at all concerning embedding Lucene in Oracle. You might try looking at
the searchable Lucene threads for that...

Best
Erick

On Tue, Mar 4, 2008 at 5:27 PM, Chris Lu <[EMAIL PROTECTED]> wrote:

> Hi, Nick,
>
> Lucene Index in a sense is more like another kind of database indexes,
> because it's inverted, etc.
>
> If we ask why we need many database indexes, the answer is, different
> query execution path.
> Same thing for Lucene index, which is faster for term matching.
>
> Lucene index actually can do more. For example, facet-search, which
> tells you how many matches in each category(facet), in addition to the
> matched results. This way is more convenient for websites to display
> results, and provide additional links for users to narrow down the
> results.
>
> --
> Chris Lu
> -
> Instant Scalable Full-Text Search On Any Database/Application
> site: http://www.dbsight.net
> demo: http://search.dbsight.com
> Lucene Database Search in 3 minutes:
>
> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
> DBSight customer, a shopping comparison site, (anonymous per request)
> got 2.6 Million Euro funding!
>
>
> On Tue, Mar 4, 2008 at 11:28 AM, Duan, Nick <[EMAIL PROTECTED]>
> wrote:
> > Hmm, I guess that's because a database query returns a list of records,
> >  whereas search engine returns only the links, not the actual content.
> >  So a search engine works only in the index space, whereas a database
> >  query engine would have to work in both index and content space...
> >
> >
> >  ND
> >
> >
> >
> >  -Original Message-
> >  From: Will Johnson [mailto:[EMAIL PROTECTED]
> >  Sent: Tuesday, March 04, 2008 2:18 PM
> >  To: java-user@lucene.apache.org
> >  Subject: RE: Why indexing database is necessary? (RE: indexing
> database)
> >
> >  Don't forget the number 1 reason: speed.  For certain types of queries
> a
> >  search engine can return results orders of magnitude faster than a
> >  database.
> >  I've seen search engines return hits in hundreds of milliseconds when
> >  the
> >  same database query took hours or even days.  That's not to say that a
> >  search engine is always better, just the it often times is for when the
> >  inputs and outputs are carefully defined.
> >
> >  - will
> >
> >  -Original Message-
> >  From: Darren Hartford [mailto:[EMAIL PROTECTED]
> >  Sent: Tuesday, March 04, 2008 1:52 PM
> >  To: java-user@lucene.apache.org
> >  Subject: RE: Why indexing database is necessary? (RE: indexing
> database)
> >
> >  Indexing with lucene/nutch on top of/instead of DB indexing for:
> >
> >  1) relativity scoring
> >  2) alias searching (i.e. a large amount of aliases, like first names)
> >  3) highlighting
> >  4) cross-datasource searching (multi DB, DB + XML files, etc).
> >
> >  As for best approach to externally index, I do not have any direct
> >  pointers.  I would recommend looking at an ETL tool that can be
> extended
> >  for this purpose (I've started writing a plugin for Pentaho, but got
> >  pulled off and haven't finished it -- and that was for Solr, not
> >  lucene/nutch).
> >
> >  -D
> >
> >  > -Original Message-
> >  > From: Duan, Nick [mailto:[EMAIL PROTECTED]
> >  > Sent: Tuesday, March 04, 2008 1:33 PM
> >  > To: java-user@lucene.apache.org
> >  > Subject: Why indexing database is necessary? (RE: indexing database)
> >  >
> >  > Could anyone provide any insight on why someone would use
> nutch/lucene
> >  > or any other search engines to index relational databases? With use
> >  > cases if possible?  Shouldn't the database's own indexing mechanism
> be
> >  > used since it is more efficient?
> >  >
> >  > If there is such a need of indexing the database content using search
> >  > engines, what would be the best approach other than de-normalizing
> the
> >  > database?
> >  >
> >  > Thanks a lot in advance!
> >  >
> >  > ND
> >  > -Original Message-
> >  > From: payo [mailto:[EMAIL PROTECTED]
> >  > Sent: Tuesday, March 04, 2008 12:36 PM
> >  > To: [EMAIL PROTECTED]
> >  > Subject: indexing database
> >  >
> >  >
> >  > hi to all
> >  >
> >  > i can index a database with nutch?
> >  >
> >  > i am use nutch 0.8.1
> >  >
> >  > thanks
> >  > --
> >  > View this message in context:
> >  > http://www.nabble.com/indexing-database-tp15832696p15832696.html
> >  > Sent from the Nutch - User mailing list archive at Nabble.com.
> >  >
> >  >
> >  > ---

NO_NORM and TOKENIZED

2008-03-04 Thread Tobias Hill
Hi,

I am quite new to the Lucene API. I find the Field-constructor
unintuitive. Maybe I have misunderstood it. Let's find out...

It can be used either as:
new Field("field", "data", Store.NO, TOKENIZED)

or:
new Field("field", "data", Store.NO, NO_NORM)


As I understand it NO_NORM and TOKENIZED are not settings for
a one-dimensional behaviour - on the contrary they are rather
orthogonal.

I.e. it is quite likely that I would want _both_ TOKENIZED and NO_NORM.
This is especially true for fields that are of approx. equal and short length
over the doc-space.

- Am I right in my reasoning (which means that the API is a bit unclear)?
Or
- Have I misunderstood something fundamental about TOKENIZED and NO_NORM?

Thankful for any feedback on this,
Tobias

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]