Re: Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)

Michael D. Curtin Wed, 02 Nov 2005 09:00:58 -0800

Richard Jones wrote:

If you're willing to continue subsetting / summarizing the data out into
Lucene, how about subsetting it out into a dedicated MySQL instance for
this purpose?  100 artists * 1M profiles * 2 ints * 4 bytes/int =
roughly 1 GB of data, which would easily fit into RAM.  Queries should
be pretty fast off of that.  Good luck!
We used to do this, but there is a lot of overhead involved inupdating/deleting/inserting all those rows / db indexes More wasted cyclesand disk activity than we see with lucene. Even ignoring the fancy ACID stuffwith MyISAM (no ref. integrity) it's still slower.
Furthermore, with lucene i can query "artists:1" and it returns what lucenedeems to be the "best" matches for artist 1 (radiohead). This is far easierthat with an SQL database, because the person whose listen counter forradiohead is highest isnt necessarily the "biggest fan". it depends on thesize of the profile. This gets even more complicated when trying to find the"best" fan of a combination of a few artists. Lucene is more useful for thisthan a database query.

Ah, so the fact that "1" actually appears many times in the string yougive Lucene is important. Neat application!

Sounds like the custom Analyzer (really a custom TokenStream) approachsuggested by others may be the way for you to go. If the informationyou get from the MySQL profiles is an artistid and count, you could codeup an Analyzer to emit N "1" tokens in a row, rather than concatenate N"1"s together into a single string and then let QueryParser, et. al.,parse them back apart again. Even if you get all N of the "1"s fromMySQL, a custom Analyzer could emit them, one at a time, rather than theconcatenate, parse-apart cycle.


--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Creating document fields by providing termvector directly (bypassing the analyzing/tokenizing stage)

Reply via email to