Re: Search Suggestions

2006-12-14 Thread Bhavin Pandya
Hi simon, You can index the past query log for your search application and search the index the way you want... - Bhavin pandya - Original Message - From: "Simon Wistow" <[EMAIL PROTECTED]> To: "Lucene" Sent: Friday, December 15, 2006 3:52 AM Subject: Search Suggestions Yahoo!

Re: Duplicates removal in search results

2006-12-14 Thread Bhavin Pandya
Hi qaz, You can remove duplicates at search time by writing your own HitCollector... - Bhavin pandya - Original Message - From: "qaz zaq" <[EMAIL PROTECTED]> To: Sent: Friday, December 15, 2006 1:01 AM Subject: Duplicates removal in search results How can i remove the duplicates re

Re: Duplicates removal in search results

2006-12-14 Thread qaz zaq
Thanks Erick, Using termdocs/termenum should work. One of my concerns is the performance: the search results could reach 100K, so the performance may be impacted. One of the alternative I am thinking is to collapse the data during indexing time, but I haven't decided to go that way. - Ori

Re: Search index performance

2006-12-14 Thread Chris Hostetter
:Just wondering if my repository has 1TB of index file, when I perform : searching, does it takes up or allocate a lot of memory usage to read and : retrieve the results? try a mailing list search for "memory usage" ... i think you'll find some previous discussions that may help. -Hoss -

Re: Lucene id generation

2006-12-14 Thread Chris Hostetter
Karl: it sounds like you are just refering to using the lucene docid as an array index for the FieldCache of your "MyID" field ... that's a perfectly valid use of the docid, the key being that you aren't expecting the id to contain any meaningful data itself -- it's just a refrence number. : > if

Search index performance

2006-12-14 Thread spinergywmy
Hi, Just wondering if my repository has 1TB of index file, when I perform searching, does it takes up or allocate a lot of memory usage to read and retrieve the results? Thanks regards, Wooi Meng -- View this message in context: http://www.nabble.com/Search-index-performance-tf2825038

Re: Lucene & LSA

2006-12-14 Thread Miles Efron
U of Tennessee professor Michael Berry maintains a good site regarding software for computing SVD on large, sparse matrices: http://www.cs.utk.edu/~lsi/ The site also points to the LSI patent. FWIW it's very easy to extract term-doc counts from a lucene index and format them for softw

Duplicates removal in search results

2006-12-14 Thread qaz zaq
How can i remove the duplicates records in the search results. i.e., I have multiple results with the same title in 'title' field, and I want to only 1 record per title, how can I achieve that? thanks!! Need

RE: Index XML file

2006-12-14 Thread MALCOLM CLARK
Hi, Sent you a private email with some code attached ;-) Malcolm yeohwm <[EMAIL PROTECTED]> wrote: Hi, Thanks for the help. Please do let me know what jar file that I needed and where I can find them. Regards, Wooi Meng -- No virus found in this outgoing message. Checked by AVG Free

RE: Index XML file

2006-12-14 Thread yeohwm
Hi, Thanks for the help. Please do let me know what jar file that I needed and where I can find them. Regards, Wooi Meng -- No virus found in this outgoing message. Checked by AVG Free Edition. Version: 7.5.432 / Virus Database: 268.15.18/586 - Release Date: 12/13/2006 6:13 PM

Re: Duplicates removal in search results

2006-12-14 Thread Erick Erickson
you need to search for all documents with the title you care about, decide which one to keep and remove all the others. You'll probably need a TermDocs/TermEnum to go through all the items in your index to create the list of documents to remove. Erick On 12/14/06, qaz zaq <[EMAIL PROTECTED]> wr

Search Suggestions

2006-12-14 Thread Simon Wistow
Yahoo! has a search suggestion feature so that if you search for say 'shoes' then it also reccomends payless shoes, jordan shoes, aldo shoes, nike shoes, bakers shoes and a bunch of others. Has anyone built something like that in Lucene? Simon ---

Duplicates removal in search results

2006-12-14 Thread qaz zaq
How can i remove the duplicates records in the search results. i.e., I have multiple results with the same title in 'title' field, and I want to only 1 record per title, how can I achieve that? thanks!! - Everyone is raving about the all-new Yahoo! Mail beta.

Re: range query on dates

2006-12-14 Thread Doron Cohen
There is an example in TestDateFilter http://svn.apache.org/viewvc/lucene/java/trunk/src/test/org/apache/lucene/search/TestDateFilter.java?view=log "Cam Bazz" <[EMAIL PROTECTED]> wrote: > Hello, > > how can I make a query to bring documents between timestamp begin and > timestamp end, given that I

Re: Lucene & LSA

2006-12-14 Thread Marvin Humphrey
On Dec 14, 2006, at 11:16 AM, Soeren Pekrul wrote: it is possible to extract the matrix from the indexing file? I don’t know any API to extract the matrix from the index file directly. How could we make it work to write an open source decomposed vector model search engine a la LSA witho

Re: Index XML file

2006-12-14 Thread MALCOLM CLARK
Hi, I used the SAX api last year to parse and index the INEX 1.4 collection using Lucene (eventually suceeded after many naive attempts). Can you give me a sample of the XML you are trying to parse? Email me and I should be able to send you some code which may help. regards, Malcol

Re: Lucene & LSA

2006-12-14 Thread Soeren Pekrul
mariolone wrote: They are successful to extract the matrix. But with collections of large documents is not one too much expensive solution? I have a quite small collection with 14,960 documents and 29,828 unique terms. If I remember right it took a few minutes on a normal laptop computer to

Re: range query on dates

2006-12-14 Thread Erick Erickson
I'd search this mail archive for DateTools, this has been discussed repeatedly and you'd get lots and lots of info. Erick On 12/14/06, Cam Bazz <[EMAIL PROTECTED]> wrote: Hello, how can I make a query to bring documents between timestamp begin and timestamp end, given that I have stored my da

Re: Lucene change field values to wrong ones when indexing

2006-12-14 Thread Doron Cohen
Two things I would check: 1) converting pubDate to String during indexing for later date-range-filtering search results might not work well, because, e.g., string wise, "9" > "100". You could use Lucene's DateTools - there's an example in TestDateFilter - http://svn.apache.org/viewvc/lucene/ja

range query on dates

2006-12-14 Thread Cam Bazz
Hello, how can I make a query to bring documents between timestamp begin and timestamp end, given that I have stored my dates using DateTools.timeToString(long)? Best regards, -C.B.

Re: datetools and index storage question

2006-12-14 Thread Cam Bazz
this made it very clear. thank you. On 12/14/06, Erick Erickson <[EMAIL PROTECTED]> wrote: UN_TOKENIZED is probably the safest way to store your dates. You could get by with using, say, WhitespaceAnalyzer for indexing and parsing the query, but that would invite hard-to-track bugs to no advanta

Re: datetools and index storage question

2006-12-14 Thread Erick Erickson
UN_TOKENIZED is probably the safest way to store your dates. You could get by with using, say, WhitespaceAnalyzer for indexing and parsing the query, but that would invite hard-to-track bugs to no advantage I can see. I'll let someone more knowledgeable than me talk about NORMS field.store.NO p

Re: Lucene change field values to wrong ones when indexing

2006-12-14 Thread Steven Rowe
Hi Adrian, I don't see anything obviously wrong with your code. Can you give more details about which field values are different from what you expect? I'm guessing it's the id field you're worried about, but it's not clear from what you have written whether it's the title or the id field which i

Re: lucene functionality

2006-12-14 Thread Patrick Turcotte
On 12/14/06, Erik Hatcher <[EMAIL PROTECTED]> wrote: On Dec 13, 2006, at 1:51 PM, Patrick Turcotte wrote: > I would suggest you take a look at exist-db (http://exist-db.org/). I really doubt eXist can handle 10M XML files. Last time I tried it, it choked on 20k of them. It is true I don't

Re: lucene functionality

2006-12-14 Thread Erik Hatcher
On Dec 13, 2006, at 1:51 PM, Patrick Turcotte wrote: I would suggest you take a look at exist-db (http://exist-db.org/). I really doubt eXist can handle 10M XML files. Last time I tried it, it choked on 20k of them. Erik A database for XML documents that support XQuery. We a

datetools and index storage question

2006-12-14 Thread Cam Bazz
Hello Everyone, I have two fields that contain the original and modification dates of certain documents. I decided to store them like: Document entry = new Document(); entry.add(new Field("edate", DateTools.timeToString(edate.getTime(), DateTools.Resolution.MINUTE), Field.Store.YES, Field.Index.

Re: Lucene scoring: coord_q_d factor

2006-12-14 Thread Grant Ingersoll
FYI: The Wiki has a fair number of resources on IR: http:// wiki.apache.org/jakarta-lucene/InformationRetrieval (I have added a link to this conversation, which contains a lot of useful information) Karl, if you are so inclined, please feel free to add any of the references you have found t

Lucene change field values to wrong ones when indexing

2006-12-14 Thread Java Programmer
Hello, I have problem with my search code - i try to index some data with searching simultanously. Everything goes fine till some number of data are indexed then my fields are bugged. Eg. I have field with title indexed as "Nowitzki führt "Mavs" zum ersten Heimsieg" and inner id "15" (not doc id,

Lucene Vector Model

2006-12-14 Thread Gorka Naveira
Hi! I'm working on Lucene's vector model, and it's way of scoring, and I have some doubts. As I think Lucene introduces terms (DocumentWriter.addPosition, using Postings) in index with some information, such as offset, document number and term frequency. I would like to apply to each term anoth

Re: Index XML file

2006-12-14 Thread Martin Braun
Hi Wooi, >Just wondering is there anyone used Digester to extract xml content and > index the xml file? Is there any source that I can refer to on how to > extract the xml contents. Or is there any other xml parser is much easier to > use? Perhaps this article may help: http://www-128.ibm.com

Re: Lucene id generation

2006-12-14 Thread karl wettin
11 dec 2006 kl. 20.04 skrev Chris Hostetter: if you are trying to think of Lucene's docid as a meaningful number, you are doing something wrong. There is this one place where I use it. The index is add only, and the only data that interests me is the stored field MyID, also kept track i

Re: Lucene & LSA

2006-12-14 Thread mariolone
Thanks for the aid, Soren!!! They are successful to extract the matrix. But with collections of large documents is not one too much expensive solution? it is possible to extract the matrix from the indexing file? Mario Sören Pekrul wrote: > > Hello Mario, > > I had a similar problem a few

Re: Index XML file

2006-12-14 Thread Heikki Doeleman
I use XmlBeans to "unmarshall" an XML file into Java objects, from which you can easily retrieve the textual values of any element to be used for indexing. See http://xmlbeans.apache.org/ for more information on this library. There are various similar libraries but I find XmlBeans superior in s

Index XML file

2006-12-14 Thread spinergywmy
Hi, Just wondering is there anyone used Digester to extract xml content and index the xml file? Is there any source that I can refer to on how to extract the xml contents. Or is there any other xml parser is much easier to use? Thanks regards, Wooi Meng -- View this message in context:

Re: Lucene scoring: coord_q_d factor

2006-12-14 Thread Soeren Pekrul
Soeren Pekrul wrote: The score for a document is the sum of the term weights w(tf, idf) for each containing term. So you have already the combination of coordination level matching with IDF. Now it is possible that your query requests three terms A, B and C. Two of them (A and B) are quite ofte

Re: Lucene scoring: coord_q_d factor

2006-12-14 Thread Karl Koch
I think I understand now. I also have evidence from literature. So I would say that my question is solved. :) Thank you, Otis, and everybody else for contributing! Karl Original-Nachricht Datum: Thu, 14 Dec 2006 09:40:31 +0100 Von: Soeren Pekrul <[EMAIL PROTECTED]> An: java-us

Re: Lucene & LSA

2006-12-14 Thread Soeren Pekrul
Hello Mario, I had a similar problem a few weeks ago (thread "How to get Term Weights (document term matrix)?", 2006-11-02, http://www.gossamer-threads.com/lists/lucene/java-user/41726). I think there is no simple function creating a document term matrix or accessing it. I extracted the matr

Re: Lucene scoring: coord_q_d factor

2006-12-14 Thread Soeren Pekrul
Karl Koch wrote: If I do not misunderstand that extract, I would say it suggests the combination of coordination level matching with IDF. I am interested in your view and those who read this? I understand that sentence: "The natural solution is to correlate a term's matching value with its co

Re: Indexing clarification , please advice

2006-12-14 Thread Lukas Vlcek
Hi, May be you can consider using Compass (http://www.opensymphony.com/compass/) which could help you in your situation. They claim that some actions (like updating the index very often) are treated in a very efficient way (due to caching which is not a native part of Lucene library). Regards, L