Announcing: IBM OmniFind Yahoo! Edition

2006-12-13 Thread Andreas Neumann
As you may have already heard, IBM and Yahoo! today released a new product named IBM OmniFind Yahoo! Edition. It is a free-of-charge search engine for web sites and file systems, which builds on Lucene and other components such as UIMA

Re: Advice on 3NF Data Structures and Lucene Please

2006-12-13 Thread Chris Lu
I think the last structure is good. The index should be structured according to how you want to search it. If your needs changed, you should simply have another index. One index for all is not really good. Index is more of trading space for time, so duplication is not really a concern. The first

Re: Index Excel File

2006-12-13 Thread rajan
I think there is problem with following line... row = sheet.getRow(i); -> row = sheet.getRow(j); Also following code with give you the contents: === Workbook excelDoc = Workbook.getWorkbook(new FileInputStream( file)); String content = ""

Re: Index Excel File

2006-12-13 Thread spinergywmy
Hi, I did use jexcelapi to extract the contents out of excel file, however, I couldn't get the content when I sysout. Below are the codes that I wrote, perhaps you can point out where I have done wrong. Thanks. Workbook excelDoc = Workbook.getWorkbook(new FileInputStream(file));

Re: Index Excel File

2006-12-13 Thread rajan
Hello, i used jexcepapi. Within that there is class called CSV.java in demo package. By using that i extracted text from excel, and added that text into the index. I hope this will help you. Regards Rajan. -Original Message- From: spinergywmy <[EMAIL PROTECTED]> To: java-user@lucene.a

Re: Index Excel File

2006-12-13 Thread spinergywmy
Hi, Can you show me the example on how to extract the text from excel file and index them? Thanks regards, Wooi Meng -- View this message in context: http://www.nabble.com/Index-Excel-File-tf2817920.html#a7865632 Sent from the Lucene - Java Users mailing list archive at Nabble.com. --

Re: Lucene scoring: coord_q_d factor

2006-12-13 Thread Otis Gospodnetic
Hi, But isn't "coord" + TFIDF pretty intuitive? Independently, they are both useful and contribute to the final score for the match. Otis - Original Message From: Karl Koch <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Sent: Wednesday, December 13, 2006 8:35:55 PM Subject: Re: L

Re: Index Excel File

2006-12-13 Thread rajan
Hello, Please try to use jexcelapi. I done it successfully. While using POI it gave me exception while image is present in excel file. Regards Rajan. -Original Message- From: spinergywmy <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Date: Wed, 13 Dec 2006 17:21:11 -0800 (PST) Subj

Re: Lucene scoring: coord_q_d factor

2006-12-13 Thread Karl Koch
Hello Paul, thank you for providing the link to that paper. I read it again, and you are right. I discovered the following text part: "In normal term co-ordination matches, if a request and document have a frequent term in common, this counts for as much as a non-frequent one; so if a request

Index Excel File

2006-12-13 Thread spinergywmy
Hi, Is anyone index an excel file before? I took a look at the API classes provided by POI HSSF, however, I did not find any method to extract the text from excel file and index them. Please assist and leet me know where I can find the example to refer to. Thanks regards, Wooi Meng -- V

Re: Advice on 3NF Data Structures and Lucene Please

2006-12-13 Thread Andrew Hughes
Thanks Erick, I'll give a representation of the data structure that I am trying to index (in xml). This represents a relational data structure. Because all Place (ie Kazakhstan) Person's are grouped together eta USA U.S.A US

Re: lucene functionality

2006-12-13 Thread Marcelo Ochoa
Hi Chris: > (1) Each field is searchable and indexable. ...and I assumed hte real problem is being ableto address use cases like "find all documents where the DRECONTENT contains the words "Action" and the words "News" near eachother -- using stemming and other Text Analysys tricks i may wnat

Re: lucene functionality

2006-12-13 Thread Chris Hostetter
: > : For 10 million records We recommend an strong database such as Oracle. : > : > eh ... who is "We" in that statement? : We are independent consultants working for many years with Oracle databases ;) And that's a perfectly acceptible answer, i just don't want any first time Lucene users

Re: lucene functionality

2006-12-13 Thread Marcelo Ochoa
Hi Chris: On 12/13/06, Chris Hostetter <[EMAIL PROTECTED]> wrote: : For 10 million records We recommend an strong database such as Oracle. eh ... who is "We" in that statement? We are independent consultants working for many years with Oracle databases ;) I Suspect you'll find other peop

Re: lucene functionality

2006-12-13 Thread Chris Hostetter
: For 10 million records We recommend an strong database such as Oracle. eh ... who is "We" in that statement? I Suspect you'll find other people on this list who have no problems running Lucene indexes containing 10 million documents. If you want a database, then by all means use a database,

Re: Lucene scoring: coord_q_d factor

2006-12-13 Thread Paul Elschot
On Wednesday 13 December 2006 16:42, Karl Koch wrote: > Do you know about any papers that discuss this? Coordination is called co-ordination In the original idf paper by K. Spärck Jones, A statistical interpretation of term specificity and its application in retrieval., Journal of Documentation 2

Re: Advice on 3NF Data Structures and Lucene Please

2006-12-13 Thread Chris Lu
You are right. Database usually is in 3NF, while lucene usually works on an array of objects. Different database has different data model. There are quite some efforts to crawl database, create the lucene index, keep it in sync with the database, and rendering the search results. If data model cha

Re: Indexing clarification , please advice

2006-12-13 Thread Daniel Naber
On Wednesday 13 December 2006 14:10, abdul aleem wrote: > a) Indexing large file ( more than 4MB ) >    Do i need to read the entire file as string using >    java.io and create a Document object ? You can also use a reader: http://lucene.apache.org/java/2_0_0/api/org/apache/lucene/document/Fiel

Re: lucene functionality

2006-12-13 Thread Doron Cohen
Lucene RangeQuery would do for the "time" and "numeric" reqs. "Mark Mei" <[EMAIL PROTECTED]> wrote: > At the bottom of this email is the sample xml file that we are using today. > We have about 10 million of these. > > We need to know whether Lucene can support the following functionalities. > (1)

Re: lucene functionality

2006-12-13 Thread Marcelo Ochoa
Hi Mark: For 10 million records We recommend an strong database such as Oracle. You can annotate the Schema (.xsd) which describes your XML record to store some field in traditional VARCHAR2 or NUMBER columns to query it faster, and in a CLOB column. You can find more information at: http://ww

Re: lucene functionality

2006-12-13 Thread Patrick Turcotte
I would suggest you take a look at exist-db (http://exist-db.org/). A database for XML documents that support XQuery. We are using both products here (lucene and exist-db), and for what you are looking for, exist-db seems better. Our documents are far more complex than yours (about 500 differen

Re: Problems with Queries which contain '_' and wildcards

2006-12-13 Thread Ronnie Kolehmainen
I recognize that error message ;) You're using AnalyzingQueryParser - yes? These are imo the two most obvious options: 1. Revert to standard QueryParser - it won't analyze prefix- and

Lucene & LSA

2006-12-13 Thread mariolone
Hi I have a problem: i must create a matrix term for document in which every element of the matrix it represents the number of occurrences of that term in the document. How can I do? Can someone help me? Thanks to all P.S. I must applicate LSA to this matrix. -- View this message in con

Problems with Queries which contain '_' and wildcards

2006-12-13 Thread Stefan Schütz
Hi, first let me explain the situation: We have to index an document, which contains a field "file" to store filenames. Sometimes filenames contain an underscore or an minus (_ or -). => e.g. foo_bar.doc Indexing is'nt the problem so far. But if we now try to search for "foo_b*" the QueryP

RE: de-boosting fields

2006-12-13 Thread Scott Smith
One other thing I discovered that I mention so no one else is tripped up by it. I set the boost to zero for the categories in the query. When I ran my unit tests, some of them started to fail. I eventually realized that the failures were in searches where I only wanted to find documents in the

lucene functionality

2006-12-13 Thread Mark Mei
At the bottom of this email is the sample xml file that we are using today. We have about 10 million of these. We need to know whether Lucene can support the following functionalities. (1) Each field is searchable and indexable. (2) Fields such as STARTTIME and ENDTIME need to be treated as a pai

Re: Lucene scoring: coord_q_d factor

2006-12-13 Thread Karl Koch
Do you know about any papers that discuss this? Karl Original-Nachricht Datum: Wed, 13 Dec 2006 10:31:41 -0500 Von: "Yonik Seeley" <[EMAIL PROTECTED]> An: java-user@lucene.apache.org Betreff: Re: Lucene scoring: coord_q_d factor > On 12/13/06, Karl Koch <[EMAIL PROTECTED]> wrot

Re: Lucene scoring: coord_q_d factor

2006-12-13 Thread Yonik Seeley
On 12/13/06, Karl Koch <[EMAIL PROTECTED]> wrote: To me, it seems that coordination level matching could be used if I don't want to use TFxIDF but not together with it. In this context, I wonder what benefit the "coordination level matching" has in combination with TFxIDF? Well, if I search f

Re: Lucene scoring: coord_q_d factor

2006-12-13 Thread Karl Koch
Hello Steven, unfortunately I don't have access to these books right now. I will try to get hold of them. Thank you for these pointers. :) I had a quick look at "coordination level matching" on the web and found evidence that this seemed to be an early retrieval strategy. My question is mainly

Re: Indexing clarification , please advice

2006-12-13 Thread abdul aleem
Many thanks Erick, Your points are valid, i was thinking entire Log file as a lucene document, im wrong trying to chop the log file might be the way to go my bad expressions , yes you got that right timestamp must be added as a "FIELD" that is what i meant really appreciate your detailed reply,

Re: Indexing clarification , please advice

2006-12-13 Thread Erick Erickson
Let me take a crack at it. See below... On 12/13/06, abdul aleem <[EMAIL PROTECTED]> wrote: Hello All, Apolgies if it is a naive question a) Indexing large file ( more than 4MB ) Do i need to read the entire file as string using java.io and create a Document object ? Essentially yes.

Re: Advice on 3NF Data Structures and Lucene Please

2006-12-13 Thread Erick Erickson
Tell us more about the problem you are trying to solve. Lucene is designed for large text searching, not relations. Trying to "index a data structure" seems like mis-application of Lucene. Without some idea of what you are trying to accomplish, any advice you get is irrelevant at best... Best Er

Indexing clarification , please advice

2006-12-13 Thread abdul aleem
Hello All, Apolgies if it is a naive question a) Indexing large file ( more than 4MB ) Do i need to read the entire file as string using java.io and create a Document object ? The file contains timestamp, if i need to index on timestamp is parsing the entire file manually (tokeni

Re: Extracting data from Lucene index files

2006-12-13 Thread Grant Ingersoll
Take a look at TermDocs and TermEnum. -Grant On Dec 13, 2006, at 6:02 AM, Venkateshprasanna wrote: I would like to use the data stored in the Lucene indexes, like the words and their frequencies and store them in a database. Can anyone suggest a way of going about it or is it possible at

Extracting data from Lucene index files

2006-12-13 Thread Venkateshprasanna
I would like to use the data stored in the Lucene indexes, like the words and their frequencies and store them in a database. Can anyone suggest a way of going about it or is it possible at all? TIA Prasanna -- View this message in context: http://www.nabble.com/Extracting-data-from-Lucene-inde