Let me take a crack at it. See below... On 12/13/06, abdul aleem <[EMAIL PROTECTED]> wrote:
Hello All, Apolgies if it is a naive question a) Indexing large file ( more than 4MB ) Do i need to read the entire file as string using java.io and create a Document object ?
Essentially yes. IF you must index the whole document as a single Lucene document. my reply later for why you may not want to do this. The following are equivalent. Document doc = new Document(); doc.add("textfield", "data1 data2 data3"); doc.add("textfield", "data4 data5 data6"); and Document doc = new Document(); doc.add("textfield", "data1 data2 data3 data4 data5 data6"); So you could read chunks of your file and index them into the *same* field before writing the document to the index, or you could read the file as a single chunk and index it all at once. HOWEVER: note that the default in Lucene is to index only the first 10,000 (?) tokens for a field in a single document. See IndexWriter.SetMaxFieldLength. The file contains timestamp, if i need to index on
timestamp is parsing the entire file manually (tokenizing) and store the timestamp as document object is the only way ? or is there any alternatives ?
Perhaps I don't understand the problem. You can store the timestamp as a *field* in a document. Is that what you mean by storing it as a "document object"? But there's no way I know of to have Lucene automatically do something with arbitrary text in the input stream. You could write a custom analyzer (see Lucene In Action for the Synonym Analyzer for a model). That Analyzer would be responsible for recognizing timestamps in the input stream and doing something special with them. b) I need to search the contents of a log file which
is changing rapidly, from initial testing i see if any changes in the file is not reflected unless it is *Indexed* again Do we need to index the files always before search if the content of the file is dynamically changed
Yes. That is, you can't search data you haven't indexed. ( log file has a pattern and always logs in a
similar fashion, each time i need to index takes lot of time as the file is large (approach a ) is there any work arounds for this ? )
I think you need to re-think your approach. A document in Lucene is whatever you want to think of it as. For instance, you could index your log file such that each Lucene "document" was all the data added to the log file over some specified time interval. So, say you have a log that starts at midnight. Each Lucene "document" could be all the data added to the index for each minute. So you'd have a document for the data added to the log between 12:00 and 12:00:59. Another "document" for all data between 12:01:00 and 12:01:59 etc. That way, you don't have to re-index the entire log, just everything since the last interval you already indexed. I'm not necessarily recommending this approach, but using it to illustrate that you don't need to think of a Lucene Document as your entire log file. You may be much better off slicing the data up somehow and having a one-to-many relationship between your log and the Lucene documents...... Perhaps you could index every message as an individual document (which would deal with your timestamp issue), or..... Hope this helps Erick I would greatly appreciate if any inputs on the
above, Many thanks, Abdul ____________________________________________________________________________________ Need a quick answer? Get one in minutes from people who know. Ask your question on www.Answers.yahoo.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]