Yes, the biggest drawback is text spanning lines:
L1 - it was the best of times, L2 - it was the worst of times
will return no hits for the search "it was the best of times, it was the worst of times" (with quotes). because no single lucene document contains the whole text alone.
I would be interested in an alternative approach here because i have encountered this problem myself. A possible solution would be to have a freetext index and a linetext index, and the query is run against the fulltext index, but when hits are returned, these hits are compared against the linetext index to find each freetext hit's exact linenumber.
Mvh Karl Øie
On 11. apr. 2005, at 15.46, cerberus yao wrote:
But the "crash.java" is a just single document physically.
Do we have any drawback if we treat each line in "crash.java" as a doucment?
Another question: If we need to present the search result with the hit lines plus n lines forward and backword, how can I do this if each lines are seperated in each document? for example:
1. contents in crash.java are: public class crash { public static void main(String[] args) { } } 2. query "main" 3. search result= the hit line +1 line and -1 line 1 public class crash { 2 public static void main(String[] args) { 3 }
On Apr 11, 2005 8:28 PM, Karl Øie <[EMAIL PROTECTED]> wrote:Most indexing creates a Lucene document for each Source document. What would need is to create a Lucene document for each line.
String src_doc = "crash.java"; int line_number = 0; while(reader!=EOF) { String line = reader.readLine(); Document ld = new Document(); ld.add(new Field("id", src_doc, true, true, false)); ld.add(new Field("line", ""+line_number, true, true, false)); ld.add(new Field("text", line.toString(), false, true, true)); index_writer.addDocument(ld); line_number++; }
This will create a small lucene document for each line, upon search you
will find documents based on the content of the line and the line
number as a field. The reason syntax highlighting works without
creating a lucene document for each line is because syntax highlighting
bases its result on groups of occurencies of text, not line numbers.
Mvh Karl Øie
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
- Somewhere, out there on the Net, is an HD full of lame quotes
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]