from:"Malcolm"

Analysis

2005-11-01 Thread Malcolm


Hi,
I've been reading my new project bible 'Lucene in Action' about Analysis in 
Chapter 4 and wondered what others are doing for indexing XML(if anyone else 
is, that is!).
Are you folks just writing your own or utilising the current Lucene analysis 
libraries?

thanks,
Malcolm Clark 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Analysis

2005-11-01 Thread Malcolm


Hi,
I'm just asking for opinions on Analyzer's for the indexing. For example 
Otis in his article uses the WhitespaceAnalyzer and the Sandbox program uses 
the StandardAnalyzer.I am just gauging opinions on the subject with regard 
to XML.
I'm using a mix of the Sandbox XMLDocumentHandlerSAX and a bit extra. I 
originally started using Digester but found that I preferred the Sandbox 
implementation.

Thanks,
Malcolm Clark 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Analysis

2005-11-01 Thread Malcolm

I'm currently indexing the INEX collection and then performing queries on 
the Format features within the text. I am using a wide range of the XML 
features. The reason I asked about the XML Analysis is I am interesting in 
opinions and reasons for adding a wide range of discussion to my 
dissertation regarding Lucene.

Thanks,
Malcolm 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: AW: AW: Determine the index of a hit after using MultiSearcher

2005-11-29 Thread Malcolm

Are you going to write another addition with lots of Servlet code? If that's 
the case put me down for an advance copy.Lucene and servlets is a direction 
I may be going in the future.

Thanks,
Malcolm Clark 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: repeating fields

2005-12-07 Thread Malcolm



Firstly you should obtain LUKE and check everything is layed out correctly 
in your index.

Secondly maybe a Wildcard/prefix query or termquery.for example(termquery):

TermQuery heTerm = new TermQuery(
  new Term("p",
  "x"));
  TermQuery sheTerm = new TermQuery(
  new Term("p",
  "xx"));
  TermQuery theyTerm = new TermQuery(
  new Term("p",
  "xxx"));

I'm sure the folks on here will be able to come up with a more efficient 
method.Try obtaining Lucene in Action or look at the examples at 
http://lucenebook.com/

cheers,
Malcolm Clark 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Non scoring search

2005-12-07 Thread Malcolm

Probably being very naive here but:
These are my index details:

Location:C:\LuceneDemo\Project6thDec
Number of documents in Index: 571
Index Current Version: 2
Last Modified: 1133899684000
The index has not had any deletions.

What is: Last Modified: 1133899684000?

I thought of indexing a date for each field but it's too inefficient as I only 
want it once.I just want the date for the index.Where does it lie?
cheers,
Malcolm

Re: repeating fields

2005-12-07 Thread Malcolm



That's what I have, loads of different  tags and (abstract) tags etc 
in each xml document so a lucene document for each is okay.
malcolm 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Vector Space Model <-> Probabilistic Model

2006-02-17 Thread Malcolm


I know of one I used for my Thesis. The REF is:
Fuhr, N. 2001, "Models in information retrieval", , pp. 21-50.

http://portal.acm.org/citation.cfm?id=567294

I may have a electronic version. If you need it give me an email address as 
this service doesn't allow attachments.


Hope this helps,

Malcolm Clark


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene and SAX

2005-10-25 Thread Malcolm

It's XML like this. It has 120-ish volumes with references to 12,107 articles 
which are like this below:

A1003
10.1041/A1003s-1995
IEEE Annals of the History of Computing
1058-6180/95/$4.00 © 1995 
IEEE
Vol. 17, No. 1
Spring1995
pp. 3-3

About this Issuepp. 3-3
J.A.N.LeeEditor‐in‐Chief


The first issue of our 17th volume is as diverse in topics as any nontheme 
issue that we have tried to present over the past many years. However, it still 
represents the work of the English‐speaking world of the North Atlantic 
rather than a broader picture of computing in the whole world. The Editorial 
Board and the article editors of the Annals
are doing their best to bring the history of the whole world of computing to 
our readers, but it does require authors in other countries to offer their 
manuscripts for our consideration. Please take this as an open invitation to 
authors in other parts of the world to submit papers to the Annals
for review and help us to follow the lead of our parent organization in being 
the “The World’s Computer Society.”
The five major articles in this issue represent several manuscripts that 
have been in our files for some time, and we are grateful to the authors for 
having “stuck with us” while we reviewed, re‐reviewed, and 
reworked their papers. Articles in the field of history do not always present 
the work of the authors themselves (though we welcome pioneers to give us their 
own stories, as in the case of the 1935 article by John McPherson in this 
issue); thus, answering the question “is it accurate?” is not 
always easy. In fact, we ask our referees to answer the following questions 
about each manuscript, and their responses determine whether we accept the 
manuscript “as is” or whether we ask the author(s) to revise the 
material:


Are the issues addressed in the paper stated clearly enough?

Re: Lucene and SAX

2005-10-25 Thread Malcolm


Hi Grant,
A highly shortened version of the volume is like below.





]>


IEEE Annals of the History of Computing
Spring 1995 (Vol. 17, No. 1)
Published by the IEEE Computer Society


About this Issue

&A1003;

Comments, Queries, and Debate

&A1004;

Articles

&A1006;



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene and SAX

2005-10-25 Thread Malcolm

I'm not in anyway an expert, in fact far from, but when I try to reference 
each article seperately it complains of entitites as the XML articles are 
not well-formed.

Thanks,
MC 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Index XML file

2006-12-14 Thread MALCOLM CLARK

Hi,
  I used the SAX api last year to parse and index the INEX 1.4 collection using 
Lucene (eventually suceeded after many naive attempts). 
  Can you give me a sample of the XML you are trying to parse? 
  Email me and I should be able to send you some code which may help.
   
  regards,
  Malcolm Clark

RE: Index XML file

2006-12-14 Thread MALCOLM CLARK

Hi,
  Sent you a private email with some code attached ;-)
  Malcolm
  

yeohwm <[EMAIL PROTECTED]> wrote:
  
Hi,

Thanks for the help. Please do let me know what jar file that I
needed and where I can find them.

Regards,
Wooi Meng 

-- 
No virus found in this outgoing message.
Checked by AVG Free Edition.
Version: 7.5.432 / Virus Database: 268.15.18/586 - Release Date: 12/13/2006
6:13 PM




Disclaimer

This e-mail and any files transmitted with it are intended only for the 
use of the addressee. If it contains confidential information, the addressee 
is expressly prohibited from distributing the e-mail and legal action may be 
taken against you. Access by any other persons to this e-mail is unauthorized. 
If you are not the intended recipient, any disclosure, copying, use, or 
distribution is prohibited and may be unlawful. This e-mail and any files 
and/or 
attachments, shall not constitute binding legal obligation with the company. 
The company shall also not be responsible for any computer virus transmitted 
with this e-mail.

MCSB Systems (M) Berhad ([EMAIL PROTECTED])




































































































-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene and Sax

2005-10-31 Thread MALCOLM CLARK

Grant,
Thanks for your help with the problem I was experiencing. I split it all down 
and realised the problem was the location of the IndexWriting(It was not in the 
correct place within the SAX processing) and also becuase of some poor error 
handling on my part. 
kind thanks,
Malcolm

Re: Lucene and SAX

2005-10-31 Thread MALCOLM CLARK


Grant,

 

Thanks for your tips.I have considered DOM processing but it seemed to take a 
hell of a long time to process all the documents(12,125).

Re: Lucene and Sax

2005-10-31 Thread MALCOLM CLARK




Karl,

 

Thanks for your tips.I have considered DOM processing but it seemed to take a 
hell of a long time to process all the documents(12,125).

 

Malcolm Clark

Re: Extract term and its frequency from the index and file?

2005-11-14 Thread MALCOLM CLARK



Hi,

Could you send me the url for HighFreqTerms.java in cvs?

 

Thanks,

Malcolm

Re: Extract term and its frequency from the index and file?

2005-11-14 Thread MALCOLM CLARK



cheers

Memory fault

2005-11-15 Thread MALCOLM CLARK


I'm currently trying to index another collection. I am suffering a problem with 
writer.close.Basically at the end of indexing it only works if I remove the 
writer.close.It simple can't find the routine despite being able to find 
writer.optimize.

Has anyone else discovered this problem and what was the solution?

Secondly by removing the writer.close will this cause heap problems(running 
out!).

I have used the Lucene in Action: mergeFactor, maxMergeDocs and 
writer.minMergeDocs to try and stop the memory problem.

Thanks in advance,

Malcolm

Commit changes

2005-11-25 Thread Malcolm Clark

Hi,
I am not that experienced with Java and am attempting to implement the commit 
method for the IndexReader for the application I'm developing.
I am trying to extend the IndexReader class but it wont let me! Should I extend 
something else as I can't see anything in the api to suggest using?

My class is this:

public abstract class commitDelete extends IndexReader {
 
 protected final void commitIndex() {
 try{
super.commit();
 }(IOException e){}
   }
 }

Incidentally if I close the index does this commit anyway?
Please help as I'm stumped.
thanks in advance,
Malcolm Clark

Re: Commit changes

2005-11-28 Thread MALCOLM CLARK


Hi Oren,

In the grand scheme of things and in comparison to some of the participants 
knowledge on here I am fairly new and inexperienced to Java and Lucene.

I thought my way may be the most effectual method of implementing the commit.I 
am using many methods of searching/reading the index for a variety of 
reasons.The facility I am trying to implement is the ability to delete a 
document from the index.Do I need to commit or just reader.close? I have the 
LIA book which is superb and have read the sections regarding delete.If it 
mentions commit maybe I missed it?

Thanks,

Malcolm

Re: Commit changes

2005-11-28 Thread MALCOLM CLARK


Hi thanks for your reply,

So when I delete a document the writer.close(); this actually commits the 
deletion to the index which is not reversible?

I have a facility which deletes but leaves the delete 'undoable' until the 
change is commited by closing the reader. I cannot access the doCommit or 
commit method as 'they are not visible'.

I have the LIA book at home and not with me.

Thanks.

Re: Commit changes

2005-11-28 Thread MALCOLM CLARK




Okay.Thanks to you both.

 

Malcolm

Re: IndexReader.open crashes JVM

2005-12-15 Thread Malcolm Clark



Hi,
Maybe post some of the code which is giving you problems and people can view 
it and try and see what's wrong.

Cheers,
MC 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

TREC,INEX and Lucene

2006-02-22 Thread Malcolm Clark


Hi all,
I am planning on participating in the INEX and hopefully passively on a 
couple of TREC tracks mainly using the Lucene API.

Is anyone else on this list planning on using Lucene during participation?
I am particularly interested in the SPAM, Blog and ADHOC tracks.
Malcolm Clark 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

TREC and INEX

2006-03-25 Thread Malcolm Clark

Hi all,
Are any of you planning on using Lucene in any way for the NLP in INEX this 
year or the Enterprise track in TREC?
Thanks,
MC

Lucene probabilistic

2006-04-14 Thread Malcolm Clark

Hi all,
I came across an old mail list item from 2003 exploring the possibilities of a 
more probabilistic approach to using Lucene. Do the online experts know if 
anyone achieved this since?
Thanks for any advice,
Malc

Re: search pdf

2006-04-16 Thread Malcolm Clark



Hi,
You have to parse/index the PDF files and then you can search  the index 
with Lucene.
Have a look at Lucene in Action and the source code which comes with 
it.There is a good demo which parses common formats such as PDF,Word XML 
etc.

Cheers,
MC 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: search pdf

2006-04-16 Thread Malcolm Clark


URL for all the source code:

http://www.lucenebook.com/LuceneInAction.zip

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reuters

2006-04-21 Thread Malcolm Clark

Hi all,
I didn't know whether to add this to the thread asking about TREC indexing or 
start a new one.
Anyway, has anyone attempted to index/search the Reuters collection which 
consists of SGML?
Mine seems to run through the process okay but alas I'm left with nothing in 
the index when I check with Luke or my own Search Engine.
Anyone got any hints (apart from don't do it)?
cheers,
MC

Re: Reuters

2006-04-21 Thread Malcolm Clark



Okay converting to XML sounds like a great option.
Thanks,
Malcolm

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Recommendations please

2006-05-13 Thread Malcolm Clark

Hi everyone,
I am about to index the INEX collection (22 files with 3 files in each-ish) 
using Java Lucene. I am undecided with the approach to indexing and have left 
my LIA book at uni :-/
Would you recommend:

  1..  indexing all files into one big index? (would this be inefficient to 
search?)
  2.. 23 seperate indexes and then merging them?
  3.. 23 seperate indexes and then searching an array of indexes?

Also has anyone else indexed the INEX collection using Java Lucene and what did 
you do?
Thanks for any helpful advice.
MC

Scoring

2006-05-23 Thread Malcolm Clark

Hi experts,
I'm currently indexing the New INEX collection using lucene and pondering this 
question.
When searching how do I retrieve the score based on a section or paragraph etc, 
and not the document score, when the documents are indexed in multi-fields 
(XML).
Can anyone point me in the correct direction?
Thanks,
Malcolm

Re: Lucene in Action

2006-06-06 Thread Malcolm Clark


Try here..


http://www.abebooks.co.uk

Maybe they have one cheaper.
Malcolm

- Original Message - 
From: "digby" <[EMAIL PROTECTED]>

To: 
Sent: Tuesday, June 06, 2006 11:55 AM
Subject: Re: Lucene in Action


Thanks everyone, although now I'm not sure what to! Blackwells quicker but 
more expensive, but is a new edition due...???


Think I'll blow the moths off my wallet and get on with it...


[EMAIL PROTECTED] wrote:





It's an invaluable book if you're new to Lucene. There have been some
changes to the Lucene API since the book was published but you shouldn't
let this put you off - they're relatively minor. I think Lucene In Action
v2.0 might be a little while in coming (checkout Otis' blog
http://www.jroller.com/page/otis?catname=%2FLucene).
Regards
Paul I.



digby 
<[EMAIL PROTECTED]> Sent by: 
news  To <[EMAIL PROTECTED] 
java-user@lucene.apache.org rg> 
cc Subject 06/06/2006 10:59  Lucene in Action 
Please respond to 
[EMAIL PROTECTED] apache.org 
Does everyone recommend getting this book? I'm just starting out with

Lucene and like to have a book beside me as well as the web / this
mailing list, but the book looks quite old now, has a 1-2 month delivery
wait time here in the UK and is quite expensive. Is it worth waiting for
a new edition perhaps?

Thanks,

Digby


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: HTML text extraction

2006-06-29 Thread MALCOLM CLARK

Hi,
Would you please send me your parser too?
Thanks!
Malcolm


- Original Message 
From: Liao Xuefeng <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Friday, June 23, 2006 12:54:29 AM
Subject: RE: HTML text extraction


hi, all,
  I wrote my own html parser because it just meets my require and do not
depend on 3rd part's lib. and i'd like to share it (in attachment).

  This class provides some static methods to do html <-> text convertion:

  HtmlUtil.html2text(String html);
  HtmlUtil.text2html(String text);

and 
  HtmlUtil.removeScriptTags(String html);
can remove script and activex tags in html, this is use to check user's blog
post before writing into database.

Best regards,
  Xuefeng

http://www.crackj2ee.com

-Original Message-
From: Michael Wechner [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 22, 2006 11:30 PM
To: java-user@lucene.apache.org
Subject: Re: HTML text extraction

John Wang wrote:
> Hi Xuefeng:
>
> Can you please send me your htmlparser too?

Xuefeng, would it be possible to open source your parser?

Thanks

Michi
>
> thanks
>
> -John
>
> On 6/21/06, Daniel Noll <[EMAIL PROTECTED]> wrote:
>>
>> Simon Courtenage wrote:
>> > I also use htmlparser, which is rather good.  I've had to customize
>> it,
>> > though, to parse strings containing html source rather than accept 
>> > urls of resources to fetch etc.
>> Also it
>> > crashes on meta tags that don't have name attributes (something I 
>> > discovered only a couple of days ago).
>>
>> Actually, it already accepts strings without modifying the library:
>>
>> String htmlSource = "...";
>> Parser parser = new Parser(new Lexer(htmlSource));
>>
>> I will have to watch out for those meta tags though.  Time to go test 
>> it.
>>
>> Daniel
>>
>>
>> --
>> Daniel Noll
>>
>> Nuix Pty Ltd
>> Suite 79, 89 Jones St, Ultimo NSW 2007, AustraliaPh: +61 2 9280 0699
>> Web: http://www.nuix.com.au/Fax: +61 2 9212 6902
>>
>> This message is intended only for the named recipient. If you are not 
>> the intended recipient you are notified that disclosing, copying, 
>> distributing or taking any action in reliance on the contents of this 
>> message or attachment is strictly prohibited.
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>


--
Michael Wechner
Wyona  -   Open Source Content Management   -Apache Lenya
http://www.wyona.com  http://lenya.apache.org
[EMAIL PROTECTED][EMAIL PROTECTED]
+41 44 272 91 61


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing large sets of documents?

2006-07-27 Thread MALCOLM CLARK




Is this the W3 Ent collection you are indexing?
 
MC

Output of index

2006-07-27 Thread MALCOLM CLARK

Hi,
I'm going to attempt to output several thousand documents from a 3+ million 
document collection into a csv file. 
What is the most efficient method of retrieving all the text from the fields of 
each document one by one? Please help!
 
Thanks,
Malcolm

Lucene/Digester

2005-10-16 Thread Malcolm Clark

Hi all,
I'm using Lucene/Digester etc for my MSc I'm quite new to these API's. I'm 
trying to obtain advice but it's hard to say whether the problem is Lucene or 
Digester.
Firstly:
I am trying to index the INEX collection but when I try to index repetitive 
elements only the last one is indexed. For example: 





 //this is the only one indexed



only the last Chapter element will be indexed and it will skip the first two. 
Secondly:
When using the Digester/Lucene with XML does each file have to contain e.g

Re: Lucene in Action : example code -> document-parsing framework ...

2005-10-18 Thread MALCOLM CLARK


Hi,

Could somebody please help me regarding Lucene and Digester. I have discovered 
this problem during indexing the INEX collection of XML for my MSc project.

During the parsing of the XML files all named Volume.xml the parser will only 
index the last XML element in any repetitive list. For example: 







 //Only this title element is indexed











How does one put multiple fields in one Digester field for Lucene indexing?

 

Thanks in advance.

MC

Re: Lucene/Digester

2005-10-19 Thread MALCOLM CLARK


Hi

I used Luke to check the content of the index and they are not there.

cheers,

MC

Re: Lucene/Digester

2005-10-19 Thread MALCOLM CLARK


Okay I'll do that. Thanks very much for the advice as it's much appreciated.

Malcolm Clark

Lucene and Digester

2005-10-20 Thread MALCOLM CLARK


Hi I have tried as suggested and isolated Digester from Lucene. Digester 
doesn't trigger an Element Matching Pattern for each element only the last one 
of each repeating tag.My XML (trimmed a bit looks like this):

 





IEEE Annals of the History of Computing

Spring 1995 (Vol. 17, No. 1)

Published by the IEEE Computer Society

 

About this Issue

Comments, Queries, and Debate

Articles





Only the last set Articles appears when parsed by 
Digester as it ignores the first two.

Does anyone know how to stuff many bits of data into one field?

Any help will be really appreciated,

MC

 

PS my code looks like this.I have tried many other types like keyword for 

the booksDocument.add(Field.Text("title", books.getTitle()));


 

Digester digester = new Digester();
digester.setValidating(true);
digester.addObjectCreate("books", Digester.class );
digester.addObjectCreate("books/journal", Books.class );
digester.addCallMethod("books/journal/title","setTitle", 0);
digester.addCallMethod("books/journal/issue",   "setIssue", 0);
digester.addCallMethod("books/journal/publisher",   "setPublisher", 0);
digester.addCallMethod("books/journal/sec1/title", "setTitle2", 0);
digester.addSetNext("books/journal",   "addBooks" );


public void addBooks(Books books) throws IOException
{
System.out.println("Adding " + books.getTitle());
System.out.println("Adding " + books.getIssue());
System.out.println("Adding " + books.getPublisher());
System.out.println("Adding " + books.getTitle());
System.out.println("Adding " + books.getTitle2());
Document booksDocument  = new Document();
booksDocument.add(Field.Text("title", books.getTitle()));
booksDocument.add(Field.Text("issue", books.getIssue()));
booksDocument.add(Field.Text("publisher", books.getPublisher()));
booksDocument.add(Field.Text("title", books.getTitle2()));
}

Re: indexwriter and index searcher

2005-10-24 Thread MALCOLM CLARK


Hi all,

I am relatively new and scared by Lucene so please don't flame me.I have 
abandoned Digester and am now just using other SAX stuff.

I have used the sandbox stuff to parse an XML file with SAX which then bungs it 
into a document in a Lucene index.The bit I'm stuck on is how is a 
elementBuffer split up into several items.I have a elementBuffer with three 
'article' documents but only shows as one when using Luke to view the index? 

please advise.

Thanks very much.

MC

Lucene and SAX

2005-10-25 Thread Malcolm Clark


Hi again,
I am desperately asking for aid!!

I have used the sandbox demo to parse the INEX collection.The problem being 
it points to a volume file which references 50 other xml articles.Lucene 
only treats this as one document.Is there any method of which I'm 
overlooking that halts after each reference?
Could somebody please help and I wont post again until I submit something 
useful.


The code is:
public class XMLDocumentHandlerSAX
extends HandlerBase
{
   /** A buffer for each XML element */
   private StringBuffer elementBuffer = new StringBuffer();

   private Document mDocument;

   // constructor
   public XMLDocumentHandlerSAX(File xmlFile)
throws ParserConfigurationException, SAXException, IOException
   {
SAXParserFactory spf = SAXParserFactory.newInstance();

SAXParser parser = spf.newSAXParser();
parser.parse(xmlFile, this);
   }

   // call at document start
   public void startDocument()
   {
mDocument = new Document();
//mDocument = new Document();
elementBuffer.setLength(0);
   }

   // call at element start
   public void startElement(String localName, AttributeList atts)
throws SAXException
   {

if (localName.equals("article")) {
 elementBuffer.setLength(0);
}

   }
   // call when cdata found
   public void characters(char[] text, int start, int length)
   {

 elementBuffer.append(text, start, length);

   }

   // call at element end
   public void endElement(String localName)
throws SAXException
   {

if (localName.equals("article")) {
 System.out.println("Article: "+elementBuffer.length());
 elementBuffer.setLength(0);
}

 mDocument.add(Field.Text(localName,elementBuffer.toString()));
 System.out.println("EB: "+elementBuffer);
 elementBuffer.setLength(0);

   }


   public Document getDocument()
   {

return mDocument;
   }

   public static void main(String[] args)
throws Exception
   {
try
{
Date start = new Date();
String indexDir = "C:\\LuceneDemo\\index";
IndexWriter writer = new IndexWriter(indexDir, new StandardAnalyzer(), 
true);

indexDocs(writer, new File("C:\\1995\\volume.xml"));


writer.optimize();
writer.close();

Date end = new Date();

   }
catch (Exception e)
{
System.out.println(" caught a " + e.getClass() + "\n with message: " + 
e.getMessage());

throw e;
}
   }

   public static void indexDocs(IndexWriter writer, File file)
throws Exception
   {

if (file.isDirectory())

{
String[] files = file.list();
for (int i = 0; i < files.length; i++)
indexDocs(writer, new File(file, files[i]));

}
else
{
System.out.println("adding " + file);

XMLDocumentHandlerSAX hdlr = new XMLDocumentHandlerSAX(file);
StandardAnalyzer anal = new StandardAnalyzer();
writer.addDocument(hdlr.getDocument(),anal);
System.out.println("Documents added to Index: "+writer.docCount());



}
   }
}
Thanks very much again.
MC 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

44 matches

Mail list logo