Hi,
Sent you a private email with some code attached ;-)
Malcolm
yeohwm <[EMAIL PROTECTED]> wrote:
Hi,
Thanks for the help. Please do let me know what jar file that I
needed and where I can find them.
Regards,
Wooi Meng
--
No virus found in this outgoing message.
Checked
,
Malcolm Clark
Hi,
I'm going to attempt to output several thousand documents from a 3+ million
document collection into a csv file.
What is the most efficient method of retrieving all the text from the fields of
each document one by one? Please help!
Thanks,
Malcolm
Is this the W3 Ent collection you are indexing?
MC
Hi,
Would you please send me your parser too?
Thanks!
Malcolm
- Original Message
From: Liao Xuefeng <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Friday, June 23, 2006 12:54:29 AM
Subject: RE: HTML text extraction
hi, all,
I wrote my own html parser because it just
Try here..
http://www.abebooks.co.uk
Maybe they have one cheaper.
Malcolm
- Original Message -
From: "digby" <[EMAIL PROTECTED]>
To:
Sent: Tuesday, June 06, 2006 11:55 AM
Subject: Re: Lucene in Action
Thanks everyone, although now I'm not sure what to! B
n the correct direction?
Thanks,
Malcolm
Hi everyone,
I am about to index the INEX collection (22 files with 3 files in each-ish)
using Java Lucene. I am undecided with the approach to indexing and have left
my LIA book at uni :-/
Would you recommend:
1.. indexing all files into one big index? (would this be inefficient to
sear
Okay converting to XML sounds like a great option.
Thanks,
Malcolm
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Hi all,
I didn't know whether to add this to the thread asking about TREC indexing or
start a new one.
Anyway, has anyone attempted to index/search the Reuters collection which
consists of SGML?
Mine seems to run through the process okay but alas I'm left with nothing in
the index when I check w
URL for all the source code:
http://www.lucenebook.com/LuceneInAction.zip
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Hi,
You have to parse/index the PDF files and then you can search the index
with Lucene.
Have a look at Lucene in Action and the source code which comes with
it.There is a good demo which parses common formats such as PDF,Word XML
etc.
Cheers,
MC
-
Hi all,
I came across an old mail list item from 2003 exploring the possibilities of a
more probabilistic approach to using Lucene. Do the online experts know if
anyone achieved this since?
Thanks for any advice,
Malc
Hi all,
Are any of you planning on using Lucene in any way for the NLP in INEX this
year or the Enterprise track in TREC?
Thanks,
MC
Hi all,
I am planning on participating in the INEX and hopefully passively on a
couple of TREC tracks mainly using the Lucene API.
Is anyone else on this list planning on using Lucene during participation?
I am particularly interested in the SPAM, Blog and ADHOC tracks.
Malcolm Clark
Hope this helps,
Malcolm Clark
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Hi,
Maybe post some of the code which is giving you problems and people can view
it and try and see what's wrong.
Cheers,
MC
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
That's what I have, loads of different tags and (abstract) tags etc
in each xml document so a lucene document for each is okay.
malcolm
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [
for each field but it's too inefficient as I only
want it once.I just want the date for the index.Where does it lie?
cheers,
Malcolm
Term = new TermQuery(
new Term("p",
"xx"));
TermQuery theyTerm = new TermQuery(
new Term("p",
"xxx"));
I'm sure the folks on here will be able to come up with a more efficient
method.Try obtaining Lucene in Action or
Are you going to write another addition with lots of Servlet code? If that's
the case put me down for an advance copy.Lucene and servlets is a direction
I may be going in the future.
Thanks,
Malcolm Clark
-
To unsubs
Okay.Thanks to you both.
Malcolm
Hi thanks for your reply,
So when I delete a document the writer.close(); this actually commits the
deletion to the index which is not reversible?
I have a facility which deletes but leaves the delete 'undoable' until the
change is commited by closing the reader. I cannot access the doCommit o
a variety of
reasons.The facility I am trying to implement is the ability to delete a
document from the index.Do I need to commit or just reader.close? I have the
LIA book which is superb and have read the sections regarding delete.If it
mentions commit maybe I missed it?
Thanks,
Malcolm
using?
My class is this:
public abstract class commitDelete extends IndexReader {
protected final void commitIndex() {
try{
super.commit();
}(IOException e){}
}
}
Incidentally if I close the index does this commit anyway?
Please help as I'm stumped.
thanks in advance,
Malcolm Clark
roblem and what was the solution?
Secondly by removing the writer.close will this cause heap problems(running
out!).
I have used the Lucene in Action: mergeFactor, maxMergeDocs and
writer.minMergeDocs to try and stop the memory problem.
Thanks in advance,
Malcolm
cheers
Hi,
Could you send me the url for HighFreqTerms.java in cvs?
Thanks,
Malcolm
to my
dissertation regarding Lucene.
Thanks,
Malcolm
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
tHandlerSAX and a bit extra. I
originally started using Digester but found that I preferred the Sandbox
implementation.
Thanks,
Malcolm Clark
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Hi,
I've been reading my new project bible 'Lucene in Action' about Analysis in
Chapter 4 and wondered what others are doing for indexing XML(if anyone else
is, that is!).
Are you folks just writing your own or utilising the current Lucene analysis
libraries?
thanks
Karl,
Thanks for your tips.I have considered DOM processing but it seemed to take a
hell of a long time to process all the documents(12,125).
Malcolm Clark
Grant,
Thanks for your tips.I have considered DOM processing but it seemed to take a
hell of a long time to process all the documents(12,125).
Grant,
Thanks for your help with the problem I was experiencing. I split it all down
and realised the problem was the location of the IndexWriting(It was not in the
correct place within the SAX processing) and also becuase of some poor error
handling on my part.
kind thanks,
Malcolm
I'm not in anyway an expert, in fact far from, but when I try to reference
each article seperately it complains of entitites as the XML articles are
not well-formed.
Thanks,
MC
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For
Hi Grant,
A highly shortened version of the volume is like below.
]>
IEEE Annals of the History of Computing
Spring 1995 (Vol. 17, No. 1)
Published by the IEEE Computer Society
About this Issue
&A1003;
Comments, Queries, and Debate
&A1004;
Articles
&A1006;
It's XML like this. It has 120-ish volumes with references to 12,107 articles
which are like this below:
A1003
10.1041/A1003s-1995
IEEE Annals of the History of Computing
1058-6180/95/$4.00 © 1995
IEEE
Vol. 17, No. 1
Spring1995
pp. 3-3
About this Issuepp. 3-3
J.A.N.LeeEditor‐in‐Chief
The firs
Hi again,
I am desperately asking for aid!!
I have used the sandbox demo to parse the INEX collection.The problem being
it points to a volume file which references 50 other xml articles.Lucene
only treats this as one document.Is there any method of which I'm
overlooking that halts after each r
Hi all,
I am relatively new and scared by Lucene so please don't flame me.I have
abandoned Digester and am now just using other SAX stuff.
I have used the sandbox stuff to parse an XML file with SAX which then bungs it
into a document in a Lucene index.The bit I'm stuck on is how is a
element
Hi I have tried as suggested and isolated Digester from Lucene. Digester
doesn't trigger an Element Matching Pattern for each element only the last one
of each repeating tag.My XML (trimmed a bit looks like this):
IEEE Annals of the History of Computing
Spring 1995 (Vol. 17, No. 1)
Pub
Okay I'll do that. Thanks very much for the advice as it's much appreciated.
Malcolm Clark
Hi
I used Luke to check the content of the index and they are not there.
cheers,
MC
Hi,
Could somebody please help me regarding Lucene and Digester. I have discovered
this problem during indexing the INEX collection of XML for my MSc project.
During the parsing of the XML files all named Volume.xml the parser will only
index the last XML element in any repetitive list. For ex
Hi all,
I'm using Lucene/Digester etc for my MSc I'm quite new to these API's. I'm
trying to obtain advice but it's hard to say whether the problem is Lucene or
Digester.
Firstly:
I am trying to index the INEX collection but when I try to index repetitive
elements only the last one is indexed. F
44 matches
Mail list logo