oding, header, and other
specificity.
Nutch use specific word tools (http://lucene.apache.org/nutch/apidocs/
org/apache/nutch/parse/msword/package-summary.html), but, IMHO, it's
not the more difficult part.
M.
Le 8 juin 07 à 19:23, jim shirreffs a écrit :
Hi,
I am trying to index mswor
rt.
M.
Le 8 juin 07 à 19:23, jim shirreffs a écrit :
Hi,
I am trying to index msword documents. I've got things working but I do
not think I am doing things properly.
To index msword docs I use an extractor to extract the text. Then I write
the text to a .txt file and index that using
many thanks I will try that, thanks again!
jim s
- Original Message -
From: "Donna L Gresh" <[EMAIL PROTECTED]>
To:
Sent: Friday, June 08, 2007 12:52 PM
Subject: Re: Indexing MSword Documents
I do this exact thing. "text" (the second input to the Field constructor)
is MSWord text
Hi,
I am trying to index msword documents. I've got things working but I do not
think I am doing things properly.
To index msword docs I use an extractor to extract the text. Then I write
the text to a .txt file and index that using an HTMLDocument object. Seems
to me that since I have the te
I am trying to index msword documents. I’ve got things working but I do not
think I am doing things properly.
To index msword docs I use an extractor to extract the text. Then I write
the text to a .txt file and index that using an HTLMDocument object. Seems
to me that since I have the text
4 PM
Subject: Re: Indexing PDF document
you need to include the both the bouncy castle jars and FontBox jar.
Both are included with the PDFBox distribution.
Ben
Quoting jim shirreffs <[EMAIL PROTECTED]>:
Thanks I rebuilt PDFbox and got past that problem but now I am getting
Exc
Thanks I rebuilt PDFbox and got past that problem but now I am getting
Exception in thread "main" java.lang.NoClassDefFoundError:
org/bouncycastle/jce/provider/BouncyCastleProvider
seems my test pdf file is provider locked so I tried a Lucene pdf file and
got
java.lang.NoClassDefFoundError
Well I got no where trying to index openoffice documents so I thought I try
indexing PDF documents. Seemed Like PDFBox was a good bet, claimed to offer
Lucene support and was on the Lucene recommended list. But after numerious
attempts failed I decided try the IndexFiles.java that comes with PDF
code up a Reader the just spites out "Here I am" a few
hundred times and see what happens. LOL.
thank you for the reply and advice.
jim s
- Original Message -
From: "Andrzej Bialecki" <[EMAIL PROTECTED]>
To:
Sent: Friday, May 25, 2007 1:10 PM
Subject: R
I've been working on this for a while, I am trying to get the demo code that
comes with Lucene to index OpenOffice documentss. I've looked at LIUS code
and at Nutch code. But can't find an easy way. So I am digging into the
code.
I wrote a KcmiDocument class that returns a Document. In it I
magic" to index it that I know of.
Erick
On 5/23/07, jim shirreffs <[EMAIL PROTECTED]> wrote:
Is it possibe to index CAD formats such as AutoCad or CGM? I know some
commecail products (excalaber) claim to be able to do that? If so what
about
TIFF?
thanks
jim s
---
Is it possibe to index CAD formats such as AutoCad or CGM? I know some
commecail products (excalaber) claim to be able to do that? If so what about
TIFF?
thanks
jim s
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additiona
Anyone know how to add OpenOffice document to a Lucene index? Is there a
parser for OpenOffice?
thanks in advance
jim s.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
Hi, I'm a relative Lucene newbe and would appreciate some expert advice.
I would like to make fulltest searchable, files distributed on various
local hosts in the intranet. My startup plan is to index these files locally
and then merge all the little indexes into a master indexes on a search
mixing databases and text searching, and I don't want to go
there
Of course, this would all work if we could just create the DWIM
algorithm...
Do What I Mean..
Erick
On 4/21/07, jim shirreffs <[EMAIL PROTECTED]> wrote:
"Lucene has no concept of "document identity&qu
"Lucene has no concept of "document identity" in that you can index
the same document 15 times in a row and Lucene will have 15 entries. "
Is this true? When ever I run the demo indexing logic document already
indexed are skipped. What am I missing.
jim s
start java org.apache.lucene.demo.In
Can indexing logic on one host update an index on another host?
In my application the files I wish to index/search live in distributed
vaults on "safe" hosts in the intranet. Accessing those files is strictly
controller by application logic in a (Tomcat) servlet.
Crawling the vaults is not an
Thanks to Karl and Donna, I followed your suggestions and was able to get a
test driver (modified demo code) working, thanks again.
jim s
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROT
Hi, I have been using Lucene "out of the box" since 1.4.3, wonderful full
text engine, I love it.
But I can't use it "out of the box" any more, I am going to have to write
some code (Oh no! Mr Bill.). I am fairly certain that the code needed will
be trivial, but I am unfamiliar with Lucene's A
19 matches
Mail list logo