Daniel Noll wrote:
The only screenshots I can see look like plain text to me, and I'm
currently working on something which needs to convert Word to HTML,
which is why I ask.
wvWare, which I mentioned earlier, can convert word to HTML and does a
pretty good job of maintaining formatting. abiwor
Ryan Ackley wrote:
>> Any comments on this are appreciated. One thing I thought of would be
>> to continue to offer the text extraction as open source but add html
>> conversion with hit highlighting for a variety of file formats as a
>> commercial add on. Is this something anyone would pay for? W
Ryan Ackley wrote:
The 512 byte thing is a limitation of POIFS I think. I could be wrong
though. Have you tried opening the file with just POIFS?
It was some time ago, but it looks like I used both
org.apache.poi.hwpf.extractor.WordExtractor
org.apache.poi.hdf.extractor.WordDocument
with the
John Haxby wrote:
Sami Siren wrote:
There's also antiword [1] which can convert your .doc to plain text or
PS, not sure how good it is.
antiword isn't very good. I use wvWare
(http://wvware.sourceforge.net/) directly, but you may find that using
abiword is better for you (abiword is an edi
Sami Siren wrote:
There's also antiword [1] which can convert your .doc to plain text or
PS, not sure how good it is.
antiword isn't very good. I use wvWare (http://wvware.sourceforge.net/)
directly, but you may find that using abiword is better for you (abiword
is an editor, but it also do
The 512 byte thing is a limitation of POIFS I think. I could be wrong
though. Have you tried opening the file with just POIFS?
On 3/26/07, Antony Bowesman <[EMAIL PROTECTED]> wrote:
Ryan Ackley wrote:
> Yes I do have plans for adding fast save support and support for more
> file formats. The tim
g Ryan's textmining in prefence to the POI as internally
> TM uses
> > POI and the Word6 extractor so handles a greater variety of files.
> >
> > Ryan, thanks for fixing your site. Do you have any plans/ideas on how
> to parse
> > the 'fast-saved' fil
anks for fixing your site. Do you have any plans/ideas on how
to parse
> the 'fast-saved' files and any ideas on Word files older than the Word 6
format?
>
> Regards
> Antony
>
>
> Ryan Ackley wrote:
> > As the author of both Word POI and textmining.org, I recomme
Ryan Ackley wrote:
Yes I do have plans for adding fast save support and support for more
file formats. The time frame for this happening is the next couple of
months.
That would be good when it comes. It would be nice if it could handle a 'brute
force' mode where in the event of problems, it
Ryan Ackley wrote:
As the author of both Word POI and textmining.org, I recommend using
textmining.org. POI is for general purpose manipulation of Word
documents. textmining's only purpose is extracting text.
I wish the two would collaborate though. It's true that POI contains
code for writin
so handles a greater variety of files.
Ryan, thanks for fixing your site. Do you have any plans/ideas on how to parse
the 'fast-saved' files and any ideas on Word files older than the Word 6 format?
Regards
Antony
Ryan Ackley wrote:
> As the author of both Word POI and
I've been using Ryan's textmining in prefence to the POI as internally TM uses
POI and the Word6 extractor so handles a greater variety of files.
Ryan, thanks for fixing your site. Do you have any plans/ideas on how to parse
the 'fast-saved' files and any ideas on Word
As the author of both Word POI and textmining.org, I recommend using
textmining.org. POI is for general purpose manipulation of Word
documents. textmining's only purpose is extracting text.
Also, people recommend using POI for text extraction but the only
place I've seen an actual how-to on this
Can anyone make a comparison between the two, namely POI API and the one
from textmining.org?
On 3/24/07, Ryan Ackley <[EMAIL PROTECTED]> wrote:
The site is down but you can download the word extractor library direct
here:
http://www.textmining.org/textmining.zip
Going to fix the site this we
The site is down but you can download the word extractor library direct here:
http://www.textmining.org/textmining.zip
Going to fix the site this weekend.
On 3/24/07, Sami Siren <[EMAIL PROTECTED]> wrote:
Antony Bowesman wrote:
>> Are there other sollutions?
There's also antiword [1] which c
Antony Bowesman wrote:
>> Are there other sollutions?
There's also antiword [1] which can convert your .doc to plain text or
PS, not sure how good it is.
--
Sami Siren
[1] http://www.winfield.demon.nl/
-
To unsubscribe, e-mai
www.textmining.org, but the site is no longer accessible. Check Nutch which has
a Word parser - it seems to be the original textmining.org Word6+POI parser.
Pre-word6 and "fast-saved" files will not work. I've not found a solution for
those
Antony
[EMAIL PROTECTED] wrote:
Thank you,
Are
: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Friday, March 23, 2007 5:03:32 PM
Subject: RE: index word files ( doc )
Thank you,
Are there other sollutions?
Van: jafarim [mailto:[EMAIL PROTECTED]
Verzonden: vr 23-
Thank you,
Are there other sollutions?
Van: jafarim [mailto:[EMAIL PROTECTED]
Verzonden: vr 23-3-2007 18:55
Aan: java-user@lucene.apache.org
Onderwerp: Re: index word files ( doc )
Hi
My experience is not much satisfactory. It breaks very easily on many
Hi
My experience is not much satisfactory. It breaks very easily on many files.
On 3/23/07, [EMAIL PROTECTED] <
[EMAIL PROTECTED]> wrote:
Hello,
I am planning to index Word 2003 files. I read I have to use Jakarta
Apache POI, but I also read on the POI site that their work with doc's is in
an
Hello,
I am planning to index Word 2003 files. I read I have to use Jakarta Apache
POI, but I also read on the POI site that their work with doc's is in an early
stage.
Is POI advisable? Or are there better alternatives?
Please give some advice.
Regards,
Erik
On Thu, 9 Feb 2006, Christiaan Fluit wrote:
Yes, that's exactly what I'm doing. Having this in POI would benefit me
a lot though, as I hardly understand the POI basics to be honest (my
fault, not POI's).
OK, that's now in POI (you'll need a scratchpad build from late yesterday
or today, see h
Dmitry Goldenberg wrote:
Awesome stuff. A few questions: is your Excel extractor somehow
better than POI's? and, what do you see as the timeframe for adding
WordPerfect support? Are you considering supporting any other sources
such as MS Project, Framemaker, etc?
I just committed a WordPerfectE
EMAIL PROTECTED]
To: java-user@lucene.apache.org
Sent: Thu 09 Feb 2006 01:36:47 PM EST
Subject: Word files
Hello,
I use the Poi Api to parse MSword files in order to index the content to
enable lucene search.
For that I download the last jars from Poi (including the scratchdpad
one) and use
mitry
From: Christiaan Fluit [mailto:[EMAIL PROTECTED]
Sent: Thu 2/9/2006 4:09 AM
To: java-user@lucene.apache.org
Subject: Re: Word files & Build vs. Buy?
Hello all,
I'm replying to two threads at once as what I have to say relates to both.
My company recently started an open
Nick Burch wrote:
You could try using org.apache.poi.hwpf.HWPFDocument, and getting the
range, then the paragraphs, and grab the text from each paragraph. If
there's interest, I could probably commit an extractor that does this to
poi.
Yes, that's exactly what I'm doing. Having this in POI wo
On Thu, 9 Feb 2006, Christiaan Fluit wrote:
My experience is that the WordDocument class crashes on about 25% of the
documents, i.e. it throws some sort of Exception. I've tested POI
2.5.1-final as well as the current code in CVS, but both produce this
result. I even suspect the output to be 10
t wd = new WordDocument(is);
[EMAIL PROTECTED] wrote:
MS Word - I know that POI exists, but development on the Word portion
seems to have stopped, and there are a lot of nasty looking bugs in
their DB. Since we're involved in dealing with contracts, many of our
Word files are large and co
Hello,
I use the Poi Api to parse MSword files in order to index the content to
enable lucene search.
For that I download the last jars from Poi (including the scratchdpad
one) and use the parser from lucenebook called POIWordDocHandler.
It works quiet good, but for some files the parser does
29 matches
Mail list logo