Re: Plugin extracting text from docs

2009-07-17 Thread Jonas Eckerman
Matus UHLAR - fantomas wrote: I've been thinking about it. The pdftohtml could provide interesting infromations like colour informations that could lead to better spam detection. Any experiences with this? I've been thinking a bit more about this. My current plan is to download the trunk vers

Re: Plugin extracting text from docs

2009-07-13 Thread Jonas Eckerman
Matus UHLAR - fantomas wrote: Ah. I didn't see that option. That's nice. I'm now using pdftotext instead of pdftohtml here as well. :-) I've been thinking about it. The pdftohtml could provide interesting infromations like colour informations that could lead to better spam detection. Any exp

Re: Plugin extracting text from docs

2009-07-13 Thread Matus UHLAR - fantomas
On 10.07.09 16:48, Jonas Eckerman wrote: > Rosenbaum, Larry M. wrote: > >> I have found the Xpdf package [...] has a pdftotext command line utility. > > If you build it with the "--without-x" option, > > Ah. I didn't see that option. That's nice. I'm now using pdftotext > instead of pdftohtml her

Re: Plugin extracting text from docs

2009-07-10 Thread Jonas Eckerman
Rosenbaum, Larry M. wrote: I have found the Xpdf package [...] has a pdftotext command line utility. > If you build it with the "--without-x" option, Ah. I didn't see that option. That's nice. I'm now using pdftotext instead of pdftohtml here as well. :-) And I've just uploaded a new versio

RE: Plugin extracting text from docs

2009-07-06 Thread Rosenbaum, Larry M.
> From: Jonas Eckerman [mailto:jonas_li...@frukt.org] > > Rosenbaum, Larry M. wrote: > > > It appears that "pdftohtml" is only available as a Windows executable > (on Sourceforge). > > If you want a precompiled executable it seems Windows is the only > platform, but AFAICS the source code is als

Re: Plugin extracting text from docs

2009-07-02 Thread Jonas Eckerman
Rosenbaum, Larry M. wrote: It appears that "pdftohtml" is only available as a Windows executable (on Sourceforge). If you want a precompiled executable it seems Windows is the only platform, but AFAICS the source code is also available at http://sourceforge.net/projects/pdftohtml/files/ >

RE: Plugin extracting text from docs

2009-07-02 Thread Martin Gregorie
On Thu, 2009-07-02 at 14:15 -0400, Rosenbaum, Larry M. wrote: > > And, please tell me of problems. > > > pdftohtml is imho not found in gentoo, but pdf2html is maybe the same ? > > It appears that "pdftohtml" is only available as a Windows executable > (on Sourceforge). I need something that wi

Re: Plugin extracting text from docs

2009-07-02 Thread Jonas Eckerman
Benny Pedersen wrote: pdftohtml is imho not found in gentoo, but pdf2html is maybe the same ? I wouldn't know since I haven't got any Gentoo machines. The "pdftohtml" I'm using is installed from FreeBSD ports. It can be downloaded from only problem i had

RE: Plugin extracting text from docs

2009-07-02 Thread Rosenbaum, Larry M.
> And, please tell me of problems. > pdftohtml is imho not found in gentoo, but pdf2html is maybe the same ? It appears that "pdftohtml" is only available as a Windows executable (on Sourceforge). I need something that will run on Solaris.

Re: Plugin extracting text from docs

2009-07-02 Thread Benny Pedersen
On Thu, July 2, 2009 15:50, Jonas Eckerman wrote: > Benny Pedersen wrote: >> just tested this plugin here, all i can say it rooks viagra out of docs rtf >> files :) > I just saw it extract a 419 from a word doc so that it was catched by > bayes and a bunch of rules (it would actually have slipped

Re: Plugin extracting text from docs

2009-07-02 Thread Jonas Eckerman
Benny Pedersen wrote: just tested this plugin here, all i can say it rooks viagra out of docs rtf files :) I just saw it extract a 419 from a word doc so that it was catched by bayes and a bunch of rules (it would actually have slipped past our filter otherwise). :-) > well done Thanks.

Re: Plugin extracting text from docs

2009-07-02 Thread Jonas Eckerman
Benny Pedersen wrote: ). I've now mirrored the file as I hope that will work better. Regards /Jonas -- Jonas Eckerman Fruktträdet & Förbundet Sveriges Dövblinda http://www.fsdb.org/ http://www.fru

Re: Plugin extracting text from docs

2009-07-01 Thread Benny Pedersen
On Wed, July 1, 2009 21:51, Jonas Eckerman wrote: > ). i had to use wget --continue to get it downloaded, is this a firewall limit ? stalls in 8k here, so multiple wget try to get the full zip down :( -- xpoint

Re: Plugin extracting text from docs

2009-07-01 Thread Jonas Eckerman
Rosenbaum, Larry M. wrote: We can use antiword to render text from MSWord files, and unrtf to render text from RTF files. What is the best tool to render text from PDF files? I don't know what the best tool is, but I'm currently using pdftohtml in XML mode (and then stripping the XML) in my

RE: Plugin extracting text from docs (was: new spam using large images)

2009-07-01 Thread Giampaolo Tomassoni
> We can use antiword to render text from MSWord files, and unrtf to > render text from RTF files. What is the best tool to render text from > PDF files? > > (We are running Solaris 9) FWIK, antiword is the best tradeoff between speed and conversion quality. The best converter I know of, even f

RE: Plugin extracting text from docs (was: new spam using large images)

2009-07-01 Thread Rosenbaum, Larry M.
We can use antiword to render text from MSWord files, and unrtf to render text from RTF files. What is the best tool to render text from PDF files? (We are running Solaris 9) L > -Original Message- > From: Jonas Eckerman [mailto:jonas_li...@frukt.org] > Sent: Wednesday, June 24, 2009 1

Re: Plugin extracting text from docs

2009-06-25 Thread David B Funk
On Fri, 26 Jun 2009, Jonas Eckerman wrote: > Theo Van Dinter wrote: > > > the convolution is a > > fingerprint that you could write a rule for and then you don't care > > what the content actually is. For example, you'd render something > > like "doc_pdf_jpg", which would make an obvious Bayes to

Re: Plugin extracting text from docs

2009-06-25 Thread Jonas Eckerman
Theo Van Dinter wrote: the convolution is a fingerprint that you could write a rule for and then you don't care what the content actually is. For example, you'd render something like "doc_pdf_jpg", which would make an obvious Bayes token. In the same way for a zip file, you could do "zip_pdf z

Re: Plugin extracting text from docs

2009-06-25 Thread Theo Van Dinter
On Thu, Jun 25, 2009 at 3:41 PM, Jonas Eckerman wrote: > Matus example was a Word document that contained as PDF wich (might in turn > contain an image). A plugin that knows how to read word document could > extract th text of the word document and then use "set_rendered" to make > that avaiölable

Re: Plugin extracting text from docs

2009-06-25 Thread Jonas Eckerman
Theo Van Dinter wrote: I would comment that plugins should probably skip parts they want to render that already has rendered text available. Ah. That's a good idea. Now I'll have to search for a nice way to check that. :-) I can't see how "set_rendered" would help in creating a fucntioning

Re: Plugin extracting text from docs

2009-06-25 Thread Theo Van Dinter
On Thu, Jun 25, 2009 at 1:12 PM, Jonas Eckerman wrote: >> Already exists, check recent list history for "set_rendered". > > I though that was for text only. It is only for text. > In any case, any plugin looking for images, or a PDF, will most likely look > at MIME type and/or file name, and then

Re: Plugin extracting text from docs

2009-06-25 Thread Jonas Eckerman
Theo Van Dinter wrote: I am not sure but I think something alike was done. What I mean is to have generic chain of format converters, where at the end would be plain image or even text, that could be processed by classic rules like bayes, replacetags etc. Already exists, check recent list his

Re: Plugin extracting text from docs

2009-06-25 Thread Jonas Eckerman
Matus UHLAR - fantomas wrote: This I don't understand. Do they put PDFs inside .doc files as if the ..doc was an archive? I am not sure but I think something alike was done. Considering that an OpenXML format is basically a zip file with XML files inside and that the actual document can co

Re: Plugin extracting text from docs

2009-06-25 Thread Theo Van Dinter
On Thu, Jun 25, 2009 at 11:48 AM, Matus UHLAR - fantomas wrote: > I am not sure but I think something alike was done. What I mean is to have > generic chain of format converters, where at the end would be plain image > or even text, that could be processed by classic rules like bayes, > replacetags

Re: Plugin extracting text from docs

2009-06-25 Thread Matus UHLAR - fantomas
> Matus UHLAR - fantomas wrote: > >>> I'm currently working on a modular plugin for extracting text and add >>> it to SA message parts. >> >> if possible, extract images too, so the fuzzyocr and similar plugins would >> be able to look at that too. > > You meen extract images and add them as part

Re: Plugin extracting text from docs

2009-06-25 Thread Jonas Eckerman
Jonas Eckerman wrote: You meen extract images and add them as parts to the message? I guess that should be doable. I know that "unrtf" can extract images from RTF files. I'll probably implement support for this, but I'll probably not implement actually doing it right away. This'll probably

Re: Plugin extracting text from docs

2009-06-25 Thread Jonas Eckerman
Matus UHLAR - fantomas wrote: I'm currently working on a modular plugin for extracting text and add it to SA message parts. if possible, extract images too, so the fuzzyocr and similar plugins would be able to look at that too. You meen extract images and add them as parts to the message?

Re: Plugin extracting text from docs (was: new spam using large images)

2009-06-25 Thread Matus UHLAR - fantomas
> Jason Haar wrote: > >> Speaking of image/rtf/word attachment spam; is there any work going on >> to standardize this so that the textual output of such attachments could >> be fed back into SA? On 24.06.09 19:33, Jonas Eckerman wrote: > Just as a note: > > I'm currently working on a modular plug