Bruno Lavoie, 07.11.2008 19:20:
Hello,
The intent is to use pdftotext and store the resulting text in datbase
for full text search purposes... I'm trying to develop a mini content
server where I'll put pdf documents to make it searchable.
Generally, PDFs are in size of 500 to 3000 pages resulting in text from
500kb to 2megabytes...
I'm also looking at open source projects like Alfresco if it can serve
with ease to my purpose... Anyone use this one? Comments are welcome.
If you are not bound to "native" Postgres tools, you might want to take a look
at my SQL Workbench/J (http://www.sql-workbench.net)
It can insert the contents of files (located on the client) into tables. You can either do this using an extended SQL syntax:
UPDATE pdf_table
SET text_content = {$clobfile=c:/temp/convertet.txt encoding=utf8}
WHERE id = 42;
(of course this statement can not be run with psql)
You could also bulk-upload several files at one using my flat-file import.
(http://www.sql-workbench.net/manual/command-import.html)
Assuming the table has two columns (id, text_content), the flat file would look
like this:
id|text_content
1|content_1.txt
2|content_2.txt
3|content_3.txt
and the import would store the content of the files not the literl
'content_1.txt' in the column text_content.
You can either insert or update the content, depending on your needs. You could
even store the orginal pdf file if the table contains a bytea column for the
blob data.
Contact me offline (contact information on my homepage) if you need help.
Regards
Thomas
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general