Re: Issues when indexing PDF files

Charlie Hull Thu, 17 Dec 2015 02:49:03 -0800

On 17/12/2015 08:45, Zheng Lin Edwin Yeo wrote:

Hi Alexandre,


Thanks for your reply.

So the only way to solve this issue is to explore with PDF specific tools
and change the encoding of the file?
Is there any way to configure it in Solr?

Solr uses Tika to extract plain text from PDFs. If the PDFs have beencreated in a way that Tika cannot easily extract the text, there'snothing you can do in Solr that will help.

Unfortunately PDF isn't a content format but a presentation format - soextracting plain text is fraught with difficulty. You may see acharacter on a PDF page, but exactly how that character is generated(using a specific encoding, font, or even by drawing a picture) isoutside your control. There are various businesses built on this premise- they charge for creating clean extracted text from PDFs - and eventhey have trouble with some PDFs.


HTH

Charlie


Regards,
Edwin


On 17 December 2015 at 15:42, Alexandre Rafalovitch <arafa...@gmail.com>
wrote:

They could be using custom fonts and non-Unicode characters. That's
probably something to explore with PDF specific tools.
On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" <edwinye...@gmail.com>
wrote:

I've checked all the files which has problem with the content in the Solr
index using the Tika app. All of them shows the same issues as what I see
in the Solr index.

So does the issues lies with the encoding of the file? Are we able to

check

the encoding of the file?


Regards,
Edwin


On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
wrote:

Hi Erik,

I've shared the file on dropbox, which you can access via the link

here:

https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0


This is what I get from the Tika app after dropping the file in.

Content-Length: 75092
Content-Type: application/pdf
Type: COSName{Info}
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf
X-TIKA:digest:SHA256:
d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7
access_permission:assemble_document: true
access_permission:can_modify: true
access_permission:can_print: true
access_permission:can_print_degraded: true
access_permission:extract_content: true
access_permission:extract_for_accessibility: true
access_permission:fill_in_form: true
access_permission:modify_annotations: true
dc:format: application/pdf; version=1.3
pdf:PDFVersion: 1.3
pdf:encrypted: false
producer: null
resourceName: Desmophen+670+BAe.pdf
xmpTPg:NPages: 3


Regards,
Edwin


On 17 December 2015 at 00:15, Erik Hatcher <erik.hatc...@gmail.com>

wrote:

Edwin - Can you share one of those PDF files?

Also, drop the file into the Tika app and see what it sees directly -

get

the tika-app JAR and run that desktop application.

Could be an encoding issue?

         Erik

—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com <http://www.lucidworks.com/>

On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo <

edwinye...@gmail.com>

wrote:


Hi,

I'm using Solr 5.3.0

I'm indexing some PDF documents. However, for certain PDF files,

there

are

chinese text in the documents, but after indexing, what is indexed

in

the

content is either a series of "??????" or an empty content.

I'm using the post.jar that comes together with Solr.

What could be the reason that causes this?

Regards,
Edwin



--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Re: Issues when indexing PDF files

Reply via email to