On 17/12/2015 08:45, Zheng Lin Edwin Yeo wrote:
Hi Alexandre,

Thanks for your reply.

So the only way to solve this issue is to explore with PDF specific tools
and change the encoding of the file?
Is there any way to configure it in Solr?

Solr uses Tika to extract plain text from PDFs. If the PDFs have been created in a way that Tika cannot easily extract the text, there's nothing you can do in Solr that will help.

Unfortunately PDF isn't a content format but a presentation format - so extracting plain text is fraught with difficulty. You may see a character on a PDF page, but exactly how that character is generated (using a specific encoding, font, or even by drawing a picture) is outside your control. There are various businesses built on this premise - they charge for creating clean extracted text from PDFs - and even they have trouble with some PDFs.

HTH

Charlie


Regards,
Edwin


On 17 December 2015 at 15:42, Alexandre Rafalovitch <arafa...@gmail.com>
wrote:

They could be using custom fonts and non-Unicode characters. That's
probably something to explore with PDF specific tools.
On 17 Dec 2015 1:37 pm, "Zheng Lin Edwin Yeo" <edwinye...@gmail.com>
wrote:

I've checked all the files which has problem with the content in the Solr
index using the Tika app. All of them shows the same issues as what I see
in the Solr index.

So does the issues lies with the encoding of the file? Are we able to
check
the encoding of the file?


Regards,
Edwin


On 17 December 2015 at 00:33, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
wrote:

Hi Erik,

I've shared the file on dropbox, which you can access via the link
here:

https://www.dropbox.com/s/rufi9esmnsmzhmw/Desmophen%2B670%2BBAe.pdf?dl=0

This is what I get from the Tika app after dropping the file in.

Content-Length: 75092
Content-Type: application/pdf
Type: COSName{Info}
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-TIKA:digest:MD5: de67120e29ec7ffa24aec7e17104b6bf
X-TIKA:digest:SHA256:
d0f04580d87290c1bc8068f3d5b34d797a0d8ccce2b18f626a37958c439733e7
access_permission:assemble_document: true
access_permission:can_modify: true
access_permission:can_print: true
access_permission:can_print_degraded: true
access_permission:extract_content: true
access_permission:extract_for_accessibility: true
access_permission:fill_in_form: true
access_permission:modify_annotations: true
dc:format: application/pdf; version=1.3
pdf:PDFVersion: 1.3
pdf:encrypted: false
producer: null
resourceName: Desmophen+670+BAe.pdf
xmpTPg:NPages: 3


Regards,
Edwin


On 17 December 2015 at 00:15, Erik Hatcher <erik.hatc...@gmail.com>
wrote:

Edwin - Can you share one of those PDF files?

Also, drop the file into the Tika app and see what it sees directly -
get
the tika-app JAR and run that desktop application.

Could be an encoding issue?

         Erik

—
Erik Hatcher, Senior Solutions Architect
http://www.lucidworks.com <http://www.lucidworks.com/>



On Dec 16, 2015, at 10:51 AM, Zheng Lin Edwin Yeo <
edwinye...@gmail.com>
wrote:

Hi,

I'm using Solr 5.3.0

I'm indexing some PDF documents. However, for certain PDF files,
there
are
chinese text in the documents, but after indexing, what is indexed
in
the
content is either a series of "??????" or an empty content.

I'm using the post.jar that comes together with Solr.

What could be the reason that causes this?

Regards,
Edwin








--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Reply via email to