PDF processing is very difficult, because the entire standard is a dumpster 
fire.  For example, it has no concept of structure like headings, 
paragraphs or sentences because each and every character is just a 
character, location coordinate, font size and font type.

In order to process the document and try to extract some of the structure 
out, its required to use heuristics.  Check out 
https://github.com/pdfminer/pdfminer.six and as Mike said above, good 
luck.  It is not a simple task.

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-users+unsubscr...@googlegroups.com.
To post to this group, send email to django-users@googlegroups.com.
Visit this group at https://groups.google.com/group/django-users.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-users/01e467e3-c34a-4054-a88d-d3fa22e20881%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to