PDF processing is very difficult, because the entire standard is a dumpster fire. For example, it has no concept of structure like headings, paragraphs or sentences because each and every character is just a character, location coordinate, font size and font type.
In order to process the document and try to extract some of the structure out, its required to use heuristics. Check out https://github.com/pdfminer/pdfminer.six and as Mike said above, good luck. It is not a simple task. -- You received this message because you are subscribed to the Google Groups "Django users" group. To unsubscribe from this group and stop receiving emails from it, send an email to django-users+unsubscr...@googlegroups.com. To post to this group, send email to django-users@googlegroups.com. Visit this group at https://groups.google.com/group/django-users. To view this discussion on the web visit https://groups.google.com/d/msgid/django-users/01e467e3-c34a-4054-a88d-d3fa22e20881%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.