Parsing Nested List
I am trying to parse a Python nested list that is the result of the getOutlines() function of module PyPFD2 using pyparsing module. This is the result I get. what in the world are 'expandtabs' and why is that making a difference to my parse attempt? Python Code 7 import PPDF2,pyparsing from pyparsing import Word, alphas, nums pdfFileObj=open('x.pdf','rb') pdfReader=PyPDF2.PdfFileReader(pdfFileObj) List=pdfReader.getOutlines() myparser = Word( alphas ) + Word(nums, exact=2) +"of" + Word(nums, exact=2) myparser.parseString(List) This is the error I get: Traceback (most recent call last): File "", line 1, in myparser.parseString(List) File "C:\python\lib\site-packages\pyparsing.py", line 1620, in parseString instring = instring.expandtabs() AttributeError: 'list' object has no attribute 'expandtabs' Thanks so much, not getting any helpful responses from https://python-forum.io. -- https://mail.python.org/mailman/listinfo/python-list
Re: Parsing Nested List
On Sunday, February 4, 2018 at 4:26:24 PM UTC-6, Stanley Denman wrote: > I am trying to parse a Python nested list that is the result of the > getOutlines() function of module PyPFD2 using pyparsing module. This is the > result I get. what in the world are 'expandtabs' and why is that making a > difference to my parse attempt? > > Python Code > 7 > import PPDF2,pyparsing > from pyparsing import Word, alphas, nums > pdfFileObj=open('x.pdf','rb') > pdfReader=PyPDF2.PdfFileReader(pdfFileObj) > List=pdfReader.getOutlines() > myparser = Word( alphas ) + Word(nums, exact=2) +"of" + Word(nums, exact=2) > myparser.parseString(List) > > This is the error I get: > > Traceback (most recent call last): > File "", line 1, in > myparser.parseString(List) > File "C:\python\lib\site-packages\pyparsing.py", line 1620, in parseString > instring = instring.expandtabs() > AttributeError: 'list' object has no attribute 'expandtabs' > > Thanks so much, not getting any helpful responses from > https://python-forum.io. -- https://mail.python.org/mailman/listinfo/python-list
Re: Parsing Nested List
On Sunday, February 4, 2018 at 5:06:26 PM UTC-6, Steven D'Aprano wrote: > On Sun, 04 Feb 2018 14:26:10 -0800, Stanley Denman wrote: > > > I am trying to parse a Python nested list that is the result of the > > getOutlines() function of module PyPFD2 using pyparsing module. > > pyparsing parses strings, not lists. > > I fear that you have completely misunderstood what pyparsing does: it > isn't a general-purpose parser of arbitrary Python objects like lists. > Like most parsers (actually, all parsers that I know of...) it takes text > as input and produces some sort of machine representation: > > https://en.wikipedia.org/wiki/Parsing#Computer_languages > > > So your code is not working because you are calling parseString() with a > list argument: > > myparser.parseString(List) > > > The name of the function, parseString(), should have been a hint that it > requires a *string* as argument. > > You have generated an outline: > > List = pdfReader.getOutlines() > > but do you know what the format of that list is? I'm going to assume that > it looks something like this: > > ['ABCD 01 of 99', 'EFGH 02 of 99', 'IJKL 03 of 99', ...] > > since that matches the template you gave to pyparsing. Notice that: > > - words are separated by spaces; > > - the first word is any arbitrary word, made up of just letters; > > - followed by EXACTLY two digits; > > - followed by the word "of"; > > - followed by EXACTLY two digits. > > Furthermore, I'm assuming it is a simple, non-nested list. If that is not > the case, you will need to explain precisely what the format of the > outline actually is. > > To parse this list is simple and pyparsing is not required: > > for item in List: > words = item.split() > if len(words) != 4: > raise ValueError('bad input data: %r' % item) > first, number, x, total = words > number = int(number) > assert x == 'of' > total = int(total) > print(first, number, total) > > > > > Hope this helps. > > (Please keep any replies on the list.) > > > > -- > Steve Thank you so much Steve. I do seem to be barking up the wrong tree. The result of running getOutlines() is indeed a nested list: it is the pdfs bookmarks. There are 3 levels: level 1 is the section from A-F. When a section there are exhibits, so in Section A we have exhibits 1A to nA. Finally there are bookmarks for individual pages in an exhibit. So we have this for Section A: [{'/Title': 'Section A. Payment Documents/Decisions', '/Page': IndirectObject(1, 0), '/Type': '/FitB'}, [{'/Title': '1A: Disability Determination Transmittal (831) Dec. Dt.: 05/27/2016 (1 page)', '/Page': IndirectObject(1, 0), '/Type': '/FitB'}, [{'/Title': '1A (Page 1 of 1)', '/Page': IndirectObject(1, 0), '/Type': '/FitB'}], {'/Title': '2A: Disability Determination Explanation (DDE) Dec. Dt.: 05/27/2016 (10 pages)', '/Page': IndirectObject(6, 0), '/Type': '/FitB'}, [{'/Title': '2A (Page 1 of 10)', '/Page': IndirectObject(6, 0), '/Type': '/FitB'}, {'/Title': '2A (Page 2 of 10)', '/Page': IndirectObject(10, 0), '/Type': '/FitB'}, {'/Title': '2A (Page 3 of 10)', '/Page': IndirectObject(14, 0), '/Type': '/FitB'}, {'/Title': '2A (Page 4 of 10)', '/Page': IndirectObject(18, 0), '/Type': '/FitB'}, {'/Title': '2A (Page 5 of 10)', '/Page': IndirectObject(22, 0), '/Type': '/FitB'}, {'/Title': '2A (Page 6 of 10)', '/Page': IndirectObject(26, 0), '/Type': '/FitB'}, {'/Title': '2A (Page 7 of 10)', '/Page': IndirectObject(30, 0), '/Type': '/FitB'}, {'/Title': '2A (Page 8 of 10)', '/Page': IndirectObject(34, 0), '/Type': '/FitB'}, {'/Title': '2A (Page 9 of 10)', '/Page': IndirectObject(38, 0), '/Type': '/FitB'}, {'/Title': '2A (Page 10 of 10)', '/Page': IndirectObject(42, 0), '/Type': '/FitB'}], {'/Title': '3A: ALJ Hearing Decision (ALJDEC) Dec. Dt.: 12/17/2012 (22 pages)', '/Page': IndirectObject(47, 0), '/Type': '/FitB'}, [{&
Re: Parsing Nested List
On Sunday, February 4, 2018 at 5:32:51 PM UTC-6, Stanley Denman wrote: > On Sunday, February 4, 2018 at 4:26:24 PM UTC-6, Stanley Denman wrote: > > I am trying to parse a Python nested list that is the result of the > > getOutlines() function of module PyPFD2 using pyparsing module. This is the > > result I get. what in the world are 'expandtabs' and why is that making a > > difference to my parse attempt? > > > > Python Code > > 7 > > import PPDF2,pyparsing > > from pyparsing import Word, alphas, nums > > pdfFileObj=open('x.pdf','rb') > > pdfReader=PyPDF2.PdfFileReader(pdfFileObj) > > List=pdfReader.getOutlines() > > myparser = Word( alphas ) + Word(nums, exact=2) +"of" + Word(nums, exact=2) > > myparser.parseString(List) > > > > This is the error I get: > > > > Traceback (most recent call last): > > File "", line 1, in > > myparser.parseString(List) > > File "C:\python\lib\site-packages\pyparsing.py", line 1620, in parseString > > instring = instring.expandtabs() > > AttributeError: 'list' object has no attribute 'expandtabs' > > > > Thanks so much, not getting any helpful responses from > > https://python-forum.io. I have found that I can use the index values in the list to print out the section I need. So print(MyList[7]) get me to section f taht I want. print(MyList[9][1]) for example give me a string that is the bookmark entry for Exhibit 1F. But this index value would presumeably be different for each pdf file - that is there may not always be Section A-E, but there will always be a Section F. In ther words, the index values that get me to the right section would be different in each pdf file. -- https://mail.python.org/mailman/listinfo/python-list
Extracting data from ython dictionary object
I am new to Python. I am trying to extract text from the bookmarks in a PDF file that would provide the data for a Word template merge. I have gotten down to a string of text pulled out of the list object that I got from using PyPDF2 module. I am stuck on now to get the data out of the string that I need. I am calling it a string, but Python is recognizing as a dictionary object. Here is the string: {'/Title': '1F: Progress Notes Src.: MILANI, JOHN C Tmt. Dt.: 05/12/2014 - 05/28/2014 (9 pages)', '/Page': IndirectObject(465, 0), '/Type': '/FitB'} What a want is the following to end up as fields on my Word template merge: MedSourceFirstName: "John" MedSourceLastName: "Milani" MedSourceLastTreatment: "05/28/2014" If I use keys() on the dictionary I get this: ['/Title', '/Page', '/Type']I was hoping "Src" and Tmt Dt." would be treated as keys. Seems like the key/value pair of a dictionary would translate nicely to fieldname and fielddata for a Word document merge. Here is my code so far. [python]import PyPDF2 pdfFileObj=open('x.pdf','rb') pdfReader=PyPDF2.PdfFileReader(pdfFileObj) MyList=pdfReader.getOutlines() MyDict=(MyList[-1][0]) print(isinstance(MyDict,dict)) print(MyDict) print(list(MyDict.keys()))[/python] I get this output in Sublime Text: True {'/Title': '1F: Progress Notes Src.: MILANI, JOHN C Tmt. Dt.: 05/12/2014 - 05/28/2014 (9 pages)', '/Page': IndirectObject(465, 0), '/Type': '/FitB'} ['/Title', '/Page', '/Type'] [Finished in 0.4s] Thank you in advance for any suggestions. -- https://mail.python.org/mailman/listinfo/python-list
Re: Extracting data from ython dictionary object
On Friday, February 9, 2018 at 1:08:27 AM UTC-6, dieter wrote: > Stanley Denman writes: > > > I am new to Python. I am trying to extract text from the bookmarks in a PDF > > file that would provide the data for a Word template merge. I have gotten > > down to a string of text pulled out of the list object that I got from > > using PyPDF2 module. I am stuck on now to get the data out of the string > > that I need. I am calling it a string, but Python is recognizing as a > > dictionary object. > > > > Here is the string: > > > > {'/Title': '1F: Progress Notes Src.: MILANI, JOHN C Tmt. Dt.: > > 05/12/2014 - 05/28/2014 (9 pages)', '/Page': IndirectObject(465, 0), > > '/Type': '/FitB'} > > > > What a want is the following to end up as fields on my Word template merge: > > MedSourceFirstName: "John" > > MedSourceLastName: "Milani" > > MedSourceLastTreatment: "05/28/2014" > > > > If I use keys() on the dictionary I get this: > > ['/Title', '/Page', '/Type']I was hoping "Src" and Tmt Dt." would be > > treated as keys. Seems like the key/value pair of a dictionary would > > translate nicely to fieldname and fielddata for a Word document merge. > > Here is my code so far. > > A Python "dict" is a mapping of keys to values. Its "keys" method > gives you the keys (as you have used above). > The subscription syntax ("[]"; e.g. > "pdf_info['/Title']") allows you to access the value associated with > "". > > In your case, relevant information is coded inside the values themselves. > You will need to extract this information yourself. Python's "re" module > might be of help (see the "library reference", for details). Thanks for your response. Nice to know I am at least on the right path. Sounds like I am going to have to did in to Regex to get at the test I want. -- https://mail.python.org/mailman/listinfo/python-list
Re: Extracting data from ython dictionary object (Posting On Python-List Prohibited)
On Friday, February 9, 2018 at 12:20:29 AM UTC-6, Lawrence D’Oliveiro wrote: > On Friday, February 9, 2018 at 6:04:48 PM UTC+13, Stanley Denman wrote: > > {'/Title': '1F: Progress Notes Src.: MILANI, JOHN C Tmt. Dt.: > > 05/12/2014 - 05/28/2014 (9 pages)', '/Page': IndirectObject(465, 0), > > '/Type': '/FitB'} > > > > What a want is the following to end up as fields on my Word template merge: > > MedSourceFirstName: "John" > > MedSourceLastName: "Milani" > > MedSourceLastTreatment: "05/28/2014" > > > > If I use keys() on the dictionary I get this: > > ['/Title', '/Page', '/Type']I was hoping "Src" and Tmt Dt." would be treated > > as keys. Seems like the key/value pair of a dictionary would translate > > nicely to fieldname and fielddata ... > > It would, except that’s not how the information is represented in the PDF > file. Looks like what you want is all in the title string. So extracting it > will require some string manipulation. Do all the title strings follow the > same format? That should simplify the manipulations you need to do. Thanks you Lawrence for your response. Sounds like I am going to have to dig in to Regex to get at the test I want. -- https://mail.python.org/mailman/listinfo/python-list
Regex on a Dictionary
I am trying to performance a regex on a "string" of text that python isinstance is telling me is a dictionary. When I run the code I get the following error: {'/Title': '1F: Progress Notes Src.: MILANI, JOHN C Tmt. Dt.: 05/12/2014 - 05/28/2014 (9 pages)', '/Page': IndirectObject(465, 0), '/Type': '/FitB'} Traceback (most recent call last): File "C:\Users\stand\Desktop\PythonSublimeText.py", line 9, in x=MyRegex.findall(MyDict) TypeError: expected string or bytes-like object Here is the "string" of code I am working with: {'/Title': '1F: Progress Notes Src.: MILANI, JOHN C Tmt. Dt.: 05/12/2014 - 05/28/2014 (9 pages)', '/Page': IndirectObject(465, 0), '/Type': '/FitB'} I want to grab the name "MILANI, JOHN C" and the last date "-mm/dd/" as a pair such that if I have X numbers of string like the above I will end out with N pairs of values (name and date)/ Here is my code: import PyPDF2,re pdfFileObj=open('x.pdf','rb') pdfReader=PyPDF2.PdfFileReader(pdfFileObj) Result=pdfReader.getOutlines() MyDict=(Result[-1][0]) print(MyDict) print(isinstance(MyDict,dict)) MyRegex=re.compile(r"MILANI,") x=MyRegex.findall(MyDict) print(x) Thanks in advance for any help. -- https://mail.python.org/mailman/listinfo/python-list
Re: Regex on a Dictionary
On Tuesday, February 13, 2018 at 9:41:14 AM UTC-6, Mark Lawrence wrote: > On 13/02/18 13:11, Stanley Denman wrote: > > I am trying to performance a regex on a "string" of text that python > > isinstance is telling me is a dictionary. When I run the code I get the > > following error: > > > > {'/Title': '1F: Progress Notes Src.: MILANI, JOHN C Tmt. Dt.: > > 05/12/2014 - 05/28/2014 (9 pages)', '/Page': IndirectObject(465, 0), > > '/Type': '/FitB'} > > > > Traceback (most recent call last): > >File "C:\Users\stand\Desktop\PythonSublimeText.py", line 9, in > > x=MyRegex.findall(MyDict) > > TypeError: expected string or bytes-like object > > > > Here is the "string" of code I am working with: > > Please call it a dictionary as in the subject line, quite clearly it is > not a string in any way, shape or form. > > > > > {'/Title': '1F: Progress Notes Src.: MILANI, JOHN C Tmt. Dt.: > > 05/12/2014 - 05/28/2014 (9 pages)', '/Page': IndirectObject(465, 0), > > '/Type': '/FitB'} > > > > I want to grab the name "MILANI, JOHN C" and the last date "-mm/dd/" as > > a pair such that if I have X numbers of string like the above I will end > > out with N pairs of values (name and date)/ Here is my code: > > > > import PyPDF2,re > > pdfFileObj=open('x.pdf','rb') > > pdfReader=PyPDF2.PdfFileReader(pdfFileObj) > > Result=pdfReader.getOutlines() > > MyDict=(Result[-1][0]) > > print(MyDict) > > print(isinstance(MyDict,dict)) > > MyRegex=re.compile(r"MILANI,") > > x=MyRegex.findall(MyDict) > > print(x) > > > > Thanks in advance for any help. > > > > Was the string methods solution that I gave a week or so ago so bad that > you still think that you need a regex to solve this? > > -- > My fellow Pythonistas, ask not what our language can do for you, ask > what you can do for our language. > > Mark Lawrence My Apology Mark. You took the time to give me the basis of a non-regex solution and I had not taken the time to fully review your answer.Did not understand it at first blush, but I think now I do. -- https://mail.python.org/mailman/listinfo/python-list