[JOB] Look for a Full Time Plone Developer
Hi All, Please take a look at a new job opportunity for Python/Plone developers. Patrick Waldo, Project Manager Decernis <http://decernis.com/> *Job Description: Full Time Python/Plone Developer* We are looking for a highly motivated and self-reliant developer to work on systems built with Plone in a small, but lively team. Our ideal candidate is not afraid to roll up their sleeves to tackle complex problems with their code as well as offer innovative solutions at the planning and design stages. The position also calls for both rapid prototyping for client meetings, maintaining current systems and fulfilling other tasks as necessary. This position requires experience in Plone administration, setting up backup and failover instances, optimizing the ZODB, testing and documentation. The job also entails creating clean user interfaces, building forms and integrating with Oracle. The position will begin with a six-month trial period at which point full-time employment will be re-evaluated based on performance. The candidate will be able to choose hours and work remotely, but must meet deadlines and report progress effectively. *Key Skills* · At least 3 years of Plone and Python development · At least 3 years web development (HTML, CSS, jQuery, etc.) · Server administration (Apache) · Oracle integration (cx_Oracle, SQLAlchemy, etc.) · Task oriented and solid project management skills · Data mining and data visualization experience is a plus · Java or Perl experience is a plus · Proficient English · Effective communication *About Decernis* Decernis is a global information systems company that works with industry leaders and government agencies to meet complex regulatory compliance and risk management needs in the areas of food, consumer products and industrial goods. We hold ourselves to high standards to meet the technically challenging areas that our clients face to ensure comprehensive, current and global solutions. Decernis has offices in Rockville, MD and Frankfurt, Germany as well as teams located around the world. For more information, please visit our website: http://www.decernis.com. *Contact* Please send resume, portfolio and cover letter to Cynthia Gamboa, * cgam...@decernis.com*. Decernis is an equal opportunity employer. -- http://mail.python.org/mailman/listinfo/python-list
[JOB] Two opportunities at Decernis
Hi All, The company I work for, Decernis, has two job opportunities that might be of interest. Decernis provides global systems for regulatory compliance management of foods and consumer products to world leaders in each sector. The company has offices in Rockville, MD as well as Frankfurt, Germany. First, we are looking for a highly effective, full-time senior software engineer with experience in both development and client interaction. This position will work mostly in Java, but Python is most definitely an added plus. Second, we are looking for a highly motivated and self-reliant independent contractor to help us build customized RSS feeds, web crawlers and site monitors. This position is part-time and all programs will be written in Python. Experience in Plone will be an added benefit. Please see below for more information. Send resume and cover letter to Cynthia Gamboa, cgam...@decernis.com. Best Patrick Project Manager Decernis News & Issue Management *Job Description: Full-Time Senior Software Engineer* We are looking for a highly effective senior software engineer with experience in both development and client interaction. Our company provides global systems for regulatory compliance management of foods and consumer products to world leaders in each sector. Our ideal candidate has the following experiences: · 5 or more years of Java/J2EE development experiences including Jboss/Tomcat and web applications and deployment; · 4 or more years of Oracle database development experience including Oracle 10g or later versions; · Strong Unix/Linux OS working experience; · Strong script language programming experience in Python and Perl; · Experience with rule-based expert systems; · Experience in Plone and other CMS a plus. Salary commensurate with experience. This position reports directly to the Director of System Development. *About Decernis* Decernis is a global information company that works with industry leaders and government agencies to meet complex regulatory compliance and risk management needs. We work closely with our clients to produce results that meet the high standards demanded in technically challenging areas to ensure comprehensive, current, and global solutions. Our team has the regulatory, scientific, data, and systems expertise to succeed with our clients and we are dedicated to results. Decernis has offices in Rockville, MD and Frankfurt, Germany. Re-locating to the Washington, DC area is a requirement of the position. Decernis is an equal opportunity employer and will not discriminate against any individual, employee, or application for employment on the basis of race, color, marital status, religion, age, sex, sexual orientation, national origin, handicap, or any other legally protected status recognized by federal, state or local law. ### *Job Description: Part Time Python Programmer* We are looking for a highly motivated and self-reliant independent contractor to help us build customized RSS feeds, web crawlers and site monitors. Our ideal candidate has experience working with data mining techniques as well as building web crawlers and scrapers. The candidate will be able to choose hours and work remotely, but must meet expected deadlines and be able to report progress effectively. In addition we are looking for someone who is able to think through the problem set and contribute their own solutions while balancing project goals and direction. The project will last approximately three months, but sufficient performance could lead to future work. This position reports directly to the Director of System Development. *Key Skills* · Data Mining & Web Crawling (Required) · Python Development (Required) · Statistics · Task Oriented · Proficient English · Effective Communication *About Decernis* Decernis is a global information company that works with industry leaders and government agencies to meet complex regulatory compliance and risk management needs. We work closely with our clients to produce results that meet the high standards demanded in technically challenging areas to ensure comprehensive, current, and global solutions. Our team has the regulatory, scientific, data, and systems expertise to succeed with our clients and we are dedicated to results. Decernis has offices in Rockville, MD and Frankfurt, Germany. Re-locating to the Washington, DC area is not a requirement. Decernis is an equal opportunity employer and will not discriminate against any individual, employee, or application for employment on the basis of race, color, marital status, religion, age, sex, sexual orientation, national origin, handicap, or any other legally protected status recognized by federal, state or local law. ### -- http://mail.python.org/mailman/listinfo/python-list
Simple Text Processing Help
Hi all, I started Python just a little while ago and I am stuck on something that is really simple, but I just can't figure out. Essentially I need to take a text document with some chemical information in Czech and organize it into another text file. The information is always EINECS number, CAS, chemical name, and formula in tables. I need to organize them into lines with | in between. So it goes from: 200-763-1 71-73-8 nátrium-tiopentál C11H18N2O2S.Na to: 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na but if I have a chemical like: kyselina močová I get: 200-720-7|69-93-2|kyselina|močová |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál and then it is all off. How can I get Python to realize that a chemical name may have a space in it? Thank you, Patrick So far I have: #take tables in one text file and organize them into lines in another import codecs path = "c:\\text_samples\\chem_1_utf8.txt" path2 = "c:\\text_samples\\chem_2.txt" input = codecs.open(path, 'r','utf8') output = codecs.open(path2, 'w', 'utf8') #read and enter into a list chem_file = [] chem_file.append(input.read()) #split words and store them in a list for word in chem_file: words = word.split() #starting values in list e=0 #EINECS c=1 #CAS ch=2 #chemical name f=3 #formula n=0 loop=1 x=len(words) #counts how many words there are in the file print '-'*100 while loop==1: if nhttp://mail.python.org/mailman/listinfo/python-list
Re: Simple Text Processing Help
Thank you both for helping me out. I am still rather new to Python and so I'm probably trying to reinvent the wheel here. When I try to do Paul's response, I get >>>tokens = line.strip().split() [] So I am not quite sure how to read line by line. tokens = input.read().split() gets me all the information from the file. tokens[2:-1] = [u' '.join(tokens[2:-1])] works just fine, like in the example; however, how can I loop this for the entire document? Also, when I try output.write(tokens), I get "TypeError: coercing to Unicode: need string or buffer, list found". Any ideas? On Oct 14, 4:25 pm, Paul Hankin <[EMAIL PROTECTED]> wrote: > On Oct 14, 2:48 pm, [EMAIL PROTECTED] wrote: > > > > > Hi all, > > > I started Python just a little while ago and I am stuck on something > > that is really simple, but I just can't figure out. > > > Essentially I need to take a text document with some chemical > > information in Czech and organize it into another text file. The > > information is always EINECS number, CAS, chemical name, and formula > > in tables. I need to organize them into lines with | in between. So > > it goes from: > > > 200-763-1 71-73-8 > > nátrium-tiopentál C11H18N2O2S.Na to: > > > 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na > > > but if I have a chemical like: kyselina močová > > > I get: > > 200-720-7|69-93-2|kyselina|močová > > |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál > > > and then it is all off. > > > How can I get Python to realize that a chemical name may have a space > > in it? > > In the original file, is every chemical on a line of its own? I assume > it is here. > > You might use a regexp (look at the re module), or I think here you > can use the fact that only chemicals have spaces in them. Then, you > can split each line on whitespace (like you're doing), and join back > together all the words between the 3rd (ie index 2) and the last (ie > index -1) using tokens[2:-1] = [u' '.join(tokens[2:-1])]. This uses > the somewhat unusual python syntax for replacing a section of a list > with another list. > > The approach you took involves reading the whole file, and building a > list of all the chemicals which you don't seem to use: I've changed it > to a per-line version and removed the big lists. > > path = "c:\\text_samples\\chem_1_utf8.txt" > path2 = "c:\\text_samples\\chem_2.txt" > input = codecs.open(path, 'r','utf8') > output = codecs.open(path2, 'w', 'utf8') > > for line in input: > tokens = line.strip().split() > tokens[2:-1] = [u' '.join(tokens[2:-1])] > chemical = u'|'.join(tokens) > print chemical + u'\n' > output.write(chemical + u'\r\n') > > input.close() > output.close() > > Obviously, this isn't tested because I don't have your chem_1_utf8.txt > file. > > -- > Paul Hankin -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple Text Processing Help
> lines = open('your_file.txt').readlines()[:4] > print lines > print map(len, lines) gave me: ['\xef\xbb\xbf200-720-769-93-2\n', 'kyselina mo\xc4\x8dov \xc3\xa1 C5H4N4O3\n', '\n', '200-001-8\t50-00-0\n'] [28, 32, 1, 18] I think it means that I'm still at option 3. I got the line by line part. My code is a lot cleaner now: import codecs path = "c:\\text_samples\\chem_1_utf8.txt" path2 = "c:\\text_samples\\chem_2.txt" input = codecs.open(path, 'r','utf8') output = codecs.open(path2, 'w', 'utf8') for line in input: tokens = line.strip().split() tokens[2:-1] = [u' '.join(tokens[2:-1])] #this doesn't seem to combine the files correctly file = u'|'.join(tokens) #this does put '|' in between print file + u'\n' output.write(file + u'\r\n') input.close() output.close() my sample input file looks like this( not organized,as you see it): 200-720-769-93-2 kyselina mocová C5H4N4O3 200-001-8 50-00-0 formaldehyd CH2O 200-002-3 50-01-1 guanidínium-chlorid CH5N3.ClH etc... and after the program I get: 200-720-7|69-93-2| kyselina|mocová||C5H4N4O3 200-001-8|50-00-0| formaldehyd|CH2O| 200-002-3| 50-01-1| guanidínium-chlorid|CH5N3.ClH| etc... So, I am sort of back at the start again. If I add: tokens = line.strip().split() for token in tokens: print token I get all the single tokens, which I thought I could then put together, except when I did: for token in tokens: s = u'|'.join(token) print s I got ?|2|0|0|-|7|2|0|-|7, etc... How can I join these together into nice neat little lines? When I try to store the tokens in a list, the tokens double and I don't know why. I can work on getting the chemical names together after...baby steps, or maybe I am just missing something obvious. The first two numbers will always be the same three digits-three digits-one digit and then two digits-two digits-one digit. My intuition tells me that I need to add an if statement that says, if the first two numbers follow the pattern, then continue, if they don't (ie a chemical name was accidently split apart) then the third entry needs to be put together. Something like if tokens.startswith('pattern') == true Again, thanks so much. I've gone to http://gnosis.cx/TPiP/ and I have a couple O'Reilly books, but they don't seem to have a straightforward example for this kind of text manipulation. Patrick On Oct 14, 11:17 pm, John Machin <[EMAIL PROTECTED]> wrote: > On Oct 14, 11:48 pm, [EMAIL PROTECTED] wrote: > > > > > Hi all, > > > I started Python just a little while ago and I am stuck on something > > that is really simple, but I just can't figure out. > > > Essentially I need to take a text document with some chemical > > information in Czech and organize it into another text file. The > > information is always EINECS number, CAS, chemical name, and formula > > in tables. I need to organize them into lines with | in between. So > > it goes from: > > > 200-763-1 71-73-8 > > nátrium-tiopentál C11H18N2O2S.Na to: > > > 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na > > > but if I have a chemical like: kyselina močová > > > I get: > > 200-720-7|69-93-2|kyselina|močová > > |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál > > > and then it is all off. > > > How can I get Python to realize that a chemical name may have a space > > in it? > > Your input file could be in one of THREE formats: > (1) fields are separated by TAB characters (represented in Python by > the escape sequence '\t', and equivalent to '\x09') > (2) fields are fixed width and padded with spaces > (3) fields are separated by a random number of whitespace characters > (and can contain spaces). > > What makes you sure that you have format 3? You might like to try > something like > lines = open('your_file.txt').readlines()[:4] > print lines > print map(len, lines) > This will print a *precise* representation of what is in the first > four lines, plus their lengths. Please show us the output. -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple Text Processing Help
> lines = open('your_file.txt').readlines()[:4] > print lines > print map(len, lines) gave me: ['\xef\xbb\xbf200-720-769-93-2\n', 'kyselina mo\xc4\x8dov \xc3\xa1 C5H4N4O3\n', '\n', '200-001-8\t50-00-0\n'] [28, 32, 1, 18] I think it means that I'm still at option 3. I got the line by line part. My code is a lot cleaner now: import codecs path = "c:\\text_samples\\chem_1_utf8.txt" path2 = "c:\\text_samples\\chem_2.txt" input = codecs.open(path, 'r','utf8') output = codecs.open(path2, 'w', 'utf8') for line in input: tokens = line.strip().split() tokens[2:-1] = [u' '.join(tokens[2:-1])] #this doesn't seem to combine the files correctly file = u'|'.join(tokens) #this does put '|' in between print file + u'\n' output.write(file + u'\r\n') input.close() output.close() my sample input file looks like this( not organized,as you see it): 200-720-769-93-2 kyselina mocová C5H4N4O3 200-001-8 50-00-0 formaldehyd CH2O 200-002-3 50-01-1 guanidínium-chlorid CH5N3.ClH etc... and after the program I get: 200-720-7|69-93-2| kyselina|mocová||C5H4N4O3 200-001-8|50-00-0| formaldehyd|CH2O| 200-002-3| 50-01-1| guanidínium-chlorid|CH5N3.ClH| etc... So, I am sort of back at the start again. If I add: tokens = line.strip().split() for token in tokens: print token I get all the single tokens, which I thought I could then put together, except when I did: for token in tokens: s = u'|'.join(token) print s I got ?|2|0|0|-|7|2|0|-|7, etc... How can I join these together into nice neat little lines? When I try to store the tokens in a list, the tokens double and I don't know why. I can work on getting the chemical names together after...baby steps, or maybe I am just missing something obvious. The first two numbers will always be the same three digits-three digits-one digit and then two digits-two digits-one digit. This seems to be on the only pattern. My intuition tells me that I need to add an if statement that says, if the first two numbers follow the pattern, then continue, if they don't (ie a chemical name was accidently split apart) then the third entry needs to be put together. Something like if tokens[1] and tokens[2] startswith('pattern') == true tokens[2] = join(tokens[2]:tokens[3]) token[3] = token[4] del token[4] but the code isn't right...any ideas? Again, thanks so much. I've gone to http://gnosis.cx/TPiP/ and I have a couple O'Reilly books, but they don't seem to have a straightforward example for this kind of text manipulation. Patrick On Oct 14, 11:17 pm, John Machin <[EMAIL PROTECTED]> wrote: > On Oct 14, 11:48 pm, [EMAIL PROTECTED] wrote: > > > > > Hi all, > > > I started Python just a little while ago and I am stuck on something > > that is really simple, but I just can't figure out. > > > Essentially I need to take a text document with some chemical > > information in Czech and organize it into another text file. The > > information is always EINECS number, CAS, chemical name, and formula > > in tables. I need to organize them into lines with | in between. So > > it goes from: > > > 200-763-1 71-73-8 > > nátrium-tiopentál C11H18N2O2S.Na to: > > > 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na > > > but if I have a chemical like: kyselina močová > > > I get: > > 200-720-7|69-93-2|kyselina|močová > > |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál > > > and then it is all off. > > > How can I get Python to realize that a chemical name may have a space > > in it? > > Your input file could be in one of THREE formats: > (1) fields are separated by TAB characters (represented in Python by > the escape sequence '\t', and equivalent to '\x09') > (2) fields are fixed width and padded with spaces > (3) fields are separated by a random number of whitespace characters > (and can contain spaces). > > What makes you sure that you have format 3? You might like to try > something like > lines = open('your_file.txt').readlines()[:4] > print lines > print map(len, lines) > This will print a *precise* representation of what is in the first > four lines, plus their lengths. Please show us the output. -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple Text Processing Help
Wow, thank you all. All three work. To output correctly I needed to add: output.write("\r\n") This is really a great help!! Because of my limited Python knowledge, I will need to try to figure out exactly how they work for future text manipulation and for my own knowledge. Could you recommend some resources for this kind of text manipulation? Also, I conceptually get it, but would you mind walking me through > for tok in tokens: > if NR_RE.match(tok) and len(chem) >= 4: > chem[2:-1] = [' '.join(chem[2:-1])] > yield chem > chem = [] > chem.append(tok) and > for key, group in groupby(instream, unicode.isspace): > if not key: > yield "".join(group) Thanks again, Patrick On Oct 15, 2:16 pm, Peter Otten <[EMAIL PROTECTED]> wrote: > patrick.waldo wrote: > > my sample input file looks like this( not organized,as you see it): > > 200-720-769-93-2 > > kyselina mocová C5H4N4O3 > > > 200-001-8 50-00-0 > > formaldehyd CH2O > > > 200-002-3 > > 50-01-1 > > guanidínium-chlorid CH5N3.ClH > > Assuming that the records are always separated by blank lines and only the > third field in a record may contain spaces the following might work: > > import codecs > from itertools import groupby > > path = "c:\\text_samples\\chem_1_utf8.txt" > path2 = "c:\\text_samples\\chem_2.txt" > > def fields(s): > parts = s.split() > return parts[0], parts[1], " ".join(parts[2:-1]), parts[-1] > > def records(instream): > for key, group in groupby(instream, unicode.isspace): > if not key: > yield "".join(group) > > if __name__ == "__main__": > outstream = codecs.open(path2, 'w', 'utf8') > for record in records(codecs.open(path, "r", "utf8")): > outstream.write("|".join(fields(record))) > outstream.write("\n") > > Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple Text Processing Help
And now for something completely different... I see a lot of COM stuff with Python for excel...and I quickly made the same program output to excel. What if the input file were a Word document? Where is there information about manipulating word documents, or what could I add to make the same program work for word? Again thanks a lot. I'll start hitting some books about this sort of text manipulation. The Excel add on: import codecs import re from win32com.client import Dispatch path = "c:\\text_samples\\chem_1_utf8.txt" path2 = "c:\\text_samples\\chem_2.txt" input = codecs.open(path, 'r','utf8') output = codecs.open(path2, 'w', 'utf8') NR_RE = re.compile(r'^\d+-\d+-\d+$') #pattern for EINECS number tokens = input.read().split() def iter_elements(tokens): product = [] for tok in tokens: if NR_RE.match(tok) and len(product) >= 4: product[2:-1] = [' '.join(product[2:-1])] yield product product = [] product.append(tok) yield product xlApp = Dispatch("Excel.Application") xlApp.Visible = 1 xlApp.Workbooks.Add() c = 1 for element in iter_elements(tokens): xlApp.ActiveSheet.Cells(c,1).Value = element[0] xlApp.ActiveSheet.Cells(c,2).Value = element[1] xlApp.ActiveSheet.Cells(c,3).Value = element[2] xlApp.ActiveSheet.Cells(c,4).Value = element[3] c = c + 1 xlApp.ActiveWorkbook.Close(SaveChanges=1) xlApp.Quit() xlApp.Visible = 0 del xlApp input.close() output.close() -- http://mail.python.org/mailman/listinfo/python-list
Re: Simple Text Processing Help
And now for something completely different... I've been reading up a bit about Python and Excel and I quickly told the program to output to Excel quite easily. However, what if the input file were a Word document? I can't seem to find much information about parsing Word files. What could I add to make the same program work for a Word file? Again thanks a lot. And the Excel Add on... import codecs import re from win32com.client import Dispatch path = "c:\\text_samples\\chem_1_utf8.txt" path2 = "c:\\text_samples\\chem_2.txt" input = codecs.open(path, 'r','utf8') output = codecs.open(path2, 'w', 'utf8') NR_RE = re.compile(r'^\d+-\d+-\d+$') #pattern for EINECS number tokens = input.read().split() def iter_elements(tokens): product = [] for tok in tokens: if NR_RE.match(tok) and len(product) >= 4: product[2:-1] = [' '.join(product[2:-1])] yield product product = [] product.append(tok) yield product xlApp = Dispatch("Excel.Application") xlApp.Visible = 1 xlApp.Workbooks.Add() c = 1 for element in iter_elements(tokens): xlApp.ActiveSheet.Cells(c,1).Value = element[0] xlApp.ActiveSheet.Cells(c,2).Value = element[1] xlApp.ActiveSheet.Cells(c,3).Value = element[2] xlApp.ActiveSheet.Cells(c,4).Value = element[3] c = c + 1 xlApp.ActiveWorkbook.Close(SaveChanges=1) xlApp.Quit() xlApp.Visible = 0 del xlApp input.close() output.close() -- http://mail.python.org/mailman/listinfo/python-list
Problem Converting Word to UTF8 Text File
Hi all, I'm trying to copy a bunch of microsoft word documents that have unicode characters into utf-8 text files. Everything works fine at the beginning. The word documents get converted and new utf-8 text files with the same name get created. And then I try to copy the data and I keep on getting "TypeError: coercing to Unicode: need string or buffer, instance found". I'm probably copying the word document wrong. What can I do? Thanks, Patrick import os, codecs, glob, shutil, win32com.client from win32com.client import Dispatch input = 'C:\\text_samples\\source\\*.doc' output_dir = 'C:\\text_samples\\source\\output' FileFormat=win32com.client.constants.wdFormatText for doc in glob.glob(input): doc_copy = shutil.copy(doc,output_dir) WordApp = Dispatch("Word.Application") WordApp.Visible = 1 WordApp.Documents.Open(doc) WordApp.ActiveDocument.SaveAs(doc, FileFormat) WordApp.ActiveDocument.Close() WordApp.Quit() for doc in glob.glob(input): txt_split = os.path.splitext(doc) txt_doc = txt_split[0] + '.txt' txt_doc = codecs.open(txt_doc,'w','utf-8') shutil.copyfile(doc,txt_doc) -- http://mail.python.org/mailman/listinfo/python-list
Re: Problem Converting Word to UTF8 Text File
Indeed, the shutil.copyfile(doc,txt_doc) was causing the problem for the reason you stated. So, I changed it to this: for doc in glob.glob(input): txt_split = os.path.splitext(doc) txt_doc = txt_split[0] + '.txt' txt_doc_dir = os.path.join(input_dir,txt_doc) doc_dir = os.path.join(input_dir,doc) shutil.copy(doc_dir,txt_doc_dir) However, I still cannot read the unicode from the Word file. If take out the first for-statement, I get a bunch of garbled text, which isn't helpful. I would save them all manually, but I want to figure out how to do it in Python, since I'm just beginning. My intuition says the problem is with FileFormat=win32com.client.constants.wdFormatText because it converts fine to a text file, just not a utf-8 text file. How can I modify this or is there another way to code this type of file conversion from *.doc to *.txt with unicode characters? Thanks On Oct 21, 7:02 pm, "Gabriel Genellina" <[EMAIL PROTECTED]> wrote: > En Sun, 21 Oct 2007 13:35:43 -0300, <[EMAIL PROTECTED]> escribi?: > > > Hi all, > > > I'm trying to copy a bunch of microsoft word documents that have > > unicode characters into utf-8 text files. Everything works fine at > > the beginning. The word documents get converted and new utf-8 text > > files with the same name get created. And then I try to copy the data > > and I keep on getting "TypeError: coercing to Unicode: need string or > > buffer, instance found". I'm probably copying the word document > > wrong. What can I do? > > Always remember to provide the full traceback. > Where do you get the error? In the last line: shutil.copyfile? > If the file already contains the text in utf-8, and you just want to make > a copy, use shutil.copy as before. > (or, why not tell Word to save the file using the .txt extension in the > first place?) > > > for doc in glob.glob(input): > > txt_split = os.path.splitext(doc) > > txt_doc = txt_split[0] + '.txt' > > txt_doc = codecs.open(txt_doc,'w','utf-8') > > shutil.copyfile(doc,txt_doc) > > copyfile expects path names as arguments, not a > codecs-wrapped-file-like-object > > -- > Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: Problem Converting Word to UTF8 Text File
That KB document was really helpful, but the problem still isn't solved. What's wierd now is that the unicode characters like become è in some odd conversion. However, I noticed when I try to open the word documents after I run the first for statement that Word gives me a window that says File Conversion and asks me how i want to encode it. None of the unicode options retain the characters. Then I looked some more and found it has a central european option both ISO and Windows which works perfectly since the documents I am looking at are in Czech. Then I try to save the document in word and it says if I try to save it as a text file I will lose the formating! So I guess I'm back at the start. Judging from some internet searches, I'm not the only one having this problem. For some reason Word can only save as .doc even though .txt can support the utf8 format with all these characters. Any ideas? On Oct 22, 5:39 am, "Gabriel Genellina" <[EMAIL PROTECTED]> wrote: > En Sun, 21 Oct 2007 15:32:57 -0300, <[EMAIL PROTECTED]> escribi?: > > > However, I still cannot read the unicode from the Word file. If take > > out the first for-statement, I get a bunch of garbled text, which > > isn't helpful. I would save them all manually, but I want to figure > > out how to do it in Python, since I'm just beginning. > > > My intuition says the problem is with > > > FileFormat=win32com.client.constants.wdFormatText > > > because it converts fine to a text file, just not a utf-8 text file. > > How can I modify this or is there another way to code this type of > > file conversion from *.doc to *.txt with unicode characters? > > Ah! I thought you were getting the right file format. > I can't test it now, but this KB > documenthttp://support.microsoft.com/kb/209186/en-us > suggests you should use wdFormatUnicodeText when saving the document. > What the MS docs call "unicode" when dealing with files, is in general > utf16. > In this case, if you want to convert to utf8, the sequence would be: > > f = open(original_filename, "rb") > udata = f.read().decode("utf16") > f.close() > f = open(new_filename, "wb") > f.write(udata.encode("utf8")) > f.close() > > -- > Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Regular Expression
Hi, I'm trying to learn regular expressions, but I am having trouble with this. I want to search a document that has mixed data; however, the last line of every entry has something like C5H4N4O3 or CH5N3.ClH. All of the letters are upper case and there will always be numbers and possibly one . However below only gave me none. import os, codecs, re text = 'C:\\text_samples\\sample.txt' text = codecs.open(text,'r','utf-8') test = re.compile('\u+\d+\.') for line in text: print test.search(line) -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression
This is related to my last post (see: http://groups.google.com/group/comp.lang.python/browse_thread/thread/c333cbbb5d496584/998af2bb2ca10e88#998af2bb2ca10e88) I have a text file with an EINECS number, a CAS number, a Chemical Name, and a Chemical Formula, always in this order. However, I realized as I ran my script that I had entries like 274-989-4 70892-58-9 diazotovaná kyselina 4- aminobenzénsulfónová, kopulovaná s farbiarskym moruovým (Chlorophora tinctoria) extraktom, komplexy so elezom komplexy eleza s produktami kopulácie diazotovanej kyseliny 4- aminobenzénsulfónovej s látkou registrovanou v Indexe farieb pod identifika ným íslom Indexu farieb, C.I. 75240. which become 274-989-4|70892-58-9|diazotovaná kyselina 4- aminobenzénsulfónová, kopulovaná s farbiarskym moruovým (Chlorophora tinctoria) extraktom, komplexy so elezom komplexy eleza s produktami kopulácie diazotovanej kyseliny 4- aminobenzénsulfónovej s látkou registrovanou v Indexe farieb pod identifika ným íslom Indexu farieb, C.I.|75240. The C.I 75240 is not a chemical formula and there isn't one. So I want to add a regular expression for the chemical name for an if statement that stipulates if there is not chemical formula to move on. However, I must be getting confused from the regular expression tutorials I've been reading. Any ideas? Original Code: #For text files in a directory... #Analyzes a randomly organized UTF8 document with EINECS, CAS, Chemical, and Chemical Formula #into a document structured as EINECS|CAS|Chemical|Chemical Formula. import os import codecs import re path = "C:\\text_samples\\text"#folder with all text files path2 = "C:\\text_samples\\text\\output" #output of all text files NR_RE = re.compile(r'^\d+-\d+-\d+$') #pattern for EINECS number def iter_elements(tokens): product = [] for tok in tokens: if NR_RE.match(tok) and len(product) >= 4: product[2:-1] = [' '.join(product[2:-1])] yield product product = [] product.append(tok) yield product for text in os.listdir(path): input_text = os.path.join(path,text) output_text = os.path.join(path2,text) input = codecs.open(input_text, 'r','utf8') output = codecs.open(output_text, 'w', 'utf8') tokens = input.read().split() for element in iter_elements(tokens): #print '|'.join(element) output.write('|'.join(element)) output.write("\r\n") input.close() output.close() On Oct 23, 5:03 pm, Paul McGuire <[EMAIL PROTECTED]> wrote: > On Oct 22, 5:29 pm, [EMAIL PROTECTED] wrote: > > > > > Hi, > > > I'm trying to learn regular expressions, but I am having trouble with > > this. I want to search a document that has mixed data; however, the > > last line of every entry has something like C5H4N4O3 or CH5N3.ClH. > > All of the letters are upper case and there will always be numbers and > > possibly one . > > > However below only gave me none. > > > import os, codecs, re > > > text = 'C:\\text_samples\\sample.txt' > > text = codecs.open(text,'r','utf-8') > > > test = re.compile('\u+\d+\.') > > > for line in text: > > print test.search(line) > > If those are chemical symbols, then I guarantee that there will be > lower case letters in the expression (like the "l" in "ClH"). > > -- Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression
Marc, thank you for the example it made me realize where I was getting things wrong. I didn't realize how specific I needed to be. Also http://weitz.de/regex-coach/ really helped me test things out on this one. I realized I had some more exceptions like C18H34O2.1/2Cu and I also realized I didn't really understand regular expressions (which I still don't but I think it's getting better) FORMULA = re.compile(r'([A-Z][A-Za-z0-9]+\.?[A-Za-z0-9]+/?[A-Za- z0-9]+)') This gets all Chemical names like C14H28 C18H34O2.1/2Cu C8H17ClO2, ie a word that begins with a capital letter followed by any number of upper or lower case letters and numbers followed by a possible . followed by any number of upper or lower case letters and numbers followed by a possible / followed by any number of upper or lower case letters and numbers. Say that five times fast! So now I want to tell the program that if it finds the formula at the end then continue, otherwise if it finds C.I. 75240 or any other type of word that it should not be broken by a | and be lumped into the whole line. But now I get: Traceback (most recent call last): File "C:\Python24\Lib\site-packages\pythonwin\pywin\framework \scriptutils.py", line 310, in RunScript exec codeObject in __main__.__dict__ File "C:\Documents and Settings\Patrick Waldo\My Documents\Python \WORD\try5-2-file-1-1.py", line 32, in ? input = codecs.open(input_text, 'r','utf8') File "C:\Python24\lib\codecs.py", line 666, in open file = __builtin__.open(filename, mode, buffering) IOError: [Errno 13] Permission denied: 'C:\\Documents and Settings\ \Patrick Waldo\\Desktop\\decernis\\DAD\\EINECS_SK\\text\\output' Ideas? #For text files in a directory... #Analyzes a randomly organized UTF8 document with EINECS, CAS, Chemical, and Chemical Formula #into a document structured as EINECS|CAS|Chemical|Chemical Formula. import os import codecs import re path = "C:\\text" path2 = "C:\\text\output" EINECS = re.compile(r'^\d\d\d-\d\d\d-\d $') FORMULA = re.compile(r'([A-Z][A-Za-z0-9]+\.?[A-Za-z0-9]+/?[A-Za- z0-9]+)') def iter_elements(tokens): product = [] for tok in tokens: if EINECS.match(tok) and len(product) >= 4: if product[-1] == FORMULA.findall(tok): product[2:-1] = [' '.join(product[2:-1])] yield product product = [] else: product[2:-1] = [' '.join(product[2:])] yield product product = [] product.append(tok) yield product for text in os.listdir(path): input_text = os.path.join(path,text) output_text = os.path.join(path2,text) input = codecs.open(input_text, 'r','utf8') output = codecs.open(output_text, 'w', 'utf8') tokens = input.read().split() for element in iter_elements(tokens): output.write('|'.join(element)) output.write("\r\n") input.close() output.close() -- http://mail.python.org/mailman/listinfo/python-list
Re: Regular Expression
Finally I solved the problem, with some really minor things to tweak. I guess it's true that I had two problems working with regular expressions. Thank you all for your help. I really learned a lot on quite a difficult problem. Final Code: #For text files in a directory... #Analyzes a randomly organized UTF8 document with EINECS, CAS, Chemical, and Chemical Formula #into a document structured as EINECS|CAS|Chemical|Chemical Formula. import os import codecs import re path = "C:\\text_samples\\text\\" path2 = "C:\\text_samples\\text\\output\\" EINECS = re.compile(r'^\d\d\d-\d\d\d-\d$') CAS = re.compile(r'^\d*-\d\d-\d$') FORMULA = re.compile(r'([A-Z][A-Za-z0-9]+\.?[A-Za-z0-9]+/?[A-Za- z0-9]+)') def iter_elements(tokens): product = [] for tok in tokens: if EINECS.match(tok) and len(product) >= 4: match = re.match(FORMULA,product[-1]) if match: product[2:-1] = [' '.join(product[2:-1])] yield product product = [] else: product[2:-1] = [' '.join(product[2:])] del product[-1] yield product product = [] product.append(tok) yield product for text in os.listdir(path): input_text = os.path.join(path,text) output_text = os.path.join(path2,text) input = codecs.open(input_text, 'r','utf8') output = codecs.open(output_text, 'w', 'utf8') tokens = input.read().split() for element in iter_elements(tokens): output.write('|'.join(element)) output.write("\r\n") input.close() output.close() -- http://mail.python.org/mailman/listinfo/python-list
Problem--IOError: [Errno 13] Permission denied
Hi all, After sludging my way through many obstacles with this interesting puzzle of a text parsing program, I found myself with one final error: Traceback (most recent call last): File "C:\Python24\Lib\site-packages\pythonwin\pywin\framework \scriptutils.py", line 310, in RunScript exec codeObject in __main__.__dict__ File "C:\Documents and Settings\Patrick Waldo\My Documents\Python \WORD\try5-2-file-1-all patterns.py", line 77, in ? input = codecs.open(input_text, 'r','utf8') File "C:\Python24\lib\codecs.py", line 666, in open file = __builtin__.open(filename, mode, buffering) IOError: [Errno 13] Permission denied: 'C:\\text_samples\\test\ \output' The error doesn't stop the program from functioning as it should, except the last line of every document gets split with | in between the words, which is just strange. I have no idea why either is happening, but perhaps they are related. Any ideas? #For text files in a directory... #Analyzes a randomly organized UTF8 document with EINECS, CAS, Chemical, and Chemical Formula #into a document structured as EINECS|CAS|Chemical|Chemical Formula. import os import codecs import re path = "C:\\text_samples\\test\\" path2 = "C:\\text_samples\\test\\output\\" EINECS = re.compile(r'^\d\d\d-\d\d\d-\d$') FORMULA = re.compile(r'([A-Z][a-zA-Z0-9]*\.?[A-Za-z0-9]*/?[A-Za- z0-9]*)') FALSE_POS = re.compile(r'^[A-Z][a-z]{4,40}\)?\.?') FALSE_POS1 = re.compile(r'C\.I\..*') FALSE_POS2 = re.compile(r'vit.*') FALSE_NEG = re.compile(r'C\d+\.') def iter_elements(tokens): product = [] for tok in tokens: if EINECS.match(tok) and len(product) >= 3: match = re.match(FORMULA,product[-1]) match_false_pos = re.match(FALSE_POS,product[-1]) match_false_pos1 = re.match(FALSE_POS1,product[-1]) match_false_pos2 = re.match(FALSE_POS2,product[2]) match_false_neg = re.match(FALSE_NEG,product[-1]) if match_false_neg: product[2:-1] = [' '.join(product[2:])] del product[-1] yield product product = [] elif match_false_pos: product[2:-1] = [' '.join(product[2:])] del product[-1] yield product product = [] elif match: product[2:-1] = [' '.join(product[2:-1])] yield product product = [] elif match_false_pos1 or match_false_pos2: product[2:-1] = [' '.join(product[2:])] del product[-1] yield product product = [] else: product[2:-1] = [' '.join(product[2:])] del product[-1] yield product product = [] product.append(tok) yield product for text in os.listdir(path): input_text = os.path.join(path,text) output_text = os.path.join(path2,text) input = codecs.open(input_text, 'r','utf8') output = codecs.open(output_text, 'w', 'utf8') tokens = input.read().split() for element in iter_elements(tokens): output.write('|'.join(element)) output.write("\r\n") input.close() output.close() -- http://mail.python.org/mailman/listinfo/python-list
Sorting Countries by Region
Hi all, I'm analyzing some data that has a lot of country data. What I need to do is sort through this data and output it into an excel doc with summary information. The countries, though, need to be sorted by region, but the way I thought I could do it isn't quite working out. So far I can only successfully get the data alphabetically. Any ideas? import xlrd import pyExcelerator def get_countries_list(list): countries_list=[] for country in countries: if country not in countries_list: countries_list.append(country) EU = ["Austria","Belgium", "Cyprus","Czech Republic", "Denmark","Estonia", "Finland"] NA = ["Canada", "United States"] AP = ["Australia", "China", "Hong Kong", "India", "Indonesia", "Japan"] Regions_tot = {'European Union':EU, 'North America':NA, 'Asia Pacific':AP,} path_file = "c:\\1\country_data.xls" book = xlrd.open_workbook(path_file) Counts = book.sheet_by_index(1) countries= Counts.col_values(0,start_rowx=1, end_rowx=None) get_countries_list(countries) wb=pyExcelerator.Workbook() matrix = wb.add_sheet("matrix") n=1 for country in unique_countries: matrix.write(n,1, country) n = n+1 wb.save('c:\\1\\matrix.xls') -- http://mail.python.org/mailman/listinfo/python-list
Re: Fwd: Sorting Countries by Region
Great, this is very helpful. I'm new to Python, so hence the inefficient or nonsensical code! > > 2) I would suggest using countries.sort(...) or sorted(countries,...), > specifying cmp or key options too sort by region instead. > I don't understand how to do this. The countries.sort() lists alphabetically and I tried to do a lambda x,y: cmp() type function, but it doesn't sort correctly. Help with that? For martyw's example, I don't need to get any sort of population info. I'm actually getting the number of various types of documents. So the entry is like this: Argentina Food and Consumer Products Food Additives Color Additives 1 Argentina Food and Consumer Products Food Additives Flavors 1 Argentina Food and Consumer Products Food Additives General 6 Argentina Food and Consumer Products Food Additives labeling 1 Argentina Food and Consumer Products Food Additives Prohibited Additives 1 Argentina Food and Consumer Products Food ContactCellulose 1 Argentina Food and Consumer Products Food ContactFood Packaging 1 Argentina Food and Consumer Products Food ContactPlastics 4 Argentina Food and Consumer Products Food Contact Waxes 1 Belize etc... So I'll need to add up all the entries for Food Additives and Food contacts, the other info like Color Additives isn't important. So I will have an output like this Food AdditivesFood Contact Argentina 107 Belize etc... Thanks so much for the help! -- http://mail.python.org/mailman/listinfo/python-list
Re: Fwd: Sorting Countries by Region
This is how I solved it last night in my inefficient sort of way and after re-reading some of my Python books on dictionaries. So far this gets the job done. However, I'd like to test if there are any countries in the excel input that are not represented, ie the input is all the information I have and the dictionary functions as the information I expect. What I did worked yesterday, but doesn't work anymore more...see comment Otherwise I tried doing this: for i, country in countries_list: if country in REGIONS_COUNTRIES['European Union']: matrix.write(i+2, 1, country) but I got "ValueError: too many values to unpack" Again, this has been a great help. Any ideas of how I can make this a bit more efficient, as I'm dealing with 5 regions and numerous countries, would be greatly appreciated. Here's the code: #keeping all the countries short REGIONS_COUNTRIES = {'European Union':["Austria","Belgium"," "France", "Germany", "Greece"],\ 'North America':["Canada", "United States"]} path_file = "c:\\1\\build\\data\\matrix2\\Update_oct07a.xls" book = xlrd.open_workbook(path_file) Counts = book.sheet_by_index(1) wb=pyExcelerator.Workbook() matrix = wb.add_sheet("matrix") countries = Counts.col_values(0,start_rowx=1, end_rowx=None) countries_list = list(set(countries)) countries_list.sort() #This seems to not work today and I don't know why #for country in countries_list: #if country not in REGIONS_COUNTRIES['European Union'] or not in REGIONS_COUNTRIES['North America']: #print "%s is not in the expected list", country #This sorts well n=2 for country in countries_list: if country in REGIONS_COUNTRIES['European Union']: matrix.write(n, 1, country) n=n+1 for country in countries_list: if country in REGIONS_COUNTRIES['North America']: matrix.write(n, 1, country) n=n+1 wb.save('c:\\1\\matrix.xls') On Nov 17, 1:12 am, "Sergio Correia" <[EMAIL PROTECTED]> wrote: > About the sort: > > Check this (also onhttp://pastebin.com/f12b5b6ca) > > def make_regions(): > > # Values you provided > EU = ["Austria","Belgium", "Cyprus","Czech Republic", > "Denmark","Estonia", "Finland"] > NA = ["Canada", "United States"] > AP = ["Australia", "China", "Hong Kong", "India", "Indonesia", > "Japan"] > regions = {'European Union':EU, 'North America':NA, 'Asia Pacific':AP} > > ans = {} > for reg_name, reg in regions.items(): > for cou in reg: > ans[cou] = reg_name > return ans > > def cmp_region(cou1, cou2): > ans = cmp(regions[cou1], regions[cou2]) > if ans: # If the region is the same, sort by country > return cmp(cou1, cou2) > else: > return ans > > regions = make_regions() > some_countries = ['Austria', 'Canada', 'China', 'India'] > > print 'Old:', some_countries > some_countries.sort(cmp_region) > print 'New:', some_countries > > Why that code? > Because the first thing I want is a dictionary where the key is the > name of the country and the value is the region. Then, I just make a > quick function that compares considering the region and country. > Finally, I sort. > > Btw, the code is just a quick hack, as it can be improved -a lot-. > > About the rest of your code: > - martyw's example is much more useful than you think. Why? because > you can just iterate across your document, adding the values you get > to the adequate object property. That is, instead of using size or > pop, use the variables you are interested in. > > Best, and good luck with python, > Sergio > > On Nov 16, 2007 5:15 PM, <[EMAIL PROTECTED]> wrote: > > > Great, this is very helpful. I'm new to Python, so hence the > > inefficient or nonsensical code! > > > > 2) I would suggest using countries.sort(...) or sorted(countries,...), > > > specifying cmp or key options too sort by region instead. > > > I don't understand how to do this. The countries.sort() lists > > alphabetically and I tried to do a lambda x,y: cmp() type function, > > but it doesn't sort correctly. Help with that? > > > For martyw's example, I don't need to get any sort of population > > info. I'm actually getting the number of various types of documents. > > So the entry is like this: > > > Argentina Food and Consumer Products Food Additives Color > > Additives 1 > > Argentina Food and Consumer Products Food Additives Flavors > > 1 > > Argentina Food and Consumer Products Food Additives > > General 6 > > Argentina Food and Consumer Products Food Additives labeling > > 1 > > Argentina Food and Consumer Products Food Additives Prohibited > > Additives 1 > > Argentina Food and Consumer Products Food ContactCellulose > > 1 > > Argentina Food and Consumer Products Food ContactFood > > Packaging 1 > > Argentina Food and Consum
Yet Another Tabular Data Question
Hi all, Fairly new Python guy here. I am having a lot of trouble trying to figure this out. I have some data on some regulations in Excel and I need to basically add up the total regulations for each country--a statistical analysis thing that I'll copy to another Excel file. Writing with pyExcelerator has been easier than reading with xlrd for me...So that's what I did first, but now I'd like to learn how to crunch some data. The input looks like this: Country Module Topic # of Docs Argentina Food and Consumer Products Cosmetics1 Argentina Food and Consumer Products Cosmetics8 Argentina Food and Consumer Products Food Additives 1 Argentina Food and Consumer Products Food Additives 1 Australia Food and Consumer Products Drinking Water 7 Australia Food and Consumer Products Food Additives 3 Australia Food and Consumer Products Food Additives 1 etc... So I need to add up all the docs for Argentina, Australia, etc...and add up the total amount for each Topic for each country so, Argentina has 9 Cosmetics laws and 2 Food Additives Laws, etc... So, here is the reduced code that can't add anything...Any thoughts would be really helpful. import xlrd import pyExcelerator from pyExcelerator import * #Open Excel files for reading and writing path_file = "c:\\1\\data.xls" book = xlrd.open_workbook(path_file) Counts = book.sheet_by_index(1) wb=pyExcelerator.Workbook() matrix = wb.add_sheet("matrix") #Get all Excel data n=1 data = [] while nhttp://mail.python.org/mailman/listinfo/python-list
Pivot Table/Groupby/Sum question
Hi all, I tried reading http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/334695 on the same subject, but it didn't work for me. I'm trying to learn how to make pivot tables from some excel sheets and I am trying to abstract this into a simple sort of example. Essentially I want to take input data like this: Name Time of day Amount Bob Morn240 Bob Aft 300 Joe Morn 70 Joe Aft80 Jil Morn 100 Jil Aft 150 And output it as: Name TotalMorning Afternoon Bob540 240300 Joe 150 70 80 Jil 250 100150 Total 940 410530 The writing the output part is the easy part. However, I have a couple problems. 1) Grouping by name seems to work perfectly, but working by time does not. ie I will get: Bob 240 300 Joe 70 80 Jil 100 150 which is great but... Morn 240 Aft 300 Morn 70 Aft 80 Morn 100 Aft 150 And not Morn 240 70 100 Aft 300 80 150 2) I can't figure out how to sum these values because of the iteration. I always get an error like: TypeError: iteration over non- sequence Here's the code: from itertools import groupby data = [['Bob', 'Morn', 240],['Bob', 'Aft', 300],['Joe', 'Morn', 70], ['Joe', 'Aft', 80],\ ['Jil', 'Morn', 100],['Jil', 'Aft', 150]] NAME, TIME, AMOUNT=range(3) for k, g in groupby(data, key=lambda r: r[NAME]): print k for record in g: print "\t", record[AMOUNT] for k, g in groupby(data, key=lambda r: r[TIME]): print k for record in g: print "\t", record[AMOUNT] Thanks for any comments -- http://mail.python.org/mailman/listinfo/python-list
Re: Pivot Table/Groupby/Sum question
On Dec 27, 10:59 pm, John Machin <[EMAIL PROTECTED]> wrote: > On Dec 28, 4:56 am, [EMAIL PROTECTED] wrote: > > > from itertools import groupby > > You seem to have overlooked this important sentence in the > documentation: "Generally, the iterable needs to already be sorted on > the same key function" Yes, but I imagine this shouldn't prevent me from using and manipulating the data. It also doesn't explain why the names get sorted correctly and the time does not. I was trying to do this: count_tot = [] for k, g in groupby(data, key=lambda r: r[NAME]): for record in g: count_tot.append((k,record[SALARY])) for i in count_tot: here I want to say add all the numbers for each person, but I'm missing something. If you have any ideas about how to solve this pivot table issue, which seems to be scant on Google, I'd much appreciate it. I know I can do this in Excel easily with the automated wizard, but I want to know how to do it myself and format it to my needs. -- http://mail.python.org/mailman/listinfo/python-list
Re: Pivot Table/Groupby/Sum question
Wow, I did not realize it would be this complicated! I'm fairly new to Python and somehow I thought I could find a simpler solution. I'll have to mull over this to fully understand how it works for a bit. Thanks a lot! On Dec 28, 4:03 am, John Machin <[EMAIL PROTECTED]> wrote: > On Dec 28, 11:48 am, John Machin <[EMAIL PROTECTED]> wrote: > > > On Dec 28, 10:05 am, [EMAIL PROTECTED] wrote: > > > > If you have any ideas about how to solve this pivot table issue, which > > > seems to be scant on Google, I'd much appreciate it. I know I can do > > > this in Excel easily with the automated wizard, but I want to know how > > > to do it myself and format it to my needs. > > > Watch this space. > > Tested as much as you see: > > 8<--- > class SimplePivotTable(object): > > def __init__( > self, > row_order=None, col_order=None, # see example > missing=0, # what to return for an empty cell. Alternatives: > '', 0.0, None, 'NULL' > ): > self.row_order = row_order > self.col_order = col_order > self.missing = missing > self.cell_dict = {} > self.row_total = {} > self.col_total = {} > self.grand_total = 0 > self.headings_OK = False > > def add_item(self, row_key, col_key, value): > self.grand_total += value > try: > self.col_total[col_key] += value > except KeyError: > self.col_total[col_key] = value > try: > self.cell_dict[row_key][col_key] += value > self.row_total[row_key] += value > except KeyError: > try: > self.cell_dict[row_key][col_key] = value > self.row_total[row_key] += value > except KeyError: > self.cell_dict[row_key] = {col_key: value} > self.row_total[row_key] = value > > def _process_headings(self): > if self.headings_OK: > return > self.row_headings = self.row_order or > list(sorted(self.row_total.keys())) > self.col_headings = self.col_order or > list(sorted(self.col_total.keys())) > self.headings_OK = True > > def get_col_headings(self): > self._process_headings() > return self.col_headings > > def generate_row_info(self): > self._process_headings() > for row_key in self.row_headings: > row_dict = self.cell_dict[row_key] > row_vals = [row_dict.get(col_key, self.missing) for > col_key in self.col_headings] > yield row_key, self.row_total[row_key], row_vals > > def get_col_totals(self): > self._process_headings() > row_dict = self.col_total > row_vals = [row_dict.get(col_key, self.missing) for col_key in > self.col_headings] > return self.grand_total, row_vals > > if __name__ == "__main__": > > data = [ > ['Bob', 'Morn', 240], > ['Bob', 'Aft', 300], > ['Joe', 'Morn', 70], > ['Joe', 'Aft', 80], > ['Jil', 'Morn', 100], > ['Jil', 'Aft', 150], > ['Bob', 'Aft', 40], > ['Bob', 'Aft',5], > ['Dozy', 'Aft', 1], # Dozy doesn't show up till lunch-time > ] > NAME, TIME, AMOUNT = range(3) > > print > ptab = SimplePivotTable( > col_order=['Morn', 'Aft'], > missing='uh-oh', > ) > for s in data: > ptab.add_item(row_key=s[NAME], col_key=s[TIME], > value=s[AMOUNT]) > print ptab.get_col_headings() > for x in ptab.generate_row_info(): > print x > print 'Tots', ptab.get_col_totals() > 8<--- -- http://mail.python.org/mailman/listinfo/python-list
Re: Pivot Table/Groupby/Sum question
Petr, thanks for the SQL suggestion, but I'm having enough trouble in Python. John would you mind walking me through your class in normal speak? I only have a vague idea of why it works and this would help me a lot to get a grip on classes and this sort of particular problem. The next step is to imagine if there was another variable, like departments and add up the information by name, department, and time, and so on...that will come another day. Thanks. On Dec 29, 1:00 am, John Machin <[EMAIL PROTECTED]> wrote: > On Dec 29, 9:58 am, [EMAIL PROTECTED] wrote: > > > What about to let SQL to work for you. > > The OP is "trying to learn how to make pivot tables from some excel > sheets". You had better give him a clue on how to use ODBC on an > "excel sheet" :-) > > [snip] > > > SELECT > > NAME, > > sum (AMOUNT) as TOTAL, > > sum (case when (TIME_OF_DAY) = 'Morn' then AMOUNT else 0 END) as > > MORN, > > sum (case when (TIME_OF_DAY) = 'Aft' then AMOUNT else 0 END) as AFT > > This technique requires advance knowledge of what the column key > values are (the hard-coded 'Morn' and 'Aft'). > > > It is the sort of thing that one sees when %SQL% is the *only* > language used to produce end-user reports. Innocuous when there are > only 2 possible columns, but bletchworthy when there are more than 20 > and the conditions are complex and the whole thing is replicated > several times in the %SQL% script because either %SQL% doesn't support > temporary procedures/functions or the BOsFH won't permit their use... > not in front of the newbies, please! > -- http://mail.python.org/mailman/listinfo/python-list
Re: Pivot Table/Groupby/Sum question
On Dec 29, 3:00 pm, [EMAIL PROTECTED] wrote: > Patrick, > > in your first posting you are writing "... I'm trying to learn how to > make pivot tables from some excel sheets...". Can you be more specific > please? AFIK Excel offers very good support for pivot tables. So why > to read tabular data from the Excel sheet and than transform it to > pivot tabel in Python? > > Petr Yes, I realize Excel has excellent support for pivot tables. However, I hate how Excel does it and, for my particular excel files, I need them to be formated in an automated way because I will have a number of them over time and I'd prefer just to have python do it in a flash than to do it every time with Excel. >It's about time you got a *concrete* idea of how something works. Absolutely right. I tend to take on ideas that I'm not ready for, in the sense that I only started using Python some months ago for some basic tasks and now I'm trying on some more complicated ones. With time, though, I will get a concrete idea of what python.exe does, but, for someone who studied art history and not comp sci, I'm doing my best to get a handle on all of it. I think a pad of paper might be a good way to visualize it. -- http://mail.python.org/mailman/listinfo/python-list
Re: Pivot Table/Groupby/Sum question
Sorry for the delay in my response. New Year's Eve and moving apartment > - Where the data come from (I mean: are your data in Excel already > when you get them)? > - If your primary source of data is the Excel file, how do you read > data from the Excel file to Python (I mean did you solve this part of the > task already)? Yes, the data comes from Excel and I use xlrd and PyExcelerator to read and write, respectively. #open for reading path_file = "c:\\1\\data.xls" book = xlrd.open_workbook(path_file) Counts = book.sheet_by_index(1) #get data n=1 data = [] while nhttp://mail.python.org/mailman/listinfo/python-list
Re: Pivot Table/Groupby/Sum question
Yes in the sense that the top part will have merged cells so that Horror and Classics don't need to be repeated every time, but the headers aren't the important part. At this point I'm more interested in organizing the data itself and i can worry about putting it into a new excel file later. -- http://mail.python.org/mailman/listinfo/python-list
Re: Pivot Table/Groupby/Sum question
Petr thanks so much for your input. I'll try to learn SQL, especially if I'll do a lot of database work. I tried to do it John's way as en exercise and I'm happy to say I understand a lot more. Basically I didn't realize I could nest dictionaries like db = {country:{genre:{sub_genre:3}}} and call them like db[country][genre][sub_genre]. The Python Cookbook was quite helpful to figure out why items needed to be added the way they did. Also using the structure of the dictionary was a conceptually easier solution than what I found on http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/334695. So, now I need to work on writing it to Excel. I'll update with the final code. Thanks again. #Movie Store Example class PivotData: def __init__(self): self.total_mov = 0 self.total_cou = {} self.total_gen = {} self.total_sub = {} self.total_cou_gen ={} self.db = {} def add_data(self,country,genre,sub_genre,value): self.total_mov += value try: self.total_cou[country] += value except KeyError: self.total_cou[country] = value try: self.total_gen[genre] += value except: self.total_gen[genre] = value try: self.total_sub[sub_genre] += value except: self.total_sub[sub_genre] = value try: self.total_cou_gen[country][genre] += value except KeyError: try: self.total_cou_gen[country][genre] = value except KeyError: self.total_cou_gen[country] = {genre:value} try: self.db[country][genre][sub_genre] += value except KeyError: try: self.db[country][genre][sub_genre] = value except KeyError: try: self.db[country][genre] = {sub_genre:value} except: self.db[country] = {genre:{sub_genre:value}} data = [['argentina','Horror', 'Slasher',4], ['argentina','Horror', 'Halloween',6], ['argentina','Drama','Romance',5], ['argentina','Drama','Romance',1], ['argentina','Drama','True Life',1], ['japan','Classics','WWII',1], ['japan','Cartoons','Anime',1], ['america','Comedy','Stand-Up',1], ['america','Cartoons','WB',10], ['america','Cartoons','WB',3]] COUNTRY, GENRE, SUB_GENRE, VALUE =range(4) x=PivotData() for s in data: x.add_data(s[COUNTRY],s[GENRE],s[SUB_GENRE],s[VALUE]) print print 'Total Movies:\n', x.total_mov print 'Total for each country\n', x.total_cou print 'Total Genres\n', x.total_gen print 'Total Sub Genres\n', x.total_sub print 'Total Genres for each Country\n', x.total_cou_gen print print x.db -- http://mail.python.org/mailman/listinfo/python-list
pyExcelerator: writing multiple rows
Hi all, I was just curious if there was a built-in or a more efficient way to do take multiple rows of information and write them into excel using pyExcelerator. This is how I resolved the problem: from pyExcelerator import * data = [[1,2,3],[4,5,'a'],['','s'],[6,7,'g']] wb=pyExcelerator.Workbook() test = wb.add_sheet("test") c=1 r=0 while rhttp://mail.python.org/mailman/listinfo/python-list
Re: joining strings question
I tried to make a simple abstraction of my problem, but it's probably better to get down to it. For the funkiness of the data, I'm relatively new to Python and I'm either not processing it well or it's because of BeautifulSoup. Basically, I'm using BeautifulSoup to strip the tables from the Federal Register (http://www.access.gpo.gov/su_docs/aces/fr- cont.html).So far my code strips the html and gets only the departments I'd like to see. Now I need to put it into an Excel file (with pyExcelerator) with the name of the record and the pdf. A snippet from my data from BeautifulSoup like this: ['Environmental Protection Agency', 'RULES', 'Approval and Promulgation of Air Quality Implementation Plans:', 'Illinois; Revisions to Emission Reduction Market System, ', '11042 [E8-3800]', 'E8-3800.pdf', 'Ohio; Oxides of Nitrogen Budget Trading Program; Correction, ', '11192 [Z8-2506]', 'Z8-2506.pdf', 'NOTICES', 'Agency Information Collection Activities; Proposals, Submissions, and Approvals, ', '11108-0 [E8-3934]', 'E8-3934.pdf', 'Data Availability for Lead National Ambient Air Quality Standard Review, ', '0-1 [E8-3935]', 'E8-3935.pdf', 'Environmental Impacts Statements; Notice of Availability, ', '2 [E8-3917]', 'E8-3917.pdf'] What I'd like to see in Excel is this: 'Approval and Promulgation of Air Quality Implementation Plans: Illinois; Revisions to Emission Reduction Market System, 11042 [E8-3800]' | 'E8-3800.pdf' | RULES 'Ohio; Oxides of Nitrogen Budget Trading Program; Correction, 11192 [Z8-2506]' | 'Z8-2506.pdf' | RULES 'Agency Information Collection Activities; Proposals, Submissions, and Approvals, 11108-0 [E8-3934]' | 'E8-3934.pdf' | NOTICES 'Data Availability for Lead National Ambient Air Quality Standard Review, 0-1 [E8-3935]' | 'E8-3935.pdf' | NOTICES 'Environmental Impacts Statements; Notice of Availability, 2 [E8-3917]' | 'E8-3917.pdf' | NOTICES etc...for every department I want. Now that I look at it I've got another problem because 'Approval and Promulgation of Air Quality Implementation Plans:' should be joined to both Illinois and Ohio...I love finding these little inconsistencies! Once I get the data organized with all the titles joined together appropriately, outputting it to Excel should be relatively easy. So my problem is how to join these titles together. There are a couple patterns. Every law is followed by a number, which is always followed by the pdf. Any ideas would be much appreciated. My code so far (excuse the ugliness): import urllib import re, codecs, os import pyExcelerator from pyExcelerator import * from BeautifulSoup import BeautifulSoup as BS #Get the url, make the soup, and get the table to be processed url = "http://www.access.gpo.gov/su_docs/aces/fr-cont.html"; site = urllib.urlopen(url) soup = BS(site) body = soup('table')[1] tds = body.findAll('td') mess = [] for td in tds: mess.append(str(td)) spacer = re.compile(r'.*') data = [] x=0 for n, t in enumerate(mess): if spacer.match(t): data.append(mess[x:n]) x = n dept = re.compile(r'.*') title = re.compile(r'.*') title2 = re.compile(r'.*') none = re.compile(r'None') #Strip the html and organize by department group = [] db_list = [] for d in data: pre_list = [] for item in d: if dept.match(item): dept_soup = BS(item) try: dept_contents = dept_soup('a')[0]['name'] pre_list.append(str(dept_contents)) except IndexError: break elif title.match(item) or title2.match(item): title_soup = BS(item) title_contents = title_soup.td.string if none.match(str(title_contents)): pre_list.append(str(title_soup('a')[0]['href'])) else: pre_list.append(str(title_contents)) elif link.match(item): link_soup = BS(item) link_contents = link_soup('a')[1]['href'] pre_list.append(str(link_contents)) db_list.append(pre_list) for db in db_list: for n, dash_space in enumerate(db): dash_space = dash_space.replace('–','-') dash_space = dash_space.replace(' ', ' ') db[n] = dash_space download = re.compile(r'http://.*') for db in db_list: for n, pdf in enumerate(db): if download.match(pdf): filename = re.split('http://.*/',pdf) db[n] = filename[1] #Strip out these departments AgrDep = re.compile(r'Agriculture Department') EPA = re.compile(r'Environmental Protection Agency') FDA = re.compile(r'Food and Drug Administration') key_data = [] for list in db_list: for db in list: if AgrDep.match(db) or EPA.match(db) or FDA.match(db): key_data.append(list) #Get appropriate links from covered departments as well LINK = re.compile(r'^#.*') links = [] for kd in key_data: for item in kd: if LINK.match(item): links.append(item[1:]) for list in db_list:
joining strings question
Hi all, I have some data with some categories, titles, subtitles, and a link to their pdf and I need to join the title and the subtitle for every file and divide them into their separate groups. So the data comes in like this: data = ['RULES', 'title','subtitle','pdf', 'title1','subtitle1','pdf1','NOTICES','title2','subtitle2','pdf','title3','subtitle3','pdf'] What I'd like to see is this: [RULES', 'title subtitle','pdf', 'title1 subtitle1','pdf1'], ['NOTICES','title2 subtitle2','pdf','title3 subtitle3','pdf'], etc... I've racked my brain for a while about this and I can't seem to figure it out. Any ideas would be much appreciated. Thanks -- http://mail.python.org/mailman/listinfo/python-list
Re: joining strings question
>def category_iterator(source): > source = iter(source) > try: >while True: > item = source.next() This gave me a lot of inspiration. After a couple of days of banging my head against the wall, I finally figured out a code that could attach headers, titles, numbers, and categories in their appropriate combinations--basically one BIG logic puzzle. It's not the prettiest thing in the world, but it works. If anyone has a better way to do it, then I'll be all ears. Anyways, thank you all for your input, it helped me think outside the box. import re data = ['RULES', 'Approval and Promulgation of Air Quality Implementation Plans:', 'Illinois; Revisions to Emission Reduction Market System, ', '11042 [E8-3800]', 'E8-3800.pdf', 'Ohio; Oxides of Nitrogen Budget Trading Program; Correction, ', '11192 [Z8-2506]', 'Z8-2506.pdf', 'NOTICES', 'Agency Information Collection Activities; Proposals, Submissions, and Approvals, ', '11108-0 [E8-3934]', 'E8-3934.pdf', 'Data Availability for Lead National Ambient Air Quality Standard Review, ', '0-1 [E8-3935]', 'E8-3935.pdf', 'Environmental Impacts Statements; Notice of Availability, ', '2 [E8-3917]', 'E8-3917.pdf'] NOTICES = re.compile(r'NOTICES') RULES = re.compile(r'RULES') TITLE = re.compile(r'[A-Z][a-z].*') NUM = re.compile(r'\d.*') PDF = re.compile(r'.*\.pdf') counted = [] sorted = [] title = [] tot = len(data) x=0 while x < tot: try: item = data[x] title = [] if NOTICES.match(item) or RULES.match(item): module = item header = '' if TITLE.match(data[x+1]) and TITLE.match(data[x+2]) and NUM.match(data[x+3]): #Header header = data[x+1] counted.append(data[x+1]) sorted.append(data[x+1]) #Title counted.append(data[x+2]) sorted.append(data[x+2]) #Number counted.append(data[x+3]) sorted.append(data[x+3]) title.append(''.join(sorted)) print title, module print sorted = [] x+=1 elif TITLE.match(data[x+1]) and NUM.match(data[x+2]): #Title counted.append(data[x+1]) sorted.append(data[x+1]) #Number counted.append(data[x+2]) sorted.append(data[x+2]) title.append(''.join(sorted)) print title, module print sorted = [] x+=1 else: print item, "strange1" break x+=1 else: if item in counted: x+=1 elif PDF.match(item): x+=1 elif TITLE.match(data[x]) and TITLE.match(data[x+1]) and NUM.match(data[x+2]): #Header header = data[x] counted.append(data[x]) sorted.append(data[x]) #Title counted.append(data[x+1]) sorted.append(data[x+1]) #Number counted.append(data[x+2]) sorted.append(data[x+2]) title.append(''.join(sorted)) sorted = [] print title, module print x+=1 elif TITLE.match(data[x]) and NUM.match(data[x+1]): #Title sorted.append(header) counted.append(data[x]) sorted.append(data[x]) #Number counted.append(data[x+1]) sorted.append(data[x+1]) title.append(''.join(sorted)) sorted = [] print title, module print x+=1 else: print item, "strange2" x+=1 break except IndexError: break -- http://mail.python.org/mailman/listinfo/python-list
UnicodeDecodeError quick question
Hi Everyone, I am using Python 2.4 and I am converting an excel spreadsheet to a pipe delimited text file and some of the cells contain utf-8 characters. I solved this problem in a very unintuitive way and I wanted to ask why. If I do, csvfile.write(cell.encode("utf-8")) I get a UnicodeDecodeError. However if I do, c = unicode(cell.encode("utf-8"),"utf-8") csvfile.write(c) Why should I have to encode the cell to utf-8 and then make it unicode in order to write to a text file? Is there a more intuitive way to get around these bothersome unicode errors? Thanks for any advice, Patrick Code: # -*- coding: utf-8 -*- import xlrd,codecs,os xls_file = "/home/pwaldo2/work/docpool_plone/2008-12-4/ EU-2008-12-4.xls" book = xlrd.open_workbook(xls_file) bibliography_sheet = book.sheet_by_index(0) csv = os.path.split(xls_file)[0] + '/' + os.path.split(xls_file)[1] [:-4] + '.csv' csvfile = codecs.open(csv,'w',encoding='utf-8') rowcount = 0 data = [] while rowcounthttp://mail.python.org/mailman/listinfo/python-list
xlrd cell background color
Hi all, I am trying to figure out a way to read colors with xlrd, but I did not understand the formatting.py module. Basically, I want to sort rows that are red or green. My initial attempt discovered that >>>print cell text:u'test1.txt' (XF:22) text:u'test2.txt' (XF:15) text:u'test3.txt' (XF:15) text:u'test4.txt' (XF:15) text:u'test5.txt' (XF:23) So, I thought that XF:22 represented my red highlighted row and XF:23 represented my green highlighted row. However, that was not always true. If one row is blank and I only highlighted one row, I got: >>>print cell text:u'test1.txt' (XF:22) text:u'test2.txt' (XF:22) text:u'test3.txt' (XF:22) text:u'test4.txt' (XF:22) text:u'test5.txt' (XF:22) empty:'' (XF:15) text:u'test6.txt' (XF:22) text:u'test7.txt' (XF:23) Now NoFill is XF:22! I am sure I am going about this the wrong way, but I just want to store filenames into a dictionary based on whether they are red or green. Any ideas would be much appreciated. My code is below. Best, Patrick filenames = {} filenames.setdefault('GREEN',[]) filenames.setdefault('RED',[]) book = xlrd.open_workbook("/home/pwaldo2/work/workbench/ Summary.xls",formatting_info=True) SumDoc = book.sheet_by_index(0) n=1 while nhttp://mail.python.org/mailman/listinfo/python-list
Re: xlrd cell background color
Thank you very much. I did not know there was a python-excel group, which I will certainly take note of in the future. The previous post answered my question, but I wanted to clarify the difference between xf.background.background_colour_index, xf.background.pattern_colour_index, and book.colour_map: >>>color = xf.background.background_colour_index >>>print color 60 60 60 65 65 65 49 60 = red and 49 = green >>>color = xf.background.pattern_colour_index >>>print color 10 10 10 64 64 64 11 10 = red 11 = green >>>print book.colour_map {0: (0, 0, 0), 1: (255, 255, 255), 2: (255, 0, 0), 3: (0, 255, 0), 4: (0, 0, 255), 5: (255, 255, 0), 6: (255, 0, 255), 7: (0, 255, 255), 8: (0, 0, 0), 9: (255, 255, 255), 10: (255, 0, 0), 11: (0, 255, 0), 12: (0, 0, 255), 13: (255, 255, 0), 14: (255, 0, 255), 15: (0, 255, 255), 16: (128, 0, 0), 17: (0, 128, 0), 18: (0, 0, 128), 19: (128, 128, 0), 20: (128, 0, 128), 21: (0, 128, 128), 22: (192, 192, 192), 23: (128, 128, 128), 24: (153, 153, 255), 25: (153, 51, 102), 26: (255, 255, 204), 27: (204, 255, 255), 28: (102, 0, 102), 29: (255, 128, 128), 30: (0, 102, 204), 31: (204, 204, 255), 32: (0, 0, 128), 33: (255, 0, 255), 34: (255, 255, 0), 35: (0, 255, 255), 36: (128, 0, 128), 37: (128, 0, 0), 38: (0, 128, 128), 39: (0, 0, 255), 40: (0, 204, 255), 41: (204, 255, 255), 42: (204, 255, 204), 43: (255, 255, 153), 44: (153, 204, 255), 45: (255, 153, 204), 46: (204, 153, 255), 47: (255, 204, 153), 48: (51, 102, 255), 49: (51, 204, 204), 50: (153, 204, 0), 51: (255, 204, 0), 52: (255, 153, 0), 53: (255, 102, 0), 54: (102, 102, 153), 55: (150, 150, 150), 56: (0, 51, 102), 57: (51, 153, 102), 58: (0, 51, 0), 59: (51, 51, 0), 60: (153, 51, 0), 61: (153, 51, 102), 62: (51, 51, 153), 63: (51, 51, 51), 64: None, 65: None, 81: None, 32767: None} After looking at the color, OpenOffice says I am using 'light red' for the first 3 rows and 'light green' for the last one, so how the numbers change for the first two examples makes sense. However, how the numbers change for book.colour_map does not make much sense to me since the numbers change without an apparent pattern. Could you clarify? Best, Patrick Revised Code: import xlrd filenames = {} filenames.setdefault('GREEN',[]) filenames.setdefault('RED',[]) book = xlrd.open_workbook("/home/pwaldo2/work/workbench/ Summary.xls",formatting_info=True) SumDoc = book.sheet_by_index(0) print book.colour_map n=1 while n wrote: > On Aug 14, 6:03 am, [EMAIL PROTECTED] wrote in > news:comp.lang.python thusly: > > > Hi all, > > > I am trying to figure out a way to read colors with xlrd, but I did > > not understand the formatting.py module. > > It is complicated, because it is digging out complicated info which > varies in somewhat arbitrary fashion between the 5 (approx.) versions > of Excel that xlrd handles. Sometimes I don't understand it, and I > wrote it :-) > > What I do when I want to *use* the formatting info, however, is to > read the xlrd documentation, and I suggest that you do the same. More > details at the end. > > > > > Basically, I want to sort > > rows that are red or green. My initial attempt discovered that>>>print cell > > > text:u'test1.txt' (XF:22) > > text:u'test2.txt' (XF:15) > > text:u'test3.txt' (XF:15) > > text:u'test4.txt' (XF:15) > > text:u'test5.txt' (XF:23) > > > So, I thought that XF:22 represented my red highlighted row and XF:23 > > represented my green highlighted row. However, that was not always > > true. If one row is blank and I only highlighted one row, I got:>>>print > > cell > > > text:u'test1.txt' (XF:22) > > text:u'test2.txt' (XF:22) > > text:u'test3.txt' (XF:22) > > text:u'test4.txt' (XF:22) > > text:u'test5.txt' (XF:22) > > empty:'' (XF:15) > > text:u'test6.txt' (XF:22) > > text:u'test7.txt' (XF:23) > > > Now NoFill is XF:22! I am sure I am going about this the wrong way, > > but I just want to store filenames into a dictionary based on whether > > they are red or green. Any ideas would be much appreciated. My code > > is below. > > > Best, > > Patrick > > > filenames = {} > > filenames.setdefault('GREEN',[]) > > filenames.setdefault('RED',[]) > > > book = xlrd.open_workbook("/home/pwaldo2/work/workbench/ > > Summary.xls",formatting_info=True) > > SumDoc = book.sheet_by_index(0) > > > n=1 > > while n > cell = SumDoc.cell(n,5) > > print cell > > filename = str(cell)[7:-9] > > color = str(cell)[-3:-1] > > if color == '22': > > filenames['RED'].append(filename) > > n+=1 > > elif color == '23': > > filenames['GREEN'].append(filename) > > n+=1 > > 22 and 23 are not colours, they are indexes into a list of XFs > (extended formats). The indexes after 16 have no fixed meaning, and as > you found, if you add/subtract formatting features to your XLS file, > the actual indexes used will change. Don't use str(cell). Use > cell.xf_index. > > Here is your reading path through the docs, starting at "The Cell > class": > Cell.xf_index > Book.xf_list >
xlrd and cPickle.dump/rows to list
Hi all, I have to work with a very large excel file and I have two questions. First, the documentation says that cPickle.dump would be the best way to work with it. However, I keep getting: Traceback (most recent call last): File "C:\Python24\Lib\site-packages\pythonwin\pywin\framework \scriptutils.py", line 310, in RunScript exec codeObject in __main__.__dict__ File "C:\python_files\pickle_test.py", line 12, in ? cPickle.dump(book,wb.save(pickle_path)) File "C:\Python24\lib\copy_reg.py", line 69, in _reduce_ex raise TypeError, "can't pickle %s objects" % base.__name__ TypeError: can't pickle file objects I tried to use open(filename, 'w') as well as pyExcelerator (wb.save(pickle_path)) to create the pickle file, but neither worked. Any ideas would be much appreciated. Patrick -- http://mail.python.org/mailman/listinfo/python-list
xlrd and cPickle.dump
Hi all, I have to work with a very large excel file and I have two questions. First, the documentation says that cPickle.dump would be the best way to work with it. However, I keep getting: Traceback (most recent call last): File "C:\Python24\Lib\site-packages\pythonwin\pywin\framework \scriptutils.py", line 310, in RunScript exec codeObject in __main__.__dict__ File "C:\python_files\pickle_test.py", line 12, in ? cPickle.dump(book,wb.save(pickle_path)) File "C:\Python24\lib\copy_reg.py", line 69, in _reduce_ex raise TypeError, "can't pickle %s objects" % base.__name__ TypeError: can't pickle file objects I tried to use open(filename, 'w') as well as pyExcelerator (wb.save(pickle_path)) to create the pickle file, but neither worked. Secondly, I am trying to make an ID number from three columns of data: category | topic | sub_topic, so that I can . I imagine that a dictionary would be the best way to sort out the repeats -- http://mail.python.org/mailman/listinfo/python-list
xlrd and cPickle.dump
Hi all, Sorry for the repeat I needed to reform my question and had some problems...silly me. The xlrd documentation says: "Pickleable. Default is true. In Python 2.4 or earlier, setting to false will cause use of array.array objects which save some memory but can't be pickled. In Python 2.5, array.arrays are used unconditionally. Note: if you have large files that you need to read multiple times, it can be much faster to cPickle.dump() the xlrd.Book object once, and use cPickle.load() multiple times." I'm using Python 2.4 and I have an extremely large excel file that I need to work with. The documentation leads me to believe that cPickle will be a more efficient option, but I am having trouble pickling the excel file. So far, I have this: import cPickle,xlrd import pyExcelerator from pyExcelerator import * data_path = """C:\test.xls""" pickle_path = """C:\pickle.xls""" book = xlrd.open_workbook(data_path) Data_sheet = book.sheet_by_index(0) wb=pyExcelerator.Workbook() proc = wb.add_sheet("proc") #Neither of these work #1) pyExcelerator try #cPickle.dump(book,wb.save(pickle_path)) #2) Normal pickle try #pickle_file = open(pickle_path, 'w') #cPickle.dump(book, pickle_file) #file.close() Any ideas would be helpful. Otherwise, I won't pickle the excel file and deal with the lag time. Patrick -- http://mail.python.org/mailman/listinfo/python-list
Re: xlrd and cPickle.dump
> How many megabytes is "extremely large"? How many seconds does it take > to open it with xlrd.open_workbook? The document is 15mb ad 50,000+ rows (for test purposes I will use a smaller sample), but my computer hangs (ie it takes a long time) when I try to do simple manipulations and the documentation leads me to believe cPickle will be more efficient. If this is not true, then I don't have a problem (ie I just have to wait), but I still would like to figure out how to pickle an xlrd object anyways. > You only need one of the above imports at the best of times, and for > what you are attempting to do, you don't need pyExcelerator at all. Using pyExcelerator was a guess, because the traditional way didn't work and I thought it may be because it's an Excel file. Secondly, I import it twice because sometimes, and I don't know why, PythonWin does not import pyExcelerator the first time. This has only been true with pyExcelerator. > > data_path = """C:\test.xls""" > > It is extremely unlikely that you have a file whose basename begins with > a TAB ('\t') character. Please post the code that you actually ran. you're right, I had just quickly erased my documents and settings folder to make it smaller for an example. > > Please post the minimal pyExcelerator-free script that demonstrates your > problem. Ensure that it includes the following line: > import sys; print sys.version; print xlrd.__VERSION__ > Also post the output and the traceback (in full). As to copy_reg.py, I downloaded Activestate Python 2.4 and that was it, so I have had no other version on my computer. Here's the code: import cPickle,xlrd, sys print sys.version print xlrd.__VERSION__ data_path = """C:\\test\\test.xls""" pickle_path = """C:\\test\\pickle.pickle""" book = xlrd.open_workbook(data_path) Data_sheet = book.sheet_by_index(0) pickle_file = open(pickle_path, 'w') cPickle.dump(book, pickle_file) pickle_file.close() Here's the output: 2.4.3 (#69, Apr 11 2006, 15:32:42) [MSC v.1310 32 bit (Intel)] 0.6.1 Traceback (most recent call last): File "C:\Python24\Lib\site-packages\pythonwin\pywin\framework \scriptutils.py", line 310, in RunScript exec codeObject in __main__.__dict__ File "C:\text analysis\pickle_test2.py", line 13, in ? cPickle.dump(book, pickle_file) File "C:\Python24\lib\copy_reg.py", line 69, in _reduce_ex raise TypeError, "can't pickle %s objects" % base.__name__ TypeError: can't pickle module objects Thanks for the advice! -- http://mail.python.org/mailman/listinfo/python-list
Re: xlrd and cPickle.dump
Still no luck: Traceback (most recent call last): File "C:\Python24\Lib\site-packages\pythonwin\pywin\framework \scriptutils.py", line 310, in RunScript exec codeObject in __main__.__dict__ File "C:\text analysis\pickle_test2.py", line 13, in ? cPickle.dump(Data_sheet, pickle_file, -1) PicklingError: Can't pickle : attribute lookup __builtin__.module failed My code remains the same, except I added 'wb' and the -1 following your suggestions: import cPickle,xlrd, sys print sys.version print xlrd.__VERSION__ data_path = """C:\\test\\test.xls""" pickle_path = """C:\\test\\pickle.pickle""" book = xlrd.open_workbook(data_path) Data_sheet = book.sheet_by_index(0) pickle_file = open(pickle_path, 'wb') cPickle.dump(Data_sheet, pickle_file, -1) pickle_file.close() To begin with (I forgot to mention this before) I get this error: WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non- zero I'm not sure what this means. > What do you describe as "simple manipulations"? Please describe your > computer, including how much memory it has. I have a 1.8Ghz HP dv6000 with 2Gb of ram, which should be speedy enough for my programming projects. However, when I try to print out the rows in the excel file, my computer gets very slow and choppy, which makes experimenting slow and frustrating. Maybe cPickle won't solve this problem at all! For this first part, I am trying to make ID numbers for the different permutation of categories, topics, and sub_topics. So I will have [book,non-fiction,biography],[book,non- fiction,history-general],[book,fiction,literature], etc.. so I want the combination of [book,non-fiction,biography] = 1 [book,non-fiction,history-general] = 2 [book,fiction,literature] = 3 etc... My code does this, except sort returns None, which is strange. I just want an alphabetical sort of the first option, which sort should do automatically. When I do a test like >>>nest_list = [['bbc', 'cds'], ['jim', 'ex'],['abc', 'sd']] >>>nest_list.sort() [['abc', 'sd'], ['bbc', 'cds'], ['jim', 'ex']] It works fine, but not for my rows. Here's the code (unpickled/unsorted): import xlrd, pyExcelerator path_file = "C:\\text_analysis\\test.xls" book = xlrd.open_workbook(path_file) ProcFT_QC = book.sheet_by_index(0) log_path = "C:\\text_analysis\\ID_Log.log" logfile = open(log_path,'wb') set_rows = [] rows = [] db = {} n=0 while n Also, any good reason for sticking with Python 2.4? Trying to learn Zope/Plone too, so I'm sticking with Python 2.4. Thanks again -- http://mail.python.org/mailman/listinfo/python-list
Re: xlrd and cPickle.dump
>FWIW, it works here on 2.5.1 without errors or warnings. Ouput is: >2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] >0.6.1 I guess it's a version issue then... I forgot about sorted! Yes, that would make sense! Thanks for the input. On Apr 2, 4:23 pm, [EMAIL PROTECTED] wrote: > Still no luck: > > Traceback (most recent call last): > File "C:\Python24\Lib\site-packages\pythonwin\pywin\framework > \scriptutils.py", line 310, in RunScript > exec codeObject in __main__.__dict__ > File "C:\text analysis\pickle_test2.py", line 13, in ? >cPickle.dump(Data_sheet, pickle_file, -1) > PicklingError: Can't pickle : attribute lookup > __builtin__.module failed > > My code remains the same, except I added 'wb' and the -1 following > your suggestions: > > import cPickle,xlrd, sys > > print sys.version > print xlrd.__VERSION__ > > data_path = """C:\\test\\test.xls""" > pickle_path = """C:\\test\\pickle.pickle""" > > book = xlrd.open_workbook(data_path) > Data_sheet = book.sheet_by_index(0) > > pickle_file = open(pickle_path, 'wb')cPickle.dump(Data_sheet, pickle_file, -1) > pickle_file.close() > > To begin with (I forgot to mention this before) I get this error: > WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non- > zero > > I'm not sure what this means. > > > What do you describe as "simple manipulations"? Please describe your > > computer, including how much memory it has. > > I have a 1.8Ghz HP dv6000 with 2Gb of ram, which should be speedy > enough for my programming projects. However, when I try to print out > the rows in the excel file, my computer gets very slow and choppy, > which makes experimenting slow and frustrating. Maybe cPickle won't > solve this problem at all! For this first part, I am trying to make > ID numbers for the different permutation of categories, topics, and > sub_topics. So I will have [book,non-fiction,biography],[book,non- > fiction,history-general],[book,fiction,literature], etc.. > so I want the combination of > [book,non-fiction,biography] = 1 > [book,non-fiction,history-general] = 2 > [book,fiction,literature] = 3 > etc... > > My code does this, except sort returns None, which is strange. I just > want an alphabetical sort of the first option, which sort should do > automatically. When I do a test like>>>nest_list = [['bbc', 'cds'], ['jim', > 'ex'],['abc', 'sd']] > >>>nest_list.sort() > > [['abc', 'sd'], ['bbc', 'cds'], ['jim', 'ex']] > It works fine, but not for my rows. > > Here's the code (unpickled/unsorted): > import xlrd, pyExcelerator > > path_file = "C:\\text_analysis\\test.xls" > book = xlrd.open_workbook(path_file) > ProcFT_QC = book.sheet_by_index(0) > log_path = "C:\\text_analysis\\ID_Log.log" > logfile = open(log_path,'wb') > > set_rows = [] > rows = [] > db = {} > n=0 > while n rows.append(ProcFT_QC.row_values(n, 6,9)) > n+=1 > print rows.sort() #Outputs None > ID = 1 > for row in rows: > if row not in set_rows: > set_rows.append(row) > db[ID] = row > entry = str(ID) + '|' + str(row).strip('u[]') + '\r\n' > logfile.write(entry) > ID+=1 > logfile.close() > > > Also, any good reason for sticking with Python 2.4? > > Trying to learn Zope/Plone too, so I'm sticking with Python 2.4. > > Thanks again -- http://mail.python.org/mailman/listinfo/python-list
Converting .doc to .txt in Linux
Hi Everyone, I had previously asked a similar question, http://groups.google.com/group/comp.lang.python/browse_thread/thread/2953d6d5d8836c4b/9dc901da63d8d059?lnk=gst&q=convert+doc+txt#9dc901da63d8d059 but at that point I was using Windows and now I am using Linux. Basically, I have some .doc files that I need to convert into txt files encoded in utf-8. However, win32com.client doesn't work in Linux. It's been giving me quite a headache all day. Any ideas would be greatly appreciated. Best, Patrick #Windows Code: import glob,os,codecs,shutil,win32com.client from win32com.client import Dispatch input = '/home/pwaldo2/work/workbench/current_documents/*.doc' input_dir = '/home/pwaldo2/work/workbench/current_documents/' outpath = '/home/pwaldo2/work/workbench/current_documents/TXT/' for doc in glob.glob1(input): WordApp = Dispatch("Word.Application") WordApp.Visible = 1 WordApp.Documents.Open(doc) WordApp.ActiveDocument.SaveAs(doc,7) WordApp.ActiveDocument.Close() WordApp.Quit() for doc in glob.glob(input): txt_split = os.path.splitext(doc) txt_doc = txt_split[0] + '.txt' txt_doc_path = os.path.join(outpath,txt_doc) doc_path = os.path.join(input_dir,doc) shutil.copy(doc_path,txt_doc_path) -- http://mail.python.org/mailman/listinfo/python-list