from:"patrick . waldo"

[JOB] Look for a Full Time Plone Developer

2013-06-06 Thread Patrick Waldo

Hi All,

Please take a look at a new job opportunity for Python/Plone developers.

Patrick Waldo,
Project Manager
Decernis <http://decernis.com/>

*Job Description: Full Time Python/Plone Developer*

We are looking for a highly motivated and self-reliant developer to work on
systems built with Plone in a small, but lively team.  Our ideal candidate
is not afraid to roll up their sleeves to tackle complex problems with
their code as well as offer innovative solutions at the planning and design
stages.  The position also calls for both rapid prototyping for client
meetings, maintaining current systems and fulfilling other tasks as
necessary.

 This position requires experience in Plone administration, setting up
backup and failover instances, optimizing the ZODB, testing and
documentation.   The job also entails creating clean user interfaces,
building forms and integrating with Oracle.

 The position will begin with a six-month trial period at which point
full-time employment will be re-evaluated based on performance.  The
candidate will be able to choose hours and work remotely, but must meet
deadlines and report progress effectively.



*Key Skills*

·  At least 3 years of Plone and Python development

·  At least 3 years web development (HTML, CSS, jQuery, etc.)

·  Server administration (Apache)

·  Oracle integration (cx_Oracle, SQLAlchemy, etc.)

·  Task oriented and solid project management skills

·  Data mining and data visualization experience is a plus

·  Java or Perl experience is a plus

·  Proficient English

·  Effective communication



*About Decernis*

Decernis is a global information systems company that works with industry
leaders and government agencies to meet complex regulatory compliance and
risk management needs in the areas of food, consumer products and
industrial goods.  We hold ourselves to high standards to meet the
technically challenging areas that our clients face to ensure
comprehensive, current and global solutions.  Decernis has offices in
Rockville, MD and Frankfurt, Germany as well as teams located around the
world.

For more information, please visit our website: http://www.decernis.com.

*Contact*

Please send resume, portfolio and cover letter to Cynthia Gamboa, *
cgam...@decernis.com*.

Decernis is an equal opportunity employer.
-- 
http://mail.python.org/mailman/listinfo/python-list

[JOB] Two opportunities at Decernis

2013-04-28 Thread Patrick Waldo

Hi All,

The company I work for, Decernis, has two job opportunities that might be
of interest.  Decernis provides global systems for regulatory compliance
management of foods and consumer products to world leaders in each sector.
The company has offices in Rockville, MD as well as Frankfurt, Germany.

First, we are looking for a highly effective, full-time senior software
engineer with experience in both development and client interaction.  This
position will work mostly in Java, but Python is most definitely an added
plus.

Second, we are looking for a highly motivated and self-reliant independent
contractor to help us build customized RSS feeds, web crawlers and site
monitors.  This position is part-time and all programs will be written in
Python.  Experience in Plone will be an added benefit.

Please see below for more information.  Send resume and cover letter to
Cynthia Gamboa, cgam...@decernis.com.

Best

Patrick
Project Manager Decernis News & Issue Management

 *Job Description: Full-Time Senior Software Engineer*

We are looking for a highly effective senior software engineer with
experience in both development and client interaction.  Our company
provides global systems for regulatory compliance management of foods and
consumer products to world leaders in each sector.



Our ideal candidate has the following experiences:



·  5 or more years of Java/J2EE development experiences including
Jboss/Tomcat and web applications and deployment;

·  4 or more years of Oracle database development experience including
Oracle 10g or later versions;

·  Strong Unix/Linux OS working experience;

·  Strong script language programming experience in Python and Perl;

·  Experience with rule-based expert systems;

·  Experience in Plone and other CMS a plus.



Salary commensurate with experience.  This position reports directly to the
Director of System Development.



*About Decernis*

Decernis is a global information company that works with industry leaders
and government agencies to meet complex regulatory compliance and risk
management needs. We work closely with our clients to produce results that
meet the high standards demanded in technically challenging areas to ensure
comprehensive, current, and global solutions. Our team has the regulatory,
scientific, data, and systems expertise to succeed with our clients and we
are dedicated to results.



Decernis has offices in Rockville, MD and Frankfurt, Germany.  Re-locating
to the Washington, DC area is a requirement of the position.



Decernis is an equal opportunity employer and will not discriminate against
any individual, employee, or application for employment on the basis of
race, color, marital status, religion, age, sex, sexual orientation,
national origin, handicap, or any other legally protected status recognized
by federal, state or local law.
###

 *Job Description: Part Time Python Programmer*

We are looking for a highly motivated and self-reliant independent
contractor to help us build customized RSS feeds, web crawlers and site
monitors.  Our ideal candidate has experience working with data mining
techniques as well as building web crawlers and scrapers.  The candidate
will be able to choose hours and work remotely, but must meet expected
deadlines and be able to report progress effectively.  In addition we are
looking for someone who is able to think through the problem set and
contribute their own solutions while balancing project goals and direction.



The project will last approximately three months, but sufficient
performance could lead to future work.  This position reports directly to
the Director of System Development.



*Key Skills*

·  Data Mining & Web Crawling (Required)

·  Python Development (Required)

·  Statistics

·  Task Oriented

·  Proficient English

·  Effective Communication



*About Decernis*

Decernis is a global information company that works with industry leaders
and government agencies to meet complex regulatory compliance and risk
management needs. We work closely with our clients to produce results that
meet the high standards demanded in technically challenging areas to ensure
comprehensive, current, and global solutions. Our team has the regulatory,
scientific, data, and systems expertise to succeed with our clients and we
are dedicated to results.



Decernis has offices in Rockville, MD and Frankfurt, Germany.  Re-locating
to the Washington, DC area is not a requirement.



Decernis is an equal opportunity employer and will not discriminate against
any individual, employee, or application for employment on the basis of
race, color, marital status, religion, age, sex, sexual orientation,
national origin, handicap, or any other legally protected status recognized
by federal, state or local law.
###
-- 
http://mail.python.org/mailman/listinfo/python-list

Simple Text Processing Help

2007-10-14 Thread patrick . waldo

Hi all,

I started Python just a little while ago and I am stuck on something
that is really simple, but I just can't figure out.

Essentially I need to take a text document with some chemical
information in Czech and organize it into another text file.  The
information is always EINECS number, CAS, chemical name, and formula
in tables.  I need to organize them into lines with | in between.  So
it goes from:

200-763-1 71-73-8
nátrium-tiopentál   C11H18N2O2S.Na   to:

200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na

but if I have a chemical like: kyselina močová

I get:
200-720-7|69-93-2|kyselina|močová
|C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál

and then it is all off.

How can I get Python to realize that a chemical name may have a space
in it?

Thank you,
Patrick

So far I have:

#take tables in one text file and organize them into lines in another

import codecs

path = "c:\\text_samples\\chem_1_utf8.txt"
path2 = "c:\\text_samples\\chem_2.txt"
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

#read and enter into a list
chem_file = []
chem_file.append(input.read())

#split words and store them in a list
for word in chem_file:
words = word.split()

#starting values in list
e=0   #EINECS
c=1   #CAS
ch=2  #chemical name
f=3   #formula

n=0
loop=1
x=len(words)  #counts how many words there are in the file

print '-'*100
while loop==1:
if nhttp://mail.python.org/mailman/listinfo/python-list

Re: Simple Text Processing Help

2007-10-14 Thread patrick . waldo

Thank you both for helping me out.  I am still rather new to Python
and so I'm probably trying to reinvent the wheel here.

When I try to do Paul's response, I get
>>>tokens = line.strip().split()
[]

So I am not quite sure how to read line by line.

tokens = input.read().split() gets me all the information from the
file.  tokens[2:-1] = [u' '.join(tokens[2:-1])] works just fine, like
in the example; however, how can I loop this for the entire document?
Also, when I try output.write(tokens), I get "TypeError: coercing to
Unicode: need string or buffer, list found".

Any ideas?

















On Oct 14, 4:25 pm, Paul Hankin <[EMAIL PROTECTED]> wrote:
> On Oct 14, 2:48 pm, [EMAIL PROTECTED] wrote:
>
>
>
> > Hi all,
>
> > I started Python just a little while ago and I am stuck on something
> > that is really simple, but I just can't figure out.
>
> > Essentially I need to take a text document with some chemical
> > information in Czech and organize it into another text file.  The
> > information is always EINECS number, CAS, chemical name, and formula
> > in tables.  I need to organize them into lines with | in between.  So
> > it goes from:
>
> > 200-763-1 71-73-8
> > nátrium-tiopentál   C11H18N2O2S.Na   to:
>
> > 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na
>
> > but if I have a chemical like: kyselina močová
>
> > I get:
> > 200-720-7|69-93-2|kyselina|močová
> > |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál
>
> > and then it is all off.
>
> > How can I get Python to realize that a chemical name may have a space
> > in it?
>
> In the original file, is every chemical on a line of its own? I assume
> it is here.
>
> You might use a regexp (look at the re module), or I think here you
> can use the fact that only chemicals have spaces in them. Then, you
> can split each line on whitespace (like you're doing), and join back
> together all the words between the 3rd (ie index 2) and the last (ie
> index -1) using tokens[2:-1] = [u' '.join(tokens[2:-1])]. This uses
> the somewhat unusual python syntax for replacing a section of a list
> with another list.
>
> The approach you took involves reading the whole file, and building a
> list of all the chemicals which you don't seem to use: I've changed it
> to a per-line version and removed the big lists.
>
> path = "c:\\text_samples\\chem_1_utf8.txt"
> path2 = "c:\\text_samples\\chem_2.txt"
> input = codecs.open(path, 'r','utf8')
> output = codecs.open(path2, 'w', 'utf8')
>
> for line in input:
> tokens = line.strip().split()
> tokens[2:-1] = [u' '.join(tokens[2:-1])]
> chemical = u'|'.join(tokens)
> print chemical + u'\n'
> output.write(chemical + u'\r\n')
>
> input.close()
> output.close()
>
> Obviously, this isn't tested because I don't have your chem_1_utf8.txt
> file.
>
> --
> Paul Hankin


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Text Processing Help

2007-10-15 Thread patrick . waldo

> lines = open('your_file.txt').readlines()[:4]
> print lines
> print map(len, lines)

gave me:
['\xef\xbb\xbf200-720-769-93-2\n', 'kyselina mo\xc4\x8dov
\xc3\xa1  C5H4N4O3\n', '\n', '200-001-8\t50-00-0\n']
[28, 32, 1, 18]

I think it means that I'm still at option 3.  I got the line by line
part.  My code is a lot cleaner now:

import codecs

path = "c:\\text_samples\\chem_1_utf8.txt"
path2 = "c:\\text_samples\\chem_2.txt"
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

for line in input:
tokens = line.strip().split()
tokens[2:-1] = [u' '.join(tokens[2:-1])]   #this doesn't seem to
combine the files correctly
file = u'|'.join(tokens)   #this does put '|' in
between
print file + u'\n'
output.write(file + u'\r\n')

input.close()
output.close()

my sample input file looks like this( not organized,as you see it):
200-720-769-93-2
kyselina mocová  C5H4N4O3

200-001-8   50-00-0
formaldehyd  CH2O

200-002-3
50-01-1
guanidínium-chlorid  CH5N3.ClH

etc...

and after the program I get:

200-720-7|69-93-2|
kyselina|mocová||C5H4N4O3

200-001-8|50-00-0|
formaldehyd|CH2O|

200-002-3|
50-01-1|
guanidínium-chlorid|CH5N3.ClH|

etc...
So, I am sort of back at the start again.

If I add:

tokens = line.strip().split()
for token in tokens:
print token

I get all the single tokens, which I thought I could then put
together, except when I did:

for token in tokens:
 s = u'|'.join(token)
 print s

I got ?|2|0|0|-|7|2|0|-|7, etc...

How can I join these together into nice neat little lines?  When I try
to store the tokens in a list, the tokens double and I don't know
why.  I can work on getting the chemical names together after...baby
steps, or maybe I am just missing something obvious.  The first two
numbers will always be the same three digits-three digits-one digit
and then two digits-two digits-one digit.

My intuition tells me that I need to add an if statement that says, if
the first two numbers follow the pattern, then continue, if they don't
(ie a chemical name was accidently split apart) then the third entry
needs to be put together.  Something like
if tokens.startswith('pattern') == true


Again, thanks so much.  I've gone to http://gnosis.cx/TPiP/ and I have
a couple O'Reilly books, but they don't seem to have a straightforward
example for this kind of text manipulation.

Patrick


On Oct 14, 11:17 pm, John Machin <[EMAIL PROTECTED]> wrote:
> On Oct 14, 11:48 pm, [EMAIL PROTECTED] wrote:
>
>
>
> > Hi all,
>
> > I started Python just a little while ago and I am stuck on something
> > that is really simple, but I just can't figure out.
>
> > Essentially I need to take a text document with some chemical
> > information in Czech and organize it into another text file.  The
> > information is always EINECS number, CAS, chemical name, and formula
> > in tables.  I need to organize them into lines with | in between.  So
> > it goes from:
>
> > 200-763-1 71-73-8
> > nátrium-tiopentál   C11H18N2O2S.Na   to:
>
> > 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na
>
> > but if I have a chemical like: kyselina močová
>
> > I get:
> > 200-720-7|69-93-2|kyselina|močová
> > |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál
>
> > and then it is all off.
>
> > How can I get Python to realize that a chemical name may have a space
> > in it?
>
> Your input file could be in one of THREE formats:
> (1) fields are separated by TAB characters (represented in Python by
> the escape sequence '\t', and equivalent to '\x09')
> (2) fields are fixed width and padded with spaces
> (3) fields are separated by a random number of whitespace characters
> (and can contain spaces).
>
> What makes you sure that you have format 3? You might like to try
> something like
> lines = open('your_file.txt').readlines()[:4]
> print lines
> print map(len, lines)
> This will print a *precise* representation of what is in the first
> four lines, plus their lengths. Please show us the output.


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Text Processing Help

2007-10-15 Thread patrick . waldo

> lines = open('your_file.txt').readlines()[:4]
> print lines
> print map(len, lines)

gave me:
['\xef\xbb\xbf200-720-769-93-2\n', 'kyselina mo\xc4\x8dov
\xc3\xa1  C5H4N4O3\n', '\n', '200-001-8\t50-00-0\n']
[28, 32, 1, 18]

I think it means that I'm still at option 3.  I got the line by line
part.  My code is a lot cleaner now:

import codecs

path = "c:\\text_samples\\chem_1_utf8.txt"
path2 = "c:\\text_samples\\chem_2.txt"
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

for line in input:
tokens = line.strip().split()
tokens[2:-1] = [u' '.join(tokens[2:-1])]   #this doesn't seem to
combine the files correctly
file = u'|'.join(tokens)   #this does put '|' in
between
print file + u'\n'
output.write(file + u'\r\n')

input.close()
output.close()

my sample input file looks like this( not organized,as you see it):
200-720-769-93-2
kyselina mocová  C5H4N4O3

200-001-8   50-00-0
formaldehyd  CH2O

200-002-3
50-01-1
guanidínium-chlorid  CH5N3.ClH

etc...

and after the program I get:

200-720-7|69-93-2|
kyselina|mocová||C5H4N4O3

200-001-8|50-00-0|
formaldehyd|CH2O|

200-002-3|
50-01-1|
guanidínium-chlorid|CH5N3.ClH|

etc...
So, I am sort of back at the start again.

If I add:

tokens = line.strip().split()
for token in tokens:
print token

I get all the single tokens, which I thought I could then put
together, except when I did:

for token in tokens:
 s = u'|'.join(token)
 print s

I got ?|2|0|0|-|7|2|0|-|7, etc...

How can I join these together into nice neat little lines?  When I try
to store the tokens in a list, the tokens double and I don't know
why.  I can work on getting the chemical names together after...baby
steps, or maybe I am just missing something obvious.  The first two
numbers will always be the same three digits-three digits-one digit
and then two digits-two digits-one digit.  This seems to be on the
only pattern.

My intuition tells me that I need to add an if statement that says, if
the first two numbers follow the pattern, then continue, if they don't
(ie a chemical name was accidently split apart) then the third entry
needs to be put together.  Something like

if tokens[1] and tokens[2] startswith('pattern') == true
tokens[2] = join(tokens[2]:tokens[3])
token[3] = token[4]
del token[4]

but the code isn't right...any ideas?

Again, thanks so much.  I've gone to http://gnosis.cx/TPiP/ and I have
a couple O'Reilly books, but they don't seem to have a straightforward
example for this kind of text manipulation.

Patrick

On Oct 14, 11:17 pm, John Machin <[EMAIL PROTECTED]> wrote:
> On Oct 14, 11:48 pm, [EMAIL PROTECTED] wrote:
>
>
>
> > Hi all,
>
> > I started Python just a little while ago and I am stuck on something
> > that is really simple, but I just can't figure out.
>
> > Essentially I need to take a text document with some chemical
> > information in Czech and organize it into another text file.  The
> > information is always EINECS number, CAS, chemical name, and formula
> > in tables.  I need to organize them into lines with | in between.  So
> > it goes from:
>
> > 200-763-1 71-73-8
> > nátrium-tiopentál   C11H18N2O2S.Na   to:
>
> > 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na
>
> > but if I have a chemical like: kyselina močová
>
> > I get:
> > 200-720-7|69-93-2|kyselina|močová
> > |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál
>
> > and then it is all off.
>
> > How can I get Python to realize that a chemical name may have a space
> > in it?
>
> Your input file could be in one of THREE formats:
> (1) fields are separated by TAB characters (represented in Python by
> the escape sequence '\t', and equivalent to '\x09')
> (2) fields are fixed width and padded with spaces
> (3) fields are separated by a random number of whitespace characters
> (and can contain spaces).
>
> What makes you sure that you have format 3? You might like to try
> something like
> lines = open('your_file.txt').readlines()[:4]
> print lines
> print map(len, lines)
> This will print a *precise* representation of what is in the first
> four lines, plus their lengths. Please show us the output.


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Text Processing Help

2007-10-15 Thread patrick . waldo

Wow, thank you all.  All three work. To output correctly I needed to
add:

output.write("\r\n")

This is really a great help!!

Because of my limited Python knowledge, I will need to try to figure
out exactly how they work for future text manipulation and for my own
knowledge.  Could you recommend some resources for this kind of text
manipulation?  Also, I conceptually get it, but would you mind walking
me through

> for tok in tokens:
> if NR_RE.match(tok) and len(chem) >= 4:
> chem[2:-1] = [' '.join(chem[2:-1])]
> yield chem
> chem = []
> chem.append(tok)

and

> for key, group in groupby(instream, unicode.isspace):
> if not key:
> yield "".join(group)


Thanks again,
Patrick



On Oct 15, 2:16 pm, Peter Otten <[EMAIL PROTECTED]> wrote:
> patrick.waldo wrote:
> > my sample input file looks like this( not organized,as you see it):
> > 200-720-769-93-2
> > kyselina mocová  C5H4N4O3
>
> > 200-001-8   50-00-0
> > formaldehyd  CH2O
>
> > 200-002-3
> > 50-01-1
> > guanidínium-chlorid  CH5N3.ClH
>
> Assuming that the records are always separated by blank lines and only the
> third field in a record may contain spaces the following might work:
>
> import codecs
> from itertools import groupby
>
> path = "c:\\text_samples\\chem_1_utf8.txt"
> path2 = "c:\\text_samples\\chem_2.txt"
>
> def fields(s):
> parts = s.split()
> return parts[0], parts[1], " ".join(parts[2:-1]), parts[-1]
>
> def records(instream):
> for key, group in groupby(instream, unicode.isspace):
> if not key:
> yield "".join(group)
>
> if __name__ == "__main__":
> outstream = codecs.open(path2, 'w', 'utf8')
> for record in records(codecs.open(path, "r", "utf8")):
> outstream.write("|".join(fields(record)))
> outstream.write("\n")
>
> Peter


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Text Processing Help

2007-10-16 Thread patrick . waldo

And now for something completely different...

I see a lot of COM stuff with Python for excel...and I quickly made
the same program output to excel.  What if the input file were a Word
document?  Where is there information about manipulating word
documents, or what could I add to make the same program work for word?

Again thanks a lot.  I'll start hitting some books about this sort of
text manipulation.

The Excel add on:

import codecs
import re
from win32com.client import Dispatch

path = "c:\\text_samples\\chem_1_utf8.txt"
path2 = "c:\\text_samples\\chem_2.txt"
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

NR_RE = re.compile(r'^\d+-\d+-\d+$')   #pattern for EINECS
number

tokens = input.read().split()
def iter_elements(tokens):
product = []
for tok in tokens:
if NR_RE.match(tok) and len(product) >= 4:
product[2:-1] = [' '.join(product[2:-1])]
yield product
product = []
product.append(tok)
yield product

xlApp = Dispatch("Excel.Application")
xlApp.Visible = 1
xlApp.Workbooks.Add()
c = 1

for element in iter_elements(tokens):
xlApp.ActiveSheet.Cells(c,1).Value = element[0]
xlApp.ActiveSheet.Cells(c,2).Value = element[1]
xlApp.ActiveSheet.Cells(c,3).Value = element[2]
xlApp.ActiveSheet.Cells(c,4).Value = element[3]
c = c + 1

xlApp.ActiveWorkbook.Close(SaveChanges=1)
xlApp.Quit()
xlApp.Visible = 0
del xlApp

input.close()
output.close()

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Simple Text Processing Help

2007-10-16 Thread patrick . waldo

And now for something completely different...

I've been reading up a bit about Python and Excel and I quickly told
the program to output to Excel quite easily.  However, what if the
input file were a Word document?  I can't seem to find much
information about parsing Word files.  What could I add to make the
same program work for a Word file?

Again thanks a lot.

And the Excel Add on...

import codecs
import re
from win32com.client import Dispatch

path = "c:\\text_samples\\chem_1_utf8.txt"
path2 = "c:\\text_samples\\chem_2.txt"
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

NR_RE = re.compile(r'^\d+-\d+-\d+$')   #pattern for EINECS
number

tokens = input.read().split()
def iter_elements(tokens):
product = []
for tok in tokens:
if NR_RE.match(tok) and len(product) >= 4:
product[2:-1] = [' '.join(product[2:-1])]
yield product
product = []
product.append(tok)
yield product

xlApp = Dispatch("Excel.Application")
xlApp.Visible = 1
xlApp.Workbooks.Add()
c = 1

for element in iter_elements(tokens):
xlApp.ActiveSheet.Cells(c,1).Value = element[0]
xlApp.ActiveSheet.Cells(c,2).Value = element[1]
xlApp.ActiveSheet.Cells(c,3).Value = element[2]
xlApp.ActiveSheet.Cells(c,4).Value = element[3]
c = c + 1

xlApp.ActiveWorkbook.Close(SaveChanges=1)
xlApp.Quit()
xlApp.Visible = 0
del xlApp

input.close()
output.close()

-- 
http://mail.python.org/mailman/listinfo/python-list

Problem Converting Word to UTF8 Text File

2007-10-21 Thread patrick . waldo

Hi all,

I'm trying to copy a bunch of microsoft word documents that have
unicode characters into utf-8 text files.  Everything works fine at
the beginning.  The word documents get converted and new utf-8 text
files with the same name get created.  And then I try to copy the data
and I keep on getting "TypeError: coercing to Unicode: need string or
buffer, instance found".  I'm probably copying the word document
wrong.  What can I do?

Thanks,
Patrick


import os, codecs, glob, shutil, win32com.client
from win32com.client import Dispatch

input = 'C:\\text_samples\\source\\*.doc'
output_dir = 'C:\\text_samples\\source\\output'
FileFormat=win32com.client.constants.wdFormatText

for doc in glob.glob(input):
doc_copy = shutil.copy(doc,output_dir)
WordApp = Dispatch("Word.Application")
WordApp.Visible = 1
WordApp.Documents.Open(doc)
WordApp.ActiveDocument.SaveAs(doc, FileFormat)
WordApp.ActiveDocument.Close()
WordApp.Quit()


for doc in glob.glob(input):
txt_split = os.path.splitext(doc)
txt_doc = txt_split[0] + '.txt'
txt_doc = codecs.open(txt_doc,'w','utf-8')
shutil.copyfile(doc,txt_doc)

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Problem Converting Word to UTF8 Text File

2007-10-21 Thread patrick . waldo

Indeed, the shutil.copyfile(doc,txt_doc) was causing the problem for
the reason you stated.  So, I changed it to this:

for doc in glob.glob(input):
txt_split = os.path.splitext(doc)
txt_doc = txt_split[0] + '.txt'
txt_doc_dir = os.path.join(input_dir,txt_doc)
doc_dir = os.path.join(input_dir,doc)
shutil.copy(doc_dir,txt_doc_dir)


However, I still cannot read the unicode from the Word file.  If take
out the first for-statement, I get a bunch of garbled text, which
isn't helpful.  I would save them all manually, but I want to figure
out how to do it in Python, since I'm just beginning.

My intuition says the problem is with

FileFormat=win32com.client.constants.wdFormatText

because it converts fine to a text file, just not a utf-8 text file.
How can I  modify this or is there another way to code this type of
file conversion from *.doc to *.txt with unicode characters?

Thanks

On Oct 21, 7:02 pm, "Gabriel Genellina" <[EMAIL PROTECTED]>
wrote:
> En Sun, 21 Oct 2007 13:35:43 -0300, <[EMAIL PROTECTED]> escribi?:
>
> > Hi all,
>
> > I'm trying to copy a bunch of microsoft word documents that have
> > unicode characters into utf-8 text files.  Everything works fine at
> > the beginning.  The word documents get converted and new utf-8 text
> > files with the same name get created.  And then I try to copy the data
> > and I keep on getting "TypeError: coercing to Unicode: need string or
> > buffer, instance found".  I'm probably copying the word document
> > wrong.  What can I do?
>
> Always remember to provide the full traceback.
> Where do you get the error? In the last line: shutil.copyfile?
> If the file already contains the text in utf-8, and you just want to make
> a copy, use shutil.copy as before.
> (or, why not tell Word to save the file using the .txt extension in the
> first place?)
>
> > for doc in glob.glob(input):
> > txt_split = os.path.splitext(doc)
> > txt_doc = txt_split[0] + '.txt'
> > txt_doc = codecs.open(txt_doc,'w','utf-8')
> > shutil.copyfile(doc,txt_doc)
>
> copyfile expects path names as arguments, not a
> codecs-wrapped-file-like-object
>
> --
> Gabriel Genellina


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Problem Converting Word to UTF8 Text File

2007-10-22 Thread patrick . waldo

That KB document was really helpful, but the problem still isn't
solved.  What's wierd now is that the unicode characters like  
become  è in some odd conversion.  However, I noticed when I try to
open the word documents after I run the first for statement that Word
gives me a window that says File Conversion and asks me how i want to
encode it.  None of the unicode options retain the characters.  Then I
looked some more and found it has a central european option both ISO
and Windows which works perfectly since the documents I am looking at
are in Czech.  Then I try to save the document in word and it says if
I try to save it as a text file I will lose the formating!  So I guess
I'm back at the start.

Judging from some internet searches, I'm not the only one having this
problem.  For some reason Word can only save as .doc even though .txt
can support the utf8 format with all these characters.

Any ideas?

On Oct 22, 5:39 am, "Gabriel Genellina" <[EMAIL PROTECTED]>
wrote:
> En Sun, 21 Oct 2007 15:32:57 -0300, <[EMAIL PROTECTED]> escribi?:
>
> > However, I still cannot read the unicode from the Word file.  If take
> > out the first for-statement, I get a bunch of garbled text, which
> > isn't helpful.  I would save them all manually, but I want to figure
> > out how to do it in Python, since I'm just beginning.
>
> > My intuition says the problem is with
>
> > FileFormat=win32com.client.constants.wdFormatText
>
> > because it converts fine to a text file, just not a utf-8 text file.
> > How can I  modify this or is there another way to code this type of
> > file conversion from *.doc to *.txt with unicode characters?
>
> Ah! I thought you were getting the right file format.
> I can't test it now, but this KB 
> documenthttp://support.microsoft.com/kb/209186/en-us
> suggests you should use wdFormatUnicodeText when saving the document.
> What the MS docs call "unicode" when dealing with files, is in general
> utf16.
> In this case, if you want to convert to utf8, the sequence would be:
>
> f = open(original_filename, "rb")
> udata = f.read().decode("utf16")
> f.close()
> f = open(new_filename, "wb")
> f.write(udata.encode("utf8"))
> f.close()
>
> --
> Gabriel Genellina

-- 
http://mail.python.org/mailman/listinfo/python-list

Regular Expression

2007-10-22 Thread patrick . waldo

Hi,

I'm trying to learn regular expressions, but I am having trouble with
this.  I want to search a document that has mixed data; however, the
last line of every entry has something like C5H4N4O3 or CH5N3.ClH.
All of the letters are upper case and there will always be numbers and
possibly one .

However below only gave me none.

import os, codecs, re

text = 'C:\\text_samples\\sample.txt'
text = codecs.open(text,'r','utf-8')

test = re.compile('\u+\d+\.')

for line in text:
print test.search(line)

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regular Expression

2007-10-23 Thread patrick . waldo

This is related to my last post (see:
http://groups.google.com/group/comp.lang.python/browse_thread/thread/c333cbbb5d496584/998af2bb2ca10e88#998af2bb2ca10e88)

I have a text file with an EINECS number, a CAS number, a Chemical
Name, and a Chemical Formula, always in this order.  However, I
realized as I ran my script that I had entries like

274-989-4
70892-58-9
diazotovaná kyselina 4-
aminobenzénsulfónová, kopulovaná s
farbiarskym moruovým (Chlorophora
tinctoria) extraktom, komplexy so
elezom
komplexy eleza s produktami
kopulácie diazotovanej kyseliny 4-
aminobenzénsulfónovej s látkou
registrovanou v Indexe farieb pod
identifika ným  íslom Indexu farieb,
C.I. 75240.

which become

274-989-4|70892-58-9|diazotovaná kyselina 4- aminobenzénsulfónová,
kopulovaná s farbiarskym moruovým (Chlorophora tinctoria) extraktom,
komplexy so elezom komplexy eleza s produktami kopulácie
diazotovanej kyseliny 4- aminobenzénsulfónovej s látkou registrovanou
v Indexe farieb pod identifika ným  íslom Indexu farieb, C.I.|75240.

The C.I 75240 is not a chemical formula and there isn't one.  So I
want to add a regular expression for the chemical name for an if
statement that stipulates if there is not chemical formula to move
on.  However, I must be getting confused from the regular expression
tutorials I've been reading.

Any ideas?

Original Code:

#For text files in a directory...
#Analyzes a randomly organized UTF8 document with EINECS, CAS,
Chemical, and Chemical Formula
#into a document structured as EINECS|CAS|Chemical|Chemical Formula.

import os
import codecs
import re

path = "C:\\text_samples\\text"#folder with all text
files
path2 = "C:\\text_samples\\text\\output"   #output of all text
files

NR_RE = re.compile(r'^\d+-\d+-\d+$')   #pattern for EINECS
number

def iter_elements(tokens):
product = []
for tok in tokens:
if NR_RE.match(tok) and len(product) >= 4:
product[2:-1] = [' '.join(product[2:-1])]
yield product
product = []
product.append(tok)
yield product

for text in os.listdir(path):
input_text = os.path.join(path,text)
output_text = os.path.join(path2,text)
input = codecs.open(input_text, 'r','utf8')
output = codecs.open(output_text, 'w', 'utf8')
tokens = input.read().split()
for element in iter_elements(tokens):
#print '|'.join(element)
output.write('|'.join(element))
output.write("\r\n")

input.close()
output.close()

On Oct 23, 5:03 pm, Paul McGuire <[EMAIL PROTECTED]> wrote:
> On Oct 22, 5:29 pm, [EMAIL PROTECTED] wrote:
>
>
>
> > Hi,
>
> > I'm trying to learn regular expressions, but I am having trouble with
> > this.  I want to search a document that has mixed data; however, the
> > last line of every entry has something like C5H4N4O3 or CH5N3.ClH.
> > All of the letters are upper case and there will always be numbers and
> > possibly one .
>
> > However below only gave me none.
>
> > import os, codecs, re
>
> > text = 'C:\\text_samples\\sample.txt'
> > text = codecs.open(text,'r','utf-8')
>
> > test = re.compile('\u+\d+\.')
>
> > for line in text:
> > print test.search(line)
>
> If those are chemical symbols, then I guarantee that there will be
> lower case letters in the expression (like the "l" in "ClH").
>
> -- Paul

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regular Expression

2007-10-25 Thread patrick . waldo

Marc, thank you for the example it made me realize where I was getting
things wrong.  I didn't realize how specific I needed to be.  Also
http://weitz.de/regex-coach/  really helped me test things out on this
one.  I realized I had some more exceptions like C18H34O2.1/2Cu and I
also realized I didn't really understand regular expressions (which I
still don't but I think it's getting better)

FORMULA = re.compile(r'([A-Z][A-Za-z0-9]+\.?[A-Za-z0-9]+/?[A-Za-
z0-9]+)')

This gets all Chemical names like C14H28 C18H34O2.1/2Cu C8H17ClO2, ie
a word that begins with a capital letter followed by any number of
upper or lower case letters and numbers followed by a possible .
followed by any number of upper or lower case letters and numbers
followed by a possible / followed by any number of upper or lower case
letters and numbers.  Say that five times fast!

So now I want to tell the program that if it finds the formula at the
end then continue, otherwise if it finds C.I. 75240 or any other type
of word that it should not be broken by a | and be lumped into the
whole line.  But now I get:

Traceback (most recent call last):
  File "C:\Python24\Lib\site-packages\pythonwin\pywin\framework
\scriptutils.py", line 310, in RunScript
exec codeObject in __main__.__dict__
  File "C:\Documents and Settings\Patrick Waldo\My Documents\Python
\WORD\try5-2-file-1-1.py", line 32, in ?
input = codecs.open(input_text, 'r','utf8')
  File "C:\Python24\lib\codecs.py", line 666, in open
file = __builtin__.open(filename, mode, buffering)
IOError: [Errno 13] Permission denied: 'C:\\Documents and Settings\
\Patrick Waldo\\Desktop\\decernis\\DAD\\EINECS_SK\\text\\output'

Ideas?


#For text files in a directory...
#Analyzes a randomly organized UTF8 document with EINECS, CAS,
Chemical, and Chemical Formula
#into a document structured as EINECS|CAS|Chemical|Chemical Formula.

import os
import codecs
import re

path = "C:\\text"
path2 = "C:\\text\output"
EINECS = re.compile(r'^\d\d\d-\d\d\d-\d
$')
FORMULA = re.compile(r'([A-Z][A-Za-z0-9]+\.?[A-Za-z0-9]+/?[A-Za-
z0-9]+)')

def iter_elements(tokens):
product = []
for tok in tokens:
if EINECS.match(tok) and len(product) >= 4:
if product[-1] == FORMULA.findall(tok):
product[2:-1] = [' '.join(product[2:-1])]
yield product
product = []
else:
product[2:-1] = [' '.join(product[2:])]
yield product
product = []
product.append(tok)
yield product

for text in os.listdir(path):
input_text = os.path.join(path,text)
output_text = os.path.join(path2,text)
input = codecs.open(input_text, 'r','utf8')
output = codecs.open(output_text, 'w', 'utf8')
tokens = input.read().split()
for element in iter_elements(tokens):
output.write('|'.join(element))
output.write("\r\n")

input.close()
output.close()

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Regular Expression

2007-10-27 Thread patrick . waldo

Finally I solved the problem, with some really minor things to tweak.
I guess it's true that I had two problems working with regular
expressions.

Thank you all for your help.  I really learned a lot on quite a
difficult problem.

Final Code:

#For text files in a directory...
#Analyzes a randomly organized UTF8 document with EINECS, CAS,
Chemical, and Chemical Formula
#into a document structured as EINECS|CAS|Chemical|Chemical Formula.

import os
import codecs
import re

path = "C:\\text_samples\\text\\"
path2 = "C:\\text_samples\\text\\output\\"
EINECS = re.compile(r'^\d\d\d-\d\d\d-\d$')
CAS = re.compile(r'^\d*-\d\d-\d$')
FORMULA = re.compile(r'([A-Z][A-Za-z0-9]+\.?[A-Za-z0-9]+/?[A-Za-
z0-9]+)')


def iter_elements(tokens):
product = []
for tok in tokens:
if EINECS.match(tok) and len(product) >= 4:
match = re.match(FORMULA,product[-1])
if match:
product[2:-1] = [' '.join(product[2:-1])]
yield product
product = []
else:
product[2:-1] = [' '.join(product[2:])]
del product[-1]
yield product
product = []
product.append(tok)
yield product

for text in os.listdir(path):
input_text = os.path.join(path,text)
output_text = os.path.join(path2,text)
input = codecs.open(input_text, 'r','utf8')
output = codecs.open(output_text, 'w', 'utf8')
tokens = input.read().split()
for element in iter_elements(tokens):
output.write('|'.join(element))
output.write("\r\n")
input.close()
output.close()

-- 
http://mail.python.org/mailman/listinfo/python-list

Problem--IOError: [Errno 13] Permission denied

2007-10-28 Thread patrick . waldo

Hi all,

After sludging my way through many obstacles with this interesting
puzzle of a text parsing program, I found myself with one final error:

Traceback (most recent call last):
  File "C:\Python24\Lib\site-packages\pythonwin\pywin\framework
\scriptutils.py", line 310, in RunScript
exec codeObject in __main__.__dict__
  File "C:\Documents and Settings\Patrick Waldo\My Documents\Python
\WORD\try5-2-file-1-all patterns.py", line 77, in ?
input = codecs.open(input_text, 'r','utf8')
  File "C:\Python24\lib\codecs.py", line 666, in open
file = __builtin__.open(filename, mode, buffering)
IOError: [Errno 13] Permission denied: 'C:\\text_samples\\test\
\output'

The error doesn't stop the program from functioning as it should,
except the last line of every document gets split with | in between
the words, which is just strange.  I have no idea why either is
happening, but perhaps they are related.

Any ideas?

#For text files in a directory...
#Analyzes a randomly organized UTF8 document with EINECS, CAS,
Chemical, and Chemical Formula
#into a document structured as EINECS|CAS|Chemical|Chemical Formula.

import os
import codecs
import re

path = "C:\\text_samples\\test\\"
path2 = "C:\\text_samples\\test\\output\\"

EINECS = re.compile(r'^\d\d\d-\d\d\d-\d$')
FORMULA = re.compile(r'([A-Z][a-zA-Z0-9]*\.?[A-Za-z0-9]*/?[A-Za-
z0-9]*)')
FALSE_POS = re.compile(r'^[A-Z][a-z]{4,40}\)?\.?')
FALSE_POS1 = re.compile(r'C\.I\..*')
FALSE_POS2 = re.compile(r'vit.*')
FALSE_NEG = re.compile(r'C\d+\.')

def iter_elements(tokens):
product = []
for tok in tokens:
if EINECS.match(tok) and len(product) >= 3:
match = re.match(FORMULA,product[-1])
match_false_pos = re.match(FALSE_POS,product[-1])
match_false_pos1 = re.match(FALSE_POS1,product[-1])
match_false_pos2 = re.match(FALSE_POS2,product[2])
match_false_neg = re.match(FALSE_NEG,product[-1])
if match_false_neg:
product[2:-1] = [' '.join(product[2:])]
del product[-1]
yield product
product = []
elif match_false_pos:
product[2:-1] = [' '.join(product[2:])]
del product[-1]
yield product
product = []
elif match:
product[2:-1] = [' '.join(product[2:-1])]
yield product
product = []
elif match_false_pos1 or match_false_pos2:
product[2:-1] = [' '.join(product[2:])]
del product[-1]
yield product
product = []
else:
product[2:-1] = [' '.join(product[2:])]
del product[-1]
yield product
product = []
product.append(tok)
yield product

for text in os.listdir(path):
input_text = os.path.join(path,text)
output_text = os.path.join(path2,text)
input = codecs.open(input_text, 'r','utf8')
output = codecs.open(output_text, 'w', 'utf8')
tokens = input.read().split()
for element in iter_elements(tokens):
output.write('|'.join(element))
output.write("\r\n")
input.close()
output.close()

-- 
http://mail.python.org/mailman/listinfo/python-list

Sorting Countries by Region

2007-11-16 Thread patrick . waldo

Hi all,

I'm analyzing some data that has a lot of country data.  What I need
to do is sort through this data and output it into an excel doc with
summary information.  The countries, though, need to be sorted by
region, but the way I thought I could do it isn't quite working out.
So far I can only successfully get the data alphabetically.

Any ideas?

import xlrd
import pyExcelerator

def get_countries_list(list):
countries_list=[]
for country in countries:
if country not in countries_list:
countries_list.append(country)

EU = ["Austria","Belgium", "Cyprus","Czech Republic",
"Denmark","Estonia", "Finland"]
NA = ["Canada", "United States"]
AP = ["Australia", "China", "Hong Kong", "India", "Indonesia",
"Japan"]
Regions_tot = {'European Union':EU, 'North America':NA, 'Asia
Pacific':AP,}

path_file = "c:\\1\country_data.xls"
book = xlrd.open_workbook(path_file)
Counts = book.sheet_by_index(1)
countries= Counts.col_values(0,start_rowx=1, end_rowx=None)

get_countries_list(countries)

wb=pyExcelerator.Workbook()
matrix = wb.add_sheet("matrix")

n=1
for country in unique_countries:
matrix.write(n,1, country)
n = n+1

wb.save('c:\\1\\matrix.xls')



-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Fwd: Sorting Countries by Region

2007-11-16 Thread patrick . waldo

Great, this is very helpful.  I'm new to Python, so hence the
inefficient or nonsensical code!

>
> 2) I would suggest using countries.sort(...) or sorted(countries,...),
> specifying cmp or key options too sort by region instead.
>

I don't understand how to do this.  The countries.sort() lists
alphabetically and I tried to do a lambda x,y: cmp() type function,
but it doesn't sort correctly.  Help with that?

For martyw's example, I don't need to get any sort of population
info.  I'm actually getting the number of various types of documents.
So the entry is like this:

Argentina   Food and Consumer Products  Food Additives  Color
Additives   1
Argentina   Food and Consumer Products  Food Additives  Flavors 
1
Argentina   Food and Consumer Products  Food Additives
General 6
Argentina   Food and Consumer Products  Food Additives  labeling
1
Argentina   Food and Consumer Products  Food Additives  Prohibited
Additives   1
Argentina   Food and Consumer Products  Food ContactCellulose   
1
Argentina   Food and Consumer Products  Food ContactFood
Packaging   1
Argentina   Food and Consumer Products  Food ContactPlastics
4
Argentina   Food and Consumer Products  Food Contact
Waxes   1
Belize
etc...

So I'll need to add up all the entries for Food Additives and Food
contacts, the other info like Color Additives isn't important.

So I will have an output like this
  Food AdditivesFood Contact
Argentina 107
Belize
etc...

Thanks so much for the help!
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Fwd: Sorting Countries by Region

2007-11-17 Thread patrick . waldo

This is how I solved it last night in my inefficient sort of way and
after re-reading some of my Python books on dictionaries.  So far this
gets the job done.  However, I'd like to test if there are any
countries in the excel input that are not represented, ie the input is
all the information I have and the dictionary functions as the
information I expect.  What I did worked yesterday, but doesn't work
anymore more...see comment

Otherwise I tried doing this:

for i, country in countries_list:
if country in REGIONS_COUNTRIES['European Union']:
matrix.write(i+2, 1, country)
but I got "ValueError: too many values to unpack"

Again, this has been a great help.  Any ideas of how I can make this a
bit more efficient, as I'm dealing with 5 regions and numerous
countries, would be greatly appreciated.  Here's the code:


#keeping all the countries short
REGIONS_COUNTRIES = {'European Union':["Austria","Belgium"," "France",
"Germany", "Greece"],\
 'North America':["Canada", "United States"]}

path_file = "c:\\1\\build\\data\\matrix2\\Update_oct07a.xls"

book = xlrd.open_workbook(path_file)
Counts = book.sheet_by_index(1)
wb=pyExcelerator.Workbook()
matrix = wb.add_sheet("matrix")

countries = Counts.col_values(0,start_rowx=1, end_rowx=None)
countries_list = list(set(countries))
countries_list.sort()

#This seems to not work today and I don't know why
#for country in countries_list:
#if country not in REGIONS_COUNTRIES['European Union'] or not in
REGIONS_COUNTRIES['North America']:
#print "%s is not in the expected list", country

#This sorts well
n=2
for country in countries_list:
if country in REGIONS_COUNTRIES['European Union']:
matrix.write(n, 1, country)
n=n+1
for country in countries_list:
if country in REGIONS_COUNTRIES['North America']:
matrix.write(n, 1, country)
n=n+1

wb.save('c:\\1\\matrix.xls')



On Nov 17, 1:12 am, "Sergio Correia" <[EMAIL PROTECTED]> wrote:
> About the sort:
>
> Check this (also onhttp://pastebin.com/f12b5b6ca)
>
> def make_regions():
>
> # Values you provided
> EU = ["Austria","Belgium", "Cyprus","Czech Republic",
> "Denmark","Estonia", "Finland"]
> NA = ["Canada", "United States"]
> AP = ["Australia", "China", "Hong Kong", "India", "Indonesia",
> "Japan"]
> regions = {'European Union':EU, 'North America':NA, 'Asia Pacific':AP}
>
> ans = {}
> for reg_name, reg in regions.items():
> for cou in reg:
> ans[cou] = reg_name
> return ans
>
> def cmp_region(cou1, cou2):
> ans = cmp(regions[cou1], regions[cou2])
> if ans: # If the region is the same, sort by country
> return cmp(cou1, cou2)
> else:
> return ans
>
> regions = make_regions()
> some_countries = ['Austria', 'Canada', 'China', 'India']
>
> print 'Old:', some_countries
> some_countries.sort(cmp_region)
> print 'New:', some_countries
>
> Why that code?
> Because the first thing I want is a dictionary where the key is the
> name of the country and the value is the region. Then, I just make a
> quick function that compares considering the region and country.
> Finally, I sort.
>
> Btw, the code is just a quick hack, as it can be improved -a lot-.
>
> About the rest of your code:
> - martyw's example is much more useful than you think. Why? because
> you can just iterate across your document, adding the values you get
> to the adequate object property. That is, instead of using size or
> pop, use the variables you are interested in.
>
> Best, and good luck with python,
> Sergio
>
> On Nov 16, 2007 5:15 PM,  <[EMAIL PROTECTED]> wrote:
>
> > Great, this is very helpful.  I'm new to Python, so hence the
> > inefficient or nonsensical code!
>
> > > 2) I would suggest using countries.sort(...) or sorted(countries,...),
> > > specifying cmp or key options too sort by region instead.
>
> > I don't understand how to do this.  The countries.sort() lists
> > alphabetically and I tried to do a lambda x,y: cmp() type function,
> > but it doesn't sort correctly.  Help with that?
>
> > For martyw's example, I don't need to get any sort of population
> > info.  I'm actually getting the number of various types of documents.
> > So the entry is like this:
>
> > Argentina   Food and Consumer Products  Food Additives  Color
> > Additives   1
> > Argentina   Food and Consumer Products  Food Additives  Flavors 
> > 1
> > Argentina   Food and Consumer Products  Food Additives
> > General 6
> > Argentina   Food and Consumer Products  Food Additives  labeling
> > 1
> > Argentina   Food and Consumer Products  Food Additives  Prohibited
> > Additives   1
> > Argentina   Food and Consumer Products  Food ContactCellulose   
> > 1
> > Argentina   Food and Consumer Products  Food ContactFood
> > Packaging   1
> > Argentina   Food and Consum

Yet Another Tabular Data Question

2007-11-29 Thread patrick . waldo

Hi all,

Fairly new Python guy here.  I am having a lot of trouble trying to
figure this out.  I have some data on some regulations in Excel and I
need to basically add up the total regulations for each country--a
statistical analysis thing that I'll copy to another Excel file.
Writing with pyExcelerator has been easier than reading with xlrd for
me...So that's what I did first, but now I'd like to learn how to
crunch some data.

The input looks like this:

Country Module
Topic  # of Docs
Argentina   Food and Consumer Products  Cosmetics1
Argentina   Food and Consumer Products  Cosmetics8
Argentina   Food and Consumer Products  Food Additives  1
Argentina   Food and Consumer Products  Food Additives  1
Australia   Food and Consumer Products  Drinking Water   7
Australia   Food and Consumer Products  Food Additives   3
Australia   Food and Consumer Products  Food Additives   1
etc...

So I need to add up all the docs for Argentina, Australia, etc...and
add up the total amount for each Topic for each country so, Argentina
has 9 Cosmetics laws and 2 Food Additives Laws, etc...

So, here is the reduced code that can't add anything...Any thoughts
would be really helpful.

import xlrd
import pyExcelerator
from pyExcelerator import *

#Open Excel files for reading and writing
path_file = "c:\\1\\data.xls"
book = xlrd.open_workbook(path_file)
Counts = book.sheet_by_index(1)
wb=pyExcelerator.Workbook()
matrix = wb.add_sheet("matrix")

#Get all Excel data
n=1
data = []
while nhttp://mail.python.org/mailman/listinfo/python-list

Pivot Table/Groupby/Sum question

2007-12-27 Thread patrick . waldo

Hi all,

I tried reading http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/334695
on the same subject, but it didn't work for me.  I'm trying to learn
how to make pivot tables from some excel sheets and I am trying to
abstract this into a simple sort of example.  Essentially I want to
take input data like this:

Name  Time of day  Amount
Bob   Morn240
Bob   Aft   300
Joe   Morn 70
Joe   Aft80
Jil   Morn  100
Jil   Aft 150

And output it as:

Name  TotalMorning  Afternoon
Bob540  240300
Joe 150  70  80
Jil   250 100150
Total   940  410530

The writing the output part is the easy part.  However, I have a
couple problems.  1) Grouping by name seems to work perfectly, but
working by time does not.  ie

I will get:
Bob
240
300
Joe
70
80
Jil
100
150
which is great but...
Morn
240
Aft
300
Morn
70
Aft
80
Morn
100
Aft
150
And not
Morn
240
70
100
Aft
   300
   80
   150

2) I can't figure out how to sum these values because of the
iteration.  I always get an error like: TypeError: iteration over non-
sequence

Here's the code:

from itertools import groupby

data = [['Bob', 'Morn', 240],['Bob', 'Aft', 300],['Joe', 'Morn', 70],
['Joe', 'Aft', 80],\
['Jil', 'Morn', 100],['Jil', 'Aft', 150]]

NAME, TIME, AMOUNT=range(3)
for k, g in groupby(data, key=lambda r: r[NAME]):
print k
for record in g:
print "\t", record[AMOUNT]
for k, g in groupby(data, key=lambda r: r[TIME]):
print k
for record in g:
print "\t", record[AMOUNT]

Thanks for any comments
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Pivot Table/Groupby/Sum question

2007-12-27 Thread patrick . waldo

On Dec 27, 10:59 pm, John Machin <[EMAIL PROTECTED]> wrote:
> On Dec 28, 4:56 am, [EMAIL PROTECTED] wrote:
>
> > from itertools import groupby
>
> You seem to have overlooked this important sentence in the
> documentation: "Generally, the iterable needs to already be sorted on
> the same key function"

Yes, but I imagine this shouldn't prevent me from using and
manipulating the data.  It also doesn't explain why the names get
sorted correctly and the time does not.

I was trying to do this:

count_tot = []
for k, g in groupby(data, key=lambda r: r[NAME]):
for record in g:
count_tot.append((k,record[SALARY]))
for i in count_tot:
here I want to say add all the numbers for each person, but I'm
missing something.

If you have any ideas about how to solve this pivot table issue, which
seems to be scant on Google, I'd much appreciate it.  I know I can do
this in Excel easily with the automated wizard, but I want to know how
to do it myself and format it to my needs.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Pivot Table/Groupby/Sum question

2007-12-28 Thread patrick . waldo

Wow, I did not realize it would be this complicated!  I'm fairly new
to Python and somehow I thought I could find a simpler solution.  I'll
have to mull over this to fully understand how it works for a bit.

Thanks a lot!

On Dec 28, 4:03 am, John Machin <[EMAIL PROTECTED]> wrote:
> On Dec 28, 11:48 am, John Machin <[EMAIL PROTECTED]> wrote:
>
> > On Dec 28, 10:05 am, [EMAIL PROTECTED] wrote:
>
> > > If you have any ideas about how to solve this pivot table issue, which
> > > seems to be scant on Google, I'd much appreciate it.  I know I can do
> > > this in Excel easily with the automated wizard, but I want to know how
> > > to do it myself and format it to my needs.
>
> > Watch this space.
>
> Tested as much as you see:
>
> 8<---
> class SimplePivotTable(object):
>
> def __init__(
> self,
> row_order=None, col_order=None, # see example
> missing=0, # what to return for an empty cell. Alternatives:
> '', 0.0, None, 'NULL'
> ):
> self.row_order = row_order
> self.col_order = col_order
> self.missing = missing
> self.cell_dict = {}
> self.row_total = {}
> self.col_total = {}
> self.grand_total = 0
> self.headings_OK = False
>
> def add_item(self, row_key, col_key, value):
> self.grand_total += value
> try:
> self.col_total[col_key] += value
> except KeyError:
> self.col_total[col_key] = value
> try:
> self.cell_dict[row_key][col_key] += value
> self.row_total[row_key] += value
> except KeyError:
> try:
> self.cell_dict[row_key][col_key] = value
> self.row_total[row_key] += value
> except KeyError:
> self.cell_dict[row_key] = {col_key: value}
> self.row_total[row_key] = value
>
> def _process_headings(self):
> if self.headings_OK:
> return
> self.row_headings = self.row_order or
> list(sorted(self.row_total.keys()))
> self.col_headings = self.col_order or
> list(sorted(self.col_total.keys()))
> self.headings_OK = True
>
> def get_col_headings(self):
> self._process_headings()
> return self.col_headings
>
> def generate_row_info(self):
> self._process_headings()
> for row_key in self.row_headings:
> row_dict = self.cell_dict[row_key]
> row_vals = [row_dict.get(col_key, self.missing) for
> col_key in self.col_headings]
> yield row_key, self.row_total[row_key], row_vals
>
> def get_col_totals(self):
> self._process_headings()
> row_dict = self.col_total
> row_vals = [row_dict.get(col_key, self.missing) for col_key in
> self.col_headings]
> return self.grand_total, row_vals
>
> if __name__ == "__main__":
>
> data = [
> ['Bob', 'Morn', 240],
> ['Bob', 'Aft',  300],
> ['Joe', 'Morn',  70],
> ['Joe', 'Aft',   80],
> ['Jil', 'Morn', 100],
> ['Jil', 'Aft',  150],
> ['Bob', 'Aft',   40],
> ['Bob', 'Aft',5],
> ['Dozy', 'Aft',   1], # Dozy doesn't show up till lunch-time
> ]
> NAME, TIME, AMOUNT = range(3)
>
> print
> ptab = SimplePivotTable(
> col_order=['Morn', 'Aft'],
> missing='uh-oh',
> )
> for s in data:
> ptab.add_item(row_key=s[NAME], col_key=s[TIME],
> value=s[AMOUNT])
> print ptab.get_col_headings()
> for x in ptab.generate_row_info():
> print x
> print 'Tots', ptab.get_col_totals()
> 8<---

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Pivot Table/Groupby/Sum question

2007-12-28 Thread patrick . waldo

Petr, thanks for the SQL suggestion, but I'm having enough trouble in
Python.

John would you mind walking me through your class in normal speak? I
only have a vague idea of why it works and this would help me a lot to
get a grip on classes and this sort of particular problem.  The next
step is to imagine if there was another variable, like departments and
add up the information by name, department, and time, and so on...that
will come another day.

Thanks.



On Dec 29, 1:00 am, John Machin <[EMAIL PROTECTED]> wrote:
> On Dec 29, 9:58 am, [EMAIL PROTECTED] wrote:
>
> > What about to let SQL to work for you.
>
> The OP is "trying to learn how to make pivot tables from some excel
> sheets". You had better give him a clue on how to use ODBC on an
> "excel sheet" :-)
>
> [snip]
>
> > SELECT
> > NAME,
> > sum (AMOUNT) as TOTAL,
> > sum (case when (TIME_OF_DAY) = 'Morn' then AMOUNT else 0 END) as
> > MORN,
> > sum (case when (TIME_OF_DAY) = 'Aft' then AMOUNT else 0 END) as AFT
>
> This technique requires advance knowledge of what the column key
> values are (the hard-coded 'Morn' and 'Aft').
>
> 
> It is the sort of thing that one sees when %SQL% is the *only*
> language used to produce end-user reports. Innocuous when there are
> only 2 possible columns, but bletchworthy when there are more than 20
> and the conditions are complex and the whole thing is replicated
> several times in the %SQL% script because either %SQL% doesn't support
> temporary procedures/functions or the BOsFH won't permit their use...
> not in front of the newbies, please!
> 

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Pivot Table/Groupby/Sum question

2007-12-29 Thread patrick . waldo

On Dec 29, 3:00 pm, [EMAIL PROTECTED] wrote:
> Patrick,
>
> in your first posting you are writing "... I'm trying to learn how to
> make pivot tables from some excel sheets...". Can you be more specific
> please? AFIK Excel offers very good support for pivot tables. So why
> to read tabular data from the Excel sheet and than transform it to
> pivot tabel in Python?
>
> Petr

Yes, I realize Excel has excellent support for pivot tables.  However,
I hate how Excel does it and, for my particular excel files, I need
them to be formated in an automated way because I will have a number
of them over time and I'd prefer just to have python do it in a flash
than to do it every time with Excel.

>It's about time you got a *concrete* idea of how something works.

Absolutely right.  I tend to take on ideas that I'm not ready for, in
the sense that I only started using Python some months ago for some
basic tasks and now I'm trying on some more complicated ones.  With
time, though, I will get a concrete idea of what python.exe does, but,
for someone who studied art history and not comp sci, I'm doing my
best to get a handle on all of it.  I think a pad of paper might be a
good way to visualize it.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Pivot Table/Groupby/Sum question

2008-01-02 Thread patrick . waldo

Sorry for the delay in my response.  New Year's Eve and moving
apartment

> - Where the data come from (I mean: are your data in Excel already
> when you get them)?
> - If your primary source of data is the Excel file, how do you read
> data from the Excel file to Python (I mean did you solve this part of the 
> task already)?

Yes, the data comes from Excel and I use xlrd and PyExcelerator to
read and write, respectively.
#open for reading
path_file = "c:\\1\\data.xls"
book = xlrd.open_workbook(path_file)
Counts = book.sheet_by_index(1)
#get data
n=1
data = []
while nhttp://mail.python.org/mailman/listinfo/python-list

Re: Pivot Table/Groupby/Sum question

2008-01-03 Thread patrick . waldo

Yes in the sense that the top part will have merged cells so that
Horror and Classics don't need to be repeated every time, but the
headers aren't the important part.  At this point I'm more interested
in organizing the data itself and i can worry about putting it into a
new excel file later.
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Pivot Table/Groupby/Sum question

2008-01-04 Thread patrick . waldo

Petr thanks so much for your input.  I'll try to learn SQL, especially
if I'll do a lot of database work.

I tried to do it John's way as en exercise and I'm happy to say I
understand a lot more.  Basically I didn't realize I could nest
dictionaries like db = {country:{genre:{sub_genre:3}}} and call them
like db[country][genre][sub_genre].  The Python Cookbook was quite
helpful to figure out why items needed to be added the way they did.
Also using the structure of the dictionary was a conceptually easier
solution than what I found on 
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/334695.

So, now I need to work on writing it to Excel.  I'll update with the
final code.

Thanks again.

#Movie Store Example
class PivotData:
def __init__(self):
self.total_mov = 0
self.total_cou = {}
self.total_gen = {}
self.total_sub = {}
self.total_cou_gen ={}
self.db = {}
def add_data(self,country,genre,sub_genre,value):
self.total_mov += value
try:
self.total_cou[country] += value
except KeyError:
self.total_cou[country] = value
try:
self.total_gen[genre] += value
except:
self.total_gen[genre] = value
try:
self.total_sub[sub_genre] += value
except:
self.total_sub[sub_genre] = value
try:
self.total_cou_gen[country][genre] += value
except KeyError:
try:
self.total_cou_gen[country][genre] = value
except KeyError:
self.total_cou_gen[country] = {genre:value}
try:
self.db[country][genre][sub_genre] += value
except KeyError:
try:
self.db[country][genre][sub_genre] = value
except KeyError:
try:
self.db[country][genre] = {sub_genre:value}
except:
self.db[country] = {genre:{sub_genre:value}}

data =  [['argentina','Horror', 'Slasher',4],
 ['argentina','Horror', 'Halloween',6],
 ['argentina','Drama','Romance',5],
 ['argentina','Drama','Romance',1],
 ['argentina','Drama','True Life',1],
 ['japan','Classics','WWII',1],
 ['japan','Cartoons','Anime',1],
 ['america','Comedy','Stand-Up',1],
 ['america','Cartoons','WB',10],
 ['america','Cartoons','WB',3]]

COUNTRY, GENRE, SUB_GENRE, VALUE =range(4)
x=PivotData()
for s in data:
x.add_data(s[COUNTRY],s[GENRE],s[SUB_GENRE],s[VALUE])
print
print 'Total Movies:\n', x.total_mov
print 'Total for each country\n', x.total_cou
print 'Total Genres\n', x.total_gen
print 'Total Sub Genres\n', x.total_sub
print 'Total Genres for each Country\n', x.total_cou_gen
print
print x.db
-- 
http://mail.python.org/mailman/listinfo/python-list

pyExcelerator: writing multiple rows

2008-01-19 Thread patrick . waldo

Hi all,

I was just curious if there was a  built-in or a more efficient way to
do take multiple rows of information and write them into excel using
pyExcelerator.  This is how I resolved the problem:

from pyExcelerator import *

data = [[1,2,3],[4,5,'a'],['','s'],[6,7,'g']]

wb=pyExcelerator.Workbook()
test = wb.add_sheet("test")

c=1
r=0
while rhttp://mail.python.org/mailman/listinfo/python-list

Re: joining strings question

2008-02-29 Thread patrick . waldo

I tried to make a simple abstraction of my problem, but it's probably
better to get down to it.  For the funkiness of the data, I'm
relatively new to Python and I'm either not processing it well or it's
because of BeautifulSoup.

Basically, I'm using BeautifulSoup to strip the tables from the
Federal Register (http://www.access.gpo.gov/su_docs/aces/fr-
cont.html).So far my code strips the html and gets only the
departments I'd like to see.  Now I need to put it into an Excel file
(with pyExcelerator) with the name of the record and the pdf.  A
snippet from my data from BeautifulSoup like this:

['Environmental Protection Agency', 'RULES', 'Approval and
Promulgation of Air Quality Implementation Plans:', 'Illinois;
Revisions to Emission Reduction Market System, ', '11042 [E8-3800]',
'E8-3800.pdf', 'Ohio; Oxides of Nitrogen Budget Trading Program;
Correction, ', '11192 [Z8-2506]', 'Z8-2506.pdf', 'NOTICES', 'Agency
Information Collection Activities; Proposals, Submissions, and
Approvals, ', '11108-0 [E8-3934]', 'E8-3934.pdf', 'Data
Availability for Lead National Ambient Air Quality Standard Review, ',
'0-1 [E8-3935]', 'E8-3935.pdf', 'Environmental Impacts
Statements; Notice of  Availability, ', '2 [E8-3917]',
'E8-3917.pdf']

What I'd like to see in Excel is this:
'Approval and Promulgation of Air Quality Implementation Plans:
Illinois; Revisions to Emission Reduction Market System, 11042
[E8-3800]'  | 'E8-3800.pdf' | RULES
'Ohio; Oxides of Nitrogen Budget Trading Program; Correction, 11192
[Z8-2506]' |  'Z8-2506.pdf' | RULES
'Agency Information Collection Activities; Proposals, Submissions, and
Approvals, 11108-0 [E8-3934]' | 'E8-3934.pdf' | NOTICES
'Data Availability for Lead National Ambient Air Quality Standard
Review, 0-1 [E8-3935]' | 'E8-3935.pdf' | NOTICES
'Environmental Impacts Statements; Notice of  Availability, 2
[E8-3917]' | 'E8-3917.pdf' | NOTICES
etc...for every department I want.

Now that I look at it I've got another problem because 'Approval and
Promulgation of Air Quality Implementation Plans:' should be joined to
both Illinois and Ohio...I love finding these little inconsistencies!
Once I get the data organized with all the titles joined together
appropriately, outputting it to Excel should be relatively easy.

So my problem is how to join these titles together.  There are a
couple patterns.  Every law is followed by a number, which is always
followed by the pdf.

Any ideas would be much appreciated.

My code so far (excuse the ugliness):

import urllib
import re, codecs, os
import pyExcelerator
from pyExcelerator import *
from BeautifulSoup import BeautifulSoup as BS

#Get the url, make the soup, and get the table to be processed
url = "http://www.access.gpo.gov/su_docs/aces/fr-cont.html";
site = urllib.urlopen(url)
soup = BS(site)
body = soup('table')[1]
tds = body.findAll('td')
mess = []
for td in tds:
mess.append(str(td))
spacer = re.compile(r'.*')
data = []
x=0
for n, t in enumerate(mess):
if spacer.match(t):
data.append(mess[x:n])
x = n

dept = re.compile(r'.*')
title = re.compile(r'.*')
title2 = re.compile(r'.*')
none = re.compile(r'None')

#Strip the html and organize by department
group = []
db_list = []
for d in data:
pre_list = []
for item in d:
if dept.match(item):
dept_soup = BS(item)
try:
dept_contents = dept_soup('a')[0]['name']
pre_list.append(str(dept_contents))
except IndexError:
break
elif title.match(item) or title2.match(item):
title_soup = BS(item)
title_contents = title_soup.td.string
if none.match(str(title_contents)):
pre_list.append(str(title_soup('a')[0]['href']))
else:
 pre_list.append(str(title_contents))
elif link.match(item):
link_soup = BS(item)
link_contents = link_soup('a')[1]['href']
pre_list.append(str(link_contents))
db_list.append(pre_list)
for db in db_list:
for n, dash_space in enumerate(db):
dash_space = dash_space.replace('–','-')
dash_space = dash_space.replace(' ', ' ')
db[n] = dash_space
download = re.compile(r'http://.*')
for db in db_list:
for n, pdf in enumerate(db):
if download.match(pdf):
filename = re.split('http://.*/',pdf)
db[n] = filename[1]
#Strip out these departments
AgrDep = re.compile(r'Agriculture Department')
EPA = re.compile(r'Environmental Protection Agency')
FDA = re.compile(r'Food and Drug Administration')
key_data = []
for list in db_list:
for db in list:
if AgrDep.match(db) or EPA.match(db) or FDA.match(db):
key_data.append(list)
#Get appropriate links from covered departments as well
LINK = re.compile(r'^#.*')
links = []
for kd in key_data:
for item in kd:
if LINK.match(item):
links.append(item[1:])
for list in db_list:

joining strings question

2008-02-29 Thread patrick . waldo

Hi all,

I have some data with some categories, titles, subtitles, and a link
to their pdf and I need to join the title and the subtitle for every
file and divide them into their separate groups.

So the data comes in like this:

data = ['RULES', 'title','subtitle','pdf',
'title1','subtitle1','pdf1','NOTICES','title2','subtitle2','pdf','title3','subtitle3','pdf']

What I'd like to see is this:

[RULES', 'title subtitle','pdf', 'title1 subtitle1','pdf1'],
['NOTICES','title2 subtitle2','pdf','title3 subtitle3','pdf'], etc...

I've racked my brain for a while about this and I can't seem to figure
it out.  Any ideas would be much appreciated.

Thanks
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: joining strings question

2008-03-01 Thread patrick . waldo

>def category_iterator(source):
>  source = iter(source)
>  try:
>while True:
>  item = source.next()

This gave me a lot of inspiration.  After a couple of days of banging
my head against the wall, I finally figured out a code that could
attach headers, titles, numbers, and categories in their appropriate
combinations--basically one BIG logic puzzle.

It's not the prettiest thing in the world, but it works.  If anyone
has a better way to do it, then I'll be all ears.  Anyways, thank you
all for your input, it helped me think outside the box.

import re

data = ['RULES', 'Approval and Promulgation of Air Quality
Implementation Plans:', 'Illinois; Revisions to Emission Reduction
Market System, ', '11042 [E8-3800]', 'E8-3800.pdf', 'Ohio; Oxides of
Nitrogen Budget Trading Program; Correction, ', '11192 [Z8-2506]',
'Z8-2506.pdf', 'NOTICES', 'Agency Information Collection Activities;
Proposals, Submissions, and Approvals, ', '11108-0 [E8-3934]',
'E8-3934.pdf', 'Data Availability for Lead National Ambient Air
Quality Standard Review, ', '0-1 [E8-3935]', 'E8-3935.pdf',
'Environmental Impacts Statements; Notice of  Availability, ', '2
[E8-3917]', 'E8-3917.pdf']

NOTICES = re.compile(r'NOTICES')
RULES = re.compile(r'RULES')
TITLE = re.compile(r'[A-Z][a-z].*')
NUM = re.compile(r'\d.*')
PDF = re.compile(r'.*\.pdf')

counted = []
sorted = []
title = []
tot = len(data)
x=0
while x < tot:
try:
item = data[x]
title = []
if NOTICES.match(item) or RULES.match(item):
module = item
header = ''
if TITLE.match(data[x+1]) and TITLE.match(data[x+2]) and
NUM.match(data[x+3]):
#Header
header = data[x+1]
counted.append(data[x+1])
sorted.append(data[x+1])
#Title
counted.append(data[x+2])
sorted.append(data[x+2])
#Number
counted.append(data[x+3])
sorted.append(data[x+3])
title.append(''.join(sorted))
print title, module
print
sorted = []
x+=1
elif TITLE.match(data[x+1]) and NUM.match(data[x+2]):
#Title
counted.append(data[x+1])
sorted.append(data[x+1])
#Number
counted.append(data[x+2])
sorted.append(data[x+2])
title.append(''.join(sorted))
print title, module
print
sorted = []
x+=1
else:
print item, "strange1"
break
x+=1
else:
if item in counted:
x+=1
elif PDF.match(item):
x+=1
elif TITLE.match(data[x]) and TITLE.match(data[x+1]) and
NUM.match(data[x+2]):
#Header
header = data[x]
counted.append(data[x])
sorted.append(data[x])
#Title
counted.append(data[x+1])
sorted.append(data[x+1])
#Number
counted.append(data[x+2])
sorted.append(data[x+2])
title.append(''.join(sorted))
sorted = []
print title, module
print
x+=1
elif TITLE.match(data[x]) and NUM.match(data[x+1]):
#Title
sorted.append(header)
counted.append(data[x])
sorted.append(data[x])
#Number
counted.append(data[x+1])
sorted.append(data[x+1])
title.append(''.join(sorted))
sorted = []
print title, module
print
x+=1
else:
print item, "strange2"
x+=1
break
except IndexError:
break
-- 
http://mail.python.org/mailman/listinfo/python-list

UnicodeDecodeError quick question

2008-12-04 Thread patrick . waldo

Hi Everyone,

I am using Python 2.4 and I am converting an excel spreadsheet to a
pipe delimited text file and some of the cells contain utf-8
characters.  I solved this problem in a very unintuitive way and I
wanted to ask why.  If I do,

csvfile.write(cell.encode("utf-8"))

I get a UnicodeDecodeError.  However if I do,

c = unicode(cell.encode("utf-8"),"utf-8")
csvfile.write(c)

Why should I have to encode the cell to utf-8 and then make it unicode
in order to write to a text file?  Is there a more intuitive way to
get around these bothersome unicode errors?

Thanks for any advice,
Patrick

Code:

# -*- coding: utf-8 -*-
import xlrd,codecs,os

xls_file = "/home/pwaldo2/work/docpool_plone/2008-12-4/
EU-2008-12-4.xls"
book = xlrd.open_workbook(xls_file)
bibliography_sheet = book.sheet_by_index(0)

csv = os.path.split(xls_file)[0] + '/' + os.path.split(xls_file)[1]
[:-4] + '.csv'
csvfile = codecs.open(csv,'w',encoding='utf-8')

rowcount = 0
data = []
while rowcounthttp://mail.python.org/mailman/listinfo/python-list

xlrd cell background color

2008-08-13 Thread patrick . waldo

Hi all,

I am trying to figure out a way to read colors with xlrd, but I did
not understand the formatting.py module.  Basically, I want to sort
rows that are red or green.  My initial attempt discovered that
>>>print cell
text:u'test1.txt' (XF:22)
text:u'test2.txt' (XF:15)
text:u'test3.txt' (XF:15)
text:u'test4.txt' (XF:15)
text:u'test5.txt' (XF:23)

So, I thought that XF:22 represented my red highlighted row and XF:23
represented my green highlighted row.  However, that was not always
true.  If one row is blank and I only highlighted one row, I got:
>>>print cell
text:u'test1.txt' (XF:22)
text:u'test2.txt' (XF:22)
text:u'test3.txt' (XF:22)
text:u'test4.txt' (XF:22)
text:u'test5.txt' (XF:22)
empty:'' (XF:15)
text:u'test6.txt' (XF:22)
text:u'test7.txt' (XF:23)

Now NoFill is XF:22!  I am sure I am going about this the wrong way,
but I just want to store filenames into a dictionary based on whether
they are red or green.  Any ideas would be much appreciated.  My code
is below.

Best,
Patrick


filenames = {}
filenames.setdefault('GREEN',[])
filenames.setdefault('RED',[])

book = xlrd.open_workbook("/home/pwaldo2/work/workbench/
Summary.xls",formatting_info=True)
SumDoc = book.sheet_by_index(0)

n=1
while nhttp://mail.python.org/mailman/listinfo/python-list

Re: xlrd cell background color

2008-08-14 Thread patrick . waldo

Thank you very much.  I did not know there was a python-excel group,
which I will certainly take note of in the future.  The previous post
answered my question, but I wanted to clarify the difference between
xf.background.background_colour_index,
xf.background.pattern_colour_index, and book.colour_map:

>>>color = xf.background.background_colour_index
>>>print color
60
60
60
65
65
65
49

60 = red and 49 = green

>>>color = xf.background.pattern_colour_index
>>>print color
10
10
10
64
64
64
11

10 = red 11 = green

>>>print book.colour_map
{0: (0, 0, 0), 1: (255, 255, 255), 2: (255, 0, 0), 3: (0, 255, 0), 4:
(0, 0, 255), 5: (255, 255, 0), 6: (255, 0, 255), 7: (0, 255, 255), 8:
(0, 0, 0), 9: (255, 255, 255), 10: (255, 0, 0), 11: (0, 255, 0), 12:
(0, 0, 255), 13: (255, 255, 0), 14: (255, 0, 255), 15: (0, 255, 255),
16: (128, 0, 0), 17: (0, 128, 0), 18: (0, 0, 128), 19: (128, 128, 0),
20: (128, 0, 128), 21: (0, 128, 128), 22: (192, 192, 192), 23: (128,
128, 128), 24: (153, 153, 255), 25: (153, 51, 102), 26: (255, 255,
204), 27: (204, 255, 255), 28: (102, 0, 102), 29: (255, 128, 128), 30:
(0, 102, 204), 31: (204, 204, 255), 32: (0, 0, 128), 33: (255, 0,
255), 34: (255, 255, 0), 35: (0, 255, 255), 36: (128, 0, 128), 37:
(128, 0, 0), 38: (0, 128, 128), 39: (0, 0, 255), 40: (0, 204, 255),
41: (204, 255, 255), 42: (204, 255, 204), 43: (255, 255, 153), 44:
(153, 204, 255), 45: (255, 153, 204), 46: (204, 153, 255), 47: (255,
204, 153), 48: (51, 102, 255), 49: (51, 204, 204), 50: (153, 204, 0),
51: (255, 204, 0), 52: (255, 153, 0), 53: (255, 102, 0), 54: (102,
102, 153), 55: (150, 150, 150), 56: (0, 51, 102), 57: (51, 153, 102),
58: (0, 51, 0), 59: (51, 51, 0), 60: (153, 51, 0), 61: (153, 51, 102),
62: (51, 51, 153), 63: (51, 51, 51), 64: None, 65: None, 81: None,
32767: None}

After looking at the color, OpenOffice says I am using 'light red' for
the first 3 rows and 'light green' for the last one, so how the
numbers change for the first two examples makes sense.  However, how
the numbers change for book.colour_map does not make much sense to me
since the numbers change without an apparent pattern.  Could you
clarify?

Best,
Patrick

Revised Code:

import xlrd

filenames = {}
filenames.setdefault('GREEN',[])
filenames.setdefault('RED',[])

book = xlrd.open_workbook("/home/pwaldo2/work/workbench/
Summary.xls",formatting_info=True)
SumDoc = book.sheet_by_index(0)
print book.colour_map

n=1
while n wrote:
> On Aug 14, 6:03 am, [EMAIL PROTECTED] wrote in
> news:comp.lang.python thusly:
>
> > Hi all,
>
> > I am trying to figure out a way to read colors with xlrd, but I did
> > not understand the formatting.py module.
>
> It is complicated, because it is digging out complicated info which
> varies in somewhat arbitrary fashion between the 5 (approx.) versions
> of Excel that xlrd handles. Sometimes I don't understand it, and I
> wrote it :-)
>
> What I do when I want to *use* the formatting info, however, is to
> read the xlrd documentation, and I suggest that you do the same. More
> details at the end.
>
>
>
> >  Basically, I want to sort
> > rows that are red or green.  My initial attempt discovered that>>>print cell
>
> > text:u'test1.txt' (XF:22)
> > text:u'test2.txt' (XF:15)
> > text:u'test3.txt' (XF:15)
> > text:u'test4.txt' (XF:15)
> > text:u'test5.txt' (XF:23)
>
> > So, I thought that XF:22 represented my red highlighted row and XF:23
> > represented my green highlighted row.  However, that was not always
> > true.  If one row is blank and I only highlighted one row, I got:>>>print 
> > cell
>
> > text:u'test1.txt' (XF:22)
> > text:u'test2.txt' (XF:22)
> > text:u'test3.txt' (XF:22)
> > text:u'test4.txt' (XF:22)
> > text:u'test5.txt' (XF:22)
> > empty:'' (XF:15)
> > text:u'test6.txt' (XF:22)
> > text:u'test7.txt' (XF:23)
>
> > Now NoFill is XF:22!  I am sure I am going about this the wrong way,
> > but I just want to store filenames into a dictionary based on whether
> > they are red or green.  Any ideas would be much appreciated.  My code
> > is below.
>
> > Best,
> > Patrick
>
> > filenames = {}
> > filenames.setdefault('GREEN',[])
> > filenames.setdefault('RED',[])
>
> > book = xlrd.open_workbook("/home/pwaldo2/work/workbench/
> > Summary.xls",formatting_info=True)
> > SumDoc = book.sheet_by_index(0)
>
> > n=1
> > while n > cell = SumDoc.cell(n,5)
> > print cell
> > filename = str(cell)[7:-9]
> > color = str(cell)[-3:-1]
> > if color == '22':
> > filenames['RED'].append(filename)
> > n+=1
> > elif color == '23':
> > filenames['GREEN'].append(filename)
> > n+=1
>
> 22 and 23 are not colours, they are indexes into a list of XFs
> (extended formats). The indexes after 16 have no fixed meaning, and as
> you found, if you add/subtract formatting features to your XLS file,
> the actual indexes used will change. Don't use str(cell). Use
> cell.xf_index.
>
> Here is your reading path through the docs, starting at "The Cell
> class":
> Cell.xf_index
> Book.xf_list
>

xlrd and cPickle.dump/rows to list

2008-03-31 Thread patrick . waldo

Hi all,

I have to work with a very large excel file and I have two questions.

First, the documentation says that cPickle.dump would be the best way
to work with it.  However, I keep getting:
Traceback (most recent call last):
  File "C:\Python24\Lib\site-packages\pythonwin\pywin\framework
\scriptutils.py", line 310, in RunScript
exec codeObject in __main__.__dict__
  File "C:\python_files\pickle_test.py", line 12, in ?
cPickle.dump(book,wb.save(pickle_path))
  File "C:\Python24\lib\copy_reg.py", line 69, in _reduce_ex
raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle file objects

I tried to use open(filename, 'w') as well as pyExcelerator
(wb.save(pickle_path)) to create the pickle file, but neither worked.

Any ideas would be much appreciated.

Patrick
-- 
http://mail.python.org/mailman/listinfo/python-list

xlrd and cPickle.dump

2008-03-31 Thread patrick . waldo

Hi all,

I have to work with a very large excel file and I have two questions.

First, the documentation says that cPickle.dump would be the best way
to work with it.  However, I keep getting:
Traceback (most recent call last):
  File "C:\Python24\Lib\site-packages\pythonwin\pywin\framework
\scriptutils.py", line 310, in RunScript
exec codeObject in __main__.__dict__
  File "C:\python_files\pickle_test.py", line 12, in ?
cPickle.dump(book,wb.save(pickle_path))
  File "C:\Python24\lib\copy_reg.py", line 69, in _reduce_ex
raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle file objects

I tried to use open(filename, 'w') as well as pyExcelerator
(wb.save(pickle_path)) to create the pickle file, but neither worked.

Secondly, I am trying to make an ID number from three columns of data:
category | topic | sub_topic, so that I can .  I imagine that a
dictionary would be the best way to sort out the repeats

-- 
http://mail.python.org/mailman/listinfo/python-list

xlrd and cPickle.dump

2008-04-01 Thread patrick . waldo

Hi all,

Sorry for the repeat I needed to reform my question and had some
problems...silly me.

The xlrd documentation says:
"Pickleable.  Default is true. In Python 2.4 or earlier, setting to
false will cause use of array.array objects which save some memory but
can't be pickled. In Python 2.5, array.arrays are used
unconditionally. Note: if you have large files that you need to read
multiple times, it can be much faster to cPickle.dump() the xlrd.Book
object once, and use cPickle.load() multiple times."

I'm using Python 2.4 and I have an extremely large excel file that I
need to work with.  The documentation leads me to believe that cPickle
will be a more efficient option, but I am having trouble pickling the
excel file.  So far, I have this:

import cPickle,xlrd
import pyExcelerator
from pyExcelerator import *

data_path = """C:\test.xls"""
pickle_path = """C:\pickle.xls"""

book = xlrd.open_workbook(data_path)
Data_sheet = book.sheet_by_index(0)

wb=pyExcelerator.Workbook()
proc = wb.add_sheet("proc")

#Neither of these work
#1) pyExcelerator try
#cPickle.dump(book,wb.save(pickle_path))
#2) Normal pickle try
#pickle_file = open(pickle_path, 'w')
#cPickle.dump(book, pickle_file)
#file.close()

Any ideas would be helpful.  Otherwise, I won't pickle the excel file
and deal with the lag time.

Patrick
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: xlrd and cPickle.dump

2008-04-01 Thread patrick . waldo

> How many megabytes is "extremely large"? How many seconds does it take
> to open it with xlrd.open_workbook?

The document is 15mb ad 50,000+ rows (for test purposes I will use a
smaller sample), but my computer hangs (ie it takes a long time) when
I try to do simple manipulations and the documentation leads me to
believe cPickle will be more efficient.  If this is not true, then I
don't have a problem (ie I just have to wait), but I still would like
to figure out how to pickle an xlrd object anyways.

> You only need one of the above imports at the best of times, and for
> what you are attempting to do, you don't need pyExcelerator at all.

Using pyExcelerator was a guess, because the traditional way didn't
work and I thought it may be because it's an Excel file.  Secondly, I
import it twice because sometimes, and I don't know why, PythonWin
does not import pyExcelerator the first time.  This has only been true
with pyExcelerator.

> > data_path = """C:\test.xls"""
>
> It is extremely unlikely that you have a file whose basename begins with
> a TAB ('\t') character. Please post the code that you actually ran.

you're right, I had just quickly erased my documents and settings
folder to make it smaller for an example.

>
> Please post the minimal pyExcelerator-free script that demonstrates your
> problem. Ensure that it includes the following line:
>  import sys; print sys.version; print xlrd.__VERSION__
> Also post the output and the traceback (in full).

As to copy_reg.py, I downloaded Activestate Python 2.4 and that was
it, so I have had no other version on my computer.

Here's the code:

import cPickle,xlrd, sys

print sys.version
print xlrd.__VERSION__

data_path = """C:\\test\\test.xls"""
pickle_path = """C:\\test\\pickle.pickle"""

book = xlrd.open_workbook(data_path)
Data_sheet = book.sheet_by_index(0)

pickle_file = open(pickle_path, 'w')
cPickle.dump(book, pickle_file)
pickle_file.close()

Here's the output:

2.4.3 (#69, Apr 11 2006, 15:32:42) [MSC v.1310 32 bit (Intel)]
0.6.1
Traceback (most recent call last):
  File "C:\Python24\Lib\site-packages\pythonwin\pywin\framework
\scriptutils.py", line 310, in RunScript
exec codeObject in __main__.__dict__
  File "C:\text analysis\pickle_test2.py", line 13, in ?
cPickle.dump(book, pickle_file)
  File "C:\Python24\lib\copy_reg.py", line 69, in _reduce_ex
raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle module objects


Thanks for the advice!
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: xlrd and cPickle.dump

2008-04-02 Thread patrick . waldo

Still no luck:

Traceback (most recent call last):
  File "C:\Python24\Lib\site-packages\pythonwin\pywin\framework
\scriptutils.py", line 310, in RunScript
exec codeObject in __main__.__dict__
  File "C:\text analysis\pickle_test2.py", line 13, in ?
cPickle.dump(Data_sheet, pickle_file, -1)
PicklingError: Can't pickle : attribute lookup
__builtin__.module failed

My code remains the same, except I added 'wb' and the -1 following
your suggestions:

import cPickle,xlrd, sys

print sys.version
print xlrd.__VERSION__

data_path = """C:\\test\\test.xls"""
pickle_path = """C:\\test\\pickle.pickle"""

book = xlrd.open_workbook(data_path)
Data_sheet = book.sheet_by_index(0)

pickle_file = open(pickle_path, 'wb')
cPickle.dump(Data_sheet, pickle_file, -1)
pickle_file.close()

To begin with (I forgot to mention this before) I get this error:
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-
zero

I'm not sure what this means.

> What do you describe as "simple manipulations"? Please describe your
> computer, including how much memory it has.

I have a 1.8Ghz HP dv6000 with 2Gb of ram, which should be speedy
enough for my programming projects.  However, when I try to print out
the rows in the excel file, my computer gets very slow and choppy,
which makes experimenting slow and frustrating.  Maybe cPickle won't
solve this problem at all!  For this first part, I am trying to make
ID numbers for the different permutation of categories, topics, and
sub_topics.  So I will have [book,non-fiction,biography],[book,non-
fiction,history-general],[book,fiction,literature], etc..
so I want the combination of
[book,non-fiction,biography] = 1
[book,non-fiction,history-general] = 2
[book,fiction,literature] = 3
etc...

My code does this, except sort returns None, which is strange.  I just
want an alphabetical sort of the first option, which sort should do
automatically.  When I do a test like
>>>nest_list = [['bbc', 'cds'], ['jim', 'ex'],['abc', 'sd']]
>>>nest_list.sort()
[['abc', 'sd'], ['bbc', 'cds'], ['jim', 'ex']]
It works fine, but not for my rows.

Here's the code (unpickled/unsorted):
import xlrd, pyExcelerator

path_file = "C:\\text_analysis\\test.xls"
book = xlrd.open_workbook(path_file)
ProcFT_QC = book.sheet_by_index(0)
log_path = "C:\\text_analysis\\ID_Log.log"
logfile = open(log_path,'wb')

set_rows = []
rows = []
db = {}
n=0
while n Also, any good reason for sticking with Python 2.4?

Trying to learn Zope/Plone too, so I'm sticking with Python 2.4.


Thanks again
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: xlrd and cPickle.dump

2008-04-02 Thread patrick . waldo

>FWIW, it works here on 2.5.1 without errors or warnings. Ouput is:
>2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)]
>0.6.1

I guess it's a version issue then...

I forgot about sorted!  Yes, that would make sense!

Thanks for the input.


On Apr 2, 4:23 pm, [EMAIL PROTECTED] wrote:
> Still no luck:
>
> Traceback (most recent call last):
>   File "C:\Python24\Lib\site-packages\pythonwin\pywin\framework
> \scriptutils.py", line 310, in RunScript
> exec codeObject in __main__.__dict__
>   File "C:\text analysis\pickle_test2.py", line 13, in ?
>cPickle.dump(Data_sheet, pickle_file, -1)
> PicklingError: Can't pickle : attribute lookup
> __builtin__.module failed
>
> My code remains the same, except I added 'wb' and the -1 following
> your suggestions:
>
> import cPickle,xlrd, sys
>
> print sys.version
> print xlrd.__VERSION__
>
> data_path = """C:\\test\\test.xls"""
> pickle_path = """C:\\test\\pickle.pickle"""
>
> book = xlrd.open_workbook(data_path)
> Data_sheet = book.sheet_by_index(0)
>
> pickle_file = open(pickle_path, 'wb')cPickle.dump(Data_sheet, pickle_file, -1)
> pickle_file.close()
>
> To begin with (I forgot to mention this before) I get this error:
> WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-
> zero
>
> I'm not sure what this means.
>
> > What do you describe as "simple manipulations"? Please describe your
> > computer, including how much memory it has.
>
> I have a 1.8Ghz HP dv6000 with 2Gb of ram, which should be speedy
> enough for my programming projects.  However, when I try to print out
> the rows in the excel file, my computer gets very slow and choppy,
> which makes experimenting slow and frustrating.  Maybe cPickle won't
> solve this problem at all!  For this first part, I am trying to make
> ID numbers for the different permutation of categories, topics, and
> sub_topics.  So I will have [book,non-fiction,biography],[book,non-
> fiction,history-general],[book,fiction,literature], etc..
> so I want the combination of
> [book,non-fiction,biography] = 1
> [book,non-fiction,history-general] = 2
> [book,fiction,literature] = 3
> etc...
>
> My code does this, except sort returns None, which is strange.  I just
> want an alphabetical sort of the first option, which sort should do
> automatically.  When I do a test like>>>nest_list = [['bbc', 'cds'], ['jim', 
> 'ex'],['abc', 'sd']]
> >>>nest_list.sort()
>
> [['abc', 'sd'], ['bbc', 'cds'], ['jim', 'ex']]
> It works fine, but not for my rows.
>
> Here's the code (unpickled/unsorted):
> import xlrd, pyExcelerator
>
> path_file = "C:\\text_analysis\\test.xls"
> book = xlrd.open_workbook(path_file)
> ProcFT_QC = book.sheet_by_index(0)
> log_path = "C:\\text_analysis\\ID_Log.log"
> logfile = open(log_path,'wb')
>
> set_rows = []
> rows = []
> db = {}
> n=0
> while n rows.append(ProcFT_QC.row_values(n, 6,9))
> n+=1
> print rows.sort() #Outputs None
> ID = 1
> for row in rows:
> if row not in set_rows:
> set_rows.append(row)
> db[ID] = row
> entry = str(ID) + '|' + str(row).strip('u[]') + '\r\n'
> logfile.write(entry)
> ID+=1
> logfile.close()
>
> > Also, any good reason for sticking with Python 2.4?
>
> Trying to learn Zope/Plone too, so I'm sticking with Python 2.4.
>
> Thanks again

-- 
http://mail.python.org/mailman/listinfo/python-list

Converting .doc to .txt in Linux

2008-09-04 Thread patrick . waldo

Hi Everyone,

I had previously asked a similar question,
http://groups.google.com/group/comp.lang.python/browse_thread/thread/2953d6d5d8836c4b/9dc901da63d8d059?lnk=gst&q=convert+doc+txt#9dc901da63d8d059

but at that point I was using Windows and now I am using Linux.
Basically, I have some .doc files that I need to convert into txt
files encoded in utf-8.  However, win32com.client doesn't work in
Linux.

It's been giving me quite a headache all day.  Any ideas would be
greatly appreciated.

Best,
Patrick

#Windows Code:
import glob,os,codecs,shutil,win32com.client
from win32com.client import Dispatch

input = '/home/pwaldo2/work/workbench/current_documents/*.doc'
input_dir = '/home/pwaldo2/work/workbench/current_documents/'
outpath = '/home/pwaldo2/work/workbench/current_documents/TXT/'

for doc in glob.glob1(input):
WordApp = Dispatch("Word.Application")
WordApp.Visible = 1
WordApp.Documents.Open(doc)
WordApp.ActiveDocument.SaveAs(doc,7)
WordApp.ActiveDocument.Close()
WordApp.Quit()

for doc in glob.glob(input):
txt_split = os.path.splitext(doc)
txt_doc = txt_split[0] + '.txt'
txt_doc_path = os.path.join(outpath,txt_doc)
doc_path = os.path.join(input_dir,doc)
shutil.copy(doc_path,txt_doc_path)
--
http://mail.python.org/mailman/listinfo/python-list

43 matches

Mail list logo