Comparing two book chapters (text files)

2009-02-04 Thread Nick Matzke

Hi all,

So I have an interesting challenge.  I want to compare two book 
chapters, which I have in plain text format, and find out (a) percentage 
similarity and (b) what has changed.


Some features make this problem different than what seems to be the 
standard text-matching problem solvable with e.g. difflib.  Here is what 
I mean:


* there is no guarantee that single lines from each file will be 
directly comparable -- e.g., if a few words are inserted into a 
sentence, then a chunk of the sentence will be moved to the next line, 
then a chunk of that line moved to the next, etc.


* Also, there are cases where paragraphs have been moved around, 
sections re-ordered, etc.  So it can't just be a "linear" match.


I imagine this kind of thing can't be all that hard in the grand scheme 
of things, but I couldn't find an easily applicable solution readily 
available.  I have advanced beginner python skills but am not quite 
where I could do this kind of thing from scratch without some guidance 
about the likely functions, libraries etc. to use.


PS: I am going to have to do this for multiple book chapters so various 
software packages, e.g. for windows, are not really usable.


Any help is much appreciated!!

Cheers,
Nick



--

Nicholas J. Matzke
Ph.D. student, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370

Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: mat...@berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."


Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.

http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm

--
http://mail.python.org/mailman/listinfo/python-list


Re: Comparing two book chapters (text files)

2009-02-04 Thread Nick Matzke



Chris Rebert wrote:

On Wed, Feb 4, 2009 at 5:20 PM, Nick Matzke  wrote:

Hi all,

So I have an interesting challenge.  I want to compare two book chapters,
which I have in plain text format, and find out (a) percentage similarity
and (b) what has changed.

Some features make this problem different than what seems to be the standard
text-matching problem solvable with e.g. difflib.  Here is what I mean:

* there is no guarantee that single lines from each file will be directly
comparable -- e.g., if a few words are inserted into a sentence, then a
chunk of the sentence will be moved to the next line, then a chunk of that
line moved to the next, etc.

* Also, there are cases where paragraphs have been moved around, sections
re-ordered, etc.  So it can't just be a "linear" match.

I imagine this kind of thing can't be all that hard in the grand scheme of
things, but I couldn't find an easily applicable solution readily available.
 I have advanced beginner python skills but am not quite where I could do
this kind of thing from scratch without some guidance about the likely
functions, libraries etc. to use.

PS: I am going to have to do this for multiple book chapters so various
software packages, e.g. for windows, are not really usable.


Though not written in Python, wdiff
(http://www.gnu.org/software/wdiff/wdiff.html) might be a good
starting point.



Wow -- this is actually amazingly effective.  And fast!   Simple to run 
from python & then use python to parse the output.


Thanks!
Nick




Cheers,
Chris



--

Nicholas J. Matzke
Ph.D. student, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370

Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: mat...@berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."


Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.

http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm

--
http://mail.python.org/mailman/listinfo/python-list


global name 'sqrt' is not defined

2009-02-05 Thread Nick Matzke

Hi all,

So, I can run this in the ipython shell just fine:

===
a = ["12", "15", "16", "38.2"]
dim = int(sqrt(size(a)))

dim
>2
===


But if I move these commands to a function in another file, it freaks out:


=
a = distances_matrix.split('\t')
from LR_run_functions_v2 import make_half_square_array
d = make_half_square_array(a)

>
---
NameError Traceback (most recent call last)

/bioinformatics/phylocom/ in ()

/bioinformatics/phylocom/_scripts/LR_run_functions_v2.py in 
make_half_square_array(linear_version_of_square_array)

   1548
   1549 a = linear_version_of_square_array
-> 1550 dim = int(sqrt(size(a)))
   1551
   1552

NameError: global name 'sqrt' is not defined

=




Here's the function in LR_run_functions_v2.py

==
def make_half_square_array(linear_version_of_square_array):

a = linear_version_of_square_array
dim = int(sqrt(size(a)))
==



Any ideas?  If I do something like "import math" in the subfunction, 
then the error changes to "global name 'math' is not defined".


Thanks!
Nick







--

Nicholas J. Matzke
Ph.D. student, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370

Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: mat...@berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."


Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.

http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm

--
http://mail.python.org/mailman/listinfo/python-list


Re: global name 'sqrt' is not defined

2009-02-05 Thread Nick Matzke



Scott David Daniels wrote:

M.-A. Lemburg wrote:

On 2009-02-05 10:08, Nick Matzke wrote:

..., I can run this in the ipython shell just fine:
a = ["12", "15", "16", "38.2"]
dim = int(sqrt(size(a)))
...But if I move these commands to a function in another file, it 
freaks out:

You need to add:

from math import sqrt

or:
from cmath import sqrt
or:
from numpy import sqrt






The weird thing is, when I do this, I still get the error:


n...@mws2[phylocom]|27> a = ["12", "15", "16", "38.2"]
n...@mws2[phylocom]|28> from LR_run_functions_v2 import 
make_half_square_array

n...@mws2[phylocom]|24> d = make_half_square_array(a)
---
NameError Traceback (most recent call last)

/bioinformatics/phylocom/ in ()

/bioinformatics/phylocom/_scripts/LR_run_functions_v2.py in 
make_half_square_array(linear_version_of_square_array)

   1548 from numpy import sqrt
   1549 a = linear_version_of_square_array
-> 1550 dim = int(sqrt(size(a)))
   1551
   1552

NameError: global name 'sqrt' is not defined
n...@mws2[phylocom]|25>


Is there some other place I should put the import command?  I.e.:
1. In the main script/ipython command line

2. In the called function, i.e. make_half_square_array() in 
LR_run_functions_v2.py


3. At the top of LR_run_functions_v2.py, outside of the individual 
functions?


Thanks...sorry for the noob questions!
Nick





Each with their own, slightly different, meaning.
Hence the reason many of us prefer to import the module
and reference the function as a module attribute.

Note that _many_ (especially older) package documents describe
their code without the module name.  I believe that such behavior
is because, when working to produce prose about a package, it
feels too much like useless redundancy when describing each function
or class as "package.name".


--Scott David Daniels
scott.dani...@acm.org
--
http://mail.python.org/mailman/listinfo/python-list



--

Nicholas J. Matzke
Ph.D. student, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370

Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: mat...@berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."


Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.

http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm

--
http://mail.python.org/mailman/listinfo/python-list


Re: global name 'sqrt' is not defined

2009-02-05 Thread Nick Matzke
OK, so the problem was that I had to exit ipython, re-enter it, and then 
import my module to get the errors to disappear.  Thanks for the help!


(PS: Is there a way to force a complete reload of a module, without 
exiting ipython?  Just doing the import command again doesn't seem to do 
it.)


Thanks!
Nick


Diez B. Roggisch wrote:

Nick Matzke schrieb:



Scott David Daniels wrote:

M.-A. Lemburg wrote:

On 2009-02-05 10:08, Nick Matzke wrote:

..., I can run this in the ipython shell just fine:
a = ["12", "15", "16", "38.2"]
dim = int(sqrt(size(a)))
...But if I move these commands to a function in another file, it 
freaks out:

You need to add:

from math import sqrt

or:
from cmath import sqrt
or:
from numpy import sqrt






The weird thing is, when I do this, I still get the error:


n...@mws2[phylocom]|27> a = ["12", "15", "16", "38.2"]
n...@mws2[phylocom]|28> from LR_run_functions_v2 import 
make_half_square_array

n...@mws2[phylocom]|24> d = make_half_square_array(a)
--- 

NameError Traceback (most recent call 
last)


/bioinformatics/phylocom/ in ()

/bioinformatics/phylocom/_scripts/LR_run_functions_v2.py in 
make_half_square_array(linear_version_of_square_array)

   1548 from numpy import sqrt
   1549 a = linear_version_of_square_array
-> 1550 dim = int(sqrt(size(a)))
   1551
   1552

NameError: global name 'sqrt' is not defined
n...@mws2[phylocom]|25>


Is there some other place I should put the import command?  I.e.:
1. In the main script/ipython command line

2. In the called function, i.e. make_half_square_array() in 
LR_run_functions_v2.py


3. At the top of LR_run_functions_v2.py, outside of the individual 
functions?


The latter. Python's imports are always local to the module/file they 
are in, not globally effective.



Diez
--
http://mail.python.org/mailman/listinfo/python-list



--

Nicholas J. Matzke
Ph.D. student, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370

Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: mat...@berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."


Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.

http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm

--
http://mail.python.org/mailman/listinfo/python-list


hist without plotting

2009-02-15 Thread Nick Matzke

Hi,

Is there a way to run the numpy hist function or something similar and 
get the outputs (bins, bar heights) without actually producing the plot 
on the screen?


(R has a plot = false option, something like this is what I'm looking 
for...)


Cheers!
Nick


--

Nicholas J. Matzke
Ph.D. student, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370

Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: mat...@berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."


Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.

http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm

--
http://mail.python.org/mailman/listinfo/python-list


Re: hist without plotting

2009-02-15 Thread Nick Matzke
Nevermind, I was running the pylab hist; the numpy.histogram function 
generates the bar counts etc. without plotting the histogram.


Cheers!
Nick

Nick Matzke wrote:

Hi,

Is there a way to run the numpy hist function or something similar and 
get the outputs (bins, bar heights) without actually producing the plot 
on the screen?


(R has a plot = false option, something like this is what I'm looking 
for...)


Cheers!
Nick




--

Nicholas J. Matzke
Ph.D. student, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370

Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: mat...@berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."


Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.

http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm

--
http://mail.python.org/mailman/listinfo/python-list


pythonic array subsetting

2009-02-16 Thread Nick Matzke

Hi,

So I've got a square floating point array that is about 1000 x 1000.  I 
need to  subset this array as efficiently as possible based on an 
ordered sublist of the list of rownames/colnames (they are the same, 
this is a symmetric array).


e.g., if sublist is of length 500, and matches the rownames list at 
every other entry, I need to pull out a 500x500 array holding every 
other row & column in the parent array.


I have to do this hundreds of times, so speed would be useful.

Cheers!
Nick



--

Nicholas J. Matzke
Ph.D. student, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370

Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: mat...@berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."


Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.

http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm

--
http://mail.python.org/mailman/listinfo/python-list


Re: pythonic array subsetting

2009-02-17 Thread Nick Matzke
Looks like "compress" is the right numpy function, but it took forever 
for me to find it...


x = array([[1,2,3], [4,5,6], [7,8,9]], dtype=float)
compress([1,2], x, axis=1)

result:
array([[ 1.,  2.],
   [ 4.,  5.],
   [ 7.,  8.]])




Gary Herron wrote:

Nick Matzke wrote:

Hi,

So I've got a square floating point array that is about 1000 x 1000.  
I need to  subset this array as efficiently as possible based on an 
ordered sublist of the list of rownames/colnames (they are the same, 
this is a symmetric array).


e.g., if sublist is of length 500, and matches the rownames list at 
every other entry, I need to pull out a 500x500 array holding every 
other row & column in the parent array.


I have to do this hundreds of times, so speed would be useful.

Cheers!
Nick


Check out numpy at http://numpy.scipy.org

It can do what you want very efficiently.


Gary Herron










--

Nicholas J. Matzke
Ph.D. student, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370

Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: mat...@berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."


Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.

http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm

--
http://mail.python.org/mailman/listinfo/python-list


debugging in IPython

2009-04-24 Thread Nick Matzke
This is a general question, but maybe there is some obvious solution 
I've missed.


When I am writing code, I have a main script that calls functions in 
another .py file.  When there is a bug or crash in the main script, in 
IPython I can just start typing the names of variables etc. to see what 
they contained at the point where the script crashed.


However, if the bug is in a function I've called from the main script, 
the crash dialog will indicate the function, line of code, etc. where 
the crash occurred.  However, the only variables I can access at the 
IPython prompt are those used in the main script.


Is there any way to access the variables in those sub-functions after a 
crash, in IPython or something similar?  The only other option is 
pasting all the code from each function into the IPython manually, or 
adding print lines throughout the relevant sub-functions.  This is 
doable but extremely tedious when the crash occurred 5 functions deep, 
or at some unknown point within a for loop.


Any help much appreciated!

Cheers!
Nick


--

Nicholas J. Matzke
Ph.D. student, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370

Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: mat...@berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."


Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.

http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm

--
http://mail.python.org/mailman/listinfo/python-list


updating NumPy in EPD

2010-06-08 Thread Nick Matzke

Hi all,

I have a slightly weird question.  I would like to install 
the PyCogent library.  However, this requires NumPy 1.3 or 
higher.  I only have NumPy 1.1.1, because I got it as part 
of the Enthought Python Distribution (4.1) back in 2008.


Now, when I download & install a new version of NumPy, it 
seems to work.  However, the PyCogent installer can only see 
the NumPy 1.1.1 version.


Any advice on what I might do to fix this?

Cheers!
Nick


--

Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Graduate Student Instructor, IB200A
Principles of Phylogenetics: Systematics
http://ib.berkeley.edu/courses/ib200a/index.shtml

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: 
http://fisher.berkeley.edu/cteg/members/matzke.html

Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: mat...@berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-
"[W]hen people thought the earth was flat, they were wrong. 
When people thought the earth was spherical, they were 
wrong. But if you think that thinking the earth is spherical 
is just as wrong as thinking the earth is flat, then your 
view is wronger than both of them put together."


Isaac Asimov (1989). "The Relativity of Wrong." The 
Skeptical Inquirer, 14(1), 35-44. Fall 1989.

http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm

--
http://mail.python.org/mailman/listinfo/python-list


Re: updating NumPy in EPD

2010-06-08 Thread Nick Matzke
Oh yes -- I would just update my version of EPD, which is 
where my NumPy came from -- however, Enthought only has 
available for academic download a version of EPD that works 
on OS X 10.5 or later, and my Mac is a 10.4.11 and I'd 
rather not completely reinstall the OS just to get one 
little library to work.


Cheers!
Nick


Nick Matzke wrote:

Hi all,

I have a slightly weird question.  I would like to install the PyCogent 
library.  However, this requires NumPy 1.3 or higher.  I only have NumPy 
1.1.1, because I got it as part of the Enthought Python Distribution 
(4.1) back in 2008.


Now, when I download & install a new version of NumPy, it seems to 
work.  However, the PyCogent installer can only see the NumPy 1.1.1 
version.


Any advice on what I might do to fix this?

Cheers!
Nick




--

Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Graduate Student Instructor, IB200A
Principles of Phylogenetics: Systematics
http://ib.berkeley.edu/courses/ib200a/index.shtml

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: 
http://fisher.berkeley.edu/cteg/members/matzke.html

Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: mat...@berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-
"[W]hen people thought the earth was flat, they were wrong. 
When people thought the earth was spherical, they were 
wrong. But if you think that thinking the earth is spherical 
is just as wrong as thinking the earth is flat, then your 
view is wronger than both of them put together."


Isaac Asimov (1989). "The Relativity of Wrong." The 
Skeptical Inquirer, 14(1), 35-44. Fall 1989.

http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm

--
http://mail.python.org/mailman/listinfo/python-list


Re: updating NumPy in EPD

2010-06-08 Thread Nick Matzke

Hi again,

I got the solution on the NumPy list, I thought I would 
share for posterity...



Jeff Hsu wrote:
> Check which version of numpy python is importing with 
"import numpy;
> printnumpy.__file__".  I had a similar question and this 
worked after I
> removed that installation of numpy.  I think the 
enthought distro

> installs it somewhere else that has priority.

Ah, this was totally the trick!  To summarize for posterity:

=
Fix for an old version of NumPy installed with the EPD 
Enthough Python Distribution:



0. Figure out, what's your current version, and where is it 
located?


ipython
import numpy
print numpy.__file__
numpy.__version__


1. Download newest Numpy.tar.gz (1.4.1) from sourceforge, unzip

2. Install with:
cd ~/Desktop/downloads/numpy-1.4.1
python setup.py (3 times -- configure, build, install)

3. delete or rename old Numpy, redirect IPython's location 
to new install:


cd 
/Library/Frameworks/Python.framework/Versions/2.5.2001/lib/python2.5/site-packages/numpy-1.0.4.0004-py2.5-macosx-10.3-fat.egg


mv numpy numpy_old

ln -s 
/Library/Frameworks/Python.framework/Versions/2.5.2001/lib/python2.5/site-packages/numpy 
numpy





Manual install of PyCogent:

1. download from sourceforge
2. working install:

cd /bioinformatics/pythonstuff/PyCogent-1.4.1/
python setup.py build
sudo python setup.py install
python setup.py build_ext -if

(no NumPy version error this time!)

Finally:

ipython
import numpy
dir(numpy)


Thanks!
Nick

> On Tue, Jun 8, 2010 at 10:30 PM, Nick Matzke 

> <mailto:mat...@berkeley.edu>> wrote:
>
> Hi NumPy gurus,
>
> I have a slightly weird question.  I would like to 
install
> the PyCogent python library.  However, this requires 
NumPy
> 1.3 or higher.  I only have NumPy 1.1.1, because I 
got it as
> part of the Enthought Python Distribution (4.1) back 
in 2008.

>
> Now, when I download & install a new version of 
NumPy, the
> install seems to work.  However, the PyCogent 
installer can

> still only see the NumPy 1.1.1 version.
>
> Any advice on what I might do to fix this?
>
> I would just update my version of EPD, which is
> where my NumPy came from -- however, Enthought only has
> available for academic download a version of EPD that 
works

> on OS X 10.5 or later, and my Mac is a 10.4.11 and I'd
> rather not completely reinstall the OS just to get one
> little library to work.
>
> Any help much appreciated!!
>
> Cheers!
> Nick
>

> ___
> NumPy-Discussion mailing list
> numpy-discuss...@scipy.org 
<mailto:numpy-discuss...@scipy.org>

> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
>
> 


>
> _______
> NumPy-Discussion mailing list
> numpy-discuss...@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion




Nick Matzke wrote:

Hi all,

I have a slightly weird question.  I would like to install the PyCogent 
library.  However, this requires NumPy 1.3 or higher.  I only have NumPy 
1.1.1, because I got it as part of the Enthought Python Distribution 
(4.1) back in 2008.


Now, when I download & install a new version of NumPy, it seems to 
work.  However, the PyCogent installer can only see the NumPy 1.1.1 
version.


Any advice on what I might do to fix this?

Cheers!
Nick




--

Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Graduate Student Instructor, IB200A
Principles of Phylogenetics: Systematics
http://ib.berkeley.edu/courses/ib200a/index.shtml

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: 
http://fisher.berkeley.edu/cteg/members/matzke.html

Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: mat...@berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-
"[W]hen people thought the earth was flat, they were wrong. 
When people thought the earth was spherical, they were 
wrong. But if you think that thinking the earth is spherical 
is just as wrong as thinking the earth is flat, then your 
view is wronger than both of them put together."


Isaac Asimov (1989). "The Relativity of Wrong." The 

cleaning up an ASCII file?

2009-06-10 Thread Nick Matzke

Hi all,

So I'm parsing an XML file returned from a database.  However, the 
database entries have occasional non-ASCII characters, and this is 
crashing my parsers.


Is there some handy function out there that will schlep through a file 
like this, and do something like fix the characters that it can 
recognize, and delete those that it can't?  Basically, like the BBEdit 
"convert to ASCII" menu option under "Text".


I googled some on this, but nothing obvious came up that wasn't specific 
to fixing one or a few characters.


Thanks!
Nick


--

Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370

Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: mat...@berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."


Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.

http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm

--
http://mail.python.org/mailman/listinfo/python-list


Re: cleaning up an ASCII file?

2009-06-10 Thread Nick Matzke
Apologies, I figured there was some easy, obvious solution, since there 
is in BBedit.  I will explain further...


John Machin wrote:

On Jun 11, 6:09 am, Nick Matzke  wrote:

Hi all,

So I'm parsing an XML file returned from a database.  However, the
database entries have occasional non-ASCII characters, and this is
crashing my parsers.


So fix your parsers. google("unicode"). Deleting stuff that you don't
understand is an "interesting" approach to academic research :-(


Not if it's just weird versions of dash characters and umlauted 
characters the like, which is what I bet it is.  Those sorts of things 
and the apparent inability of lots of email readers and websites to deal 
with them have been annoying me for years, so I tend to move straight 
towards genocidal tactics when I detect their presence.


(My database source is GBIF, they get museum specimen submissions from 
around the planet, there are zillions of records, I am just a user, so 
fixing it on their end is not a realistic option.)



Care to divulge what "crash" means? e.g. the full traceback and error
message, plus what version of python on what platform, what version of
ElementTree or other XML spftware you are using ...


All that is fine, the problem is actually when I try to print to screen 
in IPython:



UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in 
position 293: ordinal not in range(128)



Probably this is the line in the file which is causing problems (as 
displayed in BBedit):


==
  -

This document contains data shared through the GBIF Network - see 
http://data.gbif.org/ for more information.


All usage of these data must be in accordance with the GBIF Data Use 
Agreement - see http://www.gbif.org/DataProviders/Agreements/DUA


Please cite these data as follows:

Jyväskylä University Museum - The Section of Natural Sciences, 
Vascular plant collection of Jyvaskyla University Museum (accessed 
through GBIF data portal, http://data.gbif.org/datasets/resource/462, 
2009-06-11)
Missouri Botanical Garden, Missouri Botanical Garden (accessed through 
GBIF data portal, http://data.gbif.org/datasets/resource/621, 2009-06-11)
Museo Nacional de Costa Rica, herbario (accessed through GBIF data 
portal, http://data.gbif.org/datasets/resource/566, 2009-06-11)
National Science Museum, Japan, Kurashiki Museum of Natural History 
(accessed through GBIF data portal, 
http://data.gbif.org/datasets/resource/599, 2009-06-11)
The Swedish Museum of Natural History (NRM), Herbarium of Oskarshamn 
(OHN) (accessed through GBIF data portal, 
http://data.gbif.org/datasets/resource/1024, 2009-06-11)
Tiroler Landesmuseum Ferdinandeum, Tiroler Landesmuseum Ferdinandeum 
(accessed through GBIF data portal, 
http://data.gbif.org/datasets/resource/1509, 2009-06-11)
UCD, Database Schema for UC Davis [Herbarium Labels] (accessed through 
GBIF data portal, http://data.gbif.org/datasets/resource/734, 2009-06-11)


-

==


Presumably "Jyväskylä University Museum" is the problem since 
there are umlauted a's in there. (Note, though, that I have thousands of 
records to parse, so there is going to be all kinds of other umlauted & 
accented stuff in these sorts of search results.


So the goal is to replace the characters with un-umlauted versions or 
some such.


Cheers!
Nick


PS: versions I am using:

nick$ python -V
Python 2.5.2 |EPD Py25 4.1.30101|






Center for Theoretical Evolutionary Genomics


If your .sig evolves much more, it will consume all available
bandwidth in the known universe and then some ;-)


...its easier to have a big sig than to try and remember all that stuff 
;-)...





--

Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370

Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: mat...@berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."


Isaac Asimov (1989). "The Relativ

Re: cleaning up an ASCII file?

2009-06-10 Thread Nick Matzke


Looks like this was a solution:

1. Use this guy's unescape function to convert from HTML/XML Entities to 
unicode

http://effbot.org/zone/re-sub.htm#unescape-html


2. Take the unicode and convert to approximate plain ASCII matches with 
unicodedata (after import unicodedata)



ascii_content2 = unescape(line)

ascii_content = unicodedata.normalize('NFKD', 
unicode(ascii_content2)).encode('ascii','ignore')



The string "line" would give the error, but ascii_content does not.

Cheers!
Nick

PS: "asciiDammit" is also fun to look at




John Machin wrote:

On Jun 11, 6:09 am, Nick Matzke  wrote:

Hi all,

So I'm parsing an XML file returned from a database.  However, the
database entries have occasional non-ASCII characters, and this is
crashing my parsers.


So fix your parsers. google("unicode"). Deleting stuff that you don't
understand is an "interesting" approach to academic research :-(

Care to divulge what "crash" means? e.g. the full traceback and error
message, plus what version of python on what platform, what version of
ElementTree or other XML spftware you are using ...


Center for Theoretical Evolutionary Genomics


If your .sig evolves much more, it will consume all available
bandwidth in the known universe and then some ;-)


--

Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370

Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: mat...@berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."


Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.

http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm

--
http://mail.python.org/mailman/listinfo/python-list