Comparing two book chapters (text files)
Hi all, So I have an interesting challenge. I want to compare two book chapters, which I have in plain text format, and find out (a) percentage similarity and (b) what has changed. Some features make this problem different than what seems to be the standard text-matching problem solvable with e.g. difflib. Here is what I mean: * there is no guarantee that single lines from each file will be directly comparable -- e.g., if a few words are inserted into a sentence, then a chunk of the sentence will be moved to the next line, then a chunk of that line moved to the next, etc. * Also, there are cases where paragraphs have been moved around, sections re-ordered, etc. So it can't just be a "linear" match. I imagine this kind of thing can't be all that hard in the grand scheme of things, but I couldn't find an easily applicable solution readily available. I have advanced beginner python skills but am not quite where I could do this kind of thing from scratch without some guidance about the likely functions, libraries etc. to use. PS: I am going to have to do this for multiple book chapters so various software packages, e.g. for windows, are not really usable. Any help is much appreciated!! Cheers, Nick -- Nicholas J. Matzke Ph.D. student, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: mat...@berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 - "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm -- http://mail.python.org/mailman/listinfo/python-list
Re: Comparing two book chapters (text files)
Chris Rebert wrote: On Wed, Feb 4, 2009 at 5:20 PM, Nick Matzke wrote: Hi all, So I have an interesting challenge. I want to compare two book chapters, which I have in plain text format, and find out (a) percentage similarity and (b) what has changed. Some features make this problem different than what seems to be the standard text-matching problem solvable with e.g. difflib. Here is what I mean: * there is no guarantee that single lines from each file will be directly comparable -- e.g., if a few words are inserted into a sentence, then a chunk of the sentence will be moved to the next line, then a chunk of that line moved to the next, etc. * Also, there are cases where paragraphs have been moved around, sections re-ordered, etc. So it can't just be a "linear" match. I imagine this kind of thing can't be all that hard in the grand scheme of things, but I couldn't find an easily applicable solution readily available. I have advanced beginner python skills but am not quite where I could do this kind of thing from scratch without some guidance about the likely functions, libraries etc. to use. PS: I am going to have to do this for multiple book chapters so various software packages, e.g. for windows, are not really usable. Though not written in Python, wdiff (http://www.gnu.org/software/wdiff/wdiff.html) might be a good starting point. Wow -- this is actually amazingly effective. And fast! Simple to run from python & then use python to parse the output. Thanks! Nick Cheers, Chris -- Nicholas J. Matzke Ph.D. student, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: mat...@berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 - "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm -- http://mail.python.org/mailman/listinfo/python-list
global name 'sqrt' is not defined
Hi all, So, I can run this in the ipython shell just fine: === a = ["12", "15", "16", "38.2"] dim = int(sqrt(size(a))) dim >2 === But if I move these commands to a function in another file, it freaks out: = a = distances_matrix.split('\t') from LR_run_functions_v2 import make_half_square_array d = make_half_square_array(a) > --- NameError Traceback (most recent call last) /bioinformatics/phylocom/ in () /bioinformatics/phylocom/_scripts/LR_run_functions_v2.py in make_half_square_array(linear_version_of_square_array) 1548 1549 a = linear_version_of_square_array -> 1550 dim = int(sqrt(size(a))) 1551 1552 NameError: global name 'sqrt' is not defined = Here's the function in LR_run_functions_v2.py == def make_half_square_array(linear_version_of_square_array): a = linear_version_of_square_array dim = int(sqrt(size(a))) == Any ideas? If I do something like "import math" in the subfunction, then the error changes to "global name 'math' is not defined". Thanks! Nick -- Nicholas J. Matzke Ph.D. student, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: mat...@berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 - "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm -- http://mail.python.org/mailman/listinfo/python-list
Re: global name 'sqrt' is not defined
Scott David Daniels wrote: M.-A. Lemburg wrote: On 2009-02-05 10:08, Nick Matzke wrote: ..., I can run this in the ipython shell just fine: a = ["12", "15", "16", "38.2"] dim = int(sqrt(size(a))) ...But if I move these commands to a function in another file, it freaks out: You need to add: from math import sqrt or: from cmath import sqrt or: from numpy import sqrt The weird thing is, when I do this, I still get the error: n...@mws2[phylocom]|27> a = ["12", "15", "16", "38.2"] n...@mws2[phylocom]|28> from LR_run_functions_v2 import make_half_square_array n...@mws2[phylocom]|24> d = make_half_square_array(a) --- NameError Traceback (most recent call last) /bioinformatics/phylocom/ in () /bioinformatics/phylocom/_scripts/LR_run_functions_v2.py in make_half_square_array(linear_version_of_square_array) 1548 from numpy import sqrt 1549 a = linear_version_of_square_array -> 1550 dim = int(sqrt(size(a))) 1551 1552 NameError: global name 'sqrt' is not defined n...@mws2[phylocom]|25> Is there some other place I should put the import command? I.e.: 1. In the main script/ipython command line 2. In the called function, i.e. make_half_square_array() in LR_run_functions_v2.py 3. At the top of LR_run_functions_v2.py, outside of the individual functions? Thanks...sorry for the noob questions! Nick Each with their own, slightly different, meaning. Hence the reason many of us prefer to import the module and reference the function as a module attribute. Note that _many_ (especially older) package documents describe their code without the module name. I believe that such behavior is because, when working to produce prose about a package, it feels too much like useless redundancy when describing each function or class as "package.name". --Scott David Daniels scott.dani...@acm.org -- http://mail.python.org/mailman/listinfo/python-list -- Nicholas J. Matzke Ph.D. student, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: mat...@berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 - "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm -- http://mail.python.org/mailman/listinfo/python-list
Re: global name 'sqrt' is not defined
OK, so the problem was that I had to exit ipython, re-enter it, and then import my module to get the errors to disappear. Thanks for the help! (PS: Is there a way to force a complete reload of a module, without exiting ipython? Just doing the import command again doesn't seem to do it.) Thanks! Nick Diez B. Roggisch wrote: Nick Matzke schrieb: Scott David Daniels wrote: M.-A. Lemburg wrote: On 2009-02-05 10:08, Nick Matzke wrote: ..., I can run this in the ipython shell just fine: a = ["12", "15", "16", "38.2"] dim = int(sqrt(size(a))) ...But if I move these commands to a function in another file, it freaks out: You need to add: from math import sqrt or: from cmath import sqrt or: from numpy import sqrt The weird thing is, when I do this, I still get the error: n...@mws2[phylocom]|27> a = ["12", "15", "16", "38.2"] n...@mws2[phylocom]|28> from LR_run_functions_v2 import make_half_square_array n...@mws2[phylocom]|24> d = make_half_square_array(a) --- NameError Traceback (most recent call last) /bioinformatics/phylocom/ in () /bioinformatics/phylocom/_scripts/LR_run_functions_v2.py in make_half_square_array(linear_version_of_square_array) 1548 from numpy import sqrt 1549 a = linear_version_of_square_array -> 1550 dim = int(sqrt(size(a))) 1551 1552 NameError: global name 'sqrt' is not defined n...@mws2[phylocom]|25> Is there some other place I should put the import command? I.e.: 1. In the main script/ipython command line 2. In the called function, i.e. make_half_square_array() in LR_run_functions_v2.py 3. At the top of LR_run_functions_v2.py, outside of the individual functions? The latter. Python's imports are always local to the module/file they are in, not globally effective. Diez -- http://mail.python.org/mailman/listinfo/python-list -- Nicholas J. Matzke Ph.D. student, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: mat...@berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 - "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm -- http://mail.python.org/mailman/listinfo/python-list
hist without plotting
Hi, Is there a way to run the numpy hist function or something similar and get the outputs (bins, bar heights) without actually producing the plot on the screen? (R has a plot = false option, something like this is what I'm looking for...) Cheers! Nick -- Nicholas J. Matzke Ph.D. student, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: mat...@berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 - "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm -- http://mail.python.org/mailman/listinfo/python-list
Re: hist without plotting
Nevermind, I was running the pylab hist; the numpy.histogram function generates the bar counts etc. without plotting the histogram. Cheers! Nick Nick Matzke wrote: Hi, Is there a way to run the numpy hist function or something similar and get the outputs (bins, bar heights) without actually producing the plot on the screen? (R has a plot = false option, something like this is what I'm looking for...) Cheers! Nick -- Nicholas J. Matzke Ph.D. student, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: mat...@berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 - "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm -- http://mail.python.org/mailman/listinfo/python-list
pythonic array subsetting
Hi, So I've got a square floating point array that is about 1000 x 1000. I need to subset this array as efficiently as possible based on an ordered sublist of the list of rownames/colnames (they are the same, this is a symmetric array). e.g., if sublist is of length 500, and matches the rownames list at every other entry, I need to pull out a 500x500 array holding every other row & column in the parent array. I have to do this hundreds of times, so speed would be useful. Cheers! Nick -- Nicholas J. Matzke Ph.D. student, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: mat...@berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 - "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm -- http://mail.python.org/mailman/listinfo/python-list
Re: pythonic array subsetting
Looks like "compress" is the right numpy function, but it took forever for me to find it... x = array([[1,2,3], [4,5,6], [7,8,9]], dtype=float) compress([1,2], x, axis=1) result: array([[ 1., 2.], [ 4., 5.], [ 7., 8.]]) Gary Herron wrote: Nick Matzke wrote: Hi, So I've got a square floating point array that is about 1000 x 1000. I need to subset this array as efficiently as possible based on an ordered sublist of the list of rownames/colnames (they are the same, this is a symmetric array). e.g., if sublist is of length 500, and matches the rownames list at every other entry, I need to pull out a 500x500 array holding every other row & column in the parent array. I have to do this hundreds of times, so speed would be useful. Cheers! Nick Check out numpy at http://numpy.scipy.org It can do what you want very efficiently. Gary Herron -- Nicholas J. Matzke Ph.D. student, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: mat...@berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 - "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm -- http://mail.python.org/mailman/listinfo/python-list
debugging in IPython
This is a general question, but maybe there is some obvious solution I've missed. When I am writing code, I have a main script that calls functions in another .py file. When there is a bug or crash in the main script, in IPython I can just start typing the names of variables etc. to see what they contained at the point where the script crashed. However, if the bug is in a function I've called from the main script, the crash dialog will indicate the function, line of code, etc. where the crash occurred. However, the only variables I can access at the IPython prompt are those used in the main script. Is there any way to access the variables in those sub-functions after a crash, in IPython or something similar? The only other option is pasting all the code from each function into the IPython manually, or adding print lines throughout the relevant sub-functions. This is doable but extremely tedious when the crash occurred 5 functions deep, or at some unknown point within a for loop. Any help much appreciated! Cheers! Nick -- Nicholas J. Matzke Ph.D. student, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: mat...@berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 - "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm -- http://mail.python.org/mailman/listinfo/python-list
updating NumPy in EPD
Hi all, I have a slightly weird question. I would like to install the PyCogent library. However, this requires NumPy 1.3 or higher. I only have NumPy 1.1.1, because I got it as part of the Enthought Python Distribution (4.1) back in 2008. Now, when I download & install a new version of NumPy, it seems to work. However, the PyCogent installer can only see the NumPy 1.1.1 version. Any advice on what I might do to fix this? Cheers! Nick -- Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Graduate Student Instructor, IB200A Principles of Phylogenetics: Systematics http://ib.berkeley.edu/courses/ib200a/index.shtml Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: mat...@berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 - "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm -- http://mail.python.org/mailman/listinfo/python-list
Re: updating NumPy in EPD
Oh yes -- I would just update my version of EPD, which is where my NumPy came from -- however, Enthought only has available for academic download a version of EPD that works on OS X 10.5 or later, and my Mac is a 10.4.11 and I'd rather not completely reinstall the OS just to get one little library to work. Cheers! Nick Nick Matzke wrote: Hi all, I have a slightly weird question. I would like to install the PyCogent library. However, this requires NumPy 1.3 or higher. I only have NumPy 1.1.1, because I got it as part of the Enthought Python Distribution (4.1) back in 2008. Now, when I download & install a new version of NumPy, it seems to work. However, the PyCogent installer can only see the NumPy 1.1.1 version. Any advice on what I might do to fix this? Cheers! Nick -- Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Graduate Student Instructor, IB200A Principles of Phylogenetics: Systematics http://ib.berkeley.edu/courses/ib200a/index.shtml Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: mat...@berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 - "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm -- http://mail.python.org/mailman/listinfo/python-list
Re: updating NumPy in EPD
Hi again, I got the solution on the NumPy list, I thought I would share for posterity... Jeff Hsu wrote: > Check which version of numpy python is importing with "import numpy; > printnumpy.__file__". I had a similar question and this worked after I > removed that installation of numpy. I think the enthought distro > installs it somewhere else that has priority. Ah, this was totally the trick! To summarize for posterity: = Fix for an old version of NumPy installed with the EPD Enthough Python Distribution: 0. Figure out, what's your current version, and where is it located? ipython import numpy print numpy.__file__ numpy.__version__ 1. Download newest Numpy.tar.gz (1.4.1) from sourceforge, unzip 2. Install with: cd ~/Desktop/downloads/numpy-1.4.1 python setup.py (3 times -- configure, build, install) 3. delete or rename old Numpy, redirect IPython's location to new install: cd /Library/Frameworks/Python.framework/Versions/2.5.2001/lib/python2.5/site-packages/numpy-1.0.4.0004-py2.5-macosx-10.3-fat.egg mv numpy numpy_old ln -s /Library/Frameworks/Python.framework/Versions/2.5.2001/lib/python2.5/site-packages/numpy numpy Manual install of PyCogent: 1. download from sourceforge 2. working install: cd /bioinformatics/pythonstuff/PyCogent-1.4.1/ python setup.py build sudo python setup.py install python setup.py build_ext -if (no NumPy version error this time!) Finally: ipython import numpy dir(numpy) Thanks! Nick > On Tue, Jun 8, 2010 at 10:30 PM, Nick Matzke > <mailto:mat...@berkeley.edu>> wrote: > > Hi NumPy gurus, > > I have a slightly weird question. I would like to install > the PyCogent python library. However, this requires NumPy > 1.3 or higher. I only have NumPy 1.1.1, because I got it as > part of the Enthought Python Distribution (4.1) back in 2008. > > Now, when I download & install a new version of NumPy, the > install seems to work. However, the PyCogent installer can > still only see the NumPy 1.1.1 version. > > Any advice on what I might do to fix this? > > I would just update my version of EPD, which is > where my NumPy came from -- however, Enthought only has > available for academic download a version of EPD that works > on OS X 10.5 or later, and my Mac is a 10.4.11 and I'd > rather not completely reinstall the OS just to get one > little library to work. > > Any help much appreciated!! > > Cheers! > Nick > > ___ > NumPy-Discussion mailing list > numpy-discuss...@scipy.org <mailto:numpy-discuss...@scipy.org> > http://mail.scipy.org/mailman/listinfo/numpy-discussion > > > > > > _______ > NumPy-Discussion mailing list > numpy-discuss...@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion Nick Matzke wrote: Hi all, I have a slightly weird question. I would like to install the PyCogent library. However, this requires NumPy 1.3 or higher. I only have NumPy 1.1.1, because I got it as part of the Enthought Python Distribution (4.1) back in 2008. Now, when I download & install a new version of NumPy, it seems to work. However, the PyCogent installer can only see the NumPy 1.1.1 version. Any advice on what I might do to fix this? Cheers! Nick -- Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Graduate Student Instructor, IB200A Principles of Phylogenetics: Systematics http://ib.berkeley.edu/courses/ib200a/index.shtml Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: mat...@berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 - "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The
cleaning up an ASCII file?
Hi all, So I'm parsing an XML file returned from a database. However, the database entries have occasional non-ASCII characters, and this is crashing my parsers. Is there some handy function out there that will schlep through a file like this, and do something like fix the characters that it can recognize, and delete those that it can't? Basically, like the BBEdit "convert to ASCII" menu option under "Text". I googled some on this, but nothing obvious came up that wasn't specific to fixing one or a few characters. Thanks! Nick -- Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: mat...@berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 - "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm -- http://mail.python.org/mailman/listinfo/python-list
Re: cleaning up an ASCII file?
Apologies, I figured there was some easy, obvious solution, since there is in BBedit. I will explain further... John Machin wrote: On Jun 11, 6:09 am, Nick Matzke wrote: Hi all, So I'm parsing an XML file returned from a database. However, the database entries have occasional non-ASCII characters, and this is crashing my parsers. So fix your parsers. google("unicode"). Deleting stuff that you don't understand is an "interesting" approach to academic research :-( Not if it's just weird versions of dash characters and umlauted characters the like, which is what I bet it is. Those sorts of things and the apparent inability of lots of email readers and websites to deal with them have been annoying me for years, so I tend to move straight towards genocidal tactics when I detect their presence. (My database source is GBIF, they get museum specimen submissions from around the planet, there are zillions of records, I am just a user, so fixing it on their end is not a realistic option.) Care to divulge what "crash" means? e.g. the full traceback and error message, plus what version of python on what platform, what version of ElementTree or other XML spftware you are using ... All that is fine, the problem is actually when I try to print to screen in IPython: UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 293: ordinal not in range(128) Probably this is the line in the file which is causing problems (as displayed in BBedit): == - This document contains data shared through the GBIF Network - see http://data.gbif.org/ for more information. All usage of these data must be in accordance with the GBIF Data Use Agreement - see http://www.gbif.org/DataProviders/Agreements/DUA Please cite these data as follows: Jyväskylä University Museum - The Section of Natural Sciences, Vascular plant collection of Jyvaskyla University Museum (accessed through GBIF data portal, http://data.gbif.org/datasets/resource/462, 2009-06-11) Missouri Botanical Garden, Missouri Botanical Garden (accessed through GBIF data portal, http://data.gbif.org/datasets/resource/621, 2009-06-11) Museo Nacional de Costa Rica, herbario (accessed through GBIF data portal, http://data.gbif.org/datasets/resource/566, 2009-06-11) National Science Museum, Japan, Kurashiki Museum of Natural History (accessed through GBIF data portal, http://data.gbif.org/datasets/resource/599, 2009-06-11) The Swedish Museum of Natural History (NRM), Herbarium of Oskarshamn (OHN) (accessed through GBIF data portal, http://data.gbif.org/datasets/resource/1024, 2009-06-11) Tiroler Landesmuseum Ferdinandeum, Tiroler Landesmuseum Ferdinandeum (accessed through GBIF data portal, http://data.gbif.org/datasets/resource/1509, 2009-06-11) UCD, Database Schema for UC Davis [Herbarium Labels] (accessed through GBIF data portal, http://data.gbif.org/datasets/resource/734, 2009-06-11) - == Presumably "Jyväskylä University Museum" is the problem since there are umlauted a's in there. (Note, though, that I have thousands of records to parse, so there is going to be all kinds of other umlauted & accented stuff in these sorts of search results. So the goal is to replace the characters with un-umlauted versions or some such. Cheers! Nick PS: versions I am using: nick$ python -V Python 2.5.2 |EPD Py25 4.1.30101| Center for Theoretical Evolutionary Genomics If your .sig evolves much more, it will consume all available bandwidth in the known universe and then some ;-) ...its easier to have a big sig than to try and remember all that stuff ;-)... -- Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: mat...@berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 - "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativ
Re: cleaning up an ASCII file?
Looks like this was a solution: 1. Use this guy's unescape function to convert from HTML/XML Entities to unicode http://effbot.org/zone/re-sub.htm#unescape-html 2. Take the unicode and convert to approximate plain ASCII matches with unicodedata (after import unicodedata) ascii_content2 = unescape(line) ascii_content = unicodedata.normalize('NFKD', unicode(ascii_content2)).encode('ascii','ignore') The string "line" would give the error, but ascii_content does not. Cheers! Nick PS: "asciiDammit" is also fun to look at John Machin wrote: On Jun 11, 6:09 am, Nick Matzke wrote: Hi all, So I'm parsing an XML file returned from a database. However, the database entries have occasional non-ASCII characters, and this is crashing my parsers. So fix your parsers. google("unicode"). Deleting stuff that you don't understand is an "interesting" approach to academic research :-( Care to divulge what "crash" means? e.g. the full traceback and error message, plus what version of python on what platform, what version of ElementTree or other XML spftware you are using ... Center for Theoretical Evolutionary Genomics If your .sig evolves much more, it will consume all available bandwidth in the known universe and then some ;-) -- Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: mat...@berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 - "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm -- http://mail.python.org/mailman/listinfo/python-list