Hi, I observed encoding problems when reading descriptions from UDD if they do contain non-ASCII characters and I wonder what I might do wrong. Here is a little test program which queries for some descriptions I found to be problematic:
######################################################## #!/usr/bin/python PORT=5441 import psycopg2 from sys import stderr, exit conn = psycopg2.connect(host="localhost",port=PORT,user="guest",database="udd") curs = conn.cursor() query = """PREPARE query_desc (text) AS SELECT description, long_description, version FROM packages WHERE package = $1 AND architecture = 'i386' and release = 'sid'""" curs.execute(query) for pkg in ['mafft', 'melting', 'rnahybrid', 't-coffee']: query = "EXECUTE query_desc ('%s')" % pkg curs.execute(query) for row in curs.fetchall(): try: string = unicode(row[1]) print "%s: %s (%s)\n%s\n" % (pkg, row[0], row[2], row[1]) except UnicodeDecodeError, errtxt: print >> stderr, "----> %s UnicodeDecodeError: '%s'; ErrTxt: %s" % \ (pkg, row[1], errtxt) ######################################################## This results in: ----> mafft UnicodeDecodeError: ' MAFFT is a multiple sequence alignment program which offers three accuracy-oriented methods: * L-INS-i (probably most accurate; recommended for <200 sequences; iterative refinement method incorporating local pairwise alignment information), * G-INS-i (suitable for sequences of similar lengths; recommended for <200 sequences; iterative refinement method incorporating global pairwise alignment information), * E-INS-i (suitable for sequences containing large unalignable regions; recommended for <200 sequences), and five speed-oriented methods: * FFT-NS-i (iterative refinement method; two cycles only), * FFT-NS-i (iterative refinement method; max. 1000 iterations), * FFT-NS-2 (fast; progressive method), * FFT-NS-1 (very fast; recommended for >2000 sequences; progressive method with a rough guide tree), * NW-NS-PartTree-1 (recommended for <E2><88><BC>50,000 sequences; progressive method with the PartTree algorithm).'; ErrTxt: 'ascii' codec can't decode byte 0xe2 in position 889: ordinal not in range(128) ----> melting UnicodeDecodeError: ' This program computes, for a nucleic acid duplex, the enthalpy, the entropy and the melting temperature of the helix-coil transitions. Three types of hybridisation are possible: DNA/DNA, ... I tried several encode / decode combinations but without success. Is there a simple solution to handle those non-ASCII strings apropriately? Kind regards Andreas. -- http://fam-tille.de -- To UNSUBSCRIBE, email to debian-qa-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org