out utf-8 chars

Steven D'Aprano Sun, 07 Jul 2013 22:54:01 -0700

On Sun, 07 Jul 2013 17:22:26 -0700, blatt wrote:

> Hi all,
> but a particular hello to Chris Angelino which with their critics and
> suggestions pushed me to make a full revision of my application on hex
> dump in presence of utf-8 chars.


I don't understand what you are trying to say. All characters are UTF-8 
characters. "a" is a UTF-8 character. So is "ă".


> If you are not using python 3, the utf-8 codec can add further
> programming problems, 

On the contrary, I find that so long as you understand what you are doing 
it solves problems, not adds them. However, if you are confused about the 
difference between characters (text strings) and bytes, or if you are 
dealing with arbitrary binary data and trying to treat it as if it were 
UTF-8 encoded text, then you can have errors. Those errors are a good 
thing.


> especially if you are not a guru.... The script
> seems very long but I commented too much ... sorry. It is very useful
> (at least IMHO...)
> It works under Linux. but there is still a little problem which I didn't
> solve (at least programmatically...).
> 
> 
> # -*- coding: utf-8 -*-
> # px.py vers. 11 (pxb.py)   
> # python 2.6.6 # hex-dump w/ or w/out utf-8 chars
> # Using spaces as separators, this script shows 
> # (better than tabnanny) uncorrect  indentations.

The word you are looking for is "incorrect".


> # to save output > python pxb.py hex.txt > px9_out_hex.txt
> 
> nLenN=3          # n. of digits for lines
> 
> # version almost thoroughly rewritten on the ground of 
> # the critics and modifications suggested by Chris Angelico
> 
> # in the first version the utf-8 conversion to hex was shown
> horizontaly:
> 
> # 005 # qwerty: non è unicode bensì ascii 
> #     2 7767773 666 ca 7666666 6667ca 676660
> #     3 175249a efe 38 5e93f45 25e33c 13399a

Oh! We're supposed to read the output *downwards*! That's not very 
intuitive. It took me a while to work that out. You should at least say 
so.


> # ... but I had to insert additional chars to keep the
> # synchronization between the literal and the hex part
> 
> # 005 # qwerty: non è. unicode bensì. ascii 
> #     2 7767773 666 ca 7666666 6667ca 676660
> #     3 175249a efe 38 5e93f45 25e33c 13399a

Well that sucks, because now sometimes you have to read downwards 
(character 'q' -> hex 71, reading downwards) and sometimes you read both 
downwards and across (character 'è' -> hex c3a8). Sometimes a dot means a 
dot and sometimes it means filler. How is the user supposed to know when 
to read down and when across?

 
> # in the second version I followed Chris suggestion:
> # "to show the hex utf-8 vertically"

You're already showing UTF-8 characters vertically, if they happen to be 
a one-byte character. Better to be consistent and always show characters 
vertical, regardless of whether they are one, two or four bytes.


> # 005 # qwerty: non è unicode bensì ascii
> #     2 7767773 666 c 7666666 6667c 676660
> #     3 175249a efe 3 5e93f45 25e33 13399a 
> #                   a             a
> #                   8             c

Much better! Now at least you can trivially read down the column to see 
the bytes used for each character. As an alternative, you can space each 
character to show the bytes horizontally, displaying spaces and other 
invisible characters either as dots, backslash escapes, or Unicode 
control pictures, whichever you prefer. The example below uses dots for 
spaces and backslash escape for newline:

q  w  e  r  t  y  :  .  n  o  n  .  è     .  u  n  i  
71 77 65 72 74 79 3a 20 6e 6f 6e 20 c3 a8 20 75 6e 69

c  o  d  e  .  b  e  n  s  ì     .  a  s  c  i  i  \n
63 6f 64 65 20 62 65 6e 73 c3 ac 20 61 73 63 69 69 0a


There will always be some ambiguity between (e.g.) dot representing a 
dot, and it representing an invisible control character or space, but the 
reader can always tell them apart by reading the hex value, which you 
*always* read horizontally whether it is one byte, two or four. There's 
never any confusion whether you should read down or across.

Unfortunately, most fonts don't support the Unicode control pictures. But 
if you choose to use them, here they are, together with their Unicode 
name. You can use the form

'\N{...}'  # Python 3
u'\N{...}'  # Python 2

to get the characters, replacing ... with the name shown below:


␀ SYMBOL FOR NULL
␁ SYMBOL FOR START OF HEADING
␂ SYMBOL FOR START OF TEXT
␃ SYMBOL FOR END OF TEXT
␄ SYMBOL FOR END OF TRANSMISSION
␅ SYMBOL FOR ENQUIRY
␆ SYMBOL FOR ACKNOWLEDGE
␇ SYMBOL FOR BELL
␈ SYMBOL FOR BACKSPACE
␉ SYMBOL FOR HORIZONTAL TABULATION
␊ SYMBOL FOR LINE FEED
␋ SYMBOL FOR VERTICAL TABULATION
␌ SYMBOL FOR FORM FEED
␍ SYMBOL FOR CARRIAGE RETURN
␎ SYMBOL FOR SHIFT OUT
␏ SYMBOL FOR SHIFT IN
␐ SYMBOL FOR DATA LINK ESCAPE
␑ SYMBOL FOR DEVICE CONTROL ONE
␒ SYMBOL FOR DEVICE CONTROL TWO
␓ SYMBOL FOR DEVICE CONTROL THREE
␔ SYMBOL FOR DEVICE CONTROL FOUR
␕ SYMBOL FOR NEGATIVE ACKNOWLEDGE
␖ SYMBOL FOR SYNCHRONOUS IDLE
␗ SYMBOL FOR END OF TRANSMISSION BLOCK
␘ SYMBOL FOR CANCEL
␙ SYMBOL FOR END OF MEDIUM
␚ SYMBOL FOR SUBSTITUTE
␛ SYMBOL FOR ESCAPE
␜ SYMBOL FOR FILE SEPARATOR
␝ SYMBOL FOR GROUP SEPARATOR
␞ SYMBOL FOR RECORD SEPARATOR
␟ SYMBOL FOR UNIT SEPARATOR
␠ SYMBOL FOR SPACE
␡ SYMBOL FOR DELETE
␢ BLANK SYMBOL
␣ OPEN BOX
␤ SYMBOL FOR NEWLINE
␥ SYMBOL FOR DELETE FORM TWO
␦ SYMBOL FOR SUBSTITUTE FORM TWO


(I wish more fonts would support these characters, they are very useful.)


[...]
> # works on any n. of bytes for utf-8
> 
> # For the user: it is helpful to have in a separate file
> # all special characters of interest, together with their names.

In Python, you can use the unicodedata module to look up characters by 
name, or given the character, find out what it's name is.


[...]
> import fileinput
> import sys, commands
> 
> lF=[]                           # input file as list
> for line in fileinput.input():  # handles all the details of args-or-
stdin
>     lF.append(line)


That is more easily written as:

lF = list(fileinput.input())

and better written with a meaningful file name. Whenever you have a 
variable, and find the need to give a comment explaining what the 
variable name means, you should consider a more descriptive name.

When that name is a cryptic two letter name, that goes double.


> sSpacesXLN = ' ' * (nLenN+1)
> 
> 
> for n in xrange(len(lF)):
>     sLineHexND=lF[n].encode('hex')     # ND = no delimiter (space)

You're programming like a Pascal or C programmer. There is nearly never 
any need to write code like that in Python. Rather than iterate over the 
indexes, then extract the part you want, it is better to iterate directly 
over the parts you want:

for line in lF:
    sLineHexND = line.encode('hex')



>     sLineHex  =lF[n].encode('hex').replace('20','  ')
>     sLineHexH =sLineHex[::2]
>     sLineHexL =sLineHex[1::2]

Trying to keep code lined up in this way is a bad habit to get into. It 
just sets you up for many hours of unproductive adding and deleting 
spaces trying to keep things aligned.

Also, what on earth are all these "s" prefixes?

>     sSynchro=''
>     for k in xrange(0,len(sLineHexND),2):

Probably the best way to walk through a string, grabbing the characters 
in pairs, comes from the itertools module: see the recipe for "grouper".

http://docs.python.org/2/library/itertools.html

Here is a simplified version:

assert len(line) % 2 == 0
for pair in zip(*([iter(line)]*2)):
    ...

although understanding how it works requires a little advanced knowledge.


>         if sLineHexND[k]<'8':
>             sSynchro+= sLineHexND[k]+sLineHexND[k+1] 
>             k+=1
>         elif sLineHexND[k]=='c':
>             sSynchro+='c'+sLineHexND[k+1]+sLineHexND[k+2]+sLineHexND[k
+3]+'2e'
>             k+=3
>         elif sLineHexND[k]=='e':
>             sSynchro+='e'+sLineHexND[k+1]+sLineHexND[k+2]+sLineHexND[k
+3]+\
>                           sLineHexND[k+4]+sLineHexND[k+5]+'2e2e'
>             k+=5

Apart from being hideously ugly to read, I do not believe this code works 
the way you think it works. Adding to the loop variable doesn't advance 
the loop. Try this and see for yourself:


for i in range(10):
    print(i)
    i += 5


The loop variable just gets reset once it reaches the top of the loop 
again.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: hex dump w/ or w/out utf-8 chars

Reply via email to