On Sun, 07 Jul 2013 17:22:26 -0700, blatt wrote: > Hi all, > but a particular hello to Chris Angelino which with their critics and > suggestions pushed me to make a full revision of my application on hex > dump in presence of utf-8 chars.
I don't understand what you are trying to say. All characters are UTF-8 characters. "a" is a UTF-8 character. So is "ă". > If you are not using python 3, the utf-8 codec can add further > programming problems, On the contrary, I find that so long as you understand what you are doing it solves problems, not adds them. However, if you are confused about the difference between characters (text strings) and bytes, or if you are dealing with arbitrary binary data and trying to treat it as if it were UTF-8 encoded text, then you can have errors. Those errors are a good thing. > especially if you are not a guru.... The script > seems very long but I commented too much ... sorry. It is very useful > (at least IMHO...) > It works under Linux. but there is still a little problem which I didn't > solve (at least programmatically...). > > > # -*- coding: utf-8 -*- > # px.py vers. 11 (pxb.py) > # python 2.6.6 # hex-dump w/ or w/out utf-8 chars > # Using spaces as separators, this script shows > # (better than tabnanny) uncorrect indentations. The word you are looking for is "incorrect". > # to save output > python pxb.py hex.txt > px9_out_hex.txt > > nLenN=3 # n. of digits for lines > > # version almost thoroughly rewritten on the ground of > # the critics and modifications suggested by Chris Angelico > > # in the first version the utf-8 conversion to hex was shown > horizontaly: > > # 005 # qwerty: non è unicode bensì ascii > # 2 7767773 666 ca 7666666 6667ca 676660 > # 3 175249a efe 38 5e93f45 25e33c 13399a Oh! We're supposed to read the output *downwards*! That's not very intuitive. It took me a while to work that out. You should at least say so. > # ... but I had to insert additional chars to keep the > # synchronization between the literal and the hex part > > # 005 # qwerty: non è. unicode bensì. ascii > # 2 7767773 666 ca 7666666 6667ca 676660 > # 3 175249a efe 38 5e93f45 25e33c 13399a Well that sucks, because now sometimes you have to read downwards (character 'q' -> hex 71, reading downwards) and sometimes you read both downwards and across (character 'è' -> hex c3a8). Sometimes a dot means a dot and sometimes it means filler. How is the user supposed to know when to read down and when across? > # in the second version I followed Chris suggestion: > # "to show the hex utf-8 vertically" You're already showing UTF-8 characters vertically, if they happen to be a one-byte character. Better to be consistent and always show characters vertical, regardless of whether they are one, two or four bytes. > # 005 # qwerty: non è unicode bensì ascii > # 2 7767773 666 c 7666666 6667c 676660 > # 3 175249a efe 3 5e93f45 25e33 13399a > # a a > # 8 c Much better! Now at least you can trivially read down the column to see the bytes used for each character. As an alternative, you can space each character to show the bytes horizontally, displaying spaces and other invisible characters either as dots, backslash escapes, or Unicode control pictures, whichever you prefer. The example below uses dots for spaces and backslash escape for newline: q w e r t y : . n o n . è . u n i 71 77 65 72 74 79 3a 20 6e 6f 6e 20 c3 a8 20 75 6e 69 c o d e . b e n s ì . a s c i i \n 63 6f 64 65 20 62 65 6e 73 c3 ac 20 61 73 63 69 69 0a There will always be some ambiguity between (e.g.) dot representing a dot, and it representing an invisible control character or space, but the reader can always tell them apart by reading the hex value, which you *always* read horizontally whether it is one byte, two or four. There's never any confusion whether you should read down or across. Unfortunately, most fonts don't support the Unicode control pictures. But if you choose to use them, here they are, together with their Unicode name. You can use the form '\N{...}' # Python 3 u'\N{...}' # Python 2 to get the characters, replacing ... with the name shown below: ␀ SYMBOL FOR NULL ␁ SYMBOL FOR START OF HEADING ␂ SYMBOL FOR START OF TEXT ␃ SYMBOL FOR END OF TEXT ␄ SYMBOL FOR END OF TRANSMISSION ␅ SYMBOL FOR ENQUIRY ␆ SYMBOL FOR ACKNOWLEDGE ␇ SYMBOL FOR BELL ␈ SYMBOL FOR BACKSPACE ␉ SYMBOL FOR HORIZONTAL TABULATION ␊ SYMBOL FOR LINE FEED ␋ SYMBOL FOR VERTICAL TABULATION ␌ SYMBOL FOR FORM FEED ␍ SYMBOL FOR CARRIAGE RETURN ␎ SYMBOL FOR SHIFT OUT ␏ SYMBOL FOR SHIFT IN ␐ SYMBOL FOR DATA LINK ESCAPE ␑ SYMBOL FOR DEVICE CONTROL ONE ␒ SYMBOL FOR DEVICE CONTROL TWO ␓ SYMBOL FOR DEVICE CONTROL THREE ␔ SYMBOL FOR DEVICE CONTROL FOUR ␕ SYMBOL FOR NEGATIVE ACKNOWLEDGE ␖ SYMBOL FOR SYNCHRONOUS IDLE ␗ SYMBOL FOR END OF TRANSMISSION BLOCK ␘ SYMBOL FOR CANCEL ␙ SYMBOL FOR END OF MEDIUM ␚ SYMBOL FOR SUBSTITUTE ␛ SYMBOL FOR ESCAPE ␜ SYMBOL FOR FILE SEPARATOR ␝ SYMBOL FOR GROUP SEPARATOR ␞ SYMBOL FOR RECORD SEPARATOR ␟ SYMBOL FOR UNIT SEPARATOR ␠ SYMBOL FOR SPACE ␡ SYMBOL FOR DELETE ␢ BLANK SYMBOL ␣ OPEN BOX  SYMBOL FOR NEWLINE ␥ SYMBOL FOR DELETE FORM TWO ␦ SYMBOL FOR SUBSTITUTE FORM TWO (I wish more fonts would support these characters, they are very useful.) [...] > # works on any n. of bytes for utf-8 > > # For the user: it is helpful to have in a separate file > # all special characters of interest, together with their names. In Python, you can use the unicodedata module to look up characters by name, or given the character, find out what it's name is. [...] > import fileinput > import sys, commands > > lF=[] # input file as list > for line in fileinput.input(): # handles all the details of args-or- stdin > lF.append(line) That is more easily written as: lF = list(fileinput.input()) and better written with a meaningful file name. Whenever you have a variable, and find the need to give a comment explaining what the variable name means, you should consider a more descriptive name. When that name is a cryptic two letter name, that goes double. > sSpacesXLN = ' ' * (nLenN+1) > > > for n in xrange(len(lF)): > sLineHexND=lF[n].encode('hex') # ND = no delimiter (space) You're programming like a Pascal or C programmer. There is nearly never any need to write code like that in Python. Rather than iterate over the indexes, then extract the part you want, it is better to iterate directly over the parts you want: for line in lF: sLineHexND = line.encode('hex') > sLineHex =lF[n].encode('hex').replace('20',' ') > sLineHexH =sLineHex[::2] > sLineHexL =sLineHex[1::2] Trying to keep code lined up in this way is a bad habit to get into. It just sets you up for many hours of unproductive adding and deleting spaces trying to keep things aligned. Also, what on earth are all these "s" prefixes? > sSynchro='' > for k in xrange(0,len(sLineHexND),2): Probably the best way to walk through a string, grabbing the characters in pairs, comes from the itertools module: see the recipe for "grouper". http://docs.python.org/2/library/itertools.html Here is a simplified version: assert len(line) % 2 == 0 for pair in zip(*([iter(line)]*2)): ... although understanding how it works requires a little advanced knowledge. > if sLineHexND[k]<'8': > sSynchro+= sLineHexND[k]+sLineHexND[k+1] > k+=1 > elif sLineHexND[k]=='c': > sSynchro+='c'+sLineHexND[k+1]+sLineHexND[k+2]+sLineHexND[k +3]+'2e' > k+=3 > elif sLineHexND[k]=='e': > sSynchro+='e'+sLineHexND[k+1]+sLineHexND[k+2]+sLineHexND[k +3]+\ > sLineHexND[k+4]+sLineHexND[k+5]+'2e2e' > k+=5 Apart from being hideously ugly to read, I do not believe this code works the way you think it works. Adding to the loop variable doesn't advance the loop. Try this and see for yourself: for i in range(10): print(i) i += 5 The loop variable just gets reset once it reaches the top of the loop again. -- Steven -- http://mail.python.org/mailman/listinfo/python-list