Re: Need to know if a file as only ASCII charaters

norseman Tue, 16 Jun 2009 16:08:09 -0700

Scott David Daniels wrote:

norseman wrote:
Scott David Daniels wrote:
Dave Angel wrote:
Jorge wrote: ...
I'm making a application that reads 3 party generated ASCII files,but some times the files are corrupted totally or partiality and Ineed to know if it's a ASCII file with *nix line terminators.
In linux I can run the file command but the applications should run in
windows.
you are looking for a \x0D (the Carriage Return) \x0A (the Line feed)combination. If present you have Microsoft compatibility. If not youdon't. If you think High Bits might be part of the corruption, filtereach byte with byte && \x7F (byte AND'ed with hex 7F or 127 base 10)then check for the \x0D \x0A combination.
Well  ASCII defines a \x0D as the return code, and \x0A as line feed.
It is unix that is wrong, not Microsoft (don't get me wrong, I know
Microsoft has often redefined what it likes invalidly).  If you
open the file with 'U', Python will return lines w/o the \r character
whether or not they started with it, equally well on both unix and
Microsoft systems.


Yep - but if you are on Microsoft systems you will usually need the \r.

Remove them and open the file in Notepad to see what I mean.
Wordpad handles the lack of \r OK. Handles larger files too.

Many moons ago the high order bit was used as a
parity bit, but few communication systems do that these days, so
anything with the high bit set is likely corruption.


OH?  How did one transfer binary files over the phone?

I used PIP or Kermit and it got there just fine, high bits and all.Mail and other so called "text only" programs CAN (but not necessarilydo) use 7bit transfer protocols. Can we say MIME? FTP transfers highbit just fine too.

Set protocols to 8,1 and none. (8bit, 1 stop, no parity)

As to how his 3rd party ASCII files are generated? He does not know, Ido not know, we do not know (or care), so test before use.Filter out the high bits, remove all control characters except cr,lf andperhaps keep the ff too, then test what's left.


                ASCII
cr - carriage return       ^M    x0D   \r
lf - line feed             ^J    x0A   \n
ff - form feed (new page)  ^L    x0C   \f

.... Intel uses one order and the SUN and  the internet another.  The

 > BIG/Little ending confuses many. Intel reverses the order of multibyte
 > numerics.  Thus- Small machine has big ego or largest byte value last.
 > Big Ending.  Big machine has small ego.

Little Ending. Some coders get the 0D0A backwards, some don't. Youmight want to test both.

(2^32)(2^24)(2^16(2^8)  4 bytes correct math order  little ending
Intel stores them (2^8)(2^16)(2^24)(2^32)   big ending
SUN/Internet stores them in correct math order.
Python will use \r\n (0D0A) and \n\r (0A0D) correctly.


This is the most confused summary of byte sex I've ever read.
There is no such thing as "correct math order" (numbers are numbers).

"...number are numbers..." Nope! Numbers represented as characters maybe in ASCII but you should take a look at at IBM mainframes. They useEBCDIC and the 'numbers' are different bit patterns. Has anyone takenthe time to read the IEEE floating point specs? To an electroniccalculating machine, internally everything is a bit. Bytes are a groupof bits and the CPU structure determines what a given bit pattern is.The computer has no notion of number, character or program instruction.It only knows what it is told. Try this - set the next instruction(jump) to a data value and watch the machine try to execute it as aprogram instruction. (I assume you can program in assembly. If not -don't tell because 'REAL programmers do assembly'. I think the last timeI used it was 1980 or so. The program ran until the last of the hardwaredied and replacements could not be found. The client hired another towrite for the new machines and closed shop shortly after. I think theowner was tired and found an excuse to retire. :)

The '\n\r' vs. '\r\n' has _nothing_ to do with little-endian vs.
big-endian.  By the way, there are great arguments for each order,

and no clear winner.

I don't care. Not the point. Point is some people get it fouled up andcause others problems. Test for both. You will save yourself a greatdeal of trouble in the long run.

Network order was defined for sending numbers
across a wire, the idea was that you'd unpack them to native order
as you pulled the data off the wire.


"... sending BINARY FORMATTED numbers..." (verses character - type'able)

Network order was defined to reduce machine time. Since the servers thatworked day in and day out were SUN, SUN order won.I haven't used EBCDIC in so long I really don't remember for sure but itseems to me they used SUN order before SUN was around. Same for theVAX, I think.


The '\n\r' vs. '\r\n' differences harken back to the days when they were
format effectors (carriage return moved the carriage to the extreme
left, line feed advanced the paper).  You needed both to properly

position the print head.

Yep. There wasn't enough intelligence in the old printers to 'cook" thestream.

ASCII uses the pair, and defined the effect

of each.


Actually the Teletype people defined most of the \x00 - \x1f concepts.

If I remember the trivia correctly - original teletype was 6 bit bytes.Bit pattern was neither ASCII nor EBCDIC. Both of those adopted theteletype control-character concept.

As ASCII was being worked out, MIT even defined a "line
starve" character to move up one line just as line feed went down one.
The order of the format effectors most used was '\r\n' because the
carriage return involved the most physical motion on many devices, and
the vertical motion time of the line feed could happen while the

carriage was moving.

True. My experiment with reversing the two instructions would sometimescause the printer to malfunction. One of my first 'black boxes'(filters) included instructions to see and correct the "wrong" pattern.Then I had to modify it to allow pure binary to get 'pictures' on thedot matrix types.

After that, you often added padding bytes(typically ASCII NUL ('\x00') or DEL ('\x7F')) to allow the hardware
time to finish before you the did spacing and printing.


If I remember correctly:
ASCII NULL   x00      In my opinion, NULL should be none set :)
IBM NULL     x80      IBM card  80 Cols
Sperry-Rand  x90      S/R Card  90 Cols

Trivia question:
Why is a byte 8 bits?

Ans: people have 10 fingers and the hardware to handle morse code(single wire - serial transfers) needed timers. 1-start, 8 data, 1-stopmakes it a count by ten. Burroughs had 10 bits but counting by 12s justdidn't come 'naturally'.

That was the best answer I've heard to date. In reality - who knows?

'...padding...'

I never did. Never had to. Printers I used had enough buffer to voidthat practice. Thirty two character buffer seemed to be enough todisallow overflow. Of course we were using 300 to 1200 BAUD and DTR(pin 19 in most cases) -OR- the RTS and CTS pair of wires to controlflow since ^S/^Q could be a valid dot matrix byte(s). Same for hardwiredPIP or Kermit transfers.

--Scott David Daniels
scott.dani...@acm.org


--
http://mail.python.org/mailman/listinfo/python-list

Re: Need to know if a file as only ASCII charaters

Reply via email to