Scott David Daniels wrote:
norseman wrote:
Scott David Daniels wrote:
Dave Angel wrote:
Jorge wrote: ...
I'm making a application that reads 3 party generated ASCII files, but some times the files are corrupted totally or partiality and I need to know if it's a ASCII file with *nix line terminators.
In linux I can run the file command but the applications should run in
windows.
you are looking for a \x0D (the Carriage Return) \x0A (the Line feed) combination. If present you have Microsoft compatibility. If not you don't. If you think High Bits might be part of the corruption, filter each byte with byte && \x7F (byte AND'ed with hex 7F or 127 base 10) then check for the \x0D \x0A combination.

Well  ASCII defines a \x0D as the return code, and \x0A as line feed.
It is unix that is wrong, not Microsoft (don't get me wrong, I know
Microsoft has often redefined what it likes invalidly).  If you
open the file with 'U', Python will return lines w/o the \r character
whether or not they started with it, equally well on both unix and
Microsoft systems.

Yep - but if you are on Microsoft systems you will usually need the \r.

Remove them and open the file in Notepad to see what I mean.
Wordpad handles the lack of \r OK. Handles larger files too.

Many moons ago the high order bit was used as a
parity bit, but few communication systems do that these days, so
anything with the high bit set is likely corruption.


OH?  How did one transfer binary files over the phone?
I used PIP or Kermit and it got there just fine, high bits and all. Mail and other so called "text only" programs CAN (but not necessarily do) use 7bit transfer protocols. Can we say MIME? FTP transfers high bit just fine too.
Set protocols to 8,1 and none. (8bit, 1 stop, no parity)
As to how his 3rd party ASCII files are generated? He does not know, I do not know, we do not know (or care), so test before use. Filter out the high bits, remove all control characters except cr,lf and perhaps keep the ff too, then test what's left.

                ASCII
cr - carriage return       ^M    x0D   \r
lf - line feed             ^J    x0A   \n
ff - form feed (new page)  ^L    x0C   \f


.... Intel uses one order and the SUN and  the internet another.  The
 > BIG/Little ending confuses many. Intel reverses the order of multibyte
 > numerics.  Thus- Small machine has big ego or largest byte value last.
 > Big Ending.  Big machine has small ego.
Little Ending. Some coders get the 0D0A backwards, some don't. You might want to test both.
(2^32)(2^24)(2^16(2^8)  4 bytes correct math order  little ending
Intel stores them (2^8)(2^16)(2^24)(2^32)   big ending
SUN/Internet stores them in correct math order.
Python will use \r\n (0D0A) and \n\r (0A0D) correctly.

This is the most confused summary of byte sex I've ever read.
There is no such thing as "correct math order" (numbers are numbers).

"...number are numbers..." Nope! Numbers represented as characters may be in ASCII but you should take a look at at IBM mainframes. They use EBCDIC and the 'numbers' are different bit patterns. Has anyone taken the time to read the IEEE floating point specs? To an electronic calculating machine, internally everything is a bit. Bytes are a group of bits and the CPU structure determines what a given bit pattern is. The computer has no notion of number, character or program instruction. It only knows what it is told. Try this - set the next instruction (jump) to a data value and watch the machine try to execute it as a program instruction. (I assume you can program in assembly. If not - don't tell because 'REAL programmers do assembly'. I think the last time I used it was 1980 or so. The program ran until the last of the hardware died and replacements could not be found. The client hired another to write for the new machines and closed shop shortly after. I think the owner was tired and found an excuse to retire. :)


The '\n\r' vs. '\r\n' has _nothing_ to do with little-endian vs.
big-endian.  By the way, there are great arguments for each order,
and no clear winner.

I don't care. Not the point. Point is some people get it fouled up and cause others problems. Test for both. You will save yourself a great deal of trouble in the long run.

Network order was defined for sending numbers
across a wire, the idea was that you'd unpack them to native order
as you pulled the data off the wire.

"... sending BINARY FORMATTED numbers..." (verses character - type'able)

Network order was defined to reduce machine time. Since the servers that worked day in and day out were SUN, SUN order won. I haven't used EBCDIC in so long I really don't remember for sure but it seems to me they used SUN order before SUN was around. Same for the VAX, I think.


The '\n\r' vs. '\r\n' differences harken back to the days when they were
format effectors (carriage return moved the carriage to the extreme
left, line feed advanced the paper).  You needed both to properly
position the print head.

Yep. There wasn't enough intelligence in the old printers to 'cook" the stream.

ASCII uses the pair, and defined the effect
of each.

Actually the Teletype people defined most of the \x00 - \x1f concepts.
If I remember the trivia correctly - original teletype was 6 bit bytes. Bit pattern was neither ASCII nor EBCDIC. Both of those adopted the teletype control-character concept.

As ASCII was being worked out, MIT even defined a "line
starve" character to move up one line just as line feed went down one.
The order of the format effectors most used was '\r\n' because the
carriage return involved the most physical motion on many devices, and
the vertical motion time of the line feed could happen while the
carriage was moving.

True. My experiment with reversing the two instructions would sometimes cause the printer to malfunction. One of my first 'black boxes' (filters) included instructions to see and correct the "wrong" pattern. Then I had to modify it to allow pure binary to get 'pictures' on the dot matrix types.

After that, you often added padding bytes (typically ASCII NUL ('\x00') or DEL ('\x7F')) to allow the hardware
time to finish before you the did spacing and printing.


If I remember correctly:
ASCII NULL   x00      In my opinion, NULL should be none set :)
IBM NULL     x80      IBM card  80 Cols
Sperry-Rand  x90      S/R Card  90 Cols

Trivia question:
Why is a byte 8 bits?

Ans: people have 10 fingers and the hardware to handle morse code (single wire - serial transfers) needed timers. 1-start, 8 data, 1-stop makes it a count by ten. Burroughs had 10 bits but counting by 12s just didn't come 'naturally'.
That was the best answer I've heard to date. In reality - who knows?

'...padding...'
I never did. Never had to. Printers I used had enough buffer to void that practice. Thirty two character buffer seemed to be enough to disallow overflow. Of course we were using 300 to 1200 BAUD and DTR (pin 19 in most cases) -OR- the RTS and CTS pair of wires to control flow since ^S/^Q could be a valid dot matrix byte(s). Same for hardwired PIP or Kermit transfers.


--Scott David Daniels
scott.dani...@acm.org


--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to