On 11/27/2018 03:05 AM, Guy Dunphy wrote:
It was a core of the underlying philosophy, that html would NOT allow any kind of fixed formatting. The reasoning was that it could be displayed on any kind of system, so had to be free-format and quite abstract.

That's one of the reasons that I like HTML as much as I do.

Which is great, until you actually want to represent a real printed page, or book. Like Postscript can. Thus html was doomed to be inadequate for capture of printed works.

I feel like trying to accurately represent fixed page layout in HTML is a questionable idea. I would think that it would be better to use a different type of file.

That was a disaster. There wasn't any real reason it could not be both. Just an academic's insistense on enforcing his ideology. Then of course, over time html has morphed to include SOME forms of absolute layout, because there was a real demand for that. But the result is a hodge-podge.

I don't think that HTML can reproduce fixed page layout like PostScript and PDF can. It can make a close approximation. But I don't think HTML can get there. Nor do I think it should.

Yes, it should be capable of that. But not enforce 'only that way'.

I question if people are choosing to use HTML to store documentation because it's so popular and then getting upset when they want to do things that HTML is not meant to do. Or in some cases is actually meant to /not/ to.

Use the tool for the job. Don't alter the wrong tool for your particular job.

IMHO true page layout doesn't belong in HTML. Loosely laying out the same content in approximately the same layout is okay.

By 'html' I mean the kludge of html-css-js. The three-cat herd. (Ignoring all the _other_ web cats.) Now it's way too late to fix it properly with patches.

I don't agree with that. HTML (and XML) has markup that can be used, and changed, to define how the HTML is meant to be interpreted.

The fact that people don't do so correctly is mostly independent of the fact that it has the ability. I say mostly because there is some small amount of wiggle room for discussion of does the functionality actually work or not.

I meant there's no point trying to determine why they were so deluded, and failed to recognise that maybe some users (Ed) would want to just type two spaces.

I /do/ believe that there /is/ a point in trying to understand why someone did what they did.

now 'we' (the world) are stuck with it for legacy compatibility reasons.

Our need to be able to read it does not translate to our need to continue to use it.

Any extensions have to be retro-compatible.

I disagree.

I see zero reason why we couldn't come up with something new and completely different.

Granted, there should be ways to translate from one to the other. Much like how ASCII and EBCDIC are still in use today.

What I'm talking about is not that. It's about how to create a coding scheme that serves ALL the needs we are now aware of. (Just one of which is for old ASCII files to still make sense.) This involves both re-definition of some of the ASCII control codes, AND defining sequential structure standards. For eg UTF-8 is a sequential structure. So are all the html and css codings, all programming languages, etc. There's a continuum of encoding...structure...syntax. The ASCII standard didn't really consider that continuum.

I don't think that ASCII was even trying to answer / solve the problems that you're talking about.

ASCII was a solution for a different problem for a different time.

There is no reason we can't move on to something else.

Which exceptions would those be? (That weren't built on top of ASCII!)

It is subject to the meaning of "back tot he roots" and not worth taking more time.

I assume you're thinking that ASCII serves just fine for program source code?

I'm not personally aware of any cases where ASCII limits programming languages. But my ignorance does not preclude that situation from existing.

I do believe that there are a number of niche programming languages (if you will) that store things as binary data (I'm thinking PLCs and the likes) but occasionally have said data represented (as a hexadecimal dump) in ASCII. But the fact that ASCII can or can't easily display the data is immaterial to the system being programmed.

I have long wondered if there are computer languages that aren't rooted in English / ASCII. I feel like it's rather pompous to assume that all programming languages are rooted in English / ASCII. I would hope that there are programming languages that are more specific to the region of the world they were developed in. As such, I would expect that they would be stored in something other than ASCII.

Could the sequence of bytes be displayed as ASCII? Sure. Would it make much sense? Not likely.

This is a bandwagon/normalcy bias effect. "Everyone does it that way and always has, so it must be good."

Nope, not for me.

It may be the case for some people. But I actively try to avoid such biases. Or if I do use them, I acknowledge that they are biases so that others can overcome them.

Sigh. Well, I can't go into that without revealing more than I wish to atm.

Fair.

I will say that I don't think there's any reason why English based programming languages can't be encoded in Morse code, either American or International. Sure, it would be a variable bit length word, but it would work. Nothing mandates that ASCII is used. ASCII is used by convention. But nothing states that that convention can't be changed. Hence why some embedded applications use something else that's more convenient for them.

You're making my point for me. Of course there are many ways to interpret existing codes to achieve this effect. Some use control codes, others overload functionality on printable characters. eg html with < and >.

I disagree.

My point is the base coding scheme doesn't allocate a SPECIFIC mechanism for doing this.

I think there are ASCII codes that could, or should, have been used for that.

The result is a briar-patch of competing ad-hoc methods. Hence the 'babel' I'm referring to, in every matter where ASCII didn't define needed functionality.

I don't believe that the fact that people didn't use it for one reason or another is not ASCII's fault.

Exactly. Because ASCII does not provide a specific coding. It didn't occur to those drtafting the standard. Same as with all the other...

I believe that ASCII did provide control codes that could have been used.

I also question how much of the fact that the control codes weren't used was / is related to the fact that most people don't have keys on their keyboard for them. Thus many people chose to use different keys ~> bytes to perform the needed function.

And so every different devel project that needed it, added some kludge on top. This is what I'm saying: ASCII has no facility for this, but we need a basic coding scheme that does (and is still ASCII-compatible.)

How would you encode the same string of characters, "Lincoln", used in both the name of the person speaking, Abraham Lincoln, and the phrase they said, "I drive a Lincoln Town Car". Would you have completely different ways of encoding "Lincoln" in each of the contexts? Or would you have a way to indicate which context is applied to the sequence of seven characters / words of memory?

If it's the latter, you must have some way to switch between the two contexts. This could be one of the ASCII control codes, or it could be an overload of one (or sequence of) characters.

I believe that ASCII is a standard (one of many) that defines how to represent characters / control codes in a binary pattern. I also believe that any semantics or meanings beyond the context of a single character / control code is outside of the scope of ASCII or comparable standards.

I believe such semantic meaning sits on top of the underlying character set. Hence file format.

Doesn't matter. The English alphabet (or any other human language) naturally do not have protocols to concisely represent data types.

So if I apply (what I understand to be) your logic, I can argue that the English language is (similarly) flawed.

That's no reason to not build such things into the character coding scheme used in computational machinery.

I agree there is need for such data. I do not agree that it belongs in the character encoding. I believe that it belongs in file formats that sit on top of character encoding.

Does a file format need additional characters that are outside of the typical language so that the file format can contain typical characters without overloading? Sure. That's where the control codes come in.

The project consists of several parts. One is to define an extension of ASCII (with a different name, that I'm not going to mention for fear of pre-emptive copyright bullshit.) Other parts relate to other areas in comp-sci, in the same manner of 'see what happens if one starts from scratch.'

Why do you need to define them as an extension of ASCII? Rather why not define it completely anew and sluff off the old. I don't see any reason why something brand new can't be completely it's own. The only requirement I see is a way to convert between the old and the new.

So, you're saying a text encoding scheme should not have any way to represent such things? Why not?

I don't believe that the letter "A", be it bold and / or italic and / or underline is still the letter "A". The formatting and display of it does not change the semantic meaning of the letter "A". As such, I don't think that the different forms of the letter "A" should be encoded as different letters.

I do think that file formats should have a way to encode the different formats of the same semantic letter, "A". If a control code is needed to indicate a formatting change, so be it. ASCII has some codes that can be used for this. Or other character sets can have more codes to do the same thing.

The ASCII printable character set does not have adornments, BECAUSE it is purely a representation of the alphabet and other symbols. That's one of its failings, since all 'extras' have to be implemented by ad-hoc improvisations.

I think the fact that all four forms of "A" are same ASCII byte is a good thing.

It's both good and bad that programmers are free to implement their ideas how they see fit. Requiring all programmers to use the same thing to represent Italic underlined text would be limiting.

I'm pretty sure you've missed the whole point. The ASCII definition 'avoided responsibility' thus making itself inadequate. Html, postscript, and other typographic conventions layer that stuff on top, messily and often in proprietary ways.

We can agree to disagree.

Then you never tried to represent a series of printed pages in html. Can be sort-of done but is a pain.

I would not choose to accurately represent printed pages in HTML. That's not what HTML is meant for.

I would choose a page layout language to represent printed pages.

ASCII doesn't understand 'lines' either. It understands physical head printers. Hence 'carriage return' and 'line feed'. Resulting in the CR/CR-LF/LF wars for text files where a 'new line' was needed.

I don't consider that to be a war. I consider it to be three different file formats (each with it's own way of encoding a new line). And LOTS of ignorance about the fact.

It is trivial to convert between the formats. Trivial enough that many editors automatically detect the format that's being used and behave appropriately for the detected format.

Even in format-flowed text there is a typographic need for 'new line'. It means 'no matter where the current line ends, drop down one line and start at the left.'
Like I'm typing here.

I'll agree that there is a new line in the underlying text that makes up the format=flowed line.

But I believe that format=flowed text is a collection of lines (paragraphs) that are stored using format=flowed encoding. Each line (paragraph) is independent of the others. As such, the "new line" that you speak of is outside the scope of format=flowed text. Or rather, the "new line" that you speak of means the end of one format=flowed line (paragraph) and the start of another (assuming it also uses format=flowed).

A paragraph otoh is like that, but with extra vertical space separating from above. Because ASCII does not have these _absolutely_fundamental_ codes, is why html has to have <br> and <p>.

I suspect that even if ASCII did have a specific purpose code that either people wouldn't use it and / or HTML would also ignore it with it's white space compaction philosophy.

Not to get into the whole </p> argument.

I'm not going there.  I don't think it's germane to this discussion.

Note that including facility for real newline and paragraph symbols in the basic coding scheme, doesn't _force_ the text to be hard formatted. That's a display mode option.

Much like HTML's philosophy to compact white space?

Sigh. Like two spaces in succession being interpretted to do something special?

I'm not aware of any special meaning for two spaces in succession. I am aware of shenanigans that different applications do to preserve the multiple spaces. Ergo this conversation.

I'm also aware that the two spaces after punctuation is a relatively modern thing, introduced by (I believe) typewriters as a mechanism to avoid a mechanical problem. A convention that persisted into computers. A convention that some modern devices actively thwart. I.e. the iPad / iPhone turning two spaces into a period space ready for new sentences.

You know in type layout there are typically special things that happen for paragraphs but not for newlines?

Nope.  I've done very little with typography / layout.

You don't see any problem with overloading a pair of codes of one type, to mean something else?

It depends on what the file format is. If the file format expects everything to be a discrete (8-bit) word / byte, then yes, there can be complications ~> problems with requiring use of two. If the file format does not make such expectations, and instead relies on chords of (8-bit) words / bytes, then I don't see any overt problem. The biggest issue will be ensuring that chords are interpreted properly.


Factors to consider:

- Ergonomics of typing. It _should_ be possible to directly type reasonably typographically formatted text, with minimal keystrokes. One can type html, but it's far from optimal. There are many other conventions. None arising from ASCII, because it lacks _everything_ necessary.

I don't believe that the way that text is typed must be directly related to how it's stored and / or presented.

I'm curious what you've had difficulty with over the years with related to new lines, paragraphs, page breaks, formatting, etc.

- Efficiency of the file/stream encoding. Allowing for infinitely extensible character sets, embedded specifications of glyph appearances (fonts), layout, and dynamic elements.

Yes. This is part of a file format and not part of the character encoding. File have been storing binary data that is completely independent of the encoding for years. — Granted, transferring said files between systems can run into complications.

- Efficiaency and complexity of code to deal with constructing, processing and displaying texts.

I've LONG been a fan of paying it forward to make things easier to use in the long run. I'd much rather have more complex code to make my day to day life easier and more convenient.

Sure. Now you think of trying to construct a digital representation of a printed work with historical significance. So it NUST NOT dynamically reformat. Otoh it might be a total simulation of a physical object/book, page turn physics and all.

I would never try to digitally reproduce such a work in HTML. I would copy contents into HTML so that it is easily accessible via HTML. But I would never expect that representation to accurately reproduce the original work.

Ha ha... consider how does the Tab function work in typewriters? What does pressing a Tab key actually do?

Based on memory, the tab key advances to the next tab stop that the type writer / operator has configured.

Note: The tab key itself has no knowledge of where said tab stop is located.

ASCII has a Tab code, yes. It does NOT have other things required for actual use of tabular columns.

The typewriter's tab key "does NOT have other things required for actual use of tabular columns" either. Other parts of the typewriter do.

Similarly, the text editor has "other things required for actual use of tabular columns".

So, the Tab functionality is completely broken in ASCII. That was actually a really bad error on their part. They didn't need foresight, they just goofed.

I disagree.  See above for why.

Typewriters had working Tabs function since 1897.

I've been able to actual use of tabular columns for years, if not decades, on computers without any problem.

Specifically, ASCII does not provide any explicit means to set and clear an array of tabular positions (whether absolute or proportional.)

I disagree.

Again, ASCII is a way to encode characters and is independent of how those characters are used.

I could easily use Device Control 1 to tell a device that I'm sending it information that it should use to ""program / configure a tab stop distance.

Quite similar to how the tab key on the typewriter does not tell the typewriter how far to advance. Instead I have to use other keys / buttons / knobs on the typewriter to define where the tab stop should be. The tab simply advances to the next tab stop.

Hence html has to implement tables, grid systems, etc. But it SHOULD be possible to type columnar text (with tabs) exactly and as ergonomically as one would on a typwriter.

First, HTML's white space compaction will disagree.  (For better or worse).

Second, tabs are 8 characters by convention. A convention that can very easily be redefined.

As such, it's somewhere between impractical and impossible to rely on the following content to appear the same on a computer or typewriter without specifying what the tab stop is. A tab stop of 32 will align things. A tab stop of 8 will not.

bob<tab>ed
abcdefghijklmnopqrstuvwxyz<tab>0123456789

IMHO this is a flaw with the concept of tab and not the aforementioned implementations.

Why would I be talking of the binary code of the tab character?

Your comment was about what is done when the character is encountered when the overarching discussion is about ASCII, which is a standard for how to encode characters, not what is done when a character is encountered.

Sigh. You'll have to wait.

Fair enough.

ASCII is not solely a 'character encoding scheme', since it also has the control codes. But those implement far less functionality than we need.

Sorry, when I said "character encoding scheme", I was meaning character s and control codes. Thus asking too much of the {character,control code} encoding scheme.

Now tell me why you think the fundamental coding standard, should not be the same as used in file formats. You're used to those being different things (since ASCII is missing so much), but it doesn't have to be so.

I think of ASCII as being a way to encode characters / control codes. Conversely I think of file formats as being a way to define how the encoded characters / control codes are used in concert with each other.

The file format builds on top of the character / control code encoding.

There you go again, assuming 'styling' has no place in the base coding scheme.

Correct. I believe that styling is part of a file format, not the underlying character / control code encoding.

You keep assuming that a basic coding scheme should contain nothing but the common printable characters. Despite ASCII already containing more than that.

No, I do not. I can understand why you may think that. Please allow me to clarify.

I believe that a basic coding scheme contains printable characters and control codes.

Sorry for omitting the "control codes", which are part of the defined ASCII standard.

Also tell me why there should not be a printable character specifically meaning "Start of comment" (and variants, line or block comments, terminators, etc.)

I don't perceive a need for a control code that means "start of comment". Much less the aforementioned variants.

I do perceive a need for a file format that uses characters and / or control codes to represent that.

You are just used to doing it a traditional way, and not wondering if there might be better ways.

Nope.  That's a false statement.

I have long pontificated a file format that made it easy to structure text (what I typically deal with) such that it's easy to reference (include) part of it in other files. I've stared at DocBook and a few other file formats that do include meta-data about the text such that things can be referenced. All of the file formats that I looked at re-used ASCII. But nothing stopped them from using their own binary encoding. Much the way that I believe that Microsoft Word documents do.

Suffice it to say that I'm not just using the traditional methods. I'm actively looking for and evaluating alternatives to see if they will work better for me than what I'm currently doing.

You think that, because all your life you've been typing /* comment */ or whatever.

No I have not. I've been using different comment characters / chords of characters for years.

In truth, the ASCII committee just forgot.

I disagree.

Oh well.

I am willing to entertain discussions of the need of additional control characters. But I expect such discussions to include why the file format can't re-use a different control character and why it's necessary to define another one. (Think Word document's binary format.)

You're going to need to wait a few years, till you see the end product.

Okay.

That bit of text I quoted is a very, very brief points list. Detailed discussion of all this stuff is elsewhere, and I _can't_ post it now, since that would seriously damage the project's practical potential. (Economic reasons.)

Fair enough.

Column multiplier significance. That's a different thing from the nature of '0' as a symbol. At present there is no symbol meaning 'this is not information.'

Why would there be a "this is not information" in a field that is specifically meant to hold information?

I can see a use for "this information is unavailable". But null is typically used for that (and other things).

Nevermind, it's difficult to grasp without a discussion of the implications for very large mass storage device structure. And I'm not going there now.

Okay.

That sounds like it borderlines on file systems and / or database structures. Which I consider as being a higher layer than file format, used to differentiate different files / records.

It wasn't then, but the lack of it is our problem now.

I disagree.

I don't think that this is a character / control code encoding problem.

I think this is a file format problem.

UTF-8 is multi-byte binary, of a specific type. Just ONE type. No extensibility.

I find the lack of extensibility questionable. Granted I don't know much about UTF<anything>. But I do think that I routinely see new characters / glyphs added. So either it's got some modicum of extensibility, or people are simply defining previously undefined / unused code points.

??? Are you deliberately being obtuse?

No.

I'm saying that we have multiple ways to encode binary data (pictures, sound, programs, you name it) such that it can safely be represented in printable ASCII characters:

 · Base 16
 · Base 32
 · Base 64
 · UUEncode
 · Quoted-Printable

I'm sure there are more, but that's just what comes to mind at the moment.

MIME structures allow us to include multiple different types of content in the same printable ASCII file.

I've worked with Lotus Notes files (I don't remember the extension) that easily stored complex data structures. Things like a document with embedded video, sound, programs, pictures, links to other files, other files themselves, could all easily be put into a single file.

The point is to attempt to formulate a new standard that allows all this, in one well defined, extensible way that permits all future potential cases. We do know how to do this now.

I feel like the Lotus Notes file was extremely extensible.

But that's a file format, not an character / control code encoding scheme.

No. People who do scan captures of documents will understand that. They face the choice: keep the document as page images (can't text search), or OCR'd text (losing the page's visual soul.)

My understanding is that there are multiple file types (formats) that can already do that.

I believe it's possible to include the document as page image -and- include OCR'd text so that it is searchable.

I feel confident that an Epub can include an HTML page that includes the image(s) and ATL value on IMG tags.

I bet there are a number of other options too.

But it should be possible to do BOTH,

Agreed.  I think that it is possible to do both.

in one file structure

That sounds dangerously close to a file format. Or at LOT closer to a file format than a character / control code encoding scheme.

if there was a defined way to link elements in the symbolic text to words and characters in the images.

I believe there likely is, or can be.

I wonder if an image map would work in an Epub or Microsoft's HTML Archive files.

You'll say 'this is file format territory.'

Yep.

True…

:-)

…at the moment,

What will change that will prevent that the same file formats that exist today won't exist or won't be able to continue to do this in the future?

Or why will what works today stop working in the future?

but only because the basic coding scheme lacks any such capability.

Even if the new encoding scheme that you're working (which you can't talk about) on does include these capabilities, that does not preclude the current file formats from continuing to work in the future.

You realise ASCII doesn't do that?

Sorry, I was talking within the context of ASCII.

I believe that any computer that uses ASCII (and doesn't do a translation internally) does represent a capital A as binary 01000001.

If that is not the case, please enlighten me.

Something got lost there. "^W' ??

Sorry, ^W, is how unix geeks represent control w, which is a readline key sequence to erase the last word.

So I was effectively asking "What is a 2D plane?".

Surely you understand that point. English: left to right, secondary flow: downwards. Many other cultural variants exist.

Yes, I understand that English is primarily left to right, and secondarily top to bottom. However, that is within a defined 2D plane. (Or page as I was calling it.)

My real point was to ask what defines a 2D plane (page)? Is it it's size? Is it how much text can fit on it? What point size is the text there on?

A 2D plane (page) is rather nebulous without context to define it.

Huh? This is pretty random.

I was making a comparison to a defined 2D plane (page) that can hold a finite amount of information / text at a given point size.

I was then wondering if there was similar definition of a unit for oral speak.

Not after ASCII became a standard - unless you were using a language that needed more or different characters. ie most of the world's population.

EBCDIC is still quite popular, even here in the US. Well, at least in IBM shops. I hear that it's also popular in other mainframe shops around the world that want to interoperate with IBM mainframes.

Unicode / UTF-* are also gaining traction.

Thus I think the other encoding methods are making a difference.  ;-)

Hah. In fact, the ability to represent unlimited-length numeric objects, is one of the essentials of an adequate coding scheme. ASCII doesn't.

I disagree. ASCII does just as well as numbers taught to kindergartners in the English speaking world. Where the numbers are a collection of individual characters, 0 - 9.

Granted, that's not the same thing as a single word of memory holding a 64-bit number. But, humans don't have tens / hundreds / thousands of different numbers representing different values that are their own discrete character. Instead, humans use different sets of digits to represent different values in different places.

The whole 'x-bits long words' is one of the hangups of computing architectures too.

Sure. I think doing something like humans do might be more scalable. But then we could get by with probably 4 or 5 bit representations of numbers. Binary Coded Decimal comes to mind. }:-)

But that's another story.

Agree.

You're describing Chinese language programming. Though you didn't realise. And yes... :) A capable encoding scheme, and computing architecture built on it, would allow such a thing.

Something that is most decidedly outside of the scope of what ASCII was meant to solve.

Point? Not practical.

It might not be practical for most day-to-day computing. But I do think that there are merits to it for specific use cases.

The coding scheme has to be compatible with the existing cultural schemes and existing literature. (All of them.)

Why does the coding scheme have to be compatible? Why can't it be completely different as long as there is a way to translate between them.

What began as my general interest in the evolution of information encoding schemes, gained focus as more and more instances of early mistakes became apparent. Eventually it spawned a deliberate project to evaluate 'starting over.'

There in lies some critical meta-data. You have a purpose behind what you're doing, which happens to seem related to deriding ASCII.

Like this:

* Revisit the development history of computing science, identifying points at which, in hindsight, major conceptual shortcomings became cemented into foundations upon which today's practices rest.

* Evaluate how those conceptual pitfalls could have been avoided, given understandings arrived at later in computing science.

* Integrate all those improvements holistically, creating a virtual 'alternate timeline' of computing evolution, as if Computing Science had evolved with prescience of future conceptual advances and practical needs. Aiming to arrive at an information processing and computing architecture, that is what we'd already have now if we knew what we were doing from the start.

Learning from others mistakes is usually a good thing.

Hey, we totally agree on something! I *HATE* PDF, and the Adobe DRM-flyblown horse it rode in on. When I scan tech documents, for lack of anything more acceptable I structure the page images in html and wrap as a RAR-book. Unfortunately few know of this method.

~chuckle~

There *was* at one point a freeware utility for deconstructing PDF files and analysing their structure. I forget the name just now. It apparently was Borged by the forces of evil, and no longer can be found. Anyone have a copy?

I've been able to get raw data out of PDFs before. But it's usually so badly broken that it's difficult if not impossible to make it practical to use. I'm usually better of just retyping what I want.

No, they are not intrinsically different things. It just seems that way from the viewpoint of convention because ASCII lacks so many structural features that file (and stream) formats have to implement on their own. (And so everyone does them differently.)

I disagree.

ASCII is a common way of encoding characters and control codes in the same binary pattern.

File formats are what collections of ASCII characters / control codes mean / do.

Ha, wait till (eventually - if ever) you see the real thing. I'm having lots of fun with it. Result is like 'alien tech.'

Please don't blame me for not holding my breath.

Soon. Few weeks. Got to get some stuff out of the way first. I have way too many projects.

:-)



--
Grant. . . .
unix || die

Reply via email to