Can we agree that there are way more general ways to store data than anything currently in common use and that in some ways, CSV and cousins like TSV are a subset of the others in a sense? There are trees and arbitrary graphs and many complex data structures often encountered while a program is running as in-memory objects. Many are not trivial to store.
But some are if all you see is table-like constructs including matrices and data.frames. I mean any rectangular data format with umpteen rows and N columns can trivially be stored in many other formats and especially when it allows some columns to have NA values. The other format would simply have major categories that contain components with one per column, and if missing, represents an NA. Is there any reason JSON or XML cannot include the contents of any CSV with headers and without loss of info? Going the other way is harder. Note that a data.frame type of structure often imposes restrictions on a CSV and requires everything in a column to be of the same type, or coercible to a common type. (well, not always true as in using list columns in R.) But given some arbitrary structure in XML, can you look at all possible labels and if it is not too complex, make a CSV with one or more columns for every possible need? It can be a problem if say a record for an Author allows multiple actual co-authors. Normal books may let you get by with multiple columns (mostly containing an NA) with names like author1, author2, author3, ... But scientific papers seemingly allow oodles of authors and any time you update the data, you may need yet another column. And, of course, processing data where many columns have the same meaning is a bit of a pain. Data structures can also often be nested multiple levels and at some point, CSV is not a reasonable fit unless you play database games and make multiple tables you can store and retrieve to make complex queries, as in many relational database systems. Yes, each such table can be a CSV. But if you give someone a hammer, they tend to stop using thumbtacks or other tools. The real question is what kind of data makes good sense for an application. If a nice rectangular format works, great. Even if not, the Author problem above can fairly easily be handled by making the author column something like a character string you compose as "Last1, First1; Last2, First2; Last3, First3" and that fits fine in a CSV but can be taken apart in your software if looking for any book by a particular author. Not optimal, but a workaround I am sure is used. But using the most abstract and complex storage method is very often overkill and unless you are very good at it, may well be a fairly slow and even error-prone way to solve a problem. -----Original Message----- From: Python-list <python-list-bounces+avigross=verizon....@python.org> On Behalf Of Chris Angelico Sent: Thursday, September 23, 2021 9:27 AM To: Python <python-list@python.org> Subject: Re: XML Considered Harmful On Thu, Sep 23, 2021 at 10:55 PM Mats Wichmann <m...@wichmann.us> wrote: > > On 9/22/21 10:31, Dennis Lee Bieber wrote: > > > If you control both the data generation and the data > > consumption, finding some format ... > > This is really the key. I rant at people seeming to believe that csv > is THE data interchange format, and it's about as bad as it gets at > that, if you have a choice. xml is noisy but at least (potentially) > self-documenting, and ought to be able to recover from certain errors. > The problem with csv is that a substantial chunk of the world seems to > live inside Excel, and so data is commonly both generated in csv so it > can be imported into excel and generated in csv as a result of > exporting from excel, so the parts often are *not* in your control. > > Sigh. The only people who think that CSV is *the* format are people who habitually live in spreadsheets. People who move data around the internet, from program to program, are much more likely to assume that JSON is the sole format. Of course, there is no single ultimate data interchange format, but JSON is a lot closer to one than CSV is. (Or to be more precise: any such thing as a "single ultimate data interchange format" will be so generic that it isn't enough to define everything. For instance, "a stream of bytes" is a universal data interchange format, but that's not ultimately a very useful claim.) ChrisA -- https://mail.python.org/mailman/listinfo/python-list -- https://mail.python.org/mailman/listinfo/python-list