On 21/09/2021 13.12, Michael F. Stemper wrote:
If XML is not the way to package data, what is the recommended approach?
Well, there have been a lot of ideas put forth on this thread, many more than I expected. I'd like to thank everyone who took the time to contribute. Most of the reasons given for avoiding XML appear to be along the lines of "XML has all of these different options that it supports." However, it seems that I could ignore 99% of those things and just use a teeny subset of its capabilities. For instance, if I modeled a fuel like this: <Fuel name="Montana Sub-Bituminous"> <uom>ton</uom> <price>21.96</price> <heat_content>18.2</heat_content> </Fuel> and a generating unit like this: <Generator name="Skunk Creek 1"> <IHRcurve name="normal"> <point P="63" IHR="8.513"/> <point P="105" IHR="8.907"/> <point P="241" IHR="9.411"/> <point P="455" IHR="10.202"/> </IHRcurve> <IHRcurve name="constrained"> <point P="63" IHR="8.514"/> <point P="103" IHR="9.022"/> <point P="223" IHR="9.511"/> <point P="415" IHR="10.102"/> </IHRcurve> </Generator> why would the fact that I could have chosen, instead, to model the unit of measure as an attribute of the fuel, or its name as a sub-element matter? Once the modeling decision has been made, all of the decisions that might have been would seem to be irrelevant. Some years back, IEC's TC57 came up with CIM[1]. This nailed down a lot of decisions. The fact that other decisions could have been made doesn't seem to keep utilities from going forward with it as an enterprise-wide data model. My current interests are not anywhere so expansive, but it seems that the situations are at least similar: 1. Look at an endless range of options for a data model. 2. Pick one. 3. Run with it. To clearly state my (revised) question: Why does the existence of XML's many options cause a problem for my use case? Other reactions: Somebody pointed out that some approaches would require that I climb a learning curve. That's appreciated, although learning new things is always good. NestedText looks cool, and a lot like YAML. Having not gotten around to playing with YAML yet, I was surprised to learn that it tries to guess data types. This sounds as if it could lead to the same type of problems that led to the names of some genes being turned into dates. It was suggested that I use an RDBMS, such as sqlite3, for the input data. I've used sqlite3 for real-time data exchange between concurrently-running programs. However, I don't see syntax like: sqlite> INSERT INTO Fuels ...> (name,uom,price,heat_content) ...> VALUES ("Montana Sub-Bituminous", "ton", 21.96, 13.65); as being nearly as readable as the XML that I've sketched above. Yeah, I could write a program to do this, but that doesn't really change anything, since I'd still need to get the data into the program. (Changing a value would be even worse, requiring the dreaded UPDATE INTO statement, instead of five seconds in vi.) Many of the problems listed for CSV, which come from its lack of standardization, seem similar to those given for XML. "Commas or tabs?" "How are new-lines represented?" If I was to use CSV, I'd be able to just pick answers. However, fitting hierarchical data into rows/columns just seems wrong, so I doubt that I'll end up going that way. As far as disambiguating authors, I believe that most journals are now expecting an ORCID[2] (which doesn't help with papers published before that came around). As far as use of XML to store program state, I wouldn't ever consider that. As noted above, I've used an RDBMS to do so. It handles all of the concurrency issues for me. The current use case is specifically for raw, static input. Fascinating to find out that XML was originally designed to mark up text, especially legal text. It was nice to be reminded of what Matt Parker looked like when he had hair. [1] <https://en.wikipedia.org/wiki/Common_Information_Model_(electricity)> [2] <https://orcid.org/> -- Michael F. Stemper Psalm 82:3-4 -- https://mail.python.org/mailman/listinfo/python-list