You are confusing data representation with data presentation. The flaws in Excel are NOT
issues with the data format. So long as the data format clearly and consistently
represents that content, then the representation is "good".
If you want to overcome limitations in Excel's presentation (import,
interpretation), then that's an Excel issue. You can overcome it by manually
doing the import and explicitly asserting the data type of each column, or you
can create something more custom.
Realize that more and more your data is likely to be consumed by some other
data science tool (R, Python/numpy/pandas, etc.) and you quickly see how
pushing Excel issues into the data representation layer is a losing proposition.
---
Jim Melton
-----Original Message-----
Sent: Thursday, May 4, 2023 01:40
To: discuss-gnuradio@gnu.org
Subject: [EXTERNAL] Re: Getting GPS data into stream
Hey Marcus,
as you say, for a lot of science you don't get high rates – so I'm really less
worried
about that. More worried about Excel interpreting some singular data point as
date; or, as
soon as we involve textual data, all the funs with encodings,
quoting/delimiting/escaping…
(not to mention that an Excel set to German might interpret different things as
numbers
than a Northern American one).
I wish there was just one good CSV standard that tools adhered to. Alas, that's
not the
case, and especially Excel has a habit of autoconverting input and losing data
at that point.
So, looking for an alternative that has these well-defined constraints and
isn't as
focused on hierarchical data (JSON, YAML, XML), far too verbose but excellent
to query
with command line tools (XML), completely impossible to correctly parse as
human or parser
in its full beauty (YAML)… Just some tabular data notation that's textual,
appendable, and
not a party of guesswork for the reading tool.
We could just canonalize calling all our files
marcusdata.utf8.textalwaysquoted.iso8601.headerspecifies_fieldname_parentheses_type.csv
but even that wouldn't solve the issue of excel seeing an unquoted 12.2021 and
deciding
the field being about christmases past.
So, maybe we just do some rootless JSON format that starts with a SigMF object
describing
the file and its columns, and then basically is just a sequence of JSON arrays
[ 1.212e-1, 0, "Müller", 24712388823 ]
[ 1.444e-2, 1, "📡🔭 \"👽\"!", 11111111111 ]
[ 2.0115-1, 0, "Cygnus-B", 0 ]
(I'm not even sure that's not valid JSON; gut feeling tells me we should be
putting []
around the whole document, but we don't want that for streaming purposes.
ECMA-404 doesn't
seem to *forbid* it.)
That way, we get the metadata in a format that's easy to skip by simpler tools,
but
trivial to parse with the right tools (I've grown to like `jq`), and the data
into a
well-defined format. Sure, you can't dump that into Excel, still, but you know
what, if it
comes down to it, we can have a python script that takes these files and
actually converts
them to valid XLSX without the misconversion footguns, and that same tool could
also be
run in a browser for those having a hard time executing python on their
machines.
Cheers,
Marcus
On 03.05.23 23:05, Marcus D. Leech wrote:
On 03/05/2023 16:51, Marcus Müller wrote:
Do agree, but really don't like CSV, too underspecified a format, too many ways
that
comes back to bite you (aside from a thousand SDR users writing emails that
their PC
can't keep up with writing a few MS/s of CSV…)
I like CSV because you can hand your data files to someone who doesn't have a
complete
suite of astrophysics tools, and they
can slurp it into Excel and play with it.
How important is plain-textness in your applications?
I (and many others in my community) tend to throw ad-hoc tools at data from
ad-hoc
experiments. In the past, I used a lot
of AWK to post-process data, and these days, I use a lot of Python.
Text-based
formats lend themselves well to this kind
of processing. Rates are quite low, typically. Like logging an integrated
power
spectrum a few times a minute, for example.
There are other observing modes where text-based formats aren't quite so
obvious--like
pulsar observations, where filterbank
outputs might be recorded at 10s of kHz, and then post-processed with any
of a number
of pulsar tools.
In all of this, part of the "science" is extracted in "real-time" and part in
post-processing.
Best,
Marcus
CONFIDENTIALITY NOTICE - SNC EMAIL: This email and any attachments are
confidential, may contain proprietary, protected, or export controlled
information, and are intended for the use of the intended recipients only. Any
review, reliance, distribution, disclosure, or forwarding of this email and/or
attachments outside of Sierra Nevada Corporation (SNC) without express written
approval of the sender, except to the extent required to further properly
approved SNC business purposes, is strictly prohibited. If you are not the
intended recipient of this email, please notify the sender immediately, and
delete all copies without reading, printing, or saving in any manner. --- Thank
You.