Re: proposal to allow loaddata to read from stdin [was: manage.py loaddata fails to report error]

Russell Keith-Magee Fri, 17 Apr 2009 04:58:49 -0700

On Fri, Apr 17, 2009 at 3:56 PM, Phil Mocek
<pmocek-list-django-us...@mocek.org> wrote:
>
> However, due to long-standing conventions, the filename suffix
> ("extension" in MS-DOS parlance) often provides a clue about the format
> of the data within.  These conventions allow loaddata to make a
> reasonable guess about the content of a file based on its name.  That
> may be a useful practice, but it's still just a guess.


This is true. However, I'm not in the habit of labelling my XML files
.json, so in circumstances like this one, the extension can be a very
useful clue. It only becomes problematic when you have extensions like
.doc, which could mean a pure text document, an RTF document, an MS
Word document (which is itself any number of possible formats), or
anything else that someone has decided to call a "doc". This isn't a
problem that exists for fixture formats, so we're safe.

I would also repeat that we're not talking about filenames here -
we're talking about fixture labels. When you call 'loaddata
foobar.json", you're not saying "load the file foobar.json". You're
saying "find all the JSON fixtures called foobar, looking in
app1/fixture, app2/fixture, FIXTURE_DIRS, and the current working
directory. Filenames are the degenerate case of the fixture labeling
system.

> Whether input is read from standard input or from a file whose name was
> passed on the command line, loaddata will not know what is in the file
> until the file is read.

This is patently untrue, unless you are applying an exceptionally
gnostic interpretation of the word "know". Django has a very large
test suite, with a large number of test fixtures. On top of that, I
have a huge number of fixtures in test cases for personal and work
projects, and I know many other people with similar fixture
collections. "Knowing" the format of those fixtures has not yet proven
to be a problem. XML fixtures are named .xml. JSON fixtures are named
.json. This is neither confusing, nor surprising, nor problematic.

> In either
> case, loaddata would be engaging in risky behavior if it used the data
> without some verification that it is of a particular format.

Well, if you pass json data in and request parsing as XML, you're
going to find the parser choking pretty quickly. Fixture loading all
happens in transactions, and the parsers are pretty strict, so if you
start seeing errors, the database won't get anything loaded.

> A --format option would achieve parity with the syntax of loaddata's
> counterpart, dumpdata, and would allow the caller to explicitly state
> what loaddata should expect. This would provide more information to
> loaddata about the format of the data than a filename provides, because
> the best the filename can do is imply something about the format of the
> data.  Validly-formatted data still wouldn't be guaranteed, but the
> program would then know what the caller intends for the format to be,
> and that is more information than it would have were it provided with
> nothing but the stream of data and a filename.

I agree that there would be parity with dumpdata, and I can see how
--format could be useful in the context of a stdin input mode for
loaddata. It would certainly be more reliable than trying to invent
magic file format detection methods.

However, I flat out reject the idea that:

loaddata --format=xml foobar.xml

provides better formatting guarantees than:

loaddata foobar.xml

in any real fixture loading situation. I also reject the idea that
using a command line argument allows the parser to know what format
the caller "intends" any better than the original author of a fixture
does when they name their fixture, knowing the way it will be used by
Django.

> If it's reasonable for the program to predict data format based on
> filename, then it would seem to be reasonable in the absence of a
> filename to predict the data format based on whatever the default format
> is documented to be.  This would again provide parity with the dumpdata
> command, which prints data in its default format, JSON, when a format is
> not explicitly selected by the caller.  In either case, it would be best
> for some sort of internal check to occur before loaddata does anything
> with the stream of data.

There is the start an idea here, but there are several significant
edge cases. To pick a few easy targets: what happens in the following
cases? Why is your proposed behaviour _not_ surprising when compared
with the existing behaviour without the --format option?

loaddata --format=json foobar.xml

loaddata --format=json foobar
(when there is an foobar.xml fixture available)

loaddata --format=json foobar.json whizbang.xml

> My inclination would be for it to adopt the POSIX convention of reading
> from a particular file if one is specified on the command line, from
> standard input if the filename specified on the command line is "-",
> and from standard input if no file is specified on the command line.
> Thus, the rules would be:
>
>    if filename specified on cmdline:
>        if that filename is "-":
>            read from stdin
>        else:
>            read from file with name specified
>    else:
>        read from stdin
>
> Note that this means anything on standard input is ignored if the caller
> provides a filename.  This maintains backwards-compatibility with the
> current behavior.

Again - the user isn't providing a filename, they're providing a
label. However, I don't see anything fundamentally wrong with this
idea.

> Once the program decides where to get input, it can move on to deciding
> what to do with that input based on all the information available to it.
> These seem to be very different tasks, and I think it's important to
> maintain a distinction between them.

To be clear, my problems with taking input from stdin are entirely
linked to the second of these two problems. I don't have any
particular objection to taking loaddata input from stdin per se. My
objections lie with how the format of this data will be determined.
Using --format is one approach, but we need to be very clear how the
--format directive is interpreted for the existing use cases (in
particular, during unspecified format fixture discovery, such as
initial_data).

> For an example of the usefulness of using loaddata with standard input,
> consider the act of loading the database of a Django-based Web
> application with the content of the database of an instance of the same
> application running elsewhere:
>
>    ssh remotehost /path/to/manage.py dumpdata | /path/to/local/manage.py 
> loaddata

I have no difficulty seeing _how_ you would plug these pipes together.
I have a slight difficulty imagining _why_ you would want to. That
isn't to say your idea is bad - it's just not something I've found
myself in a position of thinking "oh, I wish I could do that".

Yours,
Russ Magee %-)

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: proposal to allow loaddata to read from stdin [was: manage.py loaddata fails to report error]

Reply via email to