Re: GSoC Meta refactor: Bikeshedding time!!

Ivan Kharlamov Wed, 20 Aug 2014 05:43:04 -0700

On 08/20/2014 04:28 PM, Ivan Kharlamov wrote:
> On 08/20/2014 03:52 PM, Ivan Kharlamov wrote:
>> On 08/20/2014 12:46 PM, Marc Tamlyn wrote:
>>> I'd say ArrayField is a straight up data field at the moment. It stores
>>> 0-1 lists of data. It's no different to CommaSeparatedIntegerField
>>> (seriously, why does that exists...)
>>>
>>> *If* PG gets the relevant update that will allow `integer[] references`
>>> (i.e. ArrayField(ForeignKey)) then this would be different, and would be
>>> more like a m2m field.
>>>
>>> There is an argument that it's 0-N anyway, but in the implementation
>>> both within Django and in the database I don't think the distinction is
>>> useful at the point, from an ORM point of view in any case. For a forms
>>> point of view it's quite different.
>>>
>>>
>>> On 20 August 2014 09:19, Russell Keith-Magee <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>>
>>>     On Mon, Aug 18, 2014 at 6:03 PM, Anssi Kääriäinen
>>>     <[email protected] <mailto:[email protected]>> wrote:
>>>
>>>         On Monday, August 18, 2014 7:45:17 AM UTC+3, Russell Keith-Magee
>>>         wrote:
>>>
>>>             I understand what you're driving at here, and I've had
>>>             similar thoughts over the course of the SoC. The catch is
>>>             that this makes the API for get_fields() fairly complicated.
>>>
>>>             If every field fits into one specific type, then
>>>             get_fields() just requires a single boolean flag (do I
>>>             include fields of type X) for each field type. We can also
>>>             easily add new field types by adding new booleans to the API.
>>>
>>>             However, if a field fits into multiple categories, then it's
>>>             impossible (or, at least, exceedingly complicated) to make a
>>>             single call to get_fields() that will specify all your field
>>>             requirements. "Get me all non-virtual data fields" requires
>>>             "virtual=False, data=True, m2m=False", but "Get all virtual
>>>             data fields that represent m2ms" requires "virtual=True,
>>>             data=False, m2m=True". You can't pass in both sets of
>>>             arguments at the same time, so you either have to make
>>>             multiple calls to get_fields(), or you have to invent some
>>>             sort of query syntax for get_fields() that allows union
>>>             queries. 
>>>
>>>             Plus, at the end of the day, get_fields() is abstracted
>>>             behind highly cached and optimised properties for key
>>>             lookups. These properties are effectively a cached call to
>>>             get_fields() with a specific set of arguments - so even if
>>>             get_fields() doesn't expose a "one category per field"
>>>             requirement, the API will require, at some level, names that
>>>             have clear (and preferably non-overlapping) membership.
>>>
>>>
>>>         If fields are in multiple categories then users will want to do
>>>         the full range of set operation on the categories. Encoding that
>>>         in to the API doesn't sound promising.
>>>
>>>
>>>                 I don't think users actually want to get fields based on
>>>                 the suggested categorization. I feel we get an easier to
>>>                 use and more flexible API if we have higher level
>>>                 categories and allow fields to match multiple
>>>                 categories. As a practical example if I want all
>>>                 relation fields, that is going to be hard using the
>>>                 suggested API. Getting all relation fields is a more
>>>                 realistic use case than getting related virtual objects.
>>>
>>>
>>>             Quite probably true. As a point of interest, the current (as
>>>             in, 1.6) API actually doesn't differentiate between category
>>>             (a) "pure data" and category (b) "relating data (i.e., FK)"
>>>             fields - if you ask for "data fields" you get pure data
>>>             *and* foreign keys. So, at least as far as Django's own
>>>             usage is concerned, you're correct in saying that taxonomy
>>>             I've described isn't fully required. 
>>>
>>>             Daniel's survey of internal usage reveals that there are
>>>             three use cases for getting a list of fields in Django's
>>>             internal API:
>>>
>>>              * Get all data and m2m fields (i.e., categories  a, b, and
>>>             d). This is effectively "all fields on *this* model"
>>>
>>>              * Get all data, m2m, related objects, related m2m, and
>>>             virtual fields (i.e., categories a, b, d, f, g, h, i -
>>>             excluding c and e because Django doesn't currently have any
>>>             fields of this type). This is "all fields on this model, or
>>>             related to this model"
>>>
>>>              * Get all m2m fields (i.e., category d)
>>>              
>>>             So - at the very least, we need names to describe those
>>>             three groups. My intention with describing a richer taxonomy
>>>             is to try and give names to other groupings of interest. 
>>>
>>>                 If we want to have all fields to match single and only
>>>                 single category, then we need to redefine the categories
>>>                 to make sure ForeignKeys as virtual fields are possible,
>>>                 and that more esoteric custom join based fields fit in
>>>                 to the categorization.
>>>
>>>
>>>             Agreed - that's why I threw this out there for discussion :-)
>>>
>>>             Properties like "data", "virtual", "external", "related",
>>>             "relating" - these are high level concepts describing the
>>>             way a field manifests. However, that doesn't mean we need to
>>>             expose these properties as part of the formal API.
>>>
>>>             Part of the underlying problem here -- lets say we roll out
>>>             Django 1.7 with some version of this API, and in 1.8,
>>>             foreign key fields change to become virtual. That
>>>             effectively becomes backwards incompatible for queries that
>>>             are sensitive to a "virtual" flag; but it doesn't change the
>>>             underlying need to identify that a field is a foreign key.
>>>             We need to capture the latter use case, but not necessarily
>>>             the former.
>>>
>>>          
>>>         Could we go with a minimal API for get_fields()? Instead of
>>>         having categorization on the get_fields() API, we could provide
>>>         field flags for the categories. With field flags it is
>>>         straightforward to filter the return list of get_fields(). As an
>>>         example, fetching those fields which are relations but which
>>>         aren't virtual: [f for f in get_fields() if f.relational and not
>>>         f.virtual]. If this path is taken, then I am not sure how
>>>         minimal the get_fields() API should be. We likely need flags for
>>>         at least if the field is defined on local, parent or some remote
>>>         model.
>>>
>>>         As for changing ForeignKey to virtual field plus concrete field
>>>         representation - I just realized this will be backwards
>>>         incompatible no matter what we do regarding categorization. An
>>>         all-fields including get_fields() call will return separate
>>>         author (virtual) and author_id (concrete) fields after the
>>>         split. I am not sure what we can do about this. It would be very
>>>         unfortunate if we can't refactor the way ForeignKeys work due to
>>>         the meta API. Any ideas how we can avoid the backwards
>>>         compatibility trap?
>>>
>>>
>>>     I think Daniel and I might have come up with a way to meet both
>>>     these requirements - a minimalist API for get_fields, with at least
>>>     some protection against the known incoming backwards compatibility
>>>     issue.
>>>
>>>     The summary so far: it appears that a complex taxonomy isn't
>>>     especially helpful - firstly, because any complex taxonomy is going
>>>     to have edge cases that are hard to categorize, but also because a
>>>     complex taxonomy leads to a much more complex internal API that is
>>>     going to be prone to backwards compatibility problems.
>>>
>>>     So - instead of worrying about 'virtual' and other properties like
>>>     that, lets look at why the _meta API is fundamentally used - to get
>>>     a list of fields that need to be handled in data processing. This
>>>     primarily means forms, but other forms of serialisation are also
>>>     included. In these use cases, there are always going to be per-field
>>>     differences (even a CharField and an IntegerField require *slightly*
>>>     different handling), so we won't focus on internal representations,
>>>     storage mechanisms, or anything like that. Instead, lets focus on
>>>     cardinality - a field represents some sort of data that has a
>>>     cardinality with the object on which it is stored. If something has
>>>     cardinality 1, you can display a single field. If it's cardinality
>>>     N, you need to display a list, or some sort of inline.
>>>
>>>     This results in 3 categories that are mutually exclusive:
>>>
>>>     a) "Data fields": Fields of cardinality 0-1:
>>>
>>>      * A CharField stores 0 or 1 strings (0 is the case of a nullable
>>>     field).
>>>
>>>      * An IntegerField stores 0 or 1 integers.
>>>
>>>      * A FileField stores 0 or 1 file paths.
>>>
>>>      * An ImageField stores 0 or 1 file paths - although in being
>>>     modified, it might modify some other fields.
>>>
>>>      * A ForeignKey stores 0 or 1 references to another object. 
>>>
>>>      * A GenericForeignKey stores 0 or 1 references to another object.
>>>
>>>      * A notional "DocumentField" on a NoSQL store references 0 or 1
>>>     external documents.
>>>
>>>     b) "ManyToMany Fields": Fields that are locally defined that
>>>     represent a cardinality 0-N relationship with another object:
>>>
>>>      * Many to Many fields store 0-N references to a second model.
>>>
>>>     c) "Related Objects": Fields that represent a cardinality 0-N
>>>     relationship with this object, but aren't locally defined:
>>>
>>>      * The 'related' side of a ForeignKey
>>>
>>>      * The 'related' side of a ManyToMany
>>>
>>>      * A GenericRelation representing the reverse side of a
>>>     GenericForeignKey
>>>
>>>     These three types are mutually exclusive - you either have
>>>     cardinality 1 *or* cardinality N, not both; and you're either
>>>     locally defined on this object or you're not. I can't think of an
>>>     example of "cardinality 1 data that isn't defined on this object",
>>>     but it would fit into this taxonomy if it were needed; I also can't
>>>     think of a field definition that would span models.
>>>
>>>     In addition to this basic classification, a field can be marked as
>>>     "hidden". The immediate use for this is to hide the related_name='+'
>>>     case of a FK or M2M. Looking forward, it would be used to mask
>>>     fields that exist, but aren't intended to be user visible - for
>>>     example, in the potential future case where a ForeignKey is split in
>>>     two, or a Composite Key, there would be a "hidden" integer field (or
>>>     fields) storing the actual data, and a virtual (but non-hidden)
>>>     field that is the public API for manipulating the relationship. This
>>>     would also be backwards compatible, because the "visible" field list
>>>     hasn't changed.
>>>
>>>     Fields are also tracked according to their parentage; this is used
>>>     by tools interacting with inheritance relationships to know which
>>>     fields are actually on this model, and which are inherited from a
>>>     base class.
>>>
>>>     This yields the following formal API for _meta:
>>>
>>>      * get_fields(data, many_to_many, related, include_hidden,
>>>     include_parents)
>>>
>>>      * @property data_fields (=> get_fields(data=True,
>>>     many_to_many=False, related=False, include_hidden=False,
>>>     include_parents=True)
>>>
>>>      * @property many_to_many_fields (=> get_fields(data=False,
>>>     many_to_many=True, related=False, include_hidden=False,
>>>     include_parents=True)
>>>
>>>      * @property related_objects (=> get_fields(data=False,
>>>     many_to_many=False, related=True, include_hidden=False,
>>>     include_parents=True)
>>>
>>>     Does this sound any more sane as an API?
>>>      
>>>     My one lingering question is whether the "many_to_many"
>>>     name/category is too explicit. I can conceive how an ArrayField
>>>     could be considered a data field (it stores 0-1 arrays of data), or
>>>     a "many_to_many" field (because it stores 0-N instances of some
>>>     data). This all hinges on whether the definition for that field
>>>     category is that it is a relationship with another *model*, or if
>>>     it's just cardinality N data. It's trivial to call it a Data field
>>>     and just leave it at that, but I'm wondering if there might be
>>>     benefit in broadening the definition of "many_to_many".
>>>
>>>     Russ %-)
>>
>> When I look at this situation from the point of view of forms, there are
>>
>> 1. Fields of cardinality 0-1
>> 2. Fields of cardinality 0-N
>>
>> and
>>
>> a. Fields that do not represent reference to another model (object)
>> b. Fields that represent reference to another model (object)
>>
>> 1. and 2. are mutually exclusive; a. and b. are also mutually exclusive.
>>
>> IMO, this way the future Django form would not need to care whether the
>> field is m2m or ArrayField(ForeignKey)) or ListField(EmbeddedModelField)
>> because all of them would be 2.&b.
>>
>> One may also want to add two mutually-exclusive subcategories to b:
>>
>> b1. Relationship is locally defined
>> b2. Relationship is not locally defined.
> 
> To add more examples to my proposition:
> 
> 1) CharField(), IntegerField(), FileField(), ImageField()
> 
>     are all members of both: a. and 1.
> 
> 2) ArrayField(), DictionaryField()
> 
>     are all members of both: a. and 2.
> 
> 3) ForeignKey(), GenericForeignKey(), EmbeddedModelField(),
> GenericRelation(),
> 
>     are all members of both: b. and 1.
> 
> 4) ManyToManyField(), ArrayField(ForeignKey), ListField(EmbeddedModelField)
> 
>     are all members of both: b. and 2.
> 
> 
> As Collin Anderson wrote about "virtual" fields on 08/18/2014 07:12 PM:
> 
>> Also, I think we should avoid discriminating between "virtual" and
>> non-virtual (as with local vs parent). Why should it matter how a field
>> is stored in the database? I think the distinction will make it harder
>> to use non-relational databases.
> 
> One may want to expand his statement and say that the form, ideally,
> should not care whether the field relationship is locally defined or not.
> 
> Which is not to say that b1 and b2 subcategories are not useful at all,
> but they should not be needed in form representations.


Excuse me for posting multiple emails at a time, but I'd like to make a
correction:
It just occured to me that I misused the term 'cardinality'. The best
way to correct myself is to replace this:

1. Fields of cardinality 0-1
2. Fields of cardinality 0-N

with this:

1. Fields that can have 0-1 values.
2. Fields that can have 0-N values.


Thanks for brilliant work and best regards,
Ivan

> 
>>>     -- 
>>>     You received this message because you are subscribed to the Google
>>>     Groups "Django developers" group.
>>>     To unsubscribe from this group and stop receiving emails from it,
>>>     send an email to [email protected]
>>>     <mailto:[email protected]>.
>>>     To post to this group, send email to
>>>     [email protected]
>>>     <mailto:[email protected]>.
>>>     Visit this group at http://groups.google.com/group/django-developers.
>>>     To view this discussion on the web visit
>>>     
>>> https://groups.google.com/d/msgid/django-developers/CAJxq84_OcibE72RKB9T60BJW9AtY8_YYhmhM5dXH36TtW3KsYw%40mail.gmail.com
>>>     
>>> <https://groups.google.com/d/msgid/django-developers/CAJxq84_OcibE72RKB9T60BJW9AtY8_YYhmhM5dXH36TtW3KsYw%40mail.gmail.com?utm_medium=email&utm_source=footer>.
>>>
>>>     For more options, visit https://groups.google.com/d/optout.
>>>
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google
>>> Groups "Django developers" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected]
>>> <mailto:[email protected]>.
>>> To post to this group, send email to [email protected]
>>> <mailto:[email protected]>.
>>> Visit this group at http://groups.google.com/group/django-developers.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/django-developers/CAMwjO1HLabZ7C%3D87Y3F50PWUYDncH1ip_VgtQN-cPOXthk8yHQ%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/django-developers/CAMwjO1HLabZ7C%3D87Y3F50PWUYDncH1ip_VgtQN-cPOXthk8yHQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.
>>> For more options, visit https://groups.google.com/d/optout.
>>
> 

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/53F497B1.1010303%40gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: GSoC Meta refactor: Bikeshedding time!!

Reply via email to