Hi internals,

As you are surely aware, serialization in PHP is a big mess. Said mess is
caused by some fundamental issues in the serialization format, and
exacerbated by the existence of the Serializable interface. Fixing the
serialization format is likely not possible at this point, but we can
replace Serializable with a better alternative and I'd like to start a
discussion on that.

The problem is essentially that Serializable::serialize() is expected to
return a string, which is generally obtained by recursively calling
serialize() in the Serializable::serialize() implementation. This
serialize() call shares state information with the outer serialize(), to
ensure that two references to the same object (or the same reference) will
continue referring to a single object/reference after serialization.

This causes two big issues:

First, the implementation is highly order-dependent. If
Serializable::serialize() contains multiple calls to serialize(), then
calls to unserialize() have to be repeated **in the same order** in
Serializable::unserialize(), otherwise unserialization may fail or be
corrupted. In particular this means that using parent::serialize() and
parent::unserialize() is unsafe. (See also
https://bugs.php.net/bug.php?id=66052 and linked bugs.)

Second, the existence of Serializable introduces security issues that we
cannot fix. Allowing the execution of PHP code during unserialization is
unsafe, and even innocuous looking code is easily exploited. We have
recently mitigated __wakeup() based attacks by delaying __wakeup() calls
until the end of the unserialization. We cannot do the same for
Serializable::unserialize() calls, as their design strictly requires the
unserialization context to still be active during the call. Similarly,
Serializable prevents an up-front validation pass of the serialized string,
as the format used for Serializable objects is user-defined.

The delayed __wakeup() mitigation mentioned in the previous point also
interacts badly with Serializable, because we have to delay __wakeup()
calls to the end of the unserialization, which in particular also implies
that Serializable::unserialize() sees objects prior to wakeup. (See also
https://bugs.php.net/bug.php?id=74436.)

In the end, everything comes down to the fact that Serializable requires
nested serialization calls with context sharing.

The alternative mechanism (__sleep + __wakeup) does not have these issues
(anymore), but it is not sufficiently flexible for general use: Notably,
__sleep() allows you to limit which properties are serialized, but the
properties still have to actually exist on the object.

I'd like to propose the addition of a new mechanism which essentially works
the same way as Serializable, but uses arrays instead of strings and does
not share context. I'm not sure about the naming (RealSerializable,
anyone?), so I'll just go with magic methods __serialize() and
__unserialize() for now:

    public function __serialize() : array;
    public function __unserialize(array $data) : void;

>From a userland perspective the implementation should be the same as for
Serializable methods, but with interior serialize()/unserialize() calls
stripped out. Right now Serializable implementations already usually work
by doing something like "return serialize([ ... ])", this would change it
to just "return [ ... ]" and move the serialize()/unserialize() call into
the engine, where we can perform it safely and robustly.

The new methods should reuse the "O" serialization format, rather than
introducing a new one. This allows a measure of interoperability with
previous PHP versions, which can still decode serialized strings from newer
versions using __wakeup().

If an object has both __wakeup() and __unserialize(), then __unserialize()
should be called. If an object implements both Serializable::unserialize()
and __unserialize(), then we should invoke one or the other based on
whether "C" or "O" serialization is used.

Thoughts?

Nikita

Reply via email to