Re: [v8-users] Preliminary RFC: Stabilizing the V8 script compiler cached data format

Jace Mogill Fri, 23 Jul 2021 17:02:33 -0700

All,

Since this topic has already prompted so much work and discussion I'll 
chime in as an opinionated but passive community member whose project is 
also ultimately limited by being able to reason about the storage sequence 
of V8 objects.


Startup time is often a driving consideration, but any program which spends 
a substantial portion of execution time converting JSON into a native V8 
representation would benefit from having alternatives at runtime.

I am the author of the Extended Memory Semantics module (
https://github.com/mogill/ems/) which enables a JS (or Python or C) program 
with petabytes of persistent data to start with no overhead because none of 
the data is read from storage or parsed from JSON to native representations 
until it is referenced at runtime.  The downside, of course, is the parsing 
cost is paid over and over again at runtime.  Depending on the use case 
this approach is somewhere between ideal and pathological.

Complementary data and task parallel JSON lexing implementations already 
exists (https://github.com/simdjson/simdjson, 
https://github.com/mogill/parallel-xml2json) but there's no way to store 
the results in a way V8 can use, and thus an extra copy in/out step is 
needed.

I see several "bookend" options which may be combined to varying degrees:
  - A lowest common denominator data storage sequence is defined by V8
  - Applications gain the ability to introspect about V8's object storage 
sequences at runtime
  - Applications can tell V8 how data is stored in memory and V8 can adapt 
to an existing storage sequence

I'd be only too happy to keep some form of this conversation going if it 
meant it would result in alternatives to copy in/out semantics for all V8 
data.

               -J


On Thursday, July 22, 2021 at 7:18:26 PM UTC-4 Vitali wrote:

> Hi Leszek,
>
> Apologies for the delayed reply - I've been a bit swamped at work the past 
> couple of days. Thank you for the excellent details & we'll align our plans 
> accordingly. Some replies inline.
>
> I've replied privately to Jacob's concern as I don't want to derail this 
> conversation.
>
> On Tue, Jul 20, 2021 at 3:19 AM Leszek Swirski <les...@chromium.org> 
> wrote:
>
>> Hi Vitali,
>>
>> Stabilising the cached data format as-is is pretty challenging; the cache 
>> as written is pretty much a direct field-by-field serialisation of the 
>> internal data structures, so freezing the cache would mean freezing the 
>> shapes of those internal objects, effectively making the internal fields an 
>> API-level guarantee. Furthermore, it's a backdoor to a stable bytecode 
>> format, which is something we've also pushed back on as it severely limits 
>> our ability to work on the interpreter; if we wanted to have a slightly 
>> weaker constraint of at least guaranteeing backwards compatibility with old 
>> bytecode, we'd have to vastly expand our test suite with old bytecodes in 
>> order to try to maintain this backwards compatibility, and even then I'm 
>> not sure we could fully guarantee if there's some edge case not covered in 
>> the test suite. Same story with porting code caches from older to newer 
>> versions; such a port would require a mapping from old to new, which would 
>> require a) some sort of log of what old fields/bytecodes translate to what 
>> new ones, and b) heavy testing to make sure that this mapping is valid. 
>> This is a big security problem; the deserialisation is pretty dumb (for 
>> performance reasons), and just spits out data onto the V8 heap without e.g. 
>> checking if the number of fields match. Having bugs in the old->new 
>> mapping, or in the backwards compatibility, would open up a whole pandora's 
>> box of security issues, where one deleted field in an edge case that tests 
>> don't cover would become an out-of-bounds write widget.
>>
>> Given that this would greatly increase our development complexity 
>> (maintaining a stable API is already a lot of trouble for us), would be a 
>> big source of security issues, and I don't expect it to provide much 
>> benefit for Chrome (since we expect websites to change more often than 
>> Chrome versions), I don't see us either working on (or accepting patches 
>> for) a stable or even upgradeable cache.
>>
>> I'd be curious to know if you've actually observed/measured script parse 
>> time being a big problem, or whether you're more seeing issues due to lazy 
>> function compilation time. We've done a lot of work on parse time in recent 
>> years, so it's not as slow as (some) people assume. 
>>
> What's the best way to measure script parse time vs lazy function 
> compilation time? It's been a few months since I last looked at this so my 
> memory is a bit hazy on whether it was instantiating 
> v8::ScriptCompiler::Source, v8::ScriptCompiler::CompileUnboundScript, or 
> the combined time of both (although I suspect both count as script parse 
> time?). I do recall that on my laptop, using the code cache basically 
> halved the time on larger scripts of what I was measuring & I suspect I 
> would have looked at the overall time to instantiate the isolate with a 
> script (it was a no-op on smaller scripts, so I suspect we're talking about 
> script parse time).
>
> FWIW, if It's helpful, when I profiled a stress test of isolate 
> construction on my machine with a release build, I saw V8 spending a lot of 
> time deserializing the snapshot (seemingly once for the isolate & then 
> again for the context). Breakdown of the flamegraph:
> * ~22% of total runtime to run NewContextFromSnapshot. Within that ~5% of 
> total runtime was spent just decompressing the snapshot & the rest was 
> deserializing it (17%). I thought there was only 1 snapshot. Couldn't the 
> decompression happen once in V8System instead?
> * 9% of total runtime spent decompressing the snapshot for the isolate (in 
> other words 14% of total runtime was spent decompressing the snapshot).
>
> In our use-case we construct a lot of isolates in the same process. I'm 
> curious if there's opportunities to extend V8 to utilize COW to reduce the 
> memory & CPU impact of deserializing the snapshot multiple times. Is my 
> guess correct that deserialization is actually doing non-trivial things 
> like relocating objects or do you think there's a 0-copy approach that can 
> be taken with serializing/deserializing the snapshot so that it's prebuilt 
> in the right format (perhaps even without any compression)?
>
> With respect to compression, do you think that maybe the snapshot could be 
> moved to being provided when V8System is constructed so that all isolates 
> deserialize out of the same decompressed snapshot?
>
> Apologies if these questions are nonsensical. I'm still trying to learn 
> how the internals of V8 hook up together.
>  
>
>> We're also prototyping a potential stable & standardisable snapshot 
>> format for the results of partial script execution, which could help you if 
>> you're seeing large script "setup" code being an issue, but it wouldn't 
>> store compiled bytecode (for the above reasons).
>>
>> I appreciate that this might be a disappointing answer for you, but 
>> having flexibility with internal objects and bytecode is one of the things 
>> that allows us to stay performant and secure.
>>
> I fully understand. I'm definitely interested in the snapshot format since 
> presumably anything that helps the web here will also help us. Is there a 
> paper I can reference to read up more on the proposal? I've seen a few in 
> the wild from the broader JS community but nothing about V8's plans here. I 
> have no idea if that will help our workload but it's certainly something 
> we're open to exploring.
>
> Thanks,
> Vitali
>
> - Leszek
>>
>> On Monday, July 19, 2021 at 9:00:52 PM UTC+2 lewis....@gmail.com wrote:
>>
>>> Hi Vitali,
>>>
>>> I’m neither from the v8 team, nor an expert in this subject matter. Just 
>>> wanted to drop an interesting project: Hermes - https://hermesengine.dev 
>>> , a javascript engine by Facebook that is tailored for fast startup times. 
>>> It does this by precompiling javascript into bytecode at build time.
>>>
>>> So something like this should be possible maybe.
>>>
>>> Best,
>>> Joe
>>>
>>> On Mon, Jul 19, 2021 at 9:32 PM Vitali Lovich <vlo...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I wanted to kick off a discussion and solicit some thoughts on whether 
>>>> it would be operationally feasible to try to stabilize the cached data 
>>>> format of the compiler.
>>>>
>>>> The context is that I work on Cloudflare Workers. We'd like to increase 
>>>> the script size we allow our customers to upload, but we have concerns 
>>>> about the performance impact that will have (specifically script parse 
>>>> time). One mitigation for this would be to leverage the script compiler's 
>>>> cached data & generate the cache whenever the user uploads a script. This 
>>>> way we can precompute the cached data on upload & deliver it alongside the 
>>>> script.
>>>>
>>>> Unfortunately, this approach has a major stumbling block which is that 
>>>> we track V8 releases as they're published. That means our V8 version 
>>>> changes roughly every week which would (at best) necessitate us 
>>>> regenerating the cache for all the scripts on a weekly basis. This adds 
>>>> scalability & implementation complexity concerns (especially since we may 
>>>> have multiple versions of V8 running at one time).
>>>>
>>>> I'm not looking to discuss implementation specific details, but more 
>>>> trying to get an overview of the opinions from the talented V8 team.
>>>>
>>>>    - I haven't actually examined yet what the structure of the code 
>>>>    cache actually looks like. Are there prohibitive technical blockers 
>>>> that 
>>>>    can't really be resolved that make this a non-starter? 
>>>>    - Are there meaningful maintenance/security/implementation 
>>>>    concerns? I'm assuming there are very good reasons why the data is 
>>>> version 
>>>>    locked.
>>>>    - It's not necessarily a requirement to freeze it for all time 
>>>>    (although that would of course be ideal). What is the cadence for this 
>>>>    format actually changing (vs no-op version bumps for safety)? Would it 
>>>> be 
>>>>    possible to stabilize within a major V8 release (8->9, 9->10, etc) or 
>>>> for 6 
>>>>    month periods?
>>>>    - If stabilizing is truly impossible (as I suspect it probably is), 
>>>>    would it be technically feasible to implement a cheaper "upgrade" that 
>>>>    converts the previous code cache to the current one? It's not ideal, 
>>>> but it 
>>>>    could significantly reduce the costs needed to upgrade many scripts at 
>>>> once
>>>>
>>>> I suspect that any improvement here would also apply to Chrome in the 
>>>> form of a more consistent performance experience after an upgrade.
>>>>
>>>> We do have a fallback plan that's workable within the current 
>>>> architecture, but it's got some downsides that would be neat to bypass by 
>>>> stabilizing the format. Appreciate any feedback/insights anyone can offer.
>>>>
>>>> Thanks,
>>>> Vitali
>>>>
>>>> -- 
>>>> -- 
>>>> v8-users mailing list
>>>> v8-u...@googlegroups.com
>>>> http://groups.google.com/group/v8-users
>>>> --- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "v8-users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to v8-users+u...@googlegroups.com.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/v8-users/CAF8PYMgNXRdvW16Sb%3DwRaU21XGcMG3eBgkz_ey65%2BX7DdQ0a6g%40mail.gmail.com
>>>>  
>>>> <https://groups.google.com/d/msgid/v8-users/CAF8PYMgNXRdvW16Sb%3DwRaU21XGcMG3eBgkz_ey65%2BX7DdQ0a6g%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> -- 
>> -- 
>> v8-users mailing list
>> v8-u...@googlegroups.com
>> http://groups.google.com/group/v8-users
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "v8-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to v8-users+u...@googlegroups.com.
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/v8-users/a10e0853-3cec-43d3-abbb-d6a2ecdb8796n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/v8-users/a10e0853-3cec-43d3-abbb-d6a2ecdb8796n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
-- 
v8-users mailing list
v8-users@googlegroups.com
http://groups.google.com/group/v8-users
--- 
You received this message because you are subscribed to the Google Groups 
"v8-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to v8-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/v8-users/c87efac1-7354-4d21-b4eb-0f31611e8655n%40googlegroups.com.

Re: [v8-users] Preliminary RFC: Stabilizing the V8 script compiler cached data format

Reply via email to