Re: [Qemu-devel] DMG chunk size independence

John Snow Tue, 18 Apr 2017 10:06:22 -0700


On 04/18/2017 06:21 AM, Ashijeet Acharya wrote:
> 
> On Tue, Apr 18, 2017 at 01:59 John Snow <js...@redhat.com
> <mailto:js...@redhat.com>> wrote:
> 
> 
> 
>     On 04/15/2017 04:38 AM, Ashijeet Acharya wrote:
>     > Hi,
>     >
>     > Some of you are already aware but for the benefit of the open list,
>     > this mail is regarding the task mentioned
>     > Here ->
>     http://wiki.qemu-project.org/ToDo/Block/DmgChunkSizeIndependence
>     >
> 
>     OK, so the idea here is that we should be able to read portions of
>     chunks instead of buffering entire chunks, because chunks can be quite
>     large and an unverified DMG file should not be able to cause QEMU to
>     allocate large portions of memory.
> 
>     Currently, QEMU has a maximum chunk size and it will not open DMG files
>     that have chunks that exceed that size, correct?
> 
> 
> Yes, it has an upper limit 64MiB at the moment and refuses to cater
> anything beyond that.
> 
> 
>     > I had a chat with Fam regarding this and he suggested a solution where
>     > we fix the output buffer size to a max of say "64K" and keep inflating
>     > until we reach the end of the input stream. We extract the required
>     > data when we enter the desired range and discard the rest. Fam however
>     > termed this as only a  "quick fix".
>     >
> 
>     So it looks like your problem now is how to allow reads to subsets while
>     tolerating zipped chunks, right?
> 
> 
> Yes
> 
> 
> 
>     We can't predict where the data we want is going to appear mid-stream,
>     but I'm not that familiar with the DMG format, so what does the data
>     look like and how do we seek to it in general?
> 
> 
> If I understood correctly what you meant;
> The data is divided into three types
> a) Uncompressed
> b) zlib compressed
> c) bz2 compressed
> 
> All these chunks appear in random order depending on the file.
> 
> ATM we are decompressing the whole chunk in a buffer and start reading
> sector by sector until we have what we need or we run out of output in
> that chunk.
> 
> If you meant something else there, let me know.
> 
> 
> 
>     We've got the mish blocks stored inside of the ResouceFork (right?), and
> 
> 
> I haven't understood yet what a ResourceFork is but its safe to say from
> what I know that mish blocks do appear inside resource forks and contain
> all the required info about the chunks.
> 
> 
>     each mish block contains one-or-more chunk records. So given any offset
>     into the virtual file, we at least know which chunk it belongs to, but
>     thanks to zlib, we can't just read the bits we care about.
> 
>     (Correct so far?)
> 
> 
> Absolutely
> 
> 
> 
>     > The ideal fix would obviously be if we can somehow predict the exact
>     > location inside the compressed stream relative to the desired offset
>     > in the output decompressed stream, such as a specific sector in a
>     > chunk. Unfortunately this is not possible without doing a first pass
>     > over the decompressed stream as answered on the zlib FAQ page
>     > Here -> http://zlib.net/zlib_faq.html#faq28
>     >
> 
>     Yeah, I think you need to start reading the data from the beginning of
>     each chunk -- but it depends on the zlib data. It COULD be broken up
>     into different pieces, but there's no way to know without scanning it in
>     advance.
> 
> 
> Hmm, that's the real issue I am facing. MAYBE break it like
> 
> a) inflate till the required starting offset in one go
> b) save the access point and discard the undesired data
> c) proceed by inflating one sector at a time and stop if we hit chunk's
> end or request's end
> 
> 
> 
>     (Unrelated:
> 
>     Do we have a zlib format driver?
> 
>     It might be cute to break up such DMG files and offload zlib
>     optimization to another driver, like this:
> 
>     [dmg]-->[zlib]-->[raw]
> 
>     And we could pretend that each zlib chunk in this file is virtually its
>     own zlib "file" and access it with modified offsets as appropriate.
> 
>     Any optimizations we make could just apply to this driver.
> 
>     [anyway...])
> 
> 
> Are you thinking about implementing zlib just like we have bz2
> implemented currently?
> 
> 
> 
> 
>     Pre-scanning for these sync points is probably a waste of time as
>     there's no way to know (*I THINK*) how big each sync-block would be
>     decompressed, so there's still no way this helps you seek within a
>     compressed block...
> 
> 
> I think we can predict that actually, because we know the number of
> sectors present in that chunk and each sector's size too. So...
> 
> 
>     > AFAICT after reading the zran.c example in zlib, the above mentioned
>     > ideal fix would ultimately lead us to decompress the whole chunk in
>     > steps at least once to maintain an access point lookup table. This
>     > solution is better if we get several random access requests over
>     > different read requests, otherwise it ends up being equal to the fix
>     > suggested by Fam plus some extra effort needed in building and
>     > maintaining access points.
>     >
> 
>     Yeah, probably not worth it overall... I have to imagine that most uses
>     of DMG files are for iso-like cases for installers where accesses are
>     going to be either sequential (or mostly sequential) and most data will
>     not be read twice.
> 
> 
> Exactly, if we are sure that there will be no requests to read the same
> data twice, its completely a wasted effort. But I am not aware of the
> use cases of DMG since I only learned about it last week. So maybe
> someone can enlighten me on those if possible?
> 
> 
> 
>     I could be wrong, but that's my hunch.
> 
>     Maybe you can cache the state of the INFLATE process such that once you
>     fill the cache with data, we can simply resume the INFLATE procedure
>     when the guest almost inevitably asks for the next subsequent bytes.
> 
>     That'd probably be efficient /enough/ in most cases without having to
>     worry about a metadata cache for zlib blocks or a literal data cache for
>     inflated data.
> 
> 
> Yes, I have a similar approach in mind to inflate one sector at a time
> and save the offset in the compressed stream and treat it as an access
> point for the next one.
>


Right, just save whatever zlib library state you need to save and resume
inflating. Probably the most reasonable way to go for v1. As long as you
can avoid re-inflating prior data in a chunk when possible this is
probably good.

> 
> 
>     Or maybe I'm full of crap, I don't know -- I'd probably try a few
>     approaches and see which one empirically worked better.
> 
>     > I have not explored the bzip2 compressed chunks yet but have naively
>     > assumed that we will face the same situation there?
>     >
> 
>     Not sure.
> 
> 
> I will look it up :)
> 
> Stefan/Kevin, Do you have any other preferred solution in your mind?
> Because I am more or less getting inclined towards starting to inflate
> one sector at a time and submit v1 
> 
> 
> Ashijeet
> 
>

Re: [Qemu-devel] DMG chunk size independence

Reply via email to