On 04/15/2017 04:38 AM, Ashijeet Acharya wrote: > Hi, > > Some of you are already aware but for the benefit of the open list, > this mail is regarding the task mentioned > Here -> http://wiki.qemu-project.org/ToDo/Block/DmgChunkSizeIndependence >
OK, so the idea here is that we should be able to read portions of chunks instead of buffering entire chunks, because chunks can be quite large and an unverified DMG file should not be able to cause QEMU to allocate large portions of memory. Currently, QEMU has a maximum chunk size and it will not open DMG files that have chunks that exceed that size, correct? > I had a chat with Fam regarding this and he suggested a solution where > we fix the output buffer size to a max of say "64K" and keep inflating > until we reach the end of the input stream. We extract the required > data when we enter the desired range and discard the rest. Fam however > termed this as only a "quick fix". > So it looks like your problem now is how to allow reads to subsets while tolerating zipped chunks, right? We can't predict where the data we want is going to appear mid-stream, but I'm not that familiar with the DMG format, so what does the data look like and how do we seek to it in general? We've got the mish blocks stored inside of the ResouceFork (right?), and each mish block contains one-or-more chunk records. So given any offset into the virtual file, we at least know which chunk it belongs to, but thanks to zlib, we can't just read the bits we care about. (Correct so far?) > The ideal fix would obviously be if we can somehow predict the exact > location inside the compressed stream relative to the desired offset > in the output decompressed stream, such as a specific sector in a > chunk. Unfortunately this is not possible without doing a first pass > over the decompressed stream as answered on the zlib FAQ page > Here -> http://zlib.net/zlib_faq.html#faq28 > Yeah, I think you need to start reading the data from the beginning of each chunk -- but it depends on the zlib data. It COULD be broken up into different pieces, but there's no way to know without scanning it in advance. (Unrelated: Do we have a zlib format driver? It might be cute to break up such DMG files and offload zlib optimization to another driver, like this: [dmg]-->[zlib]-->[raw] And we could pretend that each zlib chunk in this file is virtually its own zlib "file" and access it with modified offsets as appropriate. Any optimizations we make could just apply to this driver. [anyway...]) Pre-scanning for these sync points is probably a waste of time as there's no way to know (*I THINK*) how big each sync-block would be decompressed, so there's still no way this helps you seek within a compressed block... > AFAICT after reading the zran.c example in zlib, the above mentioned > ideal fix would ultimately lead us to decompress the whole chunk in > steps at least once to maintain an access point lookup table. This > solution is better if we get several random access requests over > different read requests, otherwise it ends up being equal to the fix > suggested by Fam plus some extra effort needed in building and > maintaining access points. > Yeah, probably not worth it overall... I have to imagine that most uses of DMG files are for iso-like cases for installers where accesses are going to be either sequential (or mostly sequential) and most data will not be read twice. I could be wrong, but that's my hunch. Maybe you can cache the state of the INFLATE process such that once you fill the cache with data, we can simply resume the INFLATE procedure when the guest almost inevitably asks for the next subsequent bytes. That'd probably be efficient /enough/ in most cases without having to worry about a metadata cache for zlib blocks or a literal data cache for inflated data. Or maybe I'm full of crap, I don't know -- I'd probably try a few approaches and see which one empirically worked better. > I have not explored the bzip2 compressed chunks yet but have naively > assumed that we will face the same situation there? > Not sure. > I would like the community's opinion on this and add their suggestions > if possible to give me some new thinking points. > > Thanks > Ashijeet >