Seth Willits wrote:

In my app, I import data from potentially very large files. In the first pass, I simply mmap'd the entire file, created a string using CFStringCreateWithBytesNoCopy, and go about my business. This works great until it hits the address limit when it's running as a 32-bit process, so now in the second pass I want to rework it a bit to only mmap a chunk (128 MB) at a time.

Now, if it were simply binary data, I could chop up the file however I wanted, but since the file I'm processing is actually a huge *text* file, I need to mmap an appropriate range so creating the string doesn't fail because a multi-byte character was split down the middle.

Change the buffer management.

Add a cushion to your mmap'ed chunk, say 1 MB, so you mmap in 129 MB at a time. When parsing the first 128 MB, everything proceeds normally, and there are no worries about splitting a multi-byte character. You can parse bytes after 128 MB because they're safely represented in the cushion area.

When the get-next-string starting position moves into the cushion area, then you re-mmap the next chunk (advance by 128 MB, i.e. buffer minus cushion) and reposition your pointers in the buffer. Then you have about 128 MB of no worries again.

Choose a cushion size suitable for the maximum length of multi-byte sequence. There's no magic to 1 MB, if something smaller suffices. And don't forget the combining character forms where multiple multi- byte "characters" should remain together.

  -- GG

_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Reply via email to