Thanks, this makes it much clearer to me. Some additional diving into making custom handling for SIGSEGV can be found at https://stackoverflow.com/questions/24351359/mmap-for-remote-file. However, I will likely not go down that rabbit hole. :)
I will instead consider the recommendation to split up the problem up in multiple messages with externally handled framing/indexing. Den torsdag 16 augusti 2018 kl. 20:32:10 UTC+2 skrev Kenton Varda: > > Hi Björn, > > The easiest way to make this work would be to have the data set live on a > network filesystem (e.g. NFS) or block device (e.g. NBD, iSCSI) which you > can mount on your local system and then use mmap(). > > If mounting a remote filesystem is not an option, it is technically > possible to do everything in userspace instead -- but it's tricky. > Essentially, you can implement a memory mapping entirely in userspace by > writing your own signal handler for SIGSEGV. At startup, you would create > an anonymous memory mapping that is at least the size of your remote file, > and is marked to prohibit reading. When your program attempts to read from > this space, a SIGSEGV signal is raised. In your signal handler, you look at > what address the code was trying to access (from si_addr in the siginfo_t), > you fetch the appropriate page from the remote server, you map that page > into the right place in local memory, and then you mark it as readable. On > return from the signal handler, the code continues on with the newly-mapped > data. > > This is, of course, pretty advanced systems hacking, an unfortunately I > don't know of a library that does it for you (though I bet one exists... > somewhere). > > Otherwise, you need to spit your data into smaller pieces that your > application knows how to fetch explicitly as needed... > > -Kenton > > On Thu, Aug 16, 2018 at 10:32 AM, <[email protected] <javascript:>> > wrote: > >> Hi, >> >> I'm investigating using Cap'n Proto as the basis for a format containing >> a large collection of r-tree indexed data. The typical access pattern would >> be to query the index resulting in a set of nodes in the tree. The >> collection of data would be physically clustered on node indices so that >> one can efficiently seek and read the data items for the searched node >> indexes. >> >> The recommendations for random access has been to simply use mmap which I >> assume would work well in this case but AFAIK it's something that is only >> used for files readily available on attached block storage. However, in >> this case the full dataset might very well be too large to keep locally >> and the preferred access method would be streaming access over network with >> the same pattern of random access using index searches. >> >> I'm a C++ novice and I fail to understand if something remotely like this >> can be done already with the reference C++ implementation. Indeed, I have >> not even been able to understand if it supports sequential streaming access >> of a part of a message - it seems assumed that a message is fully read into >> RAM, except when using mmap which would then be the only way to partially >> read a message (sequential or random). But I do not want to give up yet, >> perhaps there is something I'm missing? >> >> Regards, >> >> Björn >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Cap'n Proto" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> Visit this group at https://groups.google.com/group/capnproto. >> > > -- You received this message because you are subscribed to the Google Groups "Cap'n Proto" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. Visit this group at https://groups.google.com/group/capnproto.
