Re: [racket-users] Best data structure for ordered data set with insertion and reordering?

George Neuner Thu, 16 Jul 2020 20:22:14 -0700


Hi David,


On 7/16/2020 11:44 AM, David Storrs wrote:

On Thu, Jul 16, 2020 at 10:09 AM George Neuner <gneun...@comcast.net<mailto:gneun...@comcast.net>> wrote:
    The problem seems under-specified.  Can you say more about the
    real purpose?
Basic version: It's a peer-to-peer encrypted swarmed file sharingsystem that presents like Dropbox on the front end (i.e. "make achange to the filesystem on peer A and peers B-Z will replicate thatchange") and works something like Bittorrent on the back end in thatfiles are sent in chunks but it offers functionality that Bittorrentdoes not, such as encrypted transfer, WoT authentication, etc.

Interesting. So I'm guessing your problem is to (compactly) representthe state of the shared space.

Do you plan on having index servers, or are you aiming for a fullydistributed solution? And, if distributed, do you want each node tomaintain its own state picture of the shared space, or were you thinkingthat nodes could just snoop admin broadcasts looking for mention of datathey don't currently have? [Your question about how to pair / collapsemessages suggests you might be considering a snoopy solution.]

Asking because keeping a state picture has scalability issues, a snoopysolution has complexity issues, and (depending on latency) both haveissues with performing unnecessary work. In any event, I have somesuggestions.

Snoopy is the more interesting case. You start with a queue of fileoperations to be done as gleaned from the admin messages - mkdir, rmdir,fetch a file, delete a file, etc. - in whatever order the messages werereceived.

Separately, you maintain a (hash table) mapping from pathnames to a listof queue nodes that operate on that object. The map should use weakreferences so that nodes can safely be removed from the queue anddiscarded without also needing to update the map. If queue processinggets to some operation first, any map reference to it will dissolve (bereplaced by #f).

When a message is received, you lookup the pathname in the map, and if acomplementary operation is found in the queue, you remove and discardit. [You can also remove references in the map or just let themdissolve depending on your handling.] Then simply discard the message.

Otherwise you queue whatever operation the message indicates and add areference to the queue node under the object's pathname in the map.

Extra complexity comes in having to notice that map entries (pathnames)have no operations left in the queue. Weak references don't justdisappear - they are changed to #f when the referenced object is nolonger reachable - however AFAICT there is no hashtable variant thatpermits weak reference values, so you have to use weak-boxes and thosecontinue to exist even if the objects they reference are gone. Uselessmap entries will need to be identified and removed somehow.

Modeling the filesystem can be done rather simply with a trie in whichfolders are represented by mutable hash tables and files by structures. You can combine this with the operation queue above, but in this caselookups can be done in the trie and queue references kept in the trienodes. And the trie provides a snapshot of the current state which maybe useful for other purposes.

The trick in either case is processing latency: you don't want to waittoo long, but if you really want to avoid unnecessary work you need todelay performing file operations long enough that complementary messagesare likely to be received.

    What if messages are lost permanently, e.g., due to hardware crash?


    What it you receive a create but a corresponding delete or update is
    lost - then your information / picture of the file system state is
    wrong.

    What if you receive a file delete without a corresponding create?
    In the
    absence of other information, can you even assume there *was* a
    create?
    If these messages are sent in response to user actions, can they
    ever be
    sent mistakenly?
The ultimate answer to these questions is "If things get out of syncin a way that the system cannot resolve, it will be flagged for ahuman to resolve." There are things we do that mitigate them -- forexample, a write-ahead log for messages received from peers -- but weacknowledge that we cannot resolve 100% of situations automatically. Neither can any other file replication service. (Dropbox, Box.com, etc)
Also relevantly, differences are reconciled across multiple peers. Ifthere's 5 peers in your replication set and the other 4 agree thatthere should be a file at path P but you don't have one then it's safeto assume that you missed a File-Create message. And yes, that comeswith issues of its own (Q: What if it was deleted on your machine andnone of the others got your File-Delete because you crashed beforesending it? A: Worst case, the file gets recreated and the userdeletes it again. Also, move files to a Trash folder in response to aFile-Delete, don't actually delete them for a certain period of time)but again we fall back to human resolution.

Are you are considering eavesdropping on multicast? That may work fineon a LAN ... but for wide area the variability in UDP complicatesmaintaining a consistent global state picture. For a real replicationsystem I think you really will want a reliable, causal deliverymechanism. And if large scalability is an issue you will want anadaptive topology that limits the number of connections.



YMMV.  Hope this sparks a good idea.
George

--
You received this message because you are subscribed to the Google Groups "Racket 
Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/racket-users/157e1b09-203b-d3c7-68d0-822a4fa7924f%40comcast.net.

Re: [racket-users] Best data structure for ordered data set with insertion and reordering?

Reply via email to