On Sep 5, 12:13 am, Stefan Behnel <[EMAIL PROTECTED]> wrote: > Hi, > > this discussion seems pretty much off-topic for a Python list. > > > > castironpi wrote: > > In an XML file, entries are stored in serial, sort of like this. > > > AAA BBB CCC DDD > > > Or more recognizably, > > > <A><B><C>something</C><D>something</D></B></A> > > > Point is, to change <C>something</C> to <C>something else</C>, you > > have to recopy everything after that. > > > AAA BBB CCC DDD > > AAA BBBb CCC DDD > > > requires 7 writes, 'b CCC DDD', not 1. > > > I want to use a simple tree structure to store: > > > 0 A-> None, 1 > > 1 B-> None, 2 > > 2 C-> 3, None > > 3 D-> None, None > > > Each node maps to 'Next, Child', or more accurately, 'Next Sibling, > > First Child'. > > Do I understand you right: you want to access serialised XML data
alex23 correctly pointed out last night that XML is always serialized. What I mean and possibly what you mean is, a serialized tree structure. > as a memory > mapped file and operate directly on the byte data? Yes. Byte data containing both strings, and the structure of the tree. > You would still have to > decode the complete byte sequence to parse it and build your index structure > in that case. No, it's saved in an index structure; it's already in one. To find a/ b/c, read a, read its children to b, read b, read its children to c, read c. > How do you plan to store the pointers to a node's next sibling/child? And how > do you keep them updated over insertions/deletions? You update them when you insert them. To add 'c' to a/b, allocate c, initialize c, read a, read its children to b, read b, read its children to the end, add c. When you delete c from a/b/c, however, any references to c that you have in any programs become invalid. Don't delete it if you have them. The bytes are marked in the file to be no longer in use, which marking takes up some of the space in the file. > That, plus the copy > overhead in a sequential file, will be very costly on each change. No. That's the point of a dynamic structure. <abc><def> are not stored in memory as 10 consecutive characters. The file is not strictly speaking XML. See below for what they're stored as. > > You get constant time updates to contents, and log-time searches. > > Every XML tree structure gives you log-time searches. But how do you achieve > constant time updates in a sequential file? You don't use a sequential file. > Stefan I stated earlier that each node would look roughly like: tag_offset, first_attr, text_offset, tail_offset, first_child, prev_sibling, next_sibling, parent Attributes would look like: key_offset, value_offset, prev_attr, next_attr, node All these fields are integers that contain an offset into the file. Simplified: 0 Reserved 1 A.Tag 7 (points to 7 below) 2 A.FirstChild 4 (etc.) 3 A.Contents 0 (None) 4 B.Tag 11 5 B.FirstChild 0 6 B.Contents 15 7 3abc 11 3def 15 5ghijk This translates to: <abc><def>ghijk</def></abc> But isn't stored that way. The records are all the same size. The 'Tag' field of a record that starts at offset J is at offset J. The 'FirstChild' field of a record that starts at offset K is at offset K+ 1. 'tag_offset', 'text_offset', 'key_offset', and 'value_offset' contain offsets of variable-length strings into the file ( 7, 11, 15 ). The rest contain offsets of further structures. There's extra space usage not only in the structure of the tree, but in the record-keeping of what bytes are available for use, and which are in use. You have to grow the file size yourself (like growing an array) when you need to; the file system won't for you in mmap. (This means the alloc-free module I'm looking at will need modifications.) I'll reemphasize the value of constant-time insertions to a file though. -- http://mail.python.org/mailman/listinfo/python-list