2010/9/18 Dennis Lee Bieber <wlfr...@ix.netcom.com>: > On Fri, 17 Sep 2010 10:44:43 +0200, Vlastimil Brom > <vlastimil.b...@gmail.com> declaimed the following in > gmane.comp.python.general: > > >> Ok, thanks for confirming my suspicion :-), >> Now I have to decide whether I shall use my custom data structure, >> where I am on my own, or whether using an sql database in such a >> non-standard way has some advantages... >> The main problem indeed seems to be the fact, that I consider the >> tagged texts to be the primary storage format, whereas the database is >> only means for accessing it more conveniently. >> > > I suspect part of your difficulty is in trying to fit everything > into a single relation (table). > > Looking back at your ancient "format for storing textual data (for > an edition) - formatting and additional info" post, I'd probably move > your so-called tags into one relation -- where the tag type is, itself, > data... > > Without seeing an actual data sample (and pseudo-DDL): > > create table texts > ( > ID autoincrement primary key, > text varchar > ); > > create table tags > ( > ID autoincrement primary key, > textID integer foreign key references texts(ID), > tagtype char, > start integer, > end integer, > supplement varchar > ); > > I'd really have to see samples (more than one line) of the raw > input, and the desired information... > > -- > Wulfraed Dennis Lee Bieber AF6VN > wlfr...@ix.netcom.com HTTP://wlfraed.home.netcom.com/ > > -- > http://mail.python.org/mailman/listinfo/python-list >
Thanks for the elaboration, I am sure, I am missing some more advanced features of SQL (would the above also work in sqlite, as there (probably?) are no real type restrictions on data? The "markup" format as well as the requirements haven't change since those old posts, one sample of the tagged text is in one of the follow-up post of that: http://mail.python.org/pipermail/python-list/2008-May/540773.html in principle in the tags are in the form <tag_name some tag value>, from that text index on this tag-value combination is assigned - until <tag_name another value> or </tag_name> arbitrary combinations of the tags including overlapping are possible (nesting of the same tags is not possible in favor of the direct replacement). Different texts may have (partly) differing tags, which I'd prefer to handle generally, without having to adapt the queries directly. After the tagged text is parsed, the plain text and the corresponding "database" are created, which maps the text indices to the tag names with their values. Querying the data should be able to get the "tagset" for the given text index and conversely to find the indices matching the given tag-value combinations. (actually the text ranges matching those criteria would be even better, but these are easily done with bisect) (from the specification, mxTextTools looks similar, but it seemed rather low-level and quite heavyweight for the given task) Thanks in advance for any suggestions, Vlastimil Brom -- http://mail.python.org/mailman/listinfo/python-list