Apologies for the long email. I would appreciate some advice designing a database which can become very complicated. The purpose of the database is to be a source for bibliometric research.
Some background: I work with various sources: * The largest source consists of tagged delimited text files. The database formed from these files takes up about 230Gb at present. I receive weekly updates of new and/or corrected data. I use a python script to parse those files and insert the data into the database. * Other sources are much smaller (up to about 300 000 records in csv-files or Microsoft Access databases) and of various formats. Those sources are also updated from time to time. My aim is to have a database that can easily accommodate the raw data as well as one providing a user friendly structure for researchers using the data. I am working on a multilevel database: * level 0: where the data from the various sources enters. The structure is similar to that of the source. No editing of data other than the updated from the source takes place on this level * level 1: where the data are made available from the sources from level 0 in a common structure which are more suitable for the type of research we are doing, but still separate for each source. Not all the data is important on this level so not all the fields from level 0 are represented here. The data can be added or modified and value added (providing some extra tables for value added data which are source-specific). This level gets populated from level 0 in such a way that value added data don't get overwritten. I found a valuable article for the basic design of this level in an article: Title: A relational database for bibliometric analysis Author(s): Mallig Nicolai Source: JOURNAL OF INFORMETRICS Volume: 4 Issue: 4 Pages: 564-580 DOI: 10.1016/j.joi.2010.06.007 Published: OCT 2010 * level 2: An abstraction layer where queries spanning the various sources are running. This is a read-only layer. I have considered using views for a large part of level 1 but that seems not to be practical: Queries involving many millions of records (there are at present about 43 000 000 articles in the article-table) takes a long time to execute especially when aggregates are involved. I want to conserve space on the one hand and speed up queries on the other hand. Sometimes I think it is not possible to have both. In stead of views a series of tables created from standard queries may speed up the process, I have thought of creating temporary tables that can be populated periodically at times (like each night or every few hours) using the standard queries. Some of the queries will make use of full text searching as they involve large fields like article abstracts. Basically my request is for advice on how to make this database as fast as possible with as few instances of duplicated data while providing both for the updates on level 0 and value added editing on level 1. Regards Johann -- Because experiencing your loyal love is better than life itself, my lips will praise you. (Psalm 63:3)