[GENERAL] Postgres and data warehouses
I'm looking at making a data warehouse to address our rapidly spiralling report query times against the OLTP. I'm looking first at what it would take to make this a real-time data warehouse, as opposed to batch-driven. One approach I've seen used to achieve real time data warehousing is to have middleware that is intercepting all database writes and echoing them to a program that rolls up the data warehouse values and updates the facts, dimensions and so on on the fly. Another approach was to use triggers on the tables of interest to write to tables to journal the changes, which then get processed by a batch job to achieve the same thing. One of the problems of the trigger on the transactional database approach is that if there is ever a problem with the trigger, the main transaction is affected. I'm not sure if that is avoidable with proper exception handling in the trigger code? It does mean a lot of trigger code to maintain, and another schema to maintain (the journalled changes), so there were several drawbacks. I've had a quick search and haven't seen this approach used yet, but I was thinking, the asynchronous replication of Postgres databases could be used as a streaming journal of changes to be processed by a data warehouse. The other approach that suggests itself is WAL file shipping. I've not dug into the async rep protocol yet, before I do so I just wanted to get some brief feedback on whether I'm on the wrong track or not, and if there's some better approach I should be looking at first instead. Any feedback much appreciated. Regards, Nigel
Re: [GENERAL] Postgres and data warehouses
* Nigel Gardiner (nigelgardi...@gmail.com) wrote: > I've had a quick search and haven't seen this approach used yet, but I was > thinking, the asynchronous replication of Postgres databases could be used > as a streaming journal of changes to be processed by a data warehouse. The > other approach that suggests itself is WAL file shipping. I've not dug into > the async rep protocol yet, before I do so I just wanted to get some brief > feedback on whether I'm on the wrong track or not, and if there's some > better approach I should be looking at first instead. Using a streaming replica for data warehouse queries is quite common.. The issue there is if you want to change the data structure or store data on the replica because a streaming replica based on WAL shipping is read-only. You can create FDW tables which you can write to (even on the replica) or you can create an independent database which has FDW tables to the replica. They have their own pros and cons, of course. Thanks! Stephen signature.asc Description: Digital signature
Re: [GENERAL] Postgres and data warehouses
On Sun, 8 Mar 2015 11:20:16 -0400 Stephen Frost wrote: > * Nigel Gardiner (nigelgardi...@gmail.com) wrote: > > I've had a quick search and haven't seen this approach used yet, but I was > > thinking, the asynchronous replication of Postgres databases could be used > > as a streaming journal of changes to be processed by a data warehouse. The > > other approach that suggests itself is WAL file shipping. I've not dug into > > the async rep protocol yet, before I do so I just wanted to get some brief > > feedback on whether I'm on the wrong track or not, and if there's some > > better approach I should be looking at first instead. > > Using a streaming replica for data warehouse queries is quite common.. > The issue there is if you want to change the data structure or store > data on the replica because a streaming replica based on WAL shipping is > read-only. You can create FDW tables which you can write to (even on > the replica) or you can create an independent database which has FDW > tables to the replica. They have their own pros and cons, of course. Another option is to replicate using Slony instead of streaming, which allows you to create additional tables on the replica that are read/write in addition to triggers that only fire on the replica. It's complicated, but pretty damn powerful. -- Bill Moran -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] Benchmarking partitioning triggers and rules
I wrote a quick benchmarking script to test various partitioning strategies. The code is here. https://gist.github.com/timuckun/954ab6bdce36fa14bc1c I was astonished at the variability of the timings between the different variations. The test data contained 270K records. I did a simple insert into without any triggers, with three different trigger variations and with a rule. The results were like this clean 0.00 0.00 0.00 ( 3.119498) func_1 0.00 0.00 0.00 ( 7.435094) func_2 0.00 0.00 0.00 ( 28.427617) func_3 0.00 0.00 0.00 ( 18.348554) Rule 0.00 0.00 0.00 ( 2.901931) A clean insert 3.1 seconds, putting a rule took less time! A simple insert into table_name values (NEW.*) doubled the time it takes to insert the records. Using an EXECUTE with an interpolated table name took TEN TIMES LONGER!. Making a slight change to the EXECUTE took a third off the execution time WTF??? This has left me both baffled and curious. If changing little things like this makes a huge difference what else can I do to make this even faster? Would using a different language help? Is Javasscript, Python or Perl faster? Is there some other syntax I can use? I tried this EXECUTE 'INSERT INTO ' || quote_ident(partition_name) || ' VALUES (' || NEW.* || ')' but that gave me an error.
Re: [GENERAL] Postgres and data warehouses
On 3/8/2015 7:40 AM, Nigel Gardiner wrote: I'm looking at making a data warehouse to address our rapidly spiralling report query times against the OLTP. I'm looking first at what it would take to make this a real-time data warehouse, as opposed to batch-driven. we use a hybrid architecture.we have a 'republisher' process that repeatedly slurps new data from the OLTP database and sends it to the back end databases, using a 'publish/subscribe' messaging bus. several back end databases subscribe to this data, and their subscriber processes insert the incoming data into those OLAP and various other reporting databases. this way the reporting databases can have completely different schemas optimized for their needs, and have different retention requirements than the OLTP database. this republisher is usually within a few seconds of live new data. in our case its made fairly easy to track 'new' because all our OLTP transactions are event-oriented. -- john r pierce 37N 122W somewhere on the middle of the left coast -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Postgres and data warehouses
I strongly suspect many would like further details on your implementation. A user conference session well recorded perhaps? > On Mar 8, 2015, at 4:57 PM, John R Pierce wrote: > >> On 3/8/2015 7:40 AM, Nigel Gardiner wrote: >> I'm looking at making a data warehouse to address our rapidly spiralling >> report query times against the OLTP. I'm looking first at what it would take >> to make this a real-time data warehouse, as opposed to batch-driven. > > we use a hybrid architecture.we have a 'republisher' process that > repeatedly slurps new data from the OLTP database and sends it to the back > end databases, using a 'publish/subscribe' messaging bus.several back end > databases subscribe to this data, and their subscriber processes insert the > incoming data into those OLAP and various other reporting databases. this > way the reporting databases can have completely different schemas optimized > for their needs, and have different retention requirements than the OLTP > database. > > this republisher is usually within a few seconds of live new data. in our > case its made fairly easy to track 'new' because all our OLTP transactions > are event-oriented. > > > -- > john r pierce 37N 122W > somewhere on the middle of the left coast > > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Weight BLOB objects in postgreSQL? How?
Thanks, I mean, how heavy it is in kB. 2015-03-04 12:17 GMT-03:00 John R Pierce : > On 3/4/2015 7:03 AM, María Griensu wrote: > > I need to figure out how can I weight BLOB objects in a table of a DB, I'm > not expert on this topics, so I appreciate any help you can give me. > > > postgres's equivalent of BLOB is BYTEA. I'm not sure what you mean by > 'weight' here. > > -- > john r pierce 37N 122W > somewhere on the middle of the left coast > > -- Ing. María Dovale Medical IT - Griensu S.A. Av. Colón 636 - Piso 8 X5000EPT - Córdoba Tel: +54 (351) 4247979 Int. 22
Re: [GENERAL] Weight BLOB objects in postgreSQL? How?
Thanks, I mean, how heavy is. 2015-03-04 12:17 GMT-03:00 John R Pierce : > On 3/4/2015 7:03 AM, María Griensu wrote: > > I need to figure out how can I weight BLOB objects in a table of a DB, I'm > not expert on this topics, so I appreciate any help you can give me. > > > postgres's equivalent of BLOB is BYTEA. I'm not sure what you mean by > 'weight' here. > > -- > john r pierce 37N 122W > somewhere on the middle of the left coast > > -- Ing. María Dovale Medical IT - Griensu S.A. Av. Colón 636 - Piso 8 X5000EPT - Córdoba Tel: +54 (351) 4247979 Int. 22
[GENERAL] Creating composite keys from csv
Hi all, I'm a student journalist working on a project for our student paper which lists salaries and positions for every staff member at the university. We received the data from an FOI request but the university is refusing to give us primary keys for the data. The issue we've run into is that if there are two staff members with the same name (and there are) our current web app adds their salaries together and considers them one person. Now, luckily, we can create a composite key if we combine their name column with their salary column. Unfortunately, the format of the data we have makes it more difficult than that (of course!) because some employees can hold multiple paying positions. Here's some example data: Name, Position, Salary,Total Salary, ... Jane Doe, Dean, 100.000, 148.000, ... John Locke, Custodian, 30.000, 30.000, ... Jane Doe, Academic Adviser, 48.000, 148.000, ... Jane Doe, Trainer, 46.000, 46.000, ... Basically, what we'd like to do is create a serial primary key but instead of having it increment every row, it needs to check the name and total salary columns and only increment if that person doesn't already exist. If they do exist, it should just assign the previously created number to the column. However, our team is small and between us we have very little experience working with databases and we haven't found a way to accomplish this goal yet. In fact, we may be trying to solve this in the wrong way entirely. So, to put it succinctly, how would you approach this problem? What are our options? Do we need to write a script to clean the data into separate csv tables before we import it to postgres, or is this something we can do in postgres? We'd really appreciate any help you all may be able to offer. Best! Eli Murray
Re: [GENERAL] Postgres and data warehouses
On 3/8/2015 8:24 PM, Rob Sargent wrote: I strongly suspect many would like further details on your implementation. A user conference session well recorded perhaps? the details are proprietary and quite specific to our workload and requirements. -- john r pierce 37N 122W somewhere on the middle of the left coast -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Postgres and data warehouses
I expected the latter but … some proprietors like to brag :) Standing down. > On Mar 8, 2015, at 9:56 PM, John R Pierce wrote: > > On 3/8/2015 8:24 PM, Rob Sargent wrote: >> I strongly suspect many would like further details on your implementation. A >> user conference session well recorded perhaps? > > the details are proprietary and quite specific to our workload and > requirements. > > > > -- > john r pierce 37N 122W > somewhere on the middle of the left coast > > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Creating composite keys from csv
On 3/8/2015 11:49 PM, Eli Murray wrote: Hi all, I'm a student journalist working on a project for our student paper which lists salaries and positions for every staff member at the university. We received the data from an FOI request but the university is refusing to give us primary keys for the data. The issue we've run into is that if there are two staff members with the same name (and there are) our current web app adds their salaries together and considers them one person. Now, luckily, we can create a composite key if we combine their name column with their salary column. Unfortunately, the format of the data we have makes it more difficult than that (of course!) because some employees can hold multiple paying positions. Take a look at the windowing functions: http://www.postgresql.org/docs/9.4/static/functions-window.html Roxanne Here's some example data: Name, Position, Salary,Total Salary, ... Jane Doe, Dean, 100.000, 148.000, ... John Locke, Custodian, 30.000, 30.000, ... Jane Doe, Academic Adviser, 48.000, 148.000, ... Jane Doe, Trainer, 46.000, 46.000, ... Basically, what we'd like to do is create a serial primary key but instead of having it increment every row, it needs to check the name and total salary columns and only increment if that person doesn't already exist. If they do exist, it should just assign the previously created number to the column. However, our team is small and between us we have very little experience working with databases and we haven't found a way to accomplish this goal yet. In fact, we may be trying to solve this in the wrong way entirely. So, to put it succinctly, how would you approach this problem? What are our options? Do we need to write a script to clean the data into separate csv tables before we import it to postgres, or is this something we can do in postgres? We'd really appreciate any help you all may be able to offer. Best! Eli Murray -- [At other schools] I think the most common fault in general is to teach students how to pass exams instead of teaching them the science. Donald Knuth -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Creating composite keys from csv
Eli Murray wrote > Basically, what we'd like to do is create a serial primary key but instead > of having it increment every row, it needs to check the name and total > salary columns and only increment if that person doesn't already exist. So you already have a PK, (Name, Total Salary), but the source data doesn't provide a usable surrogate key to operate with. I would create Person and Person-Position tables and after importing the CSV data to a Staging area write a query to insert any unknown (Name, Total Salary) records into Person with a serial PK field. You can then join Person back onto Staging using (Name, Total Salary) but now include the PK and select just the PK, Position, and Salary fields which you can then add to the Person-Role table. You now have a Person table with (PK, Name, Total Salary) and Person-Position with (PK, Position, Role Salary) and you can discard the imported CSV data. This solves the explicit problem given the assumption that (Name, Total Salary) is indeed a uniquely identifying constraint. David J. -- View this message in context: http://postgresql.nabble.com/Creating-composite-keys-from-csv-tp5841038p5841043.html Sent from the PostgreSQL - general mailing list archive at Nabble.com. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Creating composite keys from csv
On 3/8/2015 10:32 PM, David G Johnston wrote: This solves the explicit problem given the assumption that (Name, Total Salary) is indeed a uniquely identifying constraint. that constraint seems flawed to me. -- john r pierce 37N 122W somewhere on the middle of the left coast -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general