[GENERAL] Postgres and data warehouses

2015-03-08 Thread Nigel Gardiner
I'm looking at making a data warehouse to address our rapidly spiralling
report query times against the OLTP. I'm looking first at what it would
take to make this a real-time data warehouse, as opposed to batch-driven.

One approach I've seen used to achieve real time data warehousing is to
have middleware that is intercepting all database writes and echoing them
to a program that rolls up the data warehouse values and updates the facts,
dimensions and so on on the fly. Another approach was to use triggers on
the tables of interest to write to tables to journal the changes, which
then get processed by a batch job to achieve the same thing.

One of the problems of the trigger on the transactional database approach
is that if there is ever a problem with the trigger, the main transaction
is affected. I'm not sure if that is avoidable with proper exception
handling in the trigger code? It does mean a lot of trigger code to
maintain, and another schema to maintain (the journalled changes), so there
were several drawbacks.

I've had a quick search and haven't seen this approach used yet, but I was
thinking, the asynchronous replication of Postgres databases could be used
as a streaming journal of changes to be processed by a data warehouse. The
other approach that suggests itself is WAL file shipping. I've not dug into
the async rep protocol yet, before I do so I just wanted to get some brief
feedback on whether I'm on the wrong track or not, and if there's some
better approach I should be looking at first instead.

Any feedback much appreciated.

Regards,
Nigel


Re: [GENERAL] Postgres and data warehouses

2015-03-08 Thread Stephen Frost
* Nigel Gardiner (nigelgardi...@gmail.com) wrote:
> I've had a quick search and haven't seen this approach used yet, but I was
> thinking, the asynchronous replication of Postgres databases could be used
> as a streaming journal of changes to be processed by a data warehouse. The
> other approach that suggests itself is WAL file shipping. I've not dug into
> the async rep protocol yet, before I do so I just wanted to get some brief
> feedback on whether I'm on the wrong track or not, and if there's some
> better approach I should be looking at first instead.

Using a streaming replica for data warehouse queries is quite common..
The issue there is if you want to change the data structure or store
data on the replica because a streaming replica based on WAL shipping is
read-only.  You can create FDW tables which you can write to (even on
the replica) or you can create an independent database which has FDW
tables to the replica.  They have their own pros and cons, of course.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: [GENERAL] Postgres and data warehouses

2015-03-08 Thread Bill Moran
On Sun, 8 Mar 2015 11:20:16 -0400
Stephen Frost  wrote:

> * Nigel Gardiner (nigelgardi...@gmail.com) wrote:
> > I've had a quick search and haven't seen this approach used yet, but I was
> > thinking, the asynchronous replication of Postgres databases could be used
> > as a streaming journal of changes to be processed by a data warehouse. The
> > other approach that suggests itself is WAL file shipping. I've not dug into
> > the async rep protocol yet, before I do so I just wanted to get some brief
> > feedback on whether I'm on the wrong track or not, and if there's some
> > better approach I should be looking at first instead.
> 
> Using a streaming replica for data warehouse queries is quite common..
> The issue there is if you want to change the data structure or store
> data on the replica because a streaming replica based on WAL shipping is
> read-only.  You can create FDW tables which you can write to (even on
> the replica) or you can create an independent database which has FDW
> tables to the replica.  They have their own pros and cons, of course.

Another option is to replicate using Slony instead of streaming, which allows
you to create additional tables on the replica that are read/write in
addition to triggers that only fire on the replica. It's complicated, but
pretty damn powerful.

-- 
Bill Moran


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Benchmarking partitioning triggers and rules

2015-03-08 Thread Tim Uckun
I wrote a quick benchmarking script to test various partitioning
strategies. The code is here.

https://gist.github.com/timuckun/954ab6bdce36fa14bc1c

I was astonished at the variability of the timings between the different
variations.

The test data contained 270K records.  I did a simple insert into without
any triggers, with three different trigger variations and with a rule.  The
results were like this

clean  0.00   0.00   0.00 (  3.119498)
func_1  0.00   0.00   0.00 (  7.435094)
func_2  0.00   0.00   0.00 ( 28.427617)
func_3  0.00   0.00   0.00 ( 18.348554)
Rule   0.00   0.00   0.00 (  2.901931)

A clean insert 3.1 seconds,  putting a rule took less time!

A simple insert into table_name values (NEW.*) doubled the time it takes to
insert the records.  Using an EXECUTE with an interpolated table name took
TEN TIMES LONGER!. Making a slight change to the EXECUTE took a third off
the execution time WTF???

This has left me both baffled and curious.  If changing little things like
this makes a huge difference what else can I do to make this even faster?

Would using a different language help? Is Javasscript, Python or Perl
faster?  Is there some other syntax I can use?  I tried this

EXECUTE  'INSERT INTO ' ||  quote_ident(partition_name) ||  ' VALUES ('  ||
NEW.* || ')' but that gave me an error.


Re: [GENERAL] Postgres and data warehouses

2015-03-08 Thread John R Pierce

On 3/8/2015 7:40 AM, Nigel Gardiner wrote:
I'm looking at making a data warehouse to address our rapidly 
spiralling report query times against the OLTP. I'm looking first at 
what it would take to make this a real-time data warehouse, as opposed 
to batch-driven.


we use a hybrid architecture.we have a 'republisher' process that 
repeatedly slurps new data from the OLTP database and sends it to the 
back end databases, using a 'publish/subscribe' messaging bus.
several back end databases subscribe to this data, and their subscriber 
processes insert the incoming data into those OLAP and various other 
reporting databases.   this way the reporting databases can have 
completely different schemas optimized for their needs, and have 
different retention requirements than the OLTP database.


this republisher is usually within a few seconds of live new data. in 
our case its made fairly easy to track 'new' because all our OLTP 
transactions are event-oriented.



--
john r pierce  37N 122W
somewhere on the middle of the left coast



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Postgres and data warehouses

2015-03-08 Thread Rob Sargent
I strongly suspect many would like further details on your implementation. A 
user conference session well recorded perhaps?

> On Mar 8, 2015, at 4:57 PM, John R Pierce  wrote:
> 
>> On 3/8/2015 7:40 AM, Nigel Gardiner wrote:
>> I'm looking at making a data warehouse to address our rapidly spiralling 
>> report query times against the OLTP. I'm looking first at what it would take 
>> to make this a real-time data warehouse, as opposed to batch-driven.
> 
> we use a hybrid architecture.we have a 'republisher' process that 
> repeatedly slurps new data from the OLTP database and sends it to the back 
> end databases, using a 'publish/subscribe' messaging bus.several back end 
> databases subscribe to this data, and their subscriber processes insert the 
> incoming data into those OLAP and various other reporting databases.   this 
> way the reporting databases can have completely different schemas optimized 
> for their needs, and have different retention requirements than the OLTP 
> database.
> 
> this republisher is usually within a few seconds of live new data. in our 
> case its made fairly easy to track 'new' because all our OLTP transactions 
> are event-oriented.
> 
> 
> -- 
> john r pierce  37N 122W
> somewhere on the middle of the left coast
> 
> 
> 
> -- 
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Weight BLOB objects in postgreSQL? How?

2015-03-08 Thread María Dovale
Thanks, I mean, how heavy it is in kB.

2015-03-04 12:17 GMT-03:00 John R Pierce :

>  On 3/4/2015 7:03 AM, María Griensu wrote:
>
> I need to figure out how can I weight BLOB objects in a table of a DB, I'm
> not expert on this topics, so I appreciate any help you can give me.
>
>
> postgres's equivalent of BLOB is BYTEA. I'm not sure what you mean by
> 'weight' here.
>
> --
> john r pierce  37N 122W
> somewhere on the middle of the left coast
>
>


-- 
Ing. María Dovale
Medical IT - Griensu S.A.
Av. Colón 636 - Piso 8
X5000EPT - Córdoba
Tel: +54 (351) 4247979 Int. 22


Re: [GENERAL] Weight BLOB objects in postgreSQL? How?

2015-03-08 Thread María Dovale
Thanks, I mean, how heavy is.

2015-03-04 12:17 GMT-03:00 John R Pierce :

>  On 3/4/2015 7:03 AM, María Griensu wrote:
>
> I need to figure out how can I weight BLOB objects in a table of a DB, I'm
> not expert on this topics, so I appreciate any help you can give me.
>
>
> postgres's equivalent of BLOB is BYTEA. I'm not sure what you mean by
> 'weight' here.
>
> --
> john r pierce  37N 122W
> somewhere on the middle of the left coast
>
>


-- 
Ing. María Dovale
Medical IT - Griensu S.A.
Av. Colón 636 - Piso 8
X5000EPT - Córdoba
Tel: +54 (351) 4247979 Int. 22


[GENERAL] Creating composite keys from csv

2015-03-08 Thread Eli Murray
Hi all,

I'm a student journalist working on a project for our student paper which
lists salaries and positions for every staff member at the university. We
received the data from an FOI request but the university is refusing to
give us primary keys for the data.

The issue we've run into is that if there are two staff members with the
same name (and there are) our current web app adds their salaries together
and considers them one person. Now, luckily, we can create a composite key
if we combine their name column with their salary column. Unfortunately,
the format of the data we have makes it more difficult than that (of
course!) because some employees can hold multiple paying positions.

Here's some example data:

Name, Position, Salary,Total Salary, ...
Jane Doe, Dean, 100.000, 148.000, ...
John Locke, Custodian, 30.000, 30.000, ...
Jane Doe, Academic Adviser, 48.000, 148.000, ...
Jane Doe, Trainer, 46.000, 46.000, ...

Basically, what we'd like to do is create a serial primary key but instead
of having it increment every row, it needs to check the name and total
salary columns and only increment if that person doesn't already exist. If
they do exist, it should just assign the previously created number to the
column. However, our team is small and between us we have very little
experience working with databases and we haven't found a way to accomplish
this goal yet. In fact, we may be trying to solve this in the wrong way
entirely.

So, to put it succinctly, how would you approach this problem? What are our
options? Do we need to write a script to clean the data into separate csv
tables before we import it to postgres, or is this something we can do in
postgres? We'd really appreciate any help you all may be able to offer.

Best!
Eli Murray


Re: [GENERAL] Postgres and data warehouses

2015-03-08 Thread John R Pierce

On 3/8/2015 8:24 PM, Rob Sargent wrote:

I strongly suspect many would like further details on your implementation. A 
user conference session well recorded perhaps?


the details are proprietary and quite specific to our workload and 
requirements.




--
john r pierce  37N 122W
somewhere on the middle of the left coast



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Postgres and data warehouses

2015-03-08 Thread Rob Sargent
I expected the latter but … some proprietors like to brag :)

Standing down.

> On Mar 8, 2015, at 9:56 PM, John R Pierce  wrote:
> 
> On 3/8/2015 8:24 PM, Rob Sargent wrote:
>> I strongly suspect many would like further details on your implementation. A 
>> user conference session well recorded perhaps?
> 
> the details are proprietary and quite specific to our workload and 
> requirements.
> 
> 
> 
> -- 
> john r pierce  37N 122W
> somewhere on the middle of the left coast
> 
> 
> 
> -- 
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Creating composite keys from csv

2015-03-08 Thread Roxanne Reid-Bennett

On 3/8/2015 11:49 PM, Eli Murray wrote:

Hi all,

I'm a student journalist working on a project for our student paper 
which lists salaries and positions for every staff member at the 
university. We received the data from an FOI request but the 
university is refusing to give us primary keys for the data.


The issue we've run into is that if there are two staff members with 
the same name (and there are) our current web app adds their salaries 
together and considers them one person. Now, luckily, we can create a 
composite key if we combine their name column with their salary 
column. Unfortunately, the format of the data we have makes it more 
difficult than that (of course!) because some employees can hold 
multiple paying positions.


Take a look at the windowing functions:

http://www.postgresql.org/docs/9.4/static/functions-window.html

Roxanne


Here's some example data:

Name, Position, Salary,Total Salary, ...
Jane Doe, Dean, 100.000, 148.000, ...
John Locke, Custodian, 30.000, 30.000, ...
Jane Doe, Academic Adviser, 48.000, 148.000, ...
Jane Doe, Trainer, 46.000, 46.000, ...

Basically, what we'd like to do is create a serial primary key but 
instead of having it increment every row, it needs to check the name 
and total salary columns and only increment if that person doesn't 
already exist. If they do exist, it should just assign the previously 
created number to the column. However, our team is small and between 
us we have very little experience working with databases and we 
haven't found a way to accomplish this goal yet. In fact, we may be 
trying to solve this in the wrong way entirely.


So, to put it succinctly, how would you approach this problem? What 
are our options? Do we need to write a script to clean the data into 
separate csv tables before we import it to postgres, or is this 
something we can do in postgres? We'd really appreciate any help you 
all may be able to offer.


Best!
Eli Murray





--
[At other schools] I think the most common fault in general is to teach 
students how to pass exams instead of teaching them the science.
Donald Knuth



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Creating composite keys from csv

2015-03-08 Thread David G Johnston
Eli Murray wrote
> Basically, what we'd like to do is create a serial primary key but instead
> of having it increment every row, it needs to check the name and total
> salary columns and only increment if that person doesn't already exist.

So you already have a PK, (Name, Total Salary), but the source data doesn't
provide a usable surrogate key to operate with.

I would create Person and Person-Position tables and after importing the CSV
data to a Staging area write a query to insert any unknown (Name, Total
Salary) records into Person with a serial PK field.

You can then join Person back onto Staging using (Name, Total Salary) but
now include the PK and select just the PK, Position, and Salary fields which
you can then add to the Person-Role table.

You now have a Person table with (PK, Name, Total Salary) and
Person-Position with (PK, Position, Role Salary) and you can discard the
imported CSV data.

This solves the explicit problem given the assumption that (Name, Total
Salary) is indeed a uniquely identifying constraint.

David J.




--
View this message in context: 
http://postgresql.nabble.com/Creating-composite-keys-from-csv-tp5841038p5841043.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Creating composite keys from csv

2015-03-08 Thread John R Pierce

On 3/8/2015 10:32 PM, David G Johnston wrote:

This solves the explicit problem given the assumption that (Name, Total
Salary) is indeed a uniquely identifying constraint.


that constraint seems flawed to me.


--
john r pierce  37N 122W
somewhere on the middle of the left coast



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general