2014-11-02 19:16 GMT-02:00 Mike Wilson :
> Thanks for the information Greg.
>
> Unfortunately modifying the application stack this close to the holiday
> season won’t be an option so I’m left with:
>1) Trying to optimize the settings I have for the query mix I have.
>2) Optimize any long running DML queries (if any) to prevent lag due to
> locks.
>3) Getting a better understanding of “what” causes lag.
>
> #3 will probably be central to at least minimizing lag during heavy DML
> load. If anyone has a good resource to describe when a slave would start
> to lag potentially that would help me hunt for the cause. I know long
> running DML on the master may cause lag but I’m uncertain as to the
> specifics of why. During periods of lag we do have more DML than usual
> running against the master but the queries themselves are very quick
> although there might be 20-30 DML operations per second against some of our
> central tables that store user account information. Even under heavy DML
> the queries still return in under a second. Possibly a large volume of of
> short running DML cause replication lag issues for large tables (~20M)?
>
> Thanks again for your help. BDR looks interesting but probably too
> cutting edge for my client.
>
> Mike Wilson
>
>
>
>
> On Nov 2, 2014, at 12:33 PM, Greg Spiegelberg
> wrote:
>
> Hi Mike,
>
> Sounds very familiar. Our master fans out to 16 slaves (cascading) and we
> had great success with segregating database queries to different slaves and
> some based on network latency. I'd suggest, if possible, alter the
> application to use the slave for simple SELECT's and FUNCTION's performing
> SELECT-like only work while limiting those applications and queries that
> perform DML to the master (obviously). If the load on the slave increases
> too much, spin up another slave. I'd mention from experience that it could
> be the load on the slave that is giving the appearance of replication lag.
> This is what led us to having (1) slave per application.
>
> There is also the BDR multi-master available in 9.4beta if you're wanting
> to live on the edge.
>
> -Greg
>
> On Sat, Nov 1, 2014 at 4:33 PM, Mike Wilson wrote:
>
>> I have two 9.3.4 PG instances that back a large internet website that has
>> very seasonal traffic and can generate large query loads. My instances are
>> in a master-slave streaming replication setup and are stable and in
>> general perform very well. The only issues we have with the boxes is that
>> when the master is busy the slave may start to lag excessively. I can give
>> specifics as to what heavily loaded means and additionally the
>> postgresql.conf for both boxes but my basic questions are:
>>* What causes streaming replication lag to increase?
>>* What parameters can be tuned to reduce streaming replication lag?
>>* Can a loaded slave affect lag adversely?
>>* Can increasing max_wal_senders help reduce lag?
>>
>> The reason I ask this is that as mentioned above the servers are stable
>> and are real troopers in general as they back a very popular web site that
>> puts the master under heavy seasonal load at times. At those times though
>> we see an almost exponential growth in streaming replication lag compared
>> to load on the master.
>>
>> For example, the master is a very beefy Solaris:
>>* 4 Recent Intel Zeons (16 physical cores)
>>* 256 GB of ECC RAM
>>* 12 TB of ZFS (spindle and SSD internal storage)
>>* DB on disk size is 2TB
>>* ZFS ARC cache of roughly 250G.
>>* ZFS ARC2/ZIL configured for SSD’s (this is awesome by the way)
>>
>> Basic PG Config:
>>shared_buffers = 2GB
>>work_mem = 128MB
>>max_connections = 1700 (supports roughly 100 web servers)
>>wal_keep_segments = 256 (roughly enough for 24 hours of operation
>> under heavy load)
>>wal_sender_timeout = 60s
>>replication_timeout=(not set)
>>wal_receiver_status_interval=10s
>>max_wal_senders=6
>>* wal archiving is off
>>* 98% of the queries on the master complete in under 500ms.
>>* No hung or very long running queries in general.
>>
>> The master on a normal day maintains a load of about 0.5, during which
>> replication lag to the slave is in hundreds milliseconds. When the
>> production db server is heavily hit though the load may go as high as 4 on
>> the master and the streaming replication lag may increase to more than 2
>> hours relatively quickly. Load on the slave is generally below 1 even when
>> the master is heavily loaded. The traffic to the master is primarily read
>> with about 10% DML (new users, purchase records, etc). DML statements
>> increase proportionally when under load though. The master and slave are
>> connected via dedicated 10G fiber link and even under heavy load the
>> utilization of the link is nowhere near close to saturation. BTW, the
>> slave does run some reported related queries throughout the day that might
>> take up to a minute to complete.
>>