* PG has pgbench; *maybe* you can hack it to work on Oracle. * If you want to know how well an RDBMS will work on your workload, then you must provide it with a simulated workload. Right? * AWS RDS Postgresql has a dashboard that *might* be similar to AWR. Or it might not... * We departitioned because SELECT statements were *slow*. All partitions were scanned, even when the partition key was specified in the WHERE clause.
On Sun, Dec 10, 2023 at 8:45 AM veem v <veema0...@gmail.com> wrote: > Thank you so much Ron. I have some more doubts related to this. > > We were thinking , if there is any utility in PG with which we can > create/generate large sample data volume which we can use to run it on our > on premise Oracle exadata box and use the same on the aurora postgresql in > cloud to see the read and write performance comparison. Is there any such > exists, to quickly get some large data volume? But i think , what you are > saying is, we should use real data to get actual or closest possible > benchmarks, correct me if wrong? > > We used to see the data dictionary views (called AWR views) in Oracle to > see the current and historical performance statistics like CPU, IO , Memory > usage, object level contentions etc. in the oracle database. Do we have > such a thing available in Aurora postgre, so as to monitor the > performance and get some idea of how well the load test goes and what > capacity is available or are we saturating it? > > When you said "*Beware of excessive partitioning. We had to "departion" > most tables, because of the impact on performance.*" , as I understand > partitioning helps in reducing IO while reading , as it scans less data > because of partition pruning. And while writing there is almost minimal > overhead in identifying the required partition to which the > INSERTed/Updated/Deleted data will be landing ON. So can you please help me > understand what exact performance impact you are highlighting here? Are you > talking about the size of the partitions or total count of the partitions? > In our case we are planning to do either daily range partition or hourly > based on data data volume, not sure if there exists some sweet spot in > regards to what should be the size of each partition in postgresql be. If > you are pointing to higher count of partitions of table , then in our case > if we persist ~90 days data then for a transaction table it would be ~90 > daily range partitions or ~2160 hourly range partitions in the aurora > postgresql. It would be helpful , if you can explain a bit regarding what > exact performance impact you faced in regards to the partitioning in aurora > postgresql. > > *"Use ora2pg to export views in the Oracle database. It's very easy; a > small EC2 VM running Linux with enough disk space lets you automate the > extraction from Oracle and importation into AWS Postgresql.)"* > > Need to explore a bit more on this I believe. We have an oracle on premise > database, so we can move data directly to aurora postgresql in the cloud. > Another thing , is we have some sample data available in the AWS snowflake > but then not sure if some mechanism is available to move the same data to > the aurora postgresql ? > > On Sun, 10 Dec 2023 at 02:27, Ron Johnson <ronljohnso...@gmail.com> wrote: > >> I don't know anything about Aurora, only have experience with RDS >> Postgresql. >> >> We successfully migrated from on-prem Oracle (12c, I think) to RDS >> Postgresql 12, and were very happy: little down time (I take pride in >> designing/implementing that; thanks, ora2pg!), with lower disk (8TB, down >> to 5TB) and CPU usage. >> >> I'm not sure what the TPS was in Oracle, but the server level "we" are on >> (I'm no longer with that company, and don't remember the specifics (48 vCPU >> / 132 GB RAM, I think, with 10,000 IOPs) is over-powered for *our* needs. >> >> You're going to have to spin up a full-sized instance, import a *lot* of >> real data(*) into a production-identical schema and then run your batch >> load process using test data (or copies of real batches). That's the only >> way you'll *really* know. >> >> Beware of excessive partitioning. We had to "departion" most tables, >> because of the impact on performance. >> >> (*) Use ora2pg to export views in the Oracle database. It's *very* >> easy; a small EC2 VM running Linux with enough disk space lets you automate >> the extraction from Oracle and importation into AWS Postgresql.) >> >> On Sat, Dec 9, 2023 at 3:36 PM veem v <veema0...@gmail.com> wrote: >> >>> Thank you so much for the response. >>> >>> Got your point, will check if we really need details or summary for the >>> historical data. But it looks like we will need detailed transaction data >>> for ~2 years at least. >>> >>> My understanding was that AWS has two different offerings and "aurora >>> postgresql" is more performant and costlier than "RDS postgresql". Will >>> double check on this though. >>> >>> However , how to prove RDS/aurora postgresql is going to serve our OLTP >>> requirement here , similar to the current on premise Oracle exadata. For >>> the OLTP use case we are expecting ~15K TPS write and 2K TPS read and the >>> response for the UI queries are expected to be within subseconds. But yes, >>> as i mentioned individual transactions will be batched and then will be >>> written to the database, so this should have lesser resource consumption >>> and contention created. >>> >>> To test if Aurora postgresql will be comparable to cater the above needs >>> (in regards to the expected performance with nominal cost) ,how should we >>> test it? As we won't be able to test everything right away, Should we test >>> basic read and write performance and benchmark to have some confidence and >>> go ahead with development? >>> >>> Say for example if one transaction consists of ~8 Inserts we can create >>> a sample target table on aurora Postgre with required indexes/constraints >>> and try running those inserts from multiple threads(for concurrency) using >>> blazemeter and see/compare the response time, CPU, IO, Memory usage for the >>> Postgresql database with set TPS. Similarly to see read performance, we can >>> run multiple select queries from blazemeter and compare the response time. >>> Is this the correct approach for validating the database here or any other >>> approach exists? >>> >>> And another question coming to mind, I read in past Vaccum to be a >>> problem in postgresql, is it going to give trouble in Aurora postgresql >>> too, for such a highly transactional read/write system? How to >>> test/validate that? >>> >>> On Sun, 10 Dec 2023 at 01:29, Ron Johnson <ronljohnso...@gmail.com> >>> wrote: >>> >>>> On Sat, Dec 9, 2023 at 2:13 PM veem v <veema0...@gmail.com> wrote: >>>> >>>>> >>>>> Ron Johnson <ronljohnso...@gmail.com> >>>>> wrote: >>>>> >>>>>> "OK" is relative, but it's what we did in a similar situation: two >>>>>> years of data on-line and 5 years of data in compressed files in S3. >>>>>> (We're required to keep data for 7 years, but they *never* ask for >>>>>> records more than 2 years old. If they ever do, we'll manually load the >>>>>> relevant data back into PG.) >>>>>> (I can't imagine that querying billions of unindexed flat-file >>>>>> records via S3 would be fast.) >>>>>> How often do end users look for data more than 90 days old? Two >>>>>> years old? >>>>>> How quickly does the old data need to be available? >>>>>> Maybe a four-tiered system of PG-Snowflake-S3-S3_Glacier would >>>>>> balance speed and cost (or maybe Snowflake would just drive up costs). >>>>> >>>>> >>>>> Thank You so much Ron. >>>>> >>>>> When you said "*two years of data on-line and 5 years of data in >>>>> compressed files in S3*." So do you mean two years of data in aurora >>>>> postgre which will be OLTP database and rest just dump as is in S3 bucket >>>>> and when we need any such query for those data , just simply read and dump >>>>> those S3 files back in the aurora postgre? >>>>> >>>> >>>> (RDS Postgresql, not Aurora, but that's beside the point.) >>>> >>>> Yes. >>>> >>>> But keep reading... >>>> >>>> In the currently running oracle exadata system , it has SIX months of >>>>> data (which is ~200TB) and the transaction tables are all range >>>>> partitioned >>>>> on a daily basis. And out of that ~2months of data gets frequently queried >>>>> and other ~4months of data gets queried less frequently. However, in the >>>>> target architecture which we want to build on cloud here, there are some >>>>> requirements for the analytics/data science team to query ~3years history. >>>>> Beyond ~3years we may need that data rarely. >>>>> >>>>> We were initially thinking of just having one database to serve both >>>>> OLTP and OLAP use cases(somewhat like oracle exadata currently doing for >>>>> us) but it seems we don't have such an option here on AWS. Postgre will >>>>> serve OLTP use case whereas Snowflake will serve OLAP use case. >>>>> >>>>> So do you suggest having both the databases in use, like recent >>>>> transaction data for last 3 months should be streamed to aurora postgre, >>>>> then from 3months till 3years of data should be parked in snowflake which >>>>> will serve OLAP/analytics use case. and from 3years till 10years will be >>>>> kept in S3 (as parquet or Iceberg format) so that even Snowflake can query >>>>> those directly when needed. >>>>> >>>>> OR >>>>> >>>>> Do you suggest just keeping last ~3months of data on Aurora postgre >>>>> and rest everything on snowflake considering it will store those as >>>>> compressed format and also storage is cheap(23$ per TB per month)? >>>>> >>>>> Few colleagues are pointing to databricks for the analytics use case. >>>>> Is that a good option here? >>>>> >>>>> >>>> I can't answer that without knowing what the end users actually need >>>> (details, or just summaries of historical data, in different tiers). >>>> >>>> You all will have to do the cost:benefit analysis of different >>>> architectures. >>>> >>>> >>>>> >>>>> On Sat, 9 Dec 2023 at 16:43, veem v <veema0...@gmail.com> wrote: >>>>> >>>>>> Hello All, >>>>>> Although it's not exactly related to opensource postgre but want to >>>>>> ask this question here to understand colleagues' view, considering >>>>>> having >>>>>> decades of experience in the database world, We want some guidance, if >>>>>> the >>>>>> below design looks okay for our customer use case. >>>>>> >>>>>> We currently have financial systems transaction data streams to >>>>>> Oracle exadata(X9) on-premise. This database supports processing of >>>>>> 400million transactions per day. A single transaction for us is a >>>>>> combination of 7-8 inserts into different transaction tables with >>>>>> Indexes , >>>>>> unique constraints etc defined on those. The transactions >>>>>> processed/committed in batches(~1000 batch size) in the database. And >>>>>> this >>>>>> system persists data for ~6 months. We do have all sorts of >>>>>> OLAP(daily/monthly batch reports running) applications run on the same >>>>>> database along with some user facing UI applications showing customer >>>>>> transactions. So it's basically currently serving a hybrid workload and >>>>>> is >>>>>> one stop solution for all use cases. >>>>>> >>>>>> Many of the applications are moving from on premise to AWS cloud as >>>>>> part of modernization journey and AWS being chosen cloud partner also the >>>>>> product is expected to expand across more regions and this system is >>>>>> expected to serve increase in the transaction volume. And also we have a >>>>>> requirement to persist transaction data for ~10years to have those >>>>>> available for analytics/data science use cases. >>>>>> >>>>>> So the team is thinking of splitting it into two parts >>>>>> 1)OLTP type use case in which we will persist/write the transaction >>>>>> data faster and show it to the UI related apps , in near real >>>>>> time/quickest >>>>>> possible time. and this database will store Max 60-90 days of transaction >>>>>> data. Not sure if we have an option of Oracle exadata equivalent on AWS, >>>>>> so >>>>>> team planning of using/experimenting with Aurora postgres. Please correct >>>>>> me, if there are any other options we should use otherwise? >>>>>> >>>>>> 2)Then move the data beyond ~90 days into another database or object >>>>>> storage S3 which will keep it there for ~10 years and will be queryable >>>>>> using the necessary API's. That is supposed to cater to >>>>>> Olap/analytics/data >>>>>> science use cases etc. >>>>>> >>>>>> Is the above design is okay? and also in regards to the second point >>>>>> above i.e. persisting the historical data (that to be in queryable >>>>>> state), >>>>>> should we go for some database like snowflake or should just keep it on >>>>>> S3 >>>>>> as is and make those queryable through APIs. Please advice? >>>>>> >>>>>