Your results are close enough to mine, I think, to prove the point. And, I agree that the EDB benchmark is not necessary reflective of a real-world scenario. However, the cache I'm referring to is PG's shared_buffer cache. You can see the first run of the select causing a lot of disk reads. The second identical run, reads purely from shared_buffers. What I don't understand is, why does a slightly different select from the *same* table during the same session cause shared_buffers to be blown out and re-read?? I will see if I can try YCSB next week (I'm in workshops all week...) Thanks!
On Monday, March 14, 2016 3:34 AM, Dmitry Dolgov <9erthali...@gmail.com> wrote: Hi, Paul I agree with Oleg, EDB benchmarks are strange sometimes. I did the same benchmarks several months ago. I never noticed the cache influence back then, so I tried to reproduce your situation now (on a 5*10^6 records although). I started to play with db cache (using `echo 3 > /proc/sys/vm/drop_cache`), and I see difference in time execution for two subsequent queries, but `explain` info are almost identical, e.g. `shared hit & read`: ```benchmark=# explain (buffers, analyze, verbose) select data from json_tables where data @> '{"name": "AC3 Case Red"}'; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.json_tables (cost=102.74..19001.47 rows=4999 width=1257) (actual time=740.556..215956.655 rows=454546 loops=1) Output: data Recheck Cond: (json_tables.data @> '{"name": "AC3 Case Red"}'::jsonb) Rows Removed by Index Recheck: 2114606 Heap Blocks: exact=31624 lossy=422922 Buffers: shared hit=1371 read=455551 -> Bitmap Index Scan on json_tables_idx (cost=0.00..101.49 rows=4999 width=0) (actual time=731.010..731.010 rows=454547 loops=1) Index Cond: (json_tables.data @> '{"name": "AC3 Case Red"}'::jsonb) Buffers: shared hit=1371 read=1005 Planning time: 6.352 ms Execution time: 216075.830 ms(11 rows) benchmark=# explain (buffers, analyze, verbose) select data from json_tables where data @> '{"name": "AC3 Case Red"}'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.json_tables (cost=102.74..19001.47 rows=4999 width=1257) (actual time=222.476..10692.703 rows=454546 loops=1) Output: data Recheck Cond: (json_tables.data @> '{"name": "AC3 Case Red"}'::jsonb) Rows Removed by Index Recheck: 2114606 Heap Blocks: exact=31624 lossy=422922 Buffers: shared hit=1371 read=455551 -> Bitmap Index Scan on json_tables_idx (cost=0.00..101.49 rows=4999 width=0) (actual time=214.736..214.736 rows=454547 loops=1) Index Cond: (json_tables.data @> '{"name": "AC3 Case Red"}'::jsonb) Buffers: shared hit=1371 read=1005 Planning time: 0.089 ms Execution time: 10767.739 ms(11 rows)``` But I see almost the same execution time from mongodb `explain` (216075ms for pg and 177784ms for mongo, which isn't so much I think): ```DBQuery.shellBatchSize = 10000000000; db.json_tables.find({"name": "AC3 Case Red"}).explain(true){ "queryPlanner" : { "plannerVersion" : 1, "namespace" : "benchmark.json_tables", "indexFilterSet" : false, "parsedQuery" : { "name" : { "$eq" : "AC3 Case Red" } }, "winningPlan" : { "stage" : "FETCH", "inputStage" : { "stage" : "IXSCAN", "keyPattern" : { "name" : 1 }, "indexName" : "name_1", "isMultiKey" : false, "isUnique" : false, "isSparse" : false, "isPartial" : false, "indexVersion" : 1, "direction" : "forward", "indexBounds" : { "name" : [ "[\"AC3 Case Red\", \"AC3 Case Red\"]" ] } } }, "rejectedPlans" : [ ] }, "executionStats" : { "executionSuccess" : true, "nReturned" : 454546, "executionTimeMillis" : 177784, "totalKeysExamined" : 454546, "totalDocsExamined" : 454546, "executionStages" : { "stage" : "FETCH", "nReturned" : 454546, "executionTimeMillisEstimate" : 175590, "works" : 454547, "advanced" : 454546, "needTime" : 0, "needYield" : 0, "saveState" : 8638, "restoreState" : 8638, "isEOF" : 1, "invalidates" : 0, "docsExamined" : 454546, "alreadyHasObj" : 0, "inputStage" : { "stage" : "IXSCAN", "nReturned" : 454546, "executionTimeMillisEstimate" : 700, "works" : 454547, "advanced" : 454546, "needTime" : 0, "needYield" : 0, "saveState" : 8638, "restoreState" : 8638, "isEOF" : 1, "invalidates" : 0, "keyPattern" : { "name" : 1 }, "indexName" : "name_1", "isMultiKey" : false, "isUnique" : false, "isSparse" : false, "isPartial" : false, "indexVersion" : 1, "direction" : "forward", "indexBounds" : { "name" : [ "[\"AC3 Case Red\", \"AC3 Case Red\"]" ] }, "keysExamined" : 454546, "dupsTested" : 0, "dupsDropped" : 0, "seenInvalidated" : 0 } }, "allPlansExecution" : [ ] }, "serverInfo" : { "host" : "ip-172-30-0-236", "port" : 27017, "version" : "3.2.4", "gitVersion" : "e2ee9ffcf9f5a94fad76802e28cc978718bb7a30" }, "ok" : 1}``` I not missed anything, am I right? Are you sure that it took much more time for PostgreSQL?Besides, everything is fine for queries with more small results (while the query {"name": "AC3 Case Red"} is almost 1/10 of entire dataset): ```=# insert into json_tables values('{"name": "test name"}'::jsonb); =# explain (buffers, analyze, verbose) select data from json_tables where data @> '{"name": "test name"}'; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on public.json_tables (cost=62.75..18965.16 rows=5000 width=1257) (actual time=0.020..0.021 rows=1 loops=1) Output: data Recheck Cond: (json_tables.data @> '{"name": "test name"}'::jsonb) Heap Blocks: exact=1 Buffers: shared hit=5 -> Bitmap Index Scan on json_tables_idx (cost=0.00..61.50 rows=5000 width=0) (actual time=0.011..0.011 rows=1 loops=1) Index Cond: (json_tables.data @> '{"name": "test name"}'::jsonb) Buffers: shared hit=4 Planning time: 1.164 ms Execution time: 0.045 ms(10 rows)``` As far as I know there isn't much to do about caching. I don't know if it's appropriate, but you can manually warm-up the cache (something like `cat /var/lib/postgresql/9.5/main/base/*/* > /dev/null`). On 14 March 2016 at 00:30, Oleg Bartunov <obartu...@gmail.com> wrote: On Mar 11, 2016 4:40 PM, "Paul Jones" <p...@cmicdo.com> wrote: > > I have been running the EDB benchmark that compares Postgres and MongoDB. > I believe EDB ran it against PG 9.4 and Mongo 2.6. I am running it > against PG 9.5.1 and Mongo 3.2 with WiredTiger storage using 10,000,000 > JSON records generated by the benchmark. It looks like Mongo is winning, > and apparently because of its cache management.Dmitry was working on the same > benchmarks. I think edb benchmark is broken by design. Better, use ycsb > benchmarks. I hope, Dmitry will share his results.> > The first queries on both run in ~30 min. And, once PG fills its cache, > it whips Mongo on repeats of the *same* query (vmstat shows no disk > reads for PG). > > However, when different query on the same table is issued to both, > vmstat shows that PG has to read the *entire* table again, and it takes > ~30 min. Mongo does a lot of reads initially but after about 5 minutes, > it stops reading and completes the query, most likely because it is > using its cache very effectively. > > Host: Virtual Machine > 4 CPUs > 16 Gb RAM > 200 Gb Disk > RHEL 6.6 > > PG: 9.5.1 compiled from source > shared_buffers = 7GB > effectve_cache_size = 12GB > > Mongo: 3.2 installed with RPM from Mongo > > In PG, I created the table by: > > CREATE TABLE json_tables > ( > data JSONB > ); > > After loading, it creates the index: > > CREATE INDEX json_tables_idx ON json_tables USING GIN (data jsonb_path_ops); > > After a lot of experimentation, I discovered that the benchmark was not > using PG's index, so I modified the four queries to be: > > SELECT data FROM json_tables WHERE data @> '{"brand": "ACME"}'; > SELECT data FROM json_tables WHERE data @> '{"name": "Phone Service Basic > Plan"}'; > SELECT data FROM json_tables WHERE data @> '{"name": "AC3 Case Red"}'; > SELECT data FROM json_tables WHERE data @> '{"type": "service"}'; > > Here are two consecutive explain analyze for PG, for the same query. > No functional difference in the plans that I can tell, but the effect > of PG's cache on the second is dramatic. > > If anyone has ideas on how I can get PG to more effectively use the cache > for subsequent queries, I would love to hear them. > > ------- > > benchmark=# explain analyze select data from json_tables where data @> > '{"name": "AC3 Case Red"}'; > > QUERY PLAN > > > ----------------------------------------------------------------------------------------------------------------------------------------- > > Bitmap Heap Scan on json_tables (cost=113.50..37914.64 rows=10000 width=1261) > (actual time=2157.118..1259550.327 rows=909091 loops=1) > Recheck Cond: (data @> '{"name": "AC3 Case Red"}'::jsonb) > Rows Removed by Index Recheck: 4360296 > Heap Blocks: exact=37031 lossy=872059 > -> Bitmap Index Scan on json_tables_idx (cost=0.00..111.00 rows=10000 > width =0) (actual time=2141.250..2141.250 rows=909091 loops=1) > Index Cond: (data @> '{"name": "AC3 Case Red"}'::jsonb) > Planning time: 291.932 ms > Execution time: 1259886.920 ms > (8 rows) > > Time: 1261191.844 ms > > benchmark=# explain analyze select data from json_tables where data @> > '{"name": "AC3 Case Red"}'; > QUERY PLAN > > --------------------------------------------------------------------------------------------------------------------------------------- > > Bitmap Heap Scan on json_tables (cost=113.50..37914.64 rows=10000 > width=1261) (actual time=779.261..29815.262 rows=909091 loops=1) > Recheck Cond: (data @> '{"name": "AC3 Case Red"}'::jsonb) > Rows Removed by Index Recheck: 4360296 > Heap Blocks: exact=37031 lossy=872059 > -> Bitmap Index Scan on json_tables_idx (cost=0.00..111.00 rows=10000 > width =0) (actual time=769.081..769.081 rows=909091 loops=1) > Index Cond: (data @> '{"name": "AC3 Case Red"}'::jsonb) > Planning time: 33.967 ms > Execution time: 29869.381 ms > > (8 rows) > > Time: 29987.122 ms > > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general