Loading table with indexed jsonb field is stalling

2019-05-17 Thread Will Hartung
I am trying to load data in to a table with a jsonb field that is indexed
as gin (field jsonb_path_ops).

It's a large table, and I'm loading it 100K rows at a time. Eventually, it
would be 30M+ rows in the table.

Originally I simply loaded the table and then tried to create the index,
but it never finished.

So, I'm trying to load it incrementally.

I have 2.8M rows in the table so far, the jsonb field size is, on average,
1600 bytes, with the largest (of the 2.8M loaded) 1930. Simply, these are
not large structures.

The first batches to load took various times for each file. Most of them <
1m, some took 1/2 hr.

The current file is "stuck", pushing past 20hrs so far.

The VM only has 4G of RAM, it is certainly "busy", but it is not swapping
(not at the OS level).

Here is a recent top:

top - 11:34:01 up 1 day,  1:49,  2 users,  load average: 5.84, 4.94, 4.52
Tasks: 103 total,   1 running,  59 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  1.0 sy,  0.0 ni,  0.0 id, 95.3 wa,  0.0 hi,  3.7 si,
0.0 st
KiB Mem :  4040212 total,   152336 free,   181792 used,  3706084 buff/cache
KiB Swap:  4194300 total,  4189948 free, 4352 used.  3443628 avail Mem

Postgres is pretty much default configurations, I have not tweaked any of
the memory settings (such as work memory).

My Mac OS host isn’t that busy either, but the VM adds some load, and it's
not thrashing.

While I was loading the file in 100K row chunks, here are the times of each
respective chunk to actually load:

0:46s
3:17s
8:12s
9:54s
14:09s
12:07s
18:50s
9:01s
25:28s
38:49s
25:24s
1:21s
0:47s
0:32s
0:39s
0:31s
0:31s
0:28s
0:29s
0:28s
0:21s
0:27s
0:36s
0:22s
0:27s
0:20s
0:21s
0:19s
2:16:21s  <— the last to date, but this was yesterday, now it's past 20hrs

It stalled early, but then rocketed to the stalling cliff staring climax
that it's at now.

The only reason this is on a Linux VM is that I saw similar behavior
running native Postgres 9.6 on Mac OS (the host). It didn’t make any sense
that Mac OS would be causing this, but, who knows. Try it and see.

Since the start of the load of the stalled piece, something has consumed
over 800M of storage, I can’t say what, I did not check with any higher
fidelity as to where the storage was going.

I do not understand why this suddenly falls over a cliff. The JSONs are not
large, so I don’t see how any individual one could crush the memory
subsystem. I have to assume that PG is somehow internally thrashing or
paging or something. I appreciate that the db is not tuned, but I would not
expect that it would struggle so to create this index, with these values,
and such small JSON payloads.

Also, it’s not unique to the 29th piece. I’ve tried in the past to skip
those, and it still failed. This has been happening for some time (months),
but I keep putting it away.

Any insight is helpful. My biggest fear is that for whatever reason we will
not be able to reload this table during any particular crisis in the future
should it come to that.

Thanks.


Re: Loading table with indexed jsonb field is stalling

2019-05-20 Thread Will Hartung
I sent these earlier, but they didn’t get to the list, so I’m resending — sorry 
about that.

> On May 17, 2019, at 7:53 PM, Tim Cross  wrote:
> 
> Which version of postgres?

 PostgreSQL 10.8 (Ubuntu 10.8-0ubuntu0.18.04.1) on x86_64-pc-linux-gnu, 
compiled by gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0, 64-bit

> 
> How are you loading the data? (application, psql, pg_restore) using
> (insert, copy)?


I have a SQL file I’m using filled with statements like:

\copy eis_entity from program 'gzcat /u2/hdhdata/xxx/out1000.txt.gz';
select now();
\copy eis_entity from program 'gzcat /u2/hdhdata/xxx/out1001.txt.gz';
select now();

Thanks



Re: Loading table with indexed jsonb field is stalling

2019-05-20 Thread Will Hartung
I send this earlier, but it did not make it to the list, so I’m resending it.

> On May 18, 2019, at 6:55 AM, Adrian Klaver  wrote:

> Is the jsonb field the only field in the table?
> 
> Can we see the table schema?

# \d eis_entity
Table "public.eis_entity"
  Column   |   Type   | Modifiers 
---+--+---
 eis_entity_key| numeric  | not null
 eis_identifier_domain_key | numeric  | not null
 global_id | character varying(80)| not null
 local_id  | character varying(80)| not null
 eis_status_key| numeric  | not null
 eis_entity_type_key   | numeric  | not null
 eis_entity_domain_key | numeric  | not null
 all_trait_hash| numeric(10,0)| 
 searchable_trait_hash | numeric(10,0)| 
 created_date  | timestamp with time zone | 
 traits| jsonb| 
 current_revision  | numeric(8,0) | not null
 current_version_label | character varying(10)| not null
Indexes:
"eis_ent_traits" gin (traits jsonb_path_ops)

Thanks.



Re: Loading table with indexed jsonb field is stalling

2019-05-20 Thread Will Hartung
I send this earlier, but it did not make it to the list, so I’m resending it.

> On May 18, 2019, at 11:02 AM, Tom Lane  wrote:
> 
> In addition to the questions about what PG version you're using, is
> the backend process that's doing the load actually consuming CPU time,
> or is it just sitting?

It’s consuming both CPU and disk space.

Thanks.





Re: Loading table with indexed jsonb field is stalling

2019-05-20 Thread Will Hartung


> On May 20, 2019, at 11:13 AM, Adrian Klaver  wrote:
> 
> What do the below show:
> 
> 1) ps ax | grep postgres
> 
> 2) As superuser:
>   select * from pg_stat_activity ;

I’ll restart the process and report back.

Thanks.







Re: Loading table with indexed jsonb field is stalling

2019-05-20 Thread Will Hartung


> On May 20, 2019, at 11:12 AM, Tom Lane  wrote:
> 
> Hmm, so not a deadlock (plus you're running a version that has that
> bug fixed).  It sounds like you might have hit an actual bug, perhaps
> one that causes it to endlessly re-split an index page, or something
> along that line.  But it's going to be hard to find the cause without
> a test case.  I don't suppose you can anonymize your data to the
> point where you'd have a publishable test case?

That would take some time, thankfully it seems reproducible with a few million 
rows.

I’ve moved the files around a little bit (not aggressively) and it still 
stalled, so I don’t think it’s something specific to the data. But we’ll have 
to play with it and see.

Thanks.





Re: Loading table with indexed jsonb field is stalling

2019-05-20 Thread Will Hartung

> On May 20, 2019, at 11:13 AM, Adrian Klaver  wrote:

> What do the below show:
> 
> 1) ps ax | grep postgres

$ ps -ax | grep postgres
 1171 ?S  0:04 /usr/lib/postgresql/10/bin/postgres -D 
/var/lib/postgresql/10/main -c 
config_file=/etc/postgresql/10/main/postgresql.conf
 1420 ?Ds 0:21 postgres: 10/main: checkpointer process   
 1421 ?Ss 0:38 postgres: 10/main: writer process   
 1422 ?Ss 0:22 postgres: 10/main: wal writer process   
 1423 ?Ss 0:04 postgres: 10/main: autovacuum launcher process   
 1424 ?Ss 0:04 postgres: 10/main: stats collector process   
 1425 ?Ss 0:00 postgres: 10/main: bgworker: logical replication 
launcher   
15917 tty1 S  0:00 su - postgres
16300 ?Rs10:39 postgres: 10/main: postgres willtest 
10.35.60.64(54594) COPY
16444 ?Ss 0:08 postgres: 10/main: autovacuum worker process   
willtest waiting
16633 tty1 S+ 0:00 /usr/lib/postgresql/10/bin/psql
16641 ?Ss 0:00 postgres: 10/main: postgres postgres [local] idle


> 2) As superuser:
>   select * from pg_stat_activity ;

 datid | datname  |  pid  | usesysid | usename  | application_name | 
client_addr | client_hostname | client_port | backend_start |   
   xact_start   |  query_start  | 
state_change  | wait_event_type | wait_event  | state  | 
backend_xid | backend_xmin | query |
backend_type 
---+--+---+--+--+--+-+-+-+---+---+---+---+-+-++-+--+---+-
   |  |  1425 |   10 | postgres |  |
 | | | 2019-05-17 12:00:17.659235-07 |  
 |   |  
 | Activity| LogicalLauncherMain || |  
|   | background worker
   |  |  1423 |  |  |  |
 | | | 2019-05-17 12:00:17.658936-07 |  
 |   |  
 | Activity| AutoVacuumMain  || |  
|   | autovacuum launcher
 16384 | willtest | 16444 |  |  |  |
 | | | 2019-05-20 12:16:14.564982-07 | 2019-05-20 
12:16:14.641913-07 | 2019-05-20 12:16:14.641913-07 | 2019-05-20 
12:16:14.641914-07 | Lock| page| active | 
624 |  623 | autovacuum: ANALYZE public.eis_entity | autovacuum worker
 13051 | postgres | 16889 |   10 | postgres | psql |
 | |  -1 | 2019-05-20 13:44:50.84062-07  | 2019-05-20 
13:46:17.209382-07 | 2019-05-20 13:46:17.209382-07 | 2019-05-20 
13:46:17.209387-07 | | | active |   
  |  623 | select * from pg_stat_activity;   | client backend
 16384 | willtest | 16300 |   10 | postgres | psql | 
10.35.60.64 | |   54594 | 2019-05-20 11:24:59.865383-07 | 
2019-05-20 12:15:42.494372-07 | 2019-05-20 12:15:42.494372-07 | 2019-05-20 
12:15:42.494378-07 | LWLock  | WALWriteLock| active | 
623 |  612 | COPY  eis_entity FROM STDIN ; | client backend
   |  |  1421 |  |  |  |
 | | | 2019-05-17 12:00:17.557683-07 |  
 |   |  
 | Activity| BgWriterMain|| |  
|   | background writer
   |  |  1420 |  |  |  |
 | | | 2019-05-17 12:00:17.557992-07 |  
 |   |  
 | | || |  
|   | checkpointer
   |  |  1422 |  |  |  |
 | | | 2019-05-17 12:00:17.554268-07 |  
 |   |  
 | | || |  
|   | walwriter
(8 rows)


It’s been running for

Re: Loading table with indexed jsonb field is stalling

2019-05-20 Thread Will Hartung


> On May 20, 2019, at 2:36 PM, Ron  wrote:

> I think you answered this earlier, but does the same stalling happen when 
> indexes are dropped?

No, the data loads fine.

The way I originally stumbled upon this was that I had off loaded the data for 
some other kinds of testing.

There are several indexes in production, but for my local testing, I only had 
one on the pk.

Then, one day, I wanted to look for something, and decided I’ll build the json 
index “real quick”.

Being me, it wasn’t until a few days later I’m like “why is my machine busy, 
what’s that grinding sound” (I rarely hear my drive, most work is done on SSD).

Then I realized that the index was still being built, several days later.

“Huh”

And I left it again.

I was, again, not really paying attention, and didn’t realize until 2 weeks 
after I had started that it was STILL running.

At that point I killed it.

Then I decided to break the file up and load it in chunks, see “how far” it was 
getting, etc. It did not take long for it to get stuck.

I tried it on different versions of PG, now on different OS’s (albeit one is in 
a VM). It wasn’t a crisis, just a curiosity.

But we had a scary event couple of weeks ago that may have forced us to reload 
the table, but, thankfully, didn’t and we recovered. But got me to thinking 
about poking at it again.

A colleague is trying to pg_restore our production DB for a conversion test to 
AWS Aurora, and it’s been stuck for 43 hours on 68M row loaded table. So, it’s 
systemic with our data, and I believe that we can not reload this table 
currently. Which is scary.

And here we are.






Re: Loading table with indexed jsonb field is stalling

2019-05-20 Thread Will Hartung


> On May 20, 2019, at 2:55 PM, Ron  wrote:

> And it pathologically loads even when there's just a PK on the numeric field?

Yea, that works fine.

> Drop all indexes, load data, recreate indexes?

No, I use the incremental load as it gives a much better indication when the 
process has gone off the deep end.

> But pg_restore only creates indexes AFTER the data loads.

Yes. Demonstrating that it doesn’t work either incrementally or all at once.

(No, I have no confidence that the Aurora build will ever finish in any 
reasonable amount of time.)






Re: Loading table with indexed jsonb field is stalling

2019-05-20 Thread Will Hartung

> On May 20, 2019, at 4:39 PM, Adrian Klaver  wrote:
> 
> So the issue is building the index(s) not loading the data, correct?

Correct.

Since I have been exploring this, I have not been able to successfully create a 
gin jsonb_path_ops index on a jsonb field either as a new index on a table with 
existing data, or as a pre-existing index on an empty table and loading the 
data in.

The create index does not finish on the existing data, and loading data does 
not finish when the index pre-exists.

We currently have a 68M row table loaded in an Aurora instance on AWS (I do not 
know the instance/machine size), and the index has been building for 48 hours 
now.

The pg_restore is currently working on that index:


CREATE INDEX eis_ent_traits
  ON public.eis_entity
  USING gin
  (traits jsonb_path_ops);




Re: Loading table with indexed jsonb field is stalling

2019-05-20 Thread Will Hartung

> On May 20, 2019, at 4:27 PM, Ron  wrote:
> 
> I'm confused.  You wrote above that loading without indexes and with just the 
> PK works just fine; if you really need it loaded in Aurora or production, 
> just drop the indexes beforehand?

Because the jsonb_path_ops index does not rebuild, and it’s the heart of that 
table.

The table is essentially worthless without that index to us.




Re: Loading table with indexed jsonb field is stalling

2019-05-20 Thread Will Hartung


> On May 20, 2019, at 5:15 PM, Ron  wrote:
> 
> Are there a sufficiently small number of elements in each traits object that 
> you can do something like this, on the UNINDEXED table?
> SELECT traits->element1, traits->element2, count(*)
> from eis_entry
> group by traits->element1, traits->element2;
> 
> Sure it would run for a long time, but might show you where the problem lies.

I don’t quite understand what you’re trying to achieve here. Is the intent 
essentially to access each individual element within the json on the select 
statement? There’s at least 50 elements in this thing, and the structure is not 
flat, nor is it static and consistent. Similar, sure. Perfectly congruent, no.






Re: Loading table with indexed jsonb field is stalling

2019-05-20 Thread Will Hartung


> On May 20, 2019, at 5:14 PM, Adrian Klaver  wrote:
> 
> Well looks like you are down to Tom's suggestion of creating a test case. 
> Given that it seems confined to the jsonb field and corresponding index, I 
> would think that is all that is needed for the test case. Start with some 
> smaller subset, say 10,000 rows and work up till you start seeing an issue.

This will take quite some work, and I wouldn’t attempt it with less than 5M 
rows to load.






Re: Bulk inserts into two (related) tables

2019-05-21 Thread Will Hartung

> On May 21, 2019, at 9:56 AM, Rich Shepard  wrote:
> 
> I could insert all new rows into the organizations table, then produce a
> list of the org_id and org_name to manually insert the foreign key in the
> related people table. Time and effort intense.

You can either use a script for the entire process, or, convert the people 
table to INSERT statements that have a SELECT for the foreign key as part of 
the insert.

INSERT INTO PEOPLE (id, name, org_id) VALUES (1, ‘Bob’, (SELECT org_id FROM org 
WHERE org_name=‘Main Office’))




Re: Loading table with indexed jsonb field is stalling

2019-05-21 Thread Will Hartung


> On May 20, 2019, at 5:31 PM, Tom Lane  wrote:
> 
> Well, you're the only one who's seen this problem, and none of the
> rest of us have any idea how to reproduce it.  So if you want something
> to get done in a timely fashion, it's up to you to show us a test case.

So, we had success.

We looked in to maintenance_work_mem.

By default, this value is 64MB.

Our test on AWS was against a small machine, 2 CPU, 16G.

My VM is however many cores it gets (not that it matters) and 4G of RAM.

My VM used the 64MB value for maintenance_work_mem. The AWS small VM used 247M.

We killed the rebuild on the small machine, it was pushing 67 hours.

I was running a load on my machine, and it was still making progress over 
night, but was at, like 17 hours. But it was moving, but had only loaded 2.2M 
rows in that time.

We grabbed one of the huge AWS instances. 64 CPU, 488G of ram. Just, silly.

But, that’s fine — I’m good with silly.

It’s mainteance_work_mem was ~8G.

And we loaded all of my files on that instance in about 5.5 hours, about 9M 
rows per hour.

So, obviously, maintenance_work_mem was the smoking gun. Since I don’t know the 
details of GIN indexing, its not clear to me how the maintenance_work_mem is 
utilized with GIN index builds, but, obviously 64M is “not enough”, nor is 
247M. And 8G is certainly enough.

We’re cautious just setting these values “to 11” because of the fact that 
multiple connections can utilize them, so it seems to me that it’s important 
that they be “generous enough”, but not overly generous.

So this is good news, no bug, and, perhaps, left to its own devices, the DB 
would have eventually built this index. Whether it would have done so before 
universal heat death, is a different question.

Can anyone discuss how the maintenance_work_mem is utilized during GIN index 
creation? On our production systems, this value is set to 1G. And we don’t seem 
to have any problems for day to day work. This is a very busy table, and we 
have thousands of inserts/updates daily which seem to proceed well enough. I 
have not tried to rebuild this index on this system, so I can’t say if 1G is 
enough to rebuild this index efficiently or not. But its efficient enough for 
our transaction load.

Now, that said, should I ever be in that situation of having to recover this 
table like this, I’d have no problem cranking that value up high since it would 
be the only real connection on the system anyway.

But I’m still curious how the memory is utilized during index builds just to 
have a better understanding of the nuances of the system.

Thanks all for your help.






Re: Questions about btree_gin vs btree_gist for low cardinality columns

2019-05-30 Thread Will Hartung
s sense for specialized 
indexes to basically compact themselves by removing the prefixes. Basically 
each node says “all the words that start with ‘the’ (the, them, their, there, 
thespian) are this way". So, you’ll see a node “the” beneath that you might 
find the node “spian” (thespian with the ‘the’ removed). These indexes are more 
compact and better suited for text (vs, say, random GUIDs). Because text has 
notable properties of information density and such.

For just numbers (especially random numbers), there’s no real value.

This is why “scanning the index” doesn’t work well on these kinds of structures 
compared to a classic B-Tree. Nor does a HASH index. Since, hash indexes store 
the hash — not the key value. Nothing there to scan of value.

> 
> 2. 
> 
> Lack of index only scans is definitely a downside.  However, I see basically 
> identical performance, but way less memory and space usage, for gin indexes.  
> In terms of read-only performance, if index only scans are not a factor, why 
> not always recommend btree_gin indexes instead of regular btree for low 
> cardinality fields, which will yield similar performance but use far, far 
> less space and resources?

I simply can’t speak to the applicability of index to low cardinality columns 
as those relate to different issues not directly related to searching the index.

If you have 10M rows with a “STATUS” column of 1 or 2, and an index on that 
column, then you have a 2 node index with a bazillion row pointers. Some 
systems (I can’t speak to PG in this regard) degenerate in this kind of use 
case since the index is more or less designed to work great in unique 
identifiers than low cardinality values. The representation of large record 
pointer lists may just not be handled well as edge cases.

Historically, I try to avoid low cardinality indexes, simply because I’ve had 
problems in the past (yes, decades ago, but lessons learned). In those case I 
force high cardinality (in this case, maybe create a combined index of STATUS 
and ID, which makes an ostensibly low value index in to a unique one). This is 
very important for heavily weight indexes (like where most of the statues are 
1, but you mostly just want to look for the 2’s). But then its a matter of 
convincing the optimizer that the index isn’t utterly worthless no matter what 
you query for and table scans anyway.

Regards,

Will Hartung