Feature request for adoptive indexes

2021-10-25 Thread Hayk Manukyan
Hi everyone. I want to do some feature request regarding indexes, as far as
I know this kind of functionality doesn't exists in Postgres. Here is my
problem :
I need to create following indexes:
Create index job_nlp_year_scan on ingest_scans_stageing
(`job`,`nlp`,`year`,`scan_id`);
Create index job_nlp_year_issue_flag on ingest_scans_stageing
(`job`,`nlp`,`year`,`issue_flag`);
Create index job_nlp_year_sequence on ingest_scans_stageing
(`job`,`nlp`,`year`,`sequence`);
As you can see the first 3 columns are the same (job, nlp, year). so if I
create 3 different indexes db should manage same job_nlp_year structure 3
times.
The Data Structure that I think which can be efficient in this kind of
scenarios is to have 'Adaptive Index'  which will be something like
Create index job_nlp_year on ingest_scans_stageing
(`job`,`nlp`,`year`,(`issue_flag`,`scan_id`, `sequence`));
And depend on query it will use or job_nlp_year_scan  or
job_nlp_year_issue_flag , or job_nlp_year_sequence ( job, nlp, year and one
of ( `issue_flag` , `scan_id` ,  `sequence` )
For more description please feel free to refer me


Re: Feature request for adoptive indexes

2021-10-26 Thread Hayk Manukyan
ok. here is the deal if I have the following index with 6 column

CREATE INDEX ON job_nlp_year_scan (job, nlp, year, scan_id, issue_flag,
sequence);

I need to specify all 6 columns in where clause in order to fully use this
index.
It will not be efficient in cases when I have 4 condition in where clause
also I should follow the order of columns.
In case of INCLUDE the 3 columns just will be in index but will not be
structured as index so it will have affect only if In select I will have
that 6 columns nothing more.

In my case I have table with ~15 columns
In my application  I have to do a lot of queries with following where
clauses

1. where  job =  and nlp =  and year = 
and SCAN_ID = 
2. where  job =  and nlp =  and year = 
and ISSUE_FLAG = 
3. where  job =  and nlp =  and year = 
and SEQUENCE = 

I don't want to index just on  job, nlp, year because for each  job, nlp,
year I have approximately 5000-7000 rows ,
overall table have ~50m rows so it is partitioned by job as well.  So if I
build 3 separate indexes it will be huge resource.
So I am thinking of having one index which will be job, nlp, year and the
4-th layer will be other columns not just included but also in B-tree
structure.
To visualize it will be something like this:
[image: image.png]
The red part is ordinary index with nested b-trees ant the yellow part is
adaptive part so depends on
where clause optimizer can decide which direction (leaf, b-tree whatever)
to chose.
In this case I will have one index and will manage red part only once for
all three cases.
Those it make sense ?
If you need more discussion we can have short call I will try to explain
you in more detailed way.

best regards

пн, 25 окт. 2021 г. в 19:33, Tomas Vondra :

> Hi,
>
> On 10/25/21 16:07, Hayk Manukyan wrote:
> > Hi everyone. I want to do some feature request regarding indexes, as far
> as
> > I know this kind of functionality doesn't exists in Postgres. Here is my
> > problem :
> > I need to create following indexes:
> >  Create index job_nlp_year_scan on ingest_scans_stageing
> > (`job`,`nlp`,`year`,`scan_id`);
> >  Create index job_nlp_year_issue_flag on ingest_scans_stageing
> > (`job`,`nlp`,`year`,`issue_flag`);
> >  Create index job_nlp_year_sequence on ingest_scans_stageing
> > (`job`,`nlp`,`year`,`sequence`);
> > As you can see the first 3 columns are the same (job, nlp, year). so if I
> > create 3 different indexes db should manage same job_nlp_year structure 3
> > times.
> > The Data Structure that I think which can be efficient in this kind of
> > scenarios is to have 'Adaptive Index'  which will be something like
> > Create index job_nlp_year on ingest_scans_stageing
> > (`job`,`nlp`,`year`,(`issue_flag`,`scan_id`, `sequence`));
> > And depend on query it will use or job_nlp_year_scan  or
> > job_nlp_year_issue_flag , or job_nlp_year_sequence ( job, nlp, year and
> one
> > of ( `issue_flag` , `scan_id` ,  `sequence` )
> > For more description please feel free to refer me
>
> It's not very clear what exactly would the "adaptive index" do, except
> that it'd have all three columns. Clearly, the three columns can't be
> considered for ordering etc. but need to be in the index somehow. So why
> wouldn't it be enough to either to create an index with all six columns?
>
> CREATE INDEX ON job_nlp_year_scan (job, nlp, year, scan_id, issue_flag,
> sequence);
>
> or possibly with the columns just "included" in the index:
>
> CREATE INDEX ON job_nlp_year_scan (job, nlp, year) INCLUDE (scan_id,
> issue_flag, sequence);
>
> If this does not work, you either need to explain more clearly what
> exactly the adaptive indexes does, or show queries that can't benefit
> from these existing features.
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: Feature request for adoptive indexes

2021-10-29 Thread Hayk Manukyan
Hi all
First of all thank you all for fast and rich responses, that is really nice.
I don't have that deep knowledge of how postgres  works under the hood so I
will try to explain more user side.
I want to refer for some points mentioned above.
 - First INCLUDE statement mostly eliminates the necessity to refer to a
clustered index or table to get columns that do not exist in the index. So
filtering upon columns in INCLUDE statement will not be performant. It can
give some very little performance if we include additional columns but it
is not in level to compare with indexed one. I believe this not for this
case
-  Tomas Vondra's Assumption that adaptive should be something between this
two
1) (job, nlp, year, sequence)
2) (job, nlp, year, scan_id, issue_flag, sequence)
is completely valid. I have made fairly small demo with this index
comparison and as I can see the difference is noticeable. Here is git repo
and results

,
I had no much time to do significant one sorry for that ))
 -  regarding data structure side of things by Pavel Borisov.
I also think that different data structure will be needed. Not sure exactly
at this point which kind of data structure but I will try to explain it
here.


best regards


ср, 27 окт. 2021 г. в 20:10, Peter Geoghegan :

> On Wed, Oct 27, 2021 at 1:02 AM Pavel Borisov 
> wrote:
> > AFAIK Gin is lossy for phrase queries as we don't store word position in
> the posting list. For purely logical queries, where position doesn't
> matter, it's not lossy.
>
> GIN is always lossy, in the sense that it provides only a
> gingetbitmap() routine -- there is no gingettuple() routine. I believe
> that this is fundamental to the overall design of GIN. It would be
> very difficult to add useful gingettuple() functionality now, since
> GIN already relies on lossiness to avoid race conditions.
>
> Here's an example of the problems that "adding gingettuple()" would
> run into: Today, an index's pending list entries can be merged
> concurrently with the entry tree, without worrying about returning the
> same tuples twice. This is only safe/correct because GIN only supports
> bitmap index scans. Without that, you need some other mechanism to
> make it safe -- ISTM you must "logically lock" the index structure,
> using ARIES/KVL style key value locks, or something along those lines.
>
> --
> Peter Geoghegan
>


Re: Feature request for adoptive indexes

2021-11-01 Thread Hayk Manukyan
I agree with the above mentioned.
The only concern I have is that we compare little wrong things.
For read we should compare
 (job, nlp, year, sequence) AND (job, nlp, year, Scan_ID) and (job, nlp,
year,  issue_flag  ) VS  (job, nlp, year, sequence, Scan_ID, issue_flag)
OR  (job, nlp, year INCLUDE(sequence, Scan_ID, issue_flag) )
Because our proposed index for reading should be closer to a combination of
those 3 and we have current solutions like index on all or with Include
statement.
We should try to find a gap between these three cases.
For DML queries
 (job, nlp, year, sequence, Scan_ID, issue_flag) OR  (job, nlp, year
INCLUDE(sequence, Scan_ID, issue_flag) ) VS  (job, nlp, year, sequence) AND
(job, nlp, year, Scan_ID) and (job, nlp, year,  issue_flag  )
Because again the proposed index should be just one and cover all 3
separate ones.

If you agree with these cases I will try to find a bigger time frame to
compare these two cases deeper.
The issue is not high prio but I strongly believe it can help and can be
nice feature for even more complicated cases.

Best regards.




вс, 31 окт. 2021 г. в 21:33, Tomas Vondra :

>
>
> On 10/31/21 16:48, Pavel Borisov wrote:
> >4 columns: 106 ms
> >6 columns: 109 ms
> >
> > So there's like 3% difference between the two cases, and even that
> > might
> > be just noise. This is consistent with the two indexes being about
> the
> > same size.
> >
> > I also don't think we can get great speedup in the mentioned case, so it
> > is not urgently needed of course. My point is that it is just nice to
> > have a multicolumn index constructed on stacked trees constructed on
> > separate columns, not on the index tuples as a whole thing.
>
> Well, I'd say "nice to have" features are pointless unless they actually
> give tangible benefits (like speedup) to users. I'd bet no one is going
> to implement and maintain something unless it has such benefit, because
> they have to weight it against other beneficial features.
>
> Maybe there are use cases where this would be beneficial, but so far we
> haven't seen one. Usually it's the OP who presents such a case, and a
> plausible way to improve it - but it seems this thread presents a
> solution and now we're looking for an issue it might solve.
>
> > At least there is a benefit of sparing shared memory if we don't need
> > to cache index tuples of several similar indexes, instead caching one
> > "compound index". So if someone wants to propose this thing I'd
> > support it provided problems with concurrency, which were mentioned
> > by Peter are solved.
> >
>
> The problem with this it assumes the new index would use (significantly)
> less space than three separate indexes. I find that rather unlikely, but
> maybe there is a smart way to achieve that (certainly not in detail).
>
> I don't want to sound overly pessimistic and if you have an idea how to
> do this, I'd like to hear it. But it seems pretty tricky, particularly
> if we assume the suffix columns are more variable (which limits the
> "compression" ratio etc.).
>
> > These problems could be appear easy though, as we have index tuples
> > constructed in a similar way as heap tuples. Maybe it could be easier if
> > we had another heap am, which stored separate attributes (if so it could
> > be useful as a better JSON storage method than we have today).
> >
>
> IMO this just moved the goalposts somewhere outside the solar system.
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: Feature request for adoptive indexes

2021-11-02 Thread Hayk Manukyan
Tomas Vondra
> Are you suggesting those are not the actual best/worst cases and we
> should use some other indexes? If yes, which ones?

I would say yes.
In my case I am not querying only sequence column.
I have the following cases which I want to optimize.
1. Select * from Some_table where job =  and nlp = 
and year =  and  *scan_id =  *
2. Select * from Some_table where job =  and nlp = 
and year =  and  *Issue_flag =  *
3. Select * from Some_table where job =  and nlp = 
and year =  and  *sequence =  *
Those are queries that my app send to db that is why I said that from *read
perspective* our *best case* is 3 separate indexes for
 *(job, nlp, year, sequence)* AND *(job, nlp, year, Scan_ID)* and *(job,
nlp, year,  issue_flag)*  and any other solution like
 (job, nlp, year, sequence, Scan_ID, issue_flag) OR  (job, nlp, year )
INCLUDE(sequence, Scan_ID, issue_flag)  OR just (job, nlp, year) can be
considered as* worst case *
I will remind that in real world scenario I have ~50m rows and about *~5k
rows for each (job, nlp, year )*
>From *write perspective* as far as we want to have only one index our* best
case* can be considered any of
*(job, nlp, year, sequence, Scan_ID, issue_flag)* OR * (job, nlp, year )
INCLUDE(sequence, Scan_ID, issue_flag) *
and the* worst case* will be having 3 separate queries like in read
perspective
(job, nlp, year, sequence) AND (job, nlp, year, Scan_ID) and (job, nlp,
year,  issue_flag)

So I think the comparison that we did is not right because we are comparing
different/wrong things.

For right results we need to compare this two cases when we are doing
random queries with 1,2,3  and random writes.


вт, 2 нояб. 2021 г. в 01:16, Tomas Vondra :

>
>
> On 11/1/21 21:06, Robert Haas wrote:
> > On Tue, Oct 26, 2021 at 11:11 AM Tomas Vondra
> >  wrote:
> >> If I get what you propose, you want to have a "top" tree for (job, nlp,
> >> year), which "splits" the data set into subsets of ~5000-7000 rows. And
> >> then for each subset you want a separate "small" trees on each of the
> >> other columns, so in this case three trees.
> >>
> >> Well, the problem with this is pretty obvious - each of the small trees
> >> requires separate copies of the leaf pages. And remember, in a btree the
> >> internal pages are usually less than 1% of the index, so this pretty
> >> much triples the size of the index. And if you insert a row into the
> >> index, it has to insert the item pointer into each of the small trees,
> >> likely requiring a separate I/O for each.
> >>
> >> So I'd bet this is not any different from just having three separate
> >> indexes - it doesn't save space, doesn't save I/O, nothing.
> >
> > I agree. In a lot of cases, it's actually useful to define the index
> > on fewer columns, like (job, nlp, year) or even just (job, nlp) or
> > even just (job) because it makes the index so much smaller and that's
> > pretty important. If you have enough duplicate entries in a (job, nlp,
> > year) index to justify create a (job, nlp, year, sequence) index, the
> > number of distinct (job, nlp, year) tuples has to be small compared to
> > the number of (job, nlp, year, sequence) tuples - and that means that
> > you wouldn't actually save much by combining your (job, nlp, year,
> > sequence) index with a (job, nlp, year, other-stuff) index. As you
> > say, the internal pages aren't the problem.
> >
> > I don't intend to say that there's no possible use for this kind of
> > technology. Peter G. says that people are writing research papers
> > about that and they probably wouldn't be doing that unless they'd
> > found some case where it's a big win. But this example seems extremely
> > unconvincing.
> >
>
> I actually looked at the use case mentioned by Peter G, i.e. merged
> indexes with master-detail clustering (see e.g. [1]), but that seems
> like a rather different thing. The master-detail refers to storing rows
> from multiple tables, interleaved in a way that allows faster joins. So
> it's essentially a denormalization tool.
>
> Perhaps there's something we could learn about efficient storage of the
> small trees, but I haven't found any papers describing that (I haven't
> spent much time on the search, though).
>
> [1] Algorithms for merged indexes, Goetz Graefe
>  https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.140.7709
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>


Re: Feature request for adoptive indexes

2021-11-05 Thread Hayk Manukyan
Hi All

I did final research and saw that the difference between best and worst
cases is indeed really small.
I want to thank you guys for your time and efforts.

Best regards.


вт, 2 нояб. 2021 г. в 18:04, Pavel Borisov :

> вт, 2 нояб. 2021 г. в 16:04, Hayk Manukyan :
>
>> Tomas Vondra
>> > Are you suggesting those are not the actual best/worst cases and we
>> > should use some other indexes? If yes, which ones?
>>
>> I would say yes.
>> In my case I am not querying only sequence column.
>> I have the following cases which I want to optimize.
>> 1. Select * from Some_table where job =  and nlp = 
>> and year =  and  *scan_id =  *
>> 2. Select * from Some_table where job =  and nlp = 
>> and year =  and  *Issue_flag =  *
>> 3. Select * from Some_table where job =  and nlp = 
>> and year =  and  *sequence =  *
>> Those are queries that my app send to db that is why I said that from *read
>> perspective* our *best case* is 3 separate indexes for
>>  *(job, nlp, year, sequence)* AND *(job, nlp, year, Scan_ID)* and *(job,
>> nlp, year,  issue_flag)*  and any other solution like
>>  (job, nlp, year, sequence, Scan_ID, issue_flag) OR  (job, nlp, year )
>> INCLUDE(sequence, Scan_ID, issue_flag)  OR just (job, nlp, year) can be
>> considered as* worst case *
>> I will remind that in real world scenario I have ~50m rows and about *~5k
>> rows for each (job, nlp, year )*
>>
>
>  So you get 50M rows /5K rows = 10K times selectivity, when you select on
> job =  and nlp =  and year =  which is
> enormous. Then you should select some of the 5K rows left, which is
> expected to be pretty fast on bitmap index scan or INCLUDE column
> filtering. It confirms Tomas's experiment
>
>   pgbench -n -f q4.sql -T 60
>
> 106 ms vs 109 ms
>
> fits your case pretty well. You get absolutely negligible difference
> between best and worst case and certainly you don't need anything more than
> just plain index for 3 columns, you even don't need INCLUDE index.
>
> From what I read I suppose that this feature indeed doesn't based on the
> real need. If you suppose it is useful please feel free to make and post
> here some measurements that proves your point.
>
>
>
>
> --
> Best regards,
> Pavel Borisov
>
> Postgres Professional: http://postgrespro.com <http://www.postgrespro.com>
>