Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-07 Thread Dimitri Fontaine
Le jeudi 07 février 2008, Greg Smith a écrit :
>Le mercredi 06 février 2008, Dimitri Fontaine a écrit :
>> In other cases, a logical line is a physical line, so we start after first
>> newline met from given lseek start position, and continue reading after the
>> last lseek position until a newline.
>
> Now you're talking.  Find a couple of split points that way, fine-tune the
> boundaries a bit so they rest on line termination points, and off you go.

I was thinking of not even reading the file content from the controller 
thread, just decide splitting points in bytes (0..ST_SIZE/4 - 
ST_SIZE/4+1..2*ST_SIZE/4 etc) and let the reading thread fine-tune by 
beginning to process input after having read first newline, etc.

And while we're still at the design board, I'm also thinking to add a 
per-section parameter (with a global default value possible) 
split_file_reading which defaults to False, and which you'll have to set True 
for pgloader to behave the way we're talking about.

When split_file_reading = False and section_threads != 1 pgloader will have to 
manage several processing threads per section but only one file reading 
thread, giving the read input to processing theads in a round-robin fashion. 
In the future the processing thread choosing will possibly (another knob) be 
smarter than that, as soon as we get CE support into pgloader.

When split_file_reading = True and section_threads != 1 pgloader will have to 
manage several processing threads per section, each one responsible of 
reading its own part of the file, processing boundaries to be discovered at 
reading time. Adding in here CE support in this case means managing two 
separate thread pools per section, one responsible of splitted file reading 
and another responsible of data buffering and routing (COPY to partition 
instead of to parent table).

In both cases, maybe it would also be needed for pgloader to be able to have a 
separate thread for COPYing the buffer to the server, allowing it to continue 
preparing next buffer in the meantime?

This will need some re-architecturing of pgloader, but it seems it worth it 
(I'm not entirely sold about the two thread-pools idea, though, and this last 
continue-reading-while-copying-idea still has to be examined).
Some of the work needing to be done is by now quite clear for me, but a part 
of it still needs its design-time share. As usual though, the real hard part 
is knowing what we exactly want to get done, and we're showing good progress 
here :)

Greg's behavior:
max_threads   = N 
max_parallel_sections = 1
section_threads   = -1
split_file_reading= True

Simon's behaviour:
max_threads   = N
max_parallel_sections = 1   # I don't think Simon wants parallel sections
section_threads   = -1
split_file_reading= False

Comments?
-- 
dim


signature.asc
Description: This is a digitally signed message part.


[PERFORM] Index Scan Backward + check additional condition before heap access

2008-02-07 Thread Markus Bertheau
Hi,

(PostgreSQL 8.3)

I'm trying to optimize one of the most often used queries in our system:

(Full minimized pastable schema and data below.)

create table feeds_users (
user_id int references users(id) not null,
feed_id int references feeds(id) not null,
unique(user_id, feed_id)
);

create table items (
id serial primary key,
feed_id int references feeds(id) not null,
title text,
pub_date timestamp
);

create index items_feed_id_idx on items(feed_id);
create index items_pub_date_idx on items(pub_date);
create index items_pub_date_feed_id_idx on items(pub_date, feed_id);
create index feeds_users_feed_id on feeds_users(feed_id);

-- Query variant 1:

EXPLAIN ANALYZE SELECT i.* FROM items i WHERE feed_id IN (
SELECT f.feed_id FROM feeds_users f WHERE f.user_id = ?)
ORDER BY pub_date DESC
LIMIT 20 OFFSET 100;

-- Query variant 2:
EXPLAIN ANALYZE SELECT i.* FROM items i
JOIN feeds_users f ON i.feed_id = f.feed_id AND f.user_id = ?
ORDER BY pub_date DESC
LIMIT 20 OFFSET 100;

The table items contains 70 rows, feeds_users 99000, feeds 1 and
users
1000. The number of feeds for each user is distributed logarithmically, i.e.
there are many users with none to little feeds and some users with many
feeds.

In reality, 99% of the rows in items are being inserted in pub_date order,
and the correlation of user_id in feeds_users is not 1 (it is 1 with the
test data).

I need this query to work blisteringly fast for the common case, and at
least not too slow for extreme cases. Extreme cases are:
 * no feeds for a user
 * very little feeds for a user, with the top 20 items spread over >10% of
table items
 * normal number of feeds for a user, but big offset (1000 or 1 or
10). The maximum offset could be capped in the application if needed,
but a solution without that would be preferred.

The common case is that the needed rows are within the first (by pub_date
desc) <1% of items.

I ran some tests of both query variants on a Pentium M 1.6 GHz notebook with
1280 MB RAM, shared_buffers = 32MB, temp_buffers 8MB, work_mem 8MB. Three
different user_ids were used for testing; the needed rows for each user are
either 1) not existant, 2) spread over 18% of the table, 3) spread over
0.064% of the table. Also I tried a statistics target of 10 and 100 for the
two columns in feeds_users. Two query variants were used, one with an inner
join and one with IN. I got 4 different plans all in all. Results:

no. stat   user_id  item rowsresult rows  variant  plan  time
target  scanned  w/o limitquery

1   10 370   0in   1 2 ms
2 join 2 15000 ms
3  49   46855 (18%)  630  in   1 2300 ms
4 join 2 2300 ms
5  109  448 (0.064%) 206780   in   1 6 ms
6 join 2 9 ms
7   100370   0in   3 0.2 ms
8 join 2 16500 ms
9  49   46855 (18%)  630  in   4 10 ms
10join 2 2300 ms
11 109  448 (0.064%) 206780   in   1 6 ms
12join 2 9 ms

Plans below. Now the questions:

Do the differences in characteristics of the test data and the real data
somehow invalidate these numbers?

I observe, that the query variant with IN is faster in all cases. What's the
difference in them that leads to plans being chosen that differ so much
performance-wise?

Can I somehow trigger the items_pub_date_feed_id_idx to be used? ISTM that
scanning by that index in pub_date desc order and using that same index to
test for a needed feed_id would be faster than accessing the heap for each
tuple.

With a statistics target of 100, in queries no 3 and 9 a different, a very
much faster plan was chosen. How is the statistics target to be determined
such that the faster plan is chosen? Am I going to have to increase the
statistics target as one or more table receive more rows?

Thanks

Markus

Plans:
1 (for no 3)
 Limit  (cost=1304.78..1565.74 rows=20 width=27) (actual time=
2121.866..2377.740 rows=20 loops=1)
   ->  Nested Loop IN Join  (cost=0.00..57984.39 rows= width=27) (actual
time=9.856..2377.421 rows=120 loops=1)
 ->  Index Scan Backward using items_pub_date_idx on items i  (cost=
0.00..37484.20 rows=700071 width=27) (actual
time=0.131..1152.933rows=127337 loops=1)
 ->  Index Scan using feeds_users_user_id_key on feeds_users f
(cost=0.00..0.29 rows=1 width=4) (actual time=0.006..0.006 rows=0
loops=127337)
   Index Cond: ((f.user_id = 49) AND (f.feed_id = i.feed_id))
 Total runtime: 2377.899 ms
(6 rows)

2 (for no. 4)
 Limit  (cost=542.78..651.33 rows=20 width=27) (actual time=
2133.759..2393.259

[PERFORM] index usage on arrays

2008-02-07 Thread andrew klassen
I am using Postgres 8.2.5. 
 
I have a table that has rows containing a variable length array with a known 
maximum. 
I was doing selects on the array elements using an ANY match. The performance 
was not too good as my table got bigger. So I added an index on the array. 
That didn't help since the select was not using it.  I saw a thread in the 
mailing lists stating the index wouldn't be used. 
 
So I created indices on the individual array elements and then do a select
on each element separately and then combine each match using OR. 
This did substantially increase the select performance. However, it may 
be difficult to maintain this approach over time as the maximum array 
size may increase dramatically and forming the query will become tedious. 
 
Is there any alternative to what am I currently doing other than creating a row 
for 
each array element, i.e. stop using an array and use a separate row for each 
array index? The reason I didn't want to take this approach is because there 
are 
other columns in the row that will be duplicated needlessly.
 
Thanks, Andrew


  

Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ 


Re: [PERFORM] index usage on arrays

2008-02-07 Thread Tom Lane
andrew klassen <[EMAIL PROTECTED]> writes:
> Is there any alternative to what am I currently doing other than creating a 
> row for 
> each array element,

Since (I think) 8.2, you could create a GIN index on the array column
and then array overlap (&&) would be indexable.  GIN has some
performance issues if the table is heavily updated, though.

regards, tom lane

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-07 Thread Kenneth Marshall
On Thu, Feb 07, 2008 at 12:06:42PM -0500, Greg Smith wrote:
> On Thu, 7 Feb 2008, Dimitri Fontaine wrote:
>
>> I was thinking of not even reading the file content from the controller
>> thread, just decide splitting points in bytes (0..ST_SIZE/4 -
>> ST_SIZE/4+1..2*ST_SIZE/4 etc) and let the reading thread fine-tune by
>> beginning to process input after having read first newline, etc.
>
> The problem I was pointing out is that if chunk#2 moved foward a few bytes 
> before it started reading in search of a newline, how will chunk#1 know 
> that it's supposed to read up to that further point?  You have to stop #1 
> from reading further when it catches up with where #2 started.  Since the 
> start of #2 is fuzzy until some reading is done, what you're describing 
> will need #2 to send some feedback to #1 after they've both started, and 
> that sounds bad to me.  I like designs where the boundaries between threads 
> are clearly defined before any of them start and none of them ever talk to 
> the others.
>

As long as both processes understand the start condition, there
is not a problem. p1 starts at beginning and processes through chunk2
 offset until it reaches the start condition. p2 starts loading from
chunk2 offset plus the amount needed to reach the start condition, ...

DBfile|---|--x--|x|-x--|
  x chunk1--->
 x chunk2>
 x chunk3--->...

As long as both pieces use the same test, they will each process
non-overlapping segments of the file and still process 100% of the
file.

Ken

>> In both cases, maybe it would also be needed for pgloader to be able to 
>> have a
>> separate thread for COPYing the buffer to the server, allowing it to 
>> continue
>> preparing next buffer in the meantime?
>
> That sounds like a V2.0 design to me.  I'd only chase after that level of 
> complexity if profiling suggests that's where the bottleneck really is.
>
> --
> * Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD
>
> ---(end of broadcast)---
> TIP 6: explain analyze is your friend
>

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-07 Thread Greg Smith

On Thu, 7 Feb 2008, Dimitri Fontaine wrote:


I was thinking of not even reading the file content from the controller
thread, just decide splitting points in bytes (0..ST_SIZE/4 -
ST_SIZE/4+1..2*ST_SIZE/4 etc) and let the reading thread fine-tune by
beginning to process input after having read first newline, etc.


The problem I was pointing out is that if chunk#2 moved foward a few bytes 
before it started reading in search of a newline, how will chunk#1 know 
that it's supposed to read up to that further point?  You have to stop #1 
from reading further when it catches up with where #2 started.  Since the 
start of #2 is fuzzy until some reading is done, what you're describing 
will need #2 to send some feedback to #1 after they've both started, and 
that sounds bad to me.  I like designs where the boundaries between 
threads are clearly defined before any of them start and none of them ever 
talk to the others.



In both cases, maybe it would also be needed for pgloader to be able to have a
separate thread for COPYing the buffer to the server, allowing it to continue
preparing next buffer in the meantime?


That sounds like a V2.0 design to me.  I'd only chase after that level of 
complexity if profiling suggests that's where the bottleneck really is.


--
* Greg Smith [EMAIL PROTECTED] http://www.gregsmith.com Baltimore, MD

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] index usage on arrays

2008-02-07 Thread Oleg Bartunov

andrew,
what are your queries ? Have you seen contrib/intarray, 
GIN index ?


On Thu, 7 Feb 2008, andrew klassen wrote:


I am using Postgres 8.2.5.

I have a table that has rows containing a variable length array with a known 
maximum.
I was doing selects on the array elements using an ANY match. The performance
was not too good as my table got bigger. So I added an index on the array.
That didn't help since the select was not using it.  I saw a thread in the
mailing lists stating the index wouldn't be used.

So I created indices on the individual array elements and then do a select
on each element separately and then combine each match using OR.
This did substantially increase the select performance. However, it may
be difficult to maintain this approach over time as the maximum array
size may increase dramatically and forming the query will become tedious.

Is there any alternative to what am I currently doing other than creating a row 
for
each array element, i.e. stop using an array and use a separate row for each
array index? The reason I didn't want to take this approach is because there are
other columns in the row that will be duplicated needlessly.

Thanks, Andrew


 

Be a better friend, newshound, and
know-it-all with Yahoo! Mobile.  Try it now.  
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: [EMAIL PROTECTED], http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-07 Thread Mark Lewis
> > I was thinking of not even reading the file content from the controller
> > thread, just decide splitting points in bytes (0..ST_SIZE/4 -
> > ST_SIZE/4+1..2*ST_SIZE/4 etc) and let the reading thread fine-tune by
> > beginning to process input after having read first newline, etc.
> 
> The problem I was pointing out is that if chunk#2 moved foward a few bytes 
> before it started reading in search of a newline, how will chunk#1 know 
> that it's supposed to read up to that further point?  You have to stop #1 
> from reading further when it catches up with where #2 started.  Since the 
> start of #2 is fuzzy until some reading is done, what you're describing 
> will need #2 to send some feedback to #1 after they've both started, and 
> that sounds bad to me.  I like designs where the boundaries between 
> threads are clearly defined before any of them start and none of them ever 
> talk to the others.

I don't think that any communication is needed beyond the beginning of
the threads.  Each thread knows that it should start at byte offset X
and end at byte offset Y, but if Y happens to be in the middle of a
record then just keep going until the end of the record.  As long as the
algorithm for reading past the end marker is the same as the algorithm
for skipping past the beginning marker then all is well.

-- Mark Lewis

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

http://www.postgresql.org/about/donate


Re: [PERFORM] Benchmark Data requested --- pgloader CE design ideas

2008-02-07 Thread Matthew

On Thu, 7 Feb 2008, Greg Smith wrote:
The problem I was pointing out is that if chunk#2 moved foward a few bytes 
before it started reading in search of a newline, how will chunk#1 know that 
it's supposed to read up to that further point?  You have to stop #1 from 
reading further when it catches up with where #2 started.  Since the start of 
#2 is fuzzy until some reading is done, what you're describing will need #2 
to send some feedback to #1 after they've both started, and that sounds bad 
to me.


It doesn't have to be fuzzy at all. Both threads will presumably be able 
to use the same algorithm to work out where the boundary is, therefore 
they'll get the same result. No need to pass back information.


Matthew

--
There is something in the lecture course which may not have been visible so
far, which is reality   -- Computer Science Lecturer

---(end of broadcast)---
TIP 6: explain analyze is your friend