Greg Smith wrote:
pavelbaros wrote:
I am also waiting for approval for my repository named
"materialized_view" on git.postgresql.org, so I could publish
completed parts.
Presuming that you're going to wander there and get assigned what
looks like an official repo name for this project is a
bit...optimistic. I would recommend that you publish to something
like github instead (you can fork http://github.com/postgres/postgres
), and if the work looks good enough that it gets picked up by the
community maybe you migrate it onto the main site eventually.
git.postgresql.org is really not setup to be general hosting space for
everyone who has a PostgreSQL related project; almost every repo on
there belongs to someone who has already been a steady project
contributor for a number of years.
Yes, you're true, I'm kind of newbe in this kind of project and
specially in PostgreSQL. But I think it is best way to get into
PostgreSQL. When I chose my bachelor thesis I did not know I could
participate GSoC or try to make it commitable. Anyway I will make repo
on github, so everybody could look at it, as soon as posible.
<http://github.com/pbaros/postgres>
(Switching to boilerplate mode for a paragraph...) You have picked a
PostgreSQL feature that is dramatically more difficult than it appears
to be, and I wouldn't expect you'll actually finish even a fraction of
your goals in a summer of work. You're at least in plentiful
company--most students do the same. As a rule, if you see a feature
on our TODO list that looks really useful and fun to work on, it's
only still there because people have tried multiple times to build it
completely but not managed to do so because it's harder than it
appears. This is certainly the case with materialized views.
You've outlined a reasonable way to build a prototype that does a
limited implementation here. The issue is what it will take to extend
that into being production quality for the real-world uses of
materialized views. How useful your prototype is depends on how well
it implements a subset of that in a way that will get used by the
final design.
The main hidden complexity in this particular project relates to
handling view refreshes. The non-obvious problem is that when the
view updates, you need something like a SQL MERGE to really handle
that in a robust way that doesn't conflict with concurrent access to
queries against the materialized view. And work on MERGE support is
itself blocked behind the fact that PostgreSQL doesn't have a good way
to lock access to a key value that doesn't exist yet--what other
databases call key range locking. See the notes for "Add SQL-standard
MERGE/REPLACE/UPSERT command" at http://wiki.postgresql.org/wiki/Todo
for more information.
You can work around that to build a prototype by grabbing a full table
lock on the materialized view when updating it, but that's not a
production quality solution. Solving that little detail is actually
more work than the entire project you've outlined. Your suggested
implementation--"In function CloseIntoRel executor swap relfilenode's
of temp table and original table and finally delete temp table"--is
where the full table lock is going to end up at. The exact use cases
that need materialized views cannot handle a CLUSTER-style table
recreation each time that needs an exclusive lock to switchover, so
that whole part of your design is going to be a prototype that doesn't
work at all like what needs to get built to make this feature
committable. It's also not a reasonable assumption that you have
enough disk space to hold a second copy of the MV in a production system.
For now I know it is not commitable in actual state, but for my thesis
it is enough and I know it will not be commitable with this design at
all. In case of GSoC it will depends on the time I will be able to spend
on it, if I will consider some other design.
Once there's a good way to merge updates, how to efficiently generate
them against the sort of large data sets that need materalized
views--so you just write out the updates rather than a whole new
copy--is itself a large project with a significant quantity of
academic research to absorb before starting. Dan Colish at Portland
State has been playing around with prototypes for the specific problem
of finding a good algorithm for view refreshing that is compatible
with PostgreSQL's execution model. He's already recognized the table
lock issue here and for the moment is ignoring that part. I don't
have a good feel yet for how long the targeted update code will take
to mature, but based on what I do know I suspect that little detail is
also a larger effort than the entire scope you're envisioning.
There's a reason why the MIT Press compendium "Materialized Views:
Techniques, Implementations, and Applications" is over 600 pages
long--I hope you've already started digging through that material.
I would like to start to dig through that, but I'm in a hurry now. I
already have made a small research on MV as part of my thesis. I also
plan to continue study PostgreSQL and Materialized Views more into the
depth, preferably as my master thesis. But I realize MV feature
commitable to PostgreSQL is not project for one person, of course.
Now, with all that said, that doesn't mean there's not a useful
project for you buried in this mess. The first two steps in your plan:
1) create materialized view
2) change rewriter
Include building a prototype grammer, doing an initial executor
implementation, and getting some sort of rewriter working. That is
potentially good groundwork to lay here. I would suggest that you
completely drop your step 3:
3) create command that takes snapshot (refresh MV)
Because you cannot built that in a way that will be useful (and by
that I mean committable quality) until there's a better way to handle
updates than writing a whole new table and grabbing a full relation
lock to switch to it. To do a good job just on the first two steps
should take at least a whole summer anyway--there's a whole stack of
background research needed I haven't seen anyone do yet, and that
isn't on your plan yet. There is a precedent for taking this
approach. After getting stalled trying to add the entirety of easy
partitioning to PostgreSQL, the current scope has been scaled back to
just trying to get the syntax and on-disk structure right, then finish
off the implementation. See
http://wiki.postgresql.org/wiki/Table_partitioning to get an idea how
that's been broken into those two major chunks.
Anyway thanks for all of your advices and help.
best regards,
Pavel Baros
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers