Re: GSoc 2015 | COUCHDB-1743 Make the view server & protocol faster

Jan Lehnardt Mon, 16 Mar 2015 14:34:15 -0700

Dear Buddhika,

thank you for your interest in CouchDB and the CouchDB View Server!

This is an area where you can make significant contributions to CouchDB.

It is also a little bit involved, but you seem to have all the skills
required to pull this off :)

I’m happy to mentor you.
> On 16 Mar 2015, at 10:03, Buddhika Jayawardhana <[email protected]> 
> wrote:
> 
> Hi,
> I am an Undergraduate of Department of Computer Science and Engineering
> University of Moratuwa. I have been subscribed to couchdb mailing list
> since months and I have been trying to learn some Erlang to work with
> couchdb. I noticed project  "COUCHDB-1743 Make the view server & protocol
> faster" is related to GSoC. I am willing to submit a project proposal for
> this project.
> 
> I have theoretical knowledge in software process, design patterns, and
> other Engineering concepts. I've been using 'java', 'C++' for high-level
> programming and 'C', a little bit of assembly for low-level programming and
> PHP and JavaScript  for web development. Also I have sound knowledge  on
> Erlang. I would be much thankful if you can guide to get familiar with the
> project as soon as possible.
> 
> Here are the problems in my mind
> 
>   - Are the other programming languages that I should get familiar with?

Erlang and JavaScript will do, some knowledge of C to understand the
current system will help.

>   - What are the technologies I should get familiar with?

General knowledge of Unix/POSIX fundamentals (processes, fds, stdio etc.)
will be required. Windows equivalent APIs too (but not strictly a
requirement just yet).

>   - I can work 40 hours per week for the project. Would that be enough to
>   successfully complete the project?

I can’t estimate whether you’d be able to complete this 100%, but I’m sure
that this enough time to make a significant contribution, that the 
community then can take and finish up, should you not get to the end. E.g.
don’t worry too much about this :)

>   - What are the other resources that I should read before submitting the
>   proposal?

Familiarity with the CouchDB source can’t hurt. More in-depth knowledge of
Erlang as well, http://learnyousomeerlang.com is a great free resource and
the main Erlang docs are worth a read, as well. As are the various print
books that are available from various publishers.

It will definitely also help to read through the CouchDB Guide: 
http://guide.couchdb.org

Although some parts have already been integrated into http://docs.couchdb.org,
which you should also read, especially the bits about Design Documents, Views
and List, Show, Validation, Filter and Update functions.

In addition, check out the query_server_spec, it codifies the current query
server protocol:

https://github.com/apache/couchdb/blob/master/test/view_server/query_server_spec.rb

> Hope you will guide me through the project.

Again, thanks for taking an interest in this! :)

To get things rolling, here’s my rough idea for how this could play out:

Generally, there are three components, the Erlang and the JavaScript part
and the JavaScript runtime or couchjs.

We call all these things Query Server or View Server.

The Erlang part lives in https://github.com/apache/couchdb-couch-mrview

The JavaScript part lives in 
https://github.com/apache/couchdb/tree/master/share/server

The current JavaScript runtime is Spidermonkey. We have our own C-wrapper
around Spidermonkey, to make it a CLI tool that talks stdio:

  https://github.com/apache/couchdb-couch/tree/master/priv/couch_js

We’d generally like to move away from the custom C-wrapped Spidermonkey and
have V8 be the execution engine. We also like to get away from having to
maintain C/C++. It’d probably be simplest to use Node.js as a wrapper,
because then many more people can contribute to this. Also, Node.js is good
at streaming protocols, so it is a natural fit.

Here is how I would start:

1. Create a new Query Server that *only* handles Show, List, Filter, Validation
   and Update functions as that is a lot simpler on both the Erlang and
   JavaScript side.

2. As part of 1: Design a new Query Server protocol that works in a streaming
   fashion. The current one is request/response based and both sides are waiting
   for one another while one of them is doing actual work. It’d be nice if both
   could just keep working on whatever they need to do.

3. Once 1. and 2. are in place and working correctly, expand the new Query 
Server
   to also handle Views. At this point, adding view support should not be too
   complicated anymore.

Things to watch out for:

- map/reduce functions for CouchDB views need to be “pure”, e.g. we need to 
guarantee
  they stay the same unless CouchDB can see any changes (and then invalidate 
the view
  index). This means we need some extra isolation of the JS execution. And some
  limitation or observation of the require() system.

  There is a project that demonstrated we can do this. Jason Smith has run this,
  but I can’t seem to find it on his GitHub. Jason, do you have any pointers?

- A couchjs process can be used for multiple databases and different access 
control can
  be configured per database. Data MUST NOT leak between databases. E.g. Errors 
that
  are thrown when requesting a view result on database A must not show any 
process state
  data that comes from database B (and vice versa).

- The current system works much like CGI. A single process can handle one 
concurrent
  request, if there are two concurrent requests, a new process is spawned. The 
new
  Query Server should be able to handle multiple concurrent requests. But there 
will
  be a time when a single process is saturated, at that point, we should be 
able to
  spawn more Query Servers to help with the load. — In the 1./2./3. list above, 
I’d
  either solve this upfront, or after 3., depending on what you are more 
comfortable
  with. It might be easier to get started without this, but it might be harder 
to add
  later and easier overall to have thought this through upfront.

- Windows stdio can be troublesome, beware :)

- Windows process handling can also be troublesome, that’s why we are using
  https://github.com/apache/couchdb-couch/tree/master/priv/spawnkillable to 
kill/reap
  couchjs process there. Not sure we still need this when we use Node.js, but 
worth
  checking out.

- I’ve had a bit time last year to experiment with streaming Erlang/Node.js 
communication.
  It worked fine, but I didn’t get very far (the JavaScript part just echos 
commands
  back to Erlang). The projects could help as inspiration:

  https://github.com/janl/couch_query_server2
  https://github.com/janl/node-couch-query-server2 key code is in 
src/couch_query_server2_sup.erl

  It uses the Erlang pid as a stream marker so we can interleave requests.

  Please excuse the lack of a README or other instructions!

This is all I have for now. Other folks may want to chime in with their 
opinions :)

If you have any more questions, let me know. If you want to take this into 
JIRA, let’s
open a new ticket.

Best
Jan
-- 

> Thank You.
> 
> -- 
> *Buddhika Jayawardhana*
> Undergraduate | Department of Computer Science & Engineering
> University of Moratuwa
> *[email protected] <[email protected]>* | LinkedIn
> <http://lk.linkedin.com/in/buddhikajay/>

-- 
Professional Support for Apache CouchDB:
http://www.neighbourhood.ie/couchdb-support/

Re: GSoc 2015 | COUCHDB-1743 Make the view server & protocol faster

Reply via email to