Hi I am Priyanka Sharma, master student at Vrije University, Amsterdam. My major is "parallel and distributed system system". I am interested to participate in gsoc2010 with cassandra. I would like to implement "demo application for cassandra". I have pasted my proposal(not fully final) below with this email. I tried to send proposal in attachment but there was some problem, it may filtering attachments.
You can find proposal (organized and easy to read) also at: http://www.few.vu.nl/~psa220/gsoc-proposal.pdf and CV at : http://www.few.vu.nl/~psa220/priyanka_cv.pdf I would like to have your comments on my proposal, So that I can make it better. Kindly give me some feedback about my proposal. ======================================================================================================== Cassandra gsoc2010 : Demo application for cassandra --------------------------------------------------- Name and Email Address: Priyanka Sharma, psa...@few.vu.nl, sharmapriyan...@gmail.com Chat/IM IDs and Networks: psha...@irc.freenode.net Bio, Resumé, or C.V. -------------------- I strongly believe in learning through experimentation and am conscious of my responsibility to contribute effectively to my endeavors. I relish working in teams and am confident of my system-level programming skills. I am always keen to contribute to open source projects. My interest towards research and open source projects led me to work on Security Enhanced Linux (SELinux, Role-based access control). I extended the SELinux framework and this project led to two international IEEE publication (for links, see resume). Currenty, I am pursuing masters in "parallel and distributed systems" and I have explored the area of distributed systems and databases quite well. I have been involved and worked on many distributed systems like Plan9 OS and other system developed at INRIA like Telex. Which give me internal ideas of real issues that can occur like consistency,scalability, fault tolerance. I am writing a position paper also on "casandra" in which I am going to compare it with other data storage systems like, bitable, dynamo, Which will be no doubt help me in this project. I just started using Cassandra and I found its very interesting because of its ease of use and its not "just" key/value storage. It has many properties which are very useful and interesting, and different from other data storage model. This increased my motivation to work with cassandra, and I believe that my deep study and real time experience in distributed systems and storage systems makes me an ideal candidate for this project. I had participated in gsoc2009 also with Plan9 bell labs group and I completed it successfully. Please find my complete resume in attachment with this email or at http://www.few.vu.nl/~psa220/priyanka_cv.pdf Project Title and Description ----------------------------- There are many large scale real time applications running on cassandra like facebook, twitter, digg. But it doesn't shows how they are storing data using cassandra. we need a small and simple application which can easily demonstrate features of cassandra and explains how it is different from other distributed storage systems, Which also explains the reason of migrating every application on cassandra today. For example, cassandra uses "quorum" ((N/2)+1) technique to provide consistency which actually makes it fast for write operation. Cassandra also uses "eventual comsistency" to make data consistent (which is also in amazon Dynamo). Wiki is a kind of application which deals with bulk data, It is an enclyclopedia. Managing such a bulk and changing data requires a lot of effort at the storage level. Maintaining indexes on different keys like Category, Author, Dates, Ids etc adds more complexity and very challenging. For such a system we require a distributed database with a very efficient search and indexing facility. For which we can use cassandra which provides good performance in indexing and searching. Approach --------- 1) Implement a simple and clean demo application : To implement a simple application, Wiki would be a sensible application. It will provide the main text editing feature, login and other additional features like finding system information, user preferences, see recent changes etc. Any user can edit pages with the exception of some which will only be modified by the authentic users i:e if you are logged in.(give solid example here). I may use python or PHP to implement this application. 2) Use thrift API : For storing data in cassandra, we require some API, which will help our application to talk to cassandra. We will use the most stable and popular thrift API to interect with cassandra. 3) Store data on cassandra : I will find out the best way to store data on cassandra that means we can read and write data effciently. Define columns, super columns, column family and keyspace. Make proper structure of these kewords in a way retrival of data would be effective and good in performance. I have to implement read and write indexing which perfroms well. 4) Add some showcase features in application : I will add some feature in our application which will be the showcase of cassandra. I will add search feature in wiki application as cassandra is good to perform searches. I have to think about how I am going to implement search internally, for example I should search on supercolumns. So, it will be challenging to implement efficient search algorithm internally. Another feature I would add is category, where each topic would be under some "Category" and in addition to this we can also define search into some particular category. An example would be " to find out the documents which have been changed in last week in a particular category, joining on two groups." 5) Implement group based queries : I would provide some group query results where i can use get_slice() functionality provided by thrift. For example, if user want to see its change logs per month basis or may be per week basis. Then I can query the cassandra system using thrift API (like get_slice) on the basis of key. It will provide results fast and this feature would be provide flexibility to user also. 6) 6.1) Test and demonstration of application : Once all of the above in place, it is important to test every feature in the application is working as per the definition. then I have to demonstrate some of the benefits of such a system which is using cassandra internally. Some case studies and compartive study with some other databases required now. I will test how this system is performing better than other systems for same type of application. 6.2) Testing on mulinodes : Now, I will test my application where cassandra is deployed on multinodes. I will repeat same read and write tests and compare it with other distributed databases performance for same kind of application. Timeline -------- April 20 - May23 Community bonding! Use this time to understand and read all possible features that can be provide in application which makes it effective(in the sense of cassandra). May 24 - June 06 Implementation begins! Implement simple wiki application with some basic features like edit document and create login etc. I may use PHP or Python for implementation. June 07 - June 13: Integrate wiki application with Thrift to use cassandra as in backend. June 14 - June 30: Find out and implement the best way to represent data in storage. July 1 - July 15: Add some showcase features like search and category search in application. July 16: Mid-term deliverables:Working implementation of an application running on cassandra. Which also provide some July 17 - July 30 Implement group based queries for user profile. like, last change logs. Then implement "Join query" feature where user can search category plus user based data. July 31 - Aug 15 Test application! Do some comparetive study with other databases. Find out if application looks not fully featured add some features in it. Aug 16 - Aug 29 Test application where cassandra is deployed on multinodes. Aug 30: Final deliverable: Give full proof application running on cassandra. -- Thanks & Regards Priyanka Sharma