Re: [openstack-dev] [manila] Barcelona Design Summit summary

Joshua Harlow Fri, 04 Nov 2016 11:02:23 -0700

Ben Swartzlander wrote:

Thanks to gouthamr for doing these writeups and for recording!


We had a great turn out at the manila Fishbowl and working sessions.
Important notes and Action Items are below:

===========================
Fishbowl 1: Race Conditions
===========================
Thursday 27th Oct / 11:00 - 11:40 / AC Hotel -Salon Barcelona - P1
Etherpad: https://etherpad.openstack.org/p/ocata-manila-race-conditions
Video: https://www.youtube.com/watch?v=__P7zQobAQw

Gist:
* We've some race conditions that have worsened over time:
* Deleting a share while snapshotting the share
* Two simultaneous delete-share calls
* Two simultaneous create-snapshot calls
* Though the end result of the race conditions is not terrible, we can
leave resources in untenable states, requiring administrative cleanup in
the worst scenario
* Any type of resource interaction must be protected in the database
with a test-and-set using the appropriate status fields
* Any test-and-set must be protected with a lock
* Locks must not be held over long running tasks: i.e, RPC Casts, driver
invocations etc.
* We need more granular state transitions: micro/transitional states
must be added per resource and judiciously used for state locking
* Ex: Shares need a 'snapshotting' state
* Ex: Share servers need states to signify setup phases, a la nova
compute instances

Just something that I've always wondered, and I know its not a easyanswer, but are there any ideas on why such simultaneous issues keep ongetting discovered so late in the software lifecycle, instead of atdesign time? Not probably just a manilla question, but it strikes me assomewhat confusing that keeps on popping up.

Discussion Item:
* Locks in the manila-api service (or specifically, extending usage of
locks across all manila services)
* Desirable because:
* Adding test-and-set logic at the database layer may render code
unmaintainable complicated as opposed to using locking abstractions
(oslo.concurrency / tooz)
* Cinder has evolved an elegant test-and-set solution but we may not be
able to benefit from that implementation because of the lack of being
able to do multi-table updates and because the code references OVO which
manila doesn't yet support.
* Un-desirable because:
* Most distributors (RedHat/Suse/Kubernetes-based/MOS) want to run more
than one API service in active-active H/A.
* If a true distributed locking mechanism isn't used/supported, the
current file-locks would be useless in the above scenario.
* Running file locks on shared file systems is a possibility, but
applies configuration/set-up burden
* Having all the locks on the share service would allow scale out of the
API service and the share manager is really the place where things are
going wrong
* With a limited form of test-and-set, atomic state changes can still be
achieved for the API service.

Agreed:
* File locks will not help

Action Items:
(bswartz): Will propose a spec for the locking strategy
(volunteers): Act on the spec ^ and help add more transitional states
and locks (or test-and-set if any)
(gouthamr): state transition diagrams for shares/share
instances/replicas, access rules / instance access rules
(volunteers): Review ^ and add state transition diagrams for
snapshots/snapshot instances, share servers
(mkoderer): will help with determining race conditions within
manila-share with tests

=====================================
Fishbowl 2: Data Service / Jobs Table
=====================================
Thursday 27th Oct / 11:50 - 12:30 / AC Hotel - Salon Barcelona - P1
Etherpad:
https://etherpad.openstack.org/p/ocata-manila-data-service-jobs-table
Video: https://www.youtube.com/watch?v=Sajy2Qjqbmk


Will https://review.openstack.org/#/c/260246/ help here instead?

It's the equivalent of:

http://docs.openstack.org/developer/taskflow/jobs.html

Something to think about...


Gist:
* Currently, a synchronous RPC call is made from the API to the
share-manager/data-service that's performing a migration to get the
progress of a migration
* We need a way to record progress of long running tasks: migration,
backup, data copy etc.
* We need to introduce a jobs table so that the respective service
performing the long running task can write to the database and the API
relies on the database

Discussion Items:
* There was a suggestion to extend the jobs table to all tasks on the
share: snapshotting, creating share from snapshot, extending, shrinking,
etc.
* We agreed not to do this because the table can easily go out of
control; and there isn't a solid use case to register all jobs. Maybe
asynchronous user messages is a better answer to this feature request
* "restartable" jobs would benefit from the jobs table
* service heartbeats could be used to react to services dying while
running long running jobs
* When running the data service in active-active mode, a service going
down can pass on its jobs to the other data service

Action Items:
(ganso): Will determine the structure of the jobs table model in his spec
(ganso): Will determine the benefit of the data service reacting to
additions in the database rather than acting upon RPC requests

=====================================
Working Sessions 1: High Availability
=====================================
Thursday 27th Oct / 14:40 - 15:20 / CCIB - Centre de Convencions
Internacional de Barcelona - P1 - Room 130
Etherpad: https://etherpad.openstack.org/p/ocata-manila-high-availability
Video: https://www.youtube.com/watch?v=xFk8ShK6qxU

Gist:
* We have a patch to introduce the tooz abstraction library to manila,
it currently creates a tooz coordinator for the manila-share service and
demonstrates replacing oslo concurrency locks to tooz locks:
https://review.openstack.org/#/c/318336/
* The heartbeat seems to have issues, needs debugging
* The owner/committer have tested this patch with both FileDriver and
Kazoo/Zookeeper as tooz backends. We need to test other tooz backends
* Distributors do not package dependencies for all tooz backends
* We plan to introduce leader election via tooz. We plan to use this in
cleanups, designate the service that performs polling (migration,
replication of shares and snapshots, share server cleanup)
* Code needs to be written to integrate the use of tooz/dlm via the
manila devstack plugin so it can be gate tested

Action Items:
(gouthamr): Will document how to set up tooz with 2 or more share services
(bswartz): Will set up a sub group of contributors to code/test H/A
solutions in this release


<cut>

-Josh

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [manila] Barcelona Design Summit summary

Reply via email to