[staf-users] Design for concurrent background commands stresstest system with resource slots

Nitzan Zaifman Thu, 24 Oct 2013 06:58:37 -0700

Hi Sharon Lucas,
I've posted this initially in the discussion boards under my thread: 
"concurrent background commands and monitoring their execution" but got no 
reply.
I reckon we missed each other as you were lacking vital information of the 
project. So here we go:
Thanks for all your help so far.
Since we're on such a time differences - IMHO it can be more productive to 
simply have it all out, even if it's on a couple of messages, as to get 
something useful the next day, I hope you agree and I'm not too much of a 
trouble.
Perhaps what is most needed here is a clear sight of the direction/requirement:


  1.  The stress-test system is mimicking real world stress by intelligently 
issuing commands at set intervals:
     *   mimicking real world behaviors (the following are examples numbers for 
illustration):
        *   I've checked and found that a query happens 7,200,000 times a year.
        *   taking into account 250 working days and 8 working hours per day 
(those will eventually be parameter/variables for the testcase), I figure I 
need to trigger a query every 1 second.
        *   of course this will give me an even spread, where in reality there 
are peak hours (and even peak minutes - for example right after lunch) and very 
low hours (at nights and weekends where in my "formula" it's 0 but it's not 
really the case)
        *   a good approach is to allow "condensing" (even to: "fire 300 
queries simultaneously" to simulate the minute right after lunch) and 
"spreading", such that all-in-all it will surmise to 7,200,000 queries a year.
        *   for that purpose a "ticks" queue looks the most promising approach, 
I can "spread" handling the ticks or take them more densely and know I'm on 
target (7,200,000).
     *   on the other hand I must safe guard against over straining my 
stresstesters, otherwise it may be that a bad test result come from running too 
many concurrent commands on that stresstest computer.
        *   for this, I though about a "respool" of slots
        *   each stresstester has a fixed amount of concurrent command slots it 
can handle, primarily by hardware resources.
        *   a main "queries" respool is defined, where each computer is 
registered with its slots
        *   each command (logically "a command", this can be a series of 
commands or a STAX job) take a slot and run its course in the background.
        *   for example, a query can take 60 seconds (while a query "tick" is 
once each second).
        *   when the command finishes, the slot is returned.
        *   the slot mechanism will allow easy checking of a test environment 
strain, i.e. if I see the ticks queue buildup "monotonously"/"to a high degree" 
I know I need faster and/or more stresstesters (but not momentarily from 
intentionally slowing down handling ticks to allow simultaneous invocation of 
multiple ticks at once)
     *   I was oversimplifying a bit on the type of commands: there are 
different types of activities to be simulated, for example: queries (talked 
about), stores, loads, reports, syncs and so on to name a few.
        *   this translates into different slot respools for the different 
command types.
        *   stresstesters will have slots of multiple command types at 
different mixes (i.e. 5-stores and 50-queries)
        *   there will be different ticks for different activities (for 
example, there might be 240,000 stores a year, translated into once every 30 
seconds on average (using the same 250 working days/8 working hours formula in 
the example).
     *   Finally, a result report should be gathered from all commands issued, 
for example for each query how much time did it take and the amount of results 
returned, etc.
        *   for starters a CSV output is good enough.
        *   it is also possible to issue log messages at runtime (to one of the 
user* levels) of average times/50th percentile/90th percentiles etc. on 
different activities.
        *   at the end of the day, perhaps a stax monitor extension for runtime 
display will be done.
     *   Last and currently least: some activities (i.e. scenarios) are to be 
played in exact order requiring sharing of runtime information between running 
jobs, for this I though about the VAR service but as I'm not handling this 
right now we can ignore it for the time being. it is explained further in the 
"design" below to some extent.
  2.  The second goal of the system is to be as robust and indifferent as 
possible to different environments:
     *   this system will be used by both R&D engineers at a very small scale 
(their single computer for both the tested application and issuing 
queries/stores/whatever)
     *   QA at medium scales for application sanity tests on VMs
     *   at very large scales by the platform team trying to size a site or 
test hardware/storage compatibility, thus using real-fast-storage and 
real-fast-servers and trying real sites profiles.
     *   so the system must allow this and be indifferent and require as little 
to no configurations as possible.
  3.  Lastly (and I think this is what you were referring to in your questions 
in the beginning) also compile a report from both tested application(s) servers 
for both software and primarily hardware stats:
     *   first and foremost, this is a completely different STAF task to be run 
on parallel regardless of the actual test job.
     *   at set interval (say each minute) this will check for internal tested 
application statistics and the hardware it is running on for memory/cpu/network 
etc.
     *   also storage and possibly other intermediaries such as network/fabric 
switches, etc. will be checked.
     *   perhaps a good time to note that the tested application is usually on 
multiple servers (examples: main server, compression server, speech server, web 
frontend server, database server, sometimes other "application" servers, backup 
server, etc.), so all of those should be checked.
     *   initially a CSV file is good enough for all those metrics to allow 
easy cross-referencing with the "clients" metrics (the csv file mentioned above 
of query times, etc.), as with the client metrics, at the end a stax monitor 
extension may be the way to go for runtime control.

________________________________
How about the following design (which I more or less pictured above in the 
general explanations/requirement/direction):

  1.  have a respool of slots added by participating STAF computers in the test 
env. upon their startup, this is to not overstrain my stresstesters, effecting 
their results. for now manually setting this up suffices, later I can add 
another STAX job that adds/remove on the fly by realtime available resource.
  2.  of course there will be multiple respools by command types.
     *   as to not overstrain the computer handling the respool allocations, I 
thought about designating different stresstesters as responsible for tasks.
     *   for example, stresstester1 has the queries respool; stresstester2 has 
the stores respool and so on
     *   of course this does not mean all stor slots are on (in the example 
above) stresstester2 - only their registration, so whenever I want to store, I 
need to ask stresstester2 for a slot which could be on stresstester1.
     *   this should be able to "fold" nicely. i.e. for the platform requrments 
there might be anyware from 15 to 150 stress testers with multiple "registries" 
for slots by command type, and on the other hand on a single developer's desk 
there will, of course, one one, his computer.
  3.  use the (external) timer service to post "ticks" at set intervals.
  4.  using my own queue (actually multiple queues) to listen to those ticks as 
depicted in "Sample STAX Job 3 - Creating a STAF Handle and Using it's Queue", 
this way I can achieve the following:
     *   monitor queue buildup if not enough slots are available
     *   ability to do peaks times/concurrent load while still keeping the 
total overall yearly activities ("ticks").
  5.  "ticks" behaviour:
     *   at each "tick" received, I can decide (perhaps on random, based on 
test runtime, time of day or whatever) anywhere from issuing a single relevant 
command, or do nothing (and thus allow the queue to build up intentionally), or 
do as many commands on the queue at once (peak point).
     *   alternatively, I may opt to intentionally (by testcase parameter) do 
300 concurrent commands at 13:00 (back from lunch rush minute), thus even going 
to a "overdraft" on the queue and allow it to build up again before taking 
further actions.
     *   for every decision to act, a relevant resource will be taken from the 
appropriate resource pool, and a command (or series of commands, stax job, 
etc.) will be dispatched in the background.
     *   upon completion of that command, the resource will return to the 
relevant pool.
  6.  "activity" task:
     *   register testcase (for example on queries: PASS by time/results 
formula which takes into account the the amount of answers returned vs. the 
time it took)
     *   each activity type (queries, stores, etc.) and stresstester will be 
shown in a block allowing holding all queries for example or just the queries 
on stresstester1 or even all activities on stresstester1 (queries, stores and 
whatever - I know this is a bit of a problem as it's not hierarchical), 
allowing adding more slots at runtime by increasing that stresstester's 
capacity (more memory for instance, easy if it's a VM).
     *   create a results entry in the end report (which, for now, will be CSV).
The rest of the goals are not handled right now (such as ability to do things 
in specific order, share information between components, taking metrics from 
hardware [or application, oracle, etc.] and compiling another report to name a 
few).
My aim is to do a PoC of about 30% (the mechanism and perhaps one or two 
activities) for starters.
In the design above I have the following question:

  1.  What do you recommend, now that (if you've actually read it all - more 
then I can say for myself :) ) know exactly what's at hand.
  2.  Is the proposed design feasible?
  3.  Is VAR "thread safe"? I can use it to share runtime data between multiple 
running tasks?
     *   If so? why, btw, do you have STAXGlobal?
     *   If not, how can I share (relatively) vest runtime information. I'm 
asking for the later stage of doing things in particular order (example: store 
something, allow it to move to the backup server and delete from the main 
application server [by application configuration, for example after 10 
minutes], then query it, then load it [now it is actually being fetched from 
the backup server], etc.)
  4.  Similarly, how would you suggest "thread safe"ly write results from 
multiple stresstesters. of course I can do this at a later stage - combine 
their results - but what I figured will be nicer, is utilizing the log service 
to do this:
     *   is the log service "thread safe"
     *   is it possible to give it custom format, as to create a CSV file 
instead of a plain test file?
  5.  Most importantly: how can I both run background commands (process, 
multiple processes, stax jobs, etc.) and have them update my testcase results 
(in a "pretty" way). this is where I'm at now and all this thread.
I'm sorry in advance for the very length message, take some comfort in that it 
probably took me longer to write it...
I hope this huge message will answer everything you need to know to give good 
and proper advice.
Much obliged, Nitzan.

Thanks, Bye, Nitzan.

P.S.
This is one proposed design, the other one is to do with held parallel "slots" 
(either via sem events or held blocks, as explained in 
http://sourceforge.net/p/staf/discussion/104046/thread/a5100142/#98ec at 
attached pool3.xml and pool4.xml.
Can you please provide advice before I'll be going on the wrong direction too 
far?

Thanks again, Nitzan.

------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from 
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60135991&iu=/4140/ostg.clktrk

_______________________________________________
staf-users mailing list
staf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/staf-users

[staf-users] Design for concurrent background commands stresstest system with resource slots

Reply via email to