Hi Sharon Lucas,
I've posted this initially in the discussion boards under my thread:
"concurrent background commands and monitoring their execution" but got no
reply.
I reckon we missed each other as you were lacking vital information of the
project. So here we go:
Thanks for all your help so far.
Since we're on such a time differences - IMHO it can be more productive to
simply have it all out, even if it's on a couple of messages, as to get
something useful the next day, I hope you agree and I'm not too much of a
trouble.
Perhaps what is most needed here is a clear sight of the direction/requirement:
1. The stress-test system is mimicking real world stress by intelligently
issuing commands at set intervals:
* mimicking real world behaviors (the following are examples numbers for
illustration):
* I've checked and found that a query happens 7,200,000 times a year.
* taking into account 250 working days and 8 working hours per day
(those will eventually be parameter/variables for the testcase), I figure I
need to trigger a query every 1 second.
* of course this will give me an even spread, where in reality there
are peak hours (and even peak minutes - for example right after lunch) and very
low hours (at nights and weekends where in my "formula" it's 0 but it's not
really the case)
* a good approach is to allow "condensing" (even to: "fire 300
queries simultaneously" to simulate the minute right after lunch) and
"spreading", such that all-in-all it will surmise to 7,200,000 queries a year.
* for that purpose a "ticks" queue looks the most promising approach,
I can "spread" handling the ticks or take them more densely and know I'm on
target (7,200,000).
* on the other hand I must safe guard against over straining my
stresstesters, otherwise it may be that a bad test result come from running too
many concurrent commands on that stresstest computer.
* for this, I though about a "respool" of slots
* each stresstester has a fixed amount of concurrent command slots it
can handle, primarily by hardware resources.
* a main "queries" respool is defined, where each computer is
registered with its slots
* each command (logically "a command", this can be a series of
commands or a STAX job) take a slot and run its course in the background.
* for example, a query can take 60 seconds (while a query "tick" is
once each second).
* when the command finishes, the slot is returned.
* the slot mechanism will allow easy checking of a test environment
strain, i.e. if I see the ticks queue buildup "monotonously"/"to a high degree"
I know I need faster and/or more stresstesters (but not momentarily from
intentionally slowing down handling ticks to allow simultaneous invocation of
multiple ticks at once)
* I was oversimplifying a bit on the type of commands: there are
different types of activities to be simulated, for example: queries (talked
about), stores, loads, reports, syncs and so on to name a few.
* this translates into different slot respools for the different
command types.
* stresstesters will have slots of multiple command types at
different mixes (i.e. 5-stores and 50-queries)
* there will be different ticks for different activities (for
example, there might be 240,000 stores a year, translated into once every 30
seconds on average (using the same 250 working days/8 working hours formula in
the example).
* Finally, a result report should be gathered from all commands issued,
for example for each query how much time did it take and the amount of results
returned, etc.
* for starters a CSV output is good enough.
* it is also possible to issue log messages at runtime (to one of the
user* levels) of average times/50th percentile/90th percentiles etc. on
different activities.
* at the end of the day, perhaps a stax monitor extension for runtime
display will be done.
* Last and currently least: some activities (i.e. scenarios) are to be
played in exact order requiring sharing of runtime information between running
jobs, for this I though about the VAR service but as I'm not handling this
right now we can ignore it for the time being. it is explained further in the
"design" below to some extent.
2. The second goal of the system is to be as robust and indifferent as
possible to different environments:
* this system will be used by both R&D engineers at a very small scale
(their single computer for both the tested application and issuing
queries/stores/whatever)
* QA at medium scales for application sanity tests on VMs
* at very large scales by the platform team trying to size a site or
test hardware/storage compatibility, thus using real-fast-storage and
real-fast-servers and trying real sites profiles.
* so the system must allow this and be indifferent and require as little
to no configurations as possible.
3. Lastly (and I think this is what you were referring to in your questions
in the beginning) also compile a report from both tested application(s) servers
for both software and primarily hardware stats:
* first and foremost, this is a completely different STAF task to be run
on parallel regardless of the actual test job.
* at set interval (say each minute) this will check for internal tested
application statistics and the hardware it is running on for memory/cpu/network
etc.
* also storage and possibly other intermediaries such as network/fabric
switches, etc. will be checked.
* perhaps a good time to note that the tested application is usually on
multiple servers (examples: main server, compression server, speech server, web
frontend server, database server, sometimes other "application" servers, backup
server, etc.), so all of those should be checked.
* initially a CSV file is good enough for all those metrics to allow
easy cross-referencing with the "clients" metrics (the csv file mentioned above
of query times, etc.), as with the client metrics, at the end a stax monitor
extension may be the way to go for runtime control.
________________________________
How about the following design (which I more or less pictured above in the
general explanations/requirement/direction):
1. have a respool of slots added by participating STAF computers in the test
env. upon their startup, this is to not overstrain my stresstesters, effecting
their results. for now manually setting this up suffices, later I can add
another STAX job that adds/remove on the fly by realtime available resource.
2. of course there will be multiple respools by command types.
* as to not overstrain the computer handling the respool allocations, I
thought about designating different stresstesters as responsible for tasks.
* for example, stresstester1 has the queries respool; stresstester2 has
the stores respool and so on
* of course this does not mean all stor slots are on (in the example
above) stresstester2 - only their registration, so whenever I want to store, I
need to ask stresstester2 for a slot which could be on stresstester1.
* this should be able to "fold" nicely. i.e. for the platform requrments
there might be anyware from 15 to 150 stress testers with multiple "registries"
for slots by command type, and on the other hand on a single developer's desk
there will, of course, one one, his computer.
3. use the (external) timer service to post "ticks" at set intervals.
4. using my own queue (actually multiple queues) to listen to those ticks as
depicted in "Sample STAX Job 3 - Creating a STAF Handle and Using it's Queue",
this way I can achieve the following:
* monitor queue buildup if not enough slots are available
* ability to do peaks times/concurrent load while still keeping the
total overall yearly activities ("ticks").
5. "ticks" behaviour:
* at each "tick" received, I can decide (perhaps on random, based on
test runtime, time of day or whatever) anywhere from issuing a single relevant
command, or do nothing (and thus allow the queue to build up intentionally), or
do as many commands on the queue at once (peak point).
* alternatively, I may opt to intentionally (by testcase parameter) do
300 concurrent commands at 13:00 (back from lunch rush minute), thus even going
to a "overdraft" on the queue and allow it to build up again before taking
further actions.
* for every decision to act, a relevant resource will be taken from the
appropriate resource pool, and a command (or series of commands, stax job,
etc.) will be dispatched in the background.
* upon completion of that command, the resource will return to the
relevant pool.
6. "activity" task:
* register testcase (for example on queries: PASS by time/results
formula which takes into account the the amount of answers returned vs. the
time it took)
* each activity type (queries, stores, etc.) and stresstester will be
shown in a block allowing holding all queries for example or just the queries
on stresstester1 or even all activities on stresstester1 (queries, stores and
whatever - I know this is a bit of a problem as it's not hierarchical),
allowing adding more slots at runtime by increasing that stresstester's
capacity (more memory for instance, easy if it's a VM).
* create a results entry in the end report (which, for now, will be CSV).
The rest of the goals are not handled right now (such as ability to do things
in specific order, share information between components, taking metrics from
hardware [or application, oracle, etc.] and compiling another report to name a
few).
My aim is to do a PoC of about 30% (the mechanism and perhaps one or two
activities) for starters.
In the design above I have the following question:
1. What do you recommend, now that (if you've actually read it all - more
then I can say for myself :) ) know exactly what's at hand.
2. Is the proposed design feasible?
3. Is VAR "thread safe"? I can use it to share runtime data between multiple
running tasks?
* If so? why, btw, do you have STAXGlobal?
* If not, how can I share (relatively) vest runtime information. I'm
asking for the later stage of doing things in particular order (example: store
something, allow it to move to the backup server and delete from the main
application server [by application configuration, for example after 10
minutes], then query it, then load it [now it is actually being fetched from
the backup server], etc.)
4. Similarly, how would you suggest "thread safe"ly write results from
multiple stresstesters. of course I can do this at a later stage - combine
their results - but what I figured will be nicer, is utilizing the log service
to do this:
* is the log service "thread safe"
* is it possible to give it custom format, as to create a CSV file
instead of a plain test file?
5. Most importantly: how can I both run background commands (process,
multiple processes, stax jobs, etc.) and have them update my testcase results
(in a "pretty" way). this is where I'm at now and all this thread.
I'm sorry in advance for the very length message, take some comfort in that it
probably took me longer to write it...
I hope this huge message will answer everything you need to know to give good
and proper advice.
Much obliged, Nitzan.
Thanks, Bye, Nitzan.
P.S.
This is one proposed design, the other one is to do with held parallel "slots"
(either via sem events or held blocks, as explained in
http://sourceforge.net/p/staf/discussion/104046/thread/a5100142/#98ec at
attached pool3.xml and pool4.xml.
Can you please provide advice before I'll be going on the wrong direction too
far?
Thanks again, Nitzan.
------------------------------------------------------------------------------
October Webinars: Code for Performance
Free Intel webinars can help you accelerate application performance.
Explore tips for MPI, OpenMP, advanced profiling, and more. Get the most from
the latest Intel processors and coprocessors. See abstracts and register >
http://pubads.g.doubleclick.net/gampad/clk?id=60135991&iu=/4140/ostg.clktrk
_______________________________________________
staf-users mailing list
staf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/staf-users