RE: Benchmarking dashboard proposal

Melik-Adamyan, Areg Wed, 20 Feb 2019 09:56:22 -0800

There is a lot of discussion going in the PR ARROW-4313 itself; I would like to 
bring some of the high-level questions here to discuss. First of all many 
thanks to Tanya for the work you are doing. 
Related to the dashboard intrinsics, I would like to set some scope and stick 
to that so we would not waste any job and get maximum efficiency from the work 
we are doing on the dashboard.
One thing that IMHO we are missing is against which requirements the work (DDL) 
is being done and in which scope? For me there are several things:
1. We want continuous *validated* performance tracking against checkins to 
catch performance regressions and progressions. Validated means that the 
running environment is isolated enough so the stddev (assuming the distribution 
is normal) is as close to 0 as possible. It means both hardware and software 
should be fixed and not changeable to have only one variable to measure.
2. The unit-tests framework (google/benchmark) allows to effectively report in 
textual format the needed data on benchmark with preamble containing 
information about the machine on which the benchmarks are run.
3. So with environments set and regular runs you have all the artifacts, though 
not in a very comprehensible format. So the reason to set a dashboard is to 
allow to consume data and be able to track performance of various parts on a 
historical perspective and much more nicely with visualizations. 
And here are the scope restrictions I have in mind:
- Disallow to enter data to the central repo any single benchmarks run, as they 
do not mean much in the case of continuous and statistically relevant 
measurements. What information you will get if someone reports some single run? 
You do not know how clean it was done, and more importantly is it possible to 
reproduce elsewhere. That is why even if it is better, worse or the same you 
cannot compare with the data already in the DB.
- Mandate the contributors to have dedicated environment for measurements. 
Otherwise they can use the TeamCity to run and parse data and publish on their 
site. Data that enters Arrow performance DB becomes Arrow community owned data. 
And it becomes community's job to answer why certain things are better or worse.
-  Because the numbers and flavors for CPU/GPU/accelerators are huge we cannot 
satisfy all the needs upfront and create DB that covers all the possible 
variants. I think we should have simple CPU and GPU configs now, even if they 
will not be perfect. By simple I mean basic brand string. That should be 
enough. Having all the detailed info in the DB does not make sense, as my 
experience is telling, you never use them, you use the CPUID/brandname to get 
the info needed.
- Scope and reqs will change during the time and going huge now will make 
things complicated later. So I think it will be beneficial to have something 
quick up and running, get better understanding of our needs and gaps, and go 
from there. 
The needed infra is already up on AWS, so as soon as we resolve DNS and key 
exchange issues we can launch.


-Areg.

-----Original Message-----
From: Tanya Schlusser [mailto:ta...@tickel.net] 
Sent: Thursday, February 7, 2019 4:40 PM
To: dev@arrow.apache.org
Subject: Re: Benchmarking dashboard proposal

Late, but there's a PR now with first-draft DDL ( 
https://github.com/apache/arrow/pull/3586).
Happy to receive any feedback!

I tried to think about how people would submit benchmarks, and added a 
Postgraphile container for http-via-GraphQL.
If others have strong opinions on the data modeling please speak up because I'm 
more a database user than a designer.

I can also help with benchmarking work in R/Python given guidance/a 
roadmap/examples from someone else.

Best,
Tanya

On Mon, Feb 4, 2019 at 12:37 PM Tanya Schlusser <ta...@tickel.net> wrote:

> I hope to make a PR with the DDL by tomorrow or Wednesday night—DDL 
> along with a README in a new directory `arrow/dev/benchmarking` unless 
> directed otherwise.
>
> A "C++ Benchmark Collector" script would be super. I expect some 
> back-and-forth on this to identify naïve assumptions in the data model.
>
> Attempting to submit actual benchmarks is how to get a handle on that. 
> I recognize I'm blocking downstream work. Better to get an initial PR 
> and some discussion going.
>
> Best,
> Tanya
>
> On Mon, Feb 4, 2019 at 10:10 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
>> hi folks,
>>
>> I'm curious where we currently stand on this project. I see the 
>> discussion in https://issues.apache.org/jira/browse/ARROW-4313 -- 
>> would the next step be to have a pull request with .sql files 
>> containing the DDL required to create the schema in PostgreSQL?
>>
>> I could volunteer to write the "C++ Benchmark Collector" script that 
>> will run all the benchmarks on Linux and collect their data to be 
>> inserted into the database.
>>
>> Thanks
>> Wes
>>
>> On Sun, Jan 27, 2019 at 12:20 AM Tanya Schlusser <ta...@tickel.net>
>> wrote:
>> >
>> > I don't want to be the bottleneck and have posted an initial draft 
>> > data model in the JIRA issue
>> https://issues.apache.org/jira/browse/ARROW-4313
>> >
>> > It should not be a problem to get content into a form that would be 
>> > acceptable for either a static site like ASV (via CORS queries to a 
>> > GraphQL/REST interface) or a codespeed-style site (via a separate 
>> > schema organized for Django)
>> >
>> > I don't think I'm experienced enough to actually write any 
>> > benchmarks though, so all I can contribute is backend work for this task.
>> >
>> > Best,
>> > Tanya
>> >
>> > On Sat, Jan 26, 2019 at 7:37 PM Wes McKinney <wesmck...@gmail.com>
>> wrote:
>> >
>> > > hi folks,
>> > >
>> > > I'd like to propose some kind of timeline for getting a first 
>> > > iteration of a benchmark database developed and live, with 
>> > > scripts to enable one or more initial agents to start adding new 
>> > > data on a daily / per-commit basis. I have at least 3 physical 
>> > > machines where I could immediately set up cron jobs to start 
>> > > adding new data, and I could attempt to backfill data as far back as 
>> > > possible.
>> > >
>> > > Personally, I would like to see this done by the end of February 
>> > > if not sooner -- if we don't have the volunteers to push the work 
>> > > to completion by then please let me know as I will rearrange my 
>> > > priorities to make sure that it happens. Does that sounds reasonable?
>> > >
>> > > Please let me know if this plan sounds reasonable:
>> > >
>> > > * Set up a hosted PostgreSQL instance, configure backups
>> > > * Propose and adopt a database schema for storing benchmark 
>> > > results
>> > > * For C++, write script (or Dockerfile) to execute all 
>> > > google-benchmarks, output results to JSON, then adapter script
>> > > (Python) to ingest into database
>> > > * For Python, similar script that invokes ASV, then inserts ASV 
>> > > results into benchmark database
>> > >
>> > > This seems to be a pre-requisite for having a front-end to 
>> > > visualize the results, but the dashboard/front end can hopefully 
>> > > be implemented in such a way that the details of the benchmark 
>> > > database are not too tightly coupled
>> > >
>> > > (Do we have any other benchmarks in the project that would need 
>> > > to be inserted initially?)
>> > >
>> > > Related work to trigger benchmarks on agents when new commits 
>> > > land in master can happen concurrently -- one task need not block 
>> > > the other
>> > >
>> > > Thanks
>> > > Wes
>> > >
>> > > On Mon, Jan 21, 2019 at 11:14 AM Wes McKinney 
>> > > <wesmck...@gmail.com>
>> wrote:
>> > > >
>> > > > Sorry, copy-paste failure:
>> > > https://issues.apache.org/jira/browse/ARROW-4313
>> > > >
>> > > > On Mon, Jan 21, 2019 at 11:14 AM Wes McKinney 
>> > > > <wesmck...@gmail.com>
>> > > wrote:
>> > > > >
>> > > > > I don't think there is one but I just created
>> > > > >
>> > >
>> https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c52
>> 91a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E
>> > > > >
>> > > > > On Mon, Jan 21, 2019 at 10:35 AM Tanya Schlusser <
>> ta...@tickel.net>
>> > > wrote:
>> > > > > >
>> > > > > > Areg,
>> > > > > >
>> > > > > > If you'd like help, I volunteer! No experience benchmarking 
>> > > > > > but
>> tons
>> > > > > > experience databasing—I can mock the backend (database + 
>> > > > > > http)
>> as a
>> > > > > > starting point for discussion if this is the way people 
>> > > > > > want to
>> go.
>> > > > > >
>> > > > > > Is there a Jira ticket for this that i can jump into?
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > On Sun, Jan 20, 2019 at 3:24 PM Wes McKinney <
>> wesmck...@gmail.com>
>> > > wrote:
>> > > > > >
>> > > > > > > hi Areg,
>> > > > > > >
>> > > > > > > This sounds great -- we've discussed building a more
>> full-featured
>> > > > > > > benchmark automation system in the past but nothing has 
>> > > > > > > been
>> > > developed
>> > > > > > > yet.
>> > > > > > >
>> > > > > > > Your proposal about the details sounds OK; the single 
>> > > > > > > most
>> > > important
>> > > > > > > thing to me is that we build and maintain a very general
>> purpose
>> > > > > > > database schema for building the historical benchmark 
>> > > > > > > database
>> > > > > > >
>> > > > > > > The benchmark database should keep track of:
>> > > > > > >
>> > > > > > > * Timestamp of benchmark run
>> > > > > > > * Git commit hash of codebase
>> > > > > > > * Machine unique name (sort of the "user id")
>> > > > > > > * CPU identification for machine, and clock frequency (in
>> case of
>> > > > > > > overclocking)
>> > > > > > > * CPU cache sizes (L1/L2/L3)
>> > > > > > > * Whether or not CPU throttling is enabled (if it can be
>> easily
>> > > determined)
>> > > > > > > * RAM size
>> > > > > > > * GPU identification (if any)
>> > > > > > > * Benchmark unique name
>> > > > > > > * Programming language(s) associated with benchmark (e.g. 
>> > > > > > > a
>> > > benchmark
>> > > > > > > may involve both C++ and Python)
>> > > > > > > * Benchmark time, plus mean and standard deviation if
>> available,
>> > > else NULL
>> > > > > > >
>> > > > > > > (maybe some other things)
>> > > > > > >
>> > > > > > > I would rather not be locked into the internal database
>> schema of a
>> > > > > > > particular benchmarking tool. So people in the community 
>> > > > > > > can
>> just
>> > > run
>> > > > > > > SQL queries against the database and use the data however 
>> > > > > > > they
>> > > like.
>> > > > > > > We'll just have to be careful that people don't DROP 
>> > > > > > > TABLE or
>> > > DELETE
>> > > > > > > (but we should have daily backups so we can recover from 
>> > > > > > > such
>> > > cases)
>> > > > > > >
>> > > > > > > So while we may make use of TeamCity to schedule the runs 
>> > > > > > > on
>> the
>> > > cloud
>> > > > > > > and physical hardware, we should also provide a path for 
>> > > > > > > other
>> > > people
>> > > > > > > in the community to add data to the benchmark database on
>> their
>> > > > > > > hardware on an ad hoc basis. For example, I have several
>> machines
>> > > in
>> > > > > > > my home on all operating systems (Windows / macOS / 
>> > > > > > > Linux,
>> and soon
>> > > > > > > also ARM64) and I'd like to set up scheduled tasks / cron
>> jobs to
>> > > > > > > report in to the database at least on a daily basis.
>> > > > > > >
>> > > > > > > Ideally the benchmark database would just be a PostgreSQL
>> server
>> > > with
>> > > > > > > a schema we write down and keep backed up etc. Hosted
>> PostgreSQL is
>> > > > > > > inexpensive ($200+ per year depending on size of 
>> > > > > > > instance;
>> this
>> > > > > > > probably doesn't need to be a crazy big machine)
>> > > > > > >
>> > > > > > > I suspect there will be a manageable amount of 
>> > > > > > > development
>> > > involved to
>> > > > > > > glue each of the benchmarking frameworks together with 
>> > > > > > > the
>> > > benchmark
>> > > > > > > database. This can also handle querying the operating 
>> > > > > > > system
>> for
>> > > the
>> > > > > > > system information listed above
>> > > > > > >
>> > > > > > > Thanks
>> > > > > > > Wes
>> > > > > > >
>> > > > > > > On Fri, Jan 18, 2019 at 12:14 AM Melik-Adamyan, Areg 
>> > > > > > > <areg.melik-adam...@intel.com> wrote:
>> > > > > > > >
>> > > > > > > > Hello,
>> > > > > > > >
>> > > > > > > > I want to restart/attach to the discussions for 
>> > > > > > > > creating
>> Arrow
>> > > > > > > benchmarking dashboard. I want to propose performance
>> benchmark
>> > > run per
>> > > > > > > commit to track the changes.
>> > > > > > > > The proposal includes building infrastructure for 
>> > > > > > > > per-commit
>> > > tracking
>> > > > > > > comprising of the following parts:
>> > > > > > > > - Hosted JetBrains for OSS 
>> > > > > > > > https://teamcity.jetbrains.com/
>> as a
>> > > build
>> > > > > > > system
>> > > > > > > > - Agents running in cloud both VM/container 
>> > > > > > > > (DigitalOcean,
>> or
>> > > others)
>> > > > > > > and bare-metal (Packet.net/AWS) and on-premise(Nvidia 
>> > > > > > > boxes?)
>> > > > > > > > - JFrog artifactory storage and management for OSS 
>> > > > > > > > projects
>> > > > > > > https://jfrog.com/open-source/#artifactory2
>> > > > > > > > - Codespeed as a frontend
>> https://github.com/tobami/codespeed
>> > > > > > > >
>> > > > > > > > I am volunteering to build such system (if needed more 
>> > > > > > > > Intel
>> > > folks will
>> > > > > > > be involved) so we can start tracking performance on 
>> > > > > > > various
>> > > platforms and
>> > > > > > > understand how changes affect it.
>> > > > > > > >
>> > > > > > > > Please, let me know your thoughts!
>> > > > > > > >
>> > > > > > > > Thanks,
>> > > > > > > > -Areg.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > >
>>
>

RE: Benchmarking dashboard proposal

Reply via email to