Re: Random sampling in tests

Xiao Li Mon, 08 Oct 2018 08:57:50 -0700

Yes. Testing all the timezones is not needed.

Xiao


On Mon, Oct 8, 2018 at 8:36 AM Maxim Gekk <maxim.g...@databricks.com> wrote:

> Hi All,
>
> I believe we should also take into account what we test, for example, I
> don't think it makes sense to check all timezones for JSON/CSV
> functions/datasources because those timezones are just passed to external
> libraries. So, the same code is involved into testing of each out of 650
> timezones. We basically just spend time and resources on testing the
> external libraries.
>
> I mean the PRs: https://github.com/apache/spark/pull/22657 and
> https://github.com/apache/spark/pull/22379#discussion_r223039662
>
> Maxim Gekk
>
> Technical Solutions Lead
>
> Databricks B. V.  <http://databricks.com/>
>
>
> On Mon, Oct 8, 2018 at 4:49 PM Sean Owen <sro...@gmail.com> wrote:
>
>> If the problem is simply reducing the wall-clock time of tests, then
>> even before we get to this question, I'm advocating:
>>
>> 1) try simple parallelization of tests within the suite. In this
>> instance there's no reason not to test these in parallel and get a 8x
>> or 16x speedup from cores. This assumes, I believe correctly, that the
>> machines aren't generally near 100% utilization
>> 2) explicitly choose a smaller, more directed set of cases to test
>>
>> Randomly choosing test cases with a fixed seed is basically 2, but not
>> choosing test cases for a particular reason. You can vary the seed but
>> as a rule the same random subset of tests is always chosen. Could be
>> fine if there's no reason at all to prefer some cases over others. But
>> I am guessing any wild guess at the most important subset of cases to
>> test is better than random.
>>
>> I'm trying 1) right now instead in these several cases.
>> On Mon, Oct 8, 2018 at 9:24 AM Xiao Li <lix...@databricks.com> wrote:
>> >
>> > For this specific case, I do not think we should test all the timezone.
>> If this is fast, I am fine to leave it unchanged. However, this is very
>> slow. Thus, I even prefer to reducing the tested timezone to a smaller
>> number or just hardcoding some specific time zones.
>> >
>> > In general, I like Reynold’s idea by including the seed value and we
>> add the seed name in the test case name. This can help us reproduce it.
>> >
>> > Xiao
>> >
>> > On Mon, Oct 8, 2018 at 7:08 AM Reynold Xin <r...@databricks.com> wrote:
>> >>
>> >> I'm personally not a big fan of doing it that way in the PR. It is
>> perfectly fine to employ randomized tests, and in this case it might even
>> be fine to just pick couple different timezones like the way it happened in
>> the PR, but we should:
>> >>
>> >> 1. Document in the code comment why we did it that way.
>> >>
>> >> 2. Use a seed and log the seed, so any test failures can be reproduced
>> deterministically. For this one, it'd be better to pick the seed from a
>> seed environmental variable. If the env variable is not set, set to a
>> random seed.
>> >>
>> >>
>> >>
>> >> On Mon, Oct 8, 2018 at 3:05 PM Sean Owen <sro...@gmail.com> wrote:
>> >>>
>> >>> Recently, I've seen 3 pull requests that try to speed up a test suite
>> >>> that tests a bunch of cases by randomly choosing different subsets of
>> >>> cases to test on each Jenkins run.
>> >>>
>> >>> There's disagreement about whether this is good approach to improving
>> >>> test runtime. Here's a discussion on one that was committed:
>> >>> https://github.com/apache/spark/pull/22631/files#r223190476
>> >>>
>> >>> I'm flagging it for more input.
>> >>>
>> >>> ---------------------------------------------------------------------
>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: Random sampling in tests

Reply via email to