Re: [JENKINS] Lucene-Solr-5.x-Linux (64bit/jdk1.8.0_45) - Build # 12644 - Failure!

Steve Rowe Thu, 28 May 2015 08:47:14 -0700

> On May 28, 2015, at 4:37 AM, Michael McCandless <[email protected]> 
> wrote:
> 
> Do you have any sense of whether the higher IOPs/throughput of NVMe
> SSDs vs "mere" SATA III matters for time to run Lucene's/Solr's tests?


I paid extra for the NVMe SSD because I thought it could matter, given its 
multi-channel architecture and bus speed, but I haven’t done any testing.  I 
have a 5-year-old SATA III SSD in another system that I could transplant and do 
comparisons with.

I ran the trunk all-Lucene-Solr-tests job on a HDD[1] I have on the new system 
- here’s the third run of the Jenkins job: 
<http://jenkins.sarowe.net/job/Lucene-Solr-tests-trunk-HDD>.  Holy crap it only 
takes 11-ish minutes!!!: 
<http://jenkins.sarowe.net/job/Lucene-Solr-tests-trunk-HDD/buildTimeTrend>.  I 
wonder how much the system drive being an SSD helps here?

> Also, how did you parallelize the running of the tests?  Just the
> normal top-level "ant test"?  Or one "ant test" under lucene and one
> under solr, running at once?

Four Ant invocations, each using a number of JVMs (roughly) tuned to complete 
at close to the same time: 

- Lucene core and Lucene test-framework (2 JVMs)
- All other Lucene modules (9 JVMs)
- Solr core (12 JVMs)
- Solrj and Solr contribs (3 JVMs)

Here’s the script I use on Jenkins: 
<http://jenkins.sarowe.net/configfiles/show?id=org.jenkinsci.plugins.managedscripts.ScriptConfig1431809910785>.

I had to provide separate Ivy resolution caches to each JVM (hacked in for 
now), otherwise they would sometimes overwrite each other's stuff and get 
confused.

I also "tail -f" all 4 JVMs’ output and munge the "==> source-file <==" header 
to instead show the current module being tested.  I strip out blank lines to 
compress the output and more clearly show module output transitions.  You can 
see an example of that here: 
<http://jenkins.sarowe.net/job/Lucene-Solr-tests-trunk/217/consoleText>

I’m seeing lots of builds that have stalled tests using this script, e.g. 
<http://jenkins.sarowe.net/job/Lucene-Solr-tests-trunk/185/consoleText>, and no 
apparent pattern or set of tests that stall, so I’m thinking of switching to 
separate Jenkins jobs for each of the parallel Ant jobs, and increasing the 
number of Jenkins executors from one to four.

Steve

[1] HDD: <http://www.newegg.com/Product/Product.aspx?Item=N82E16822178338>

> Mike McCandless
> 
> http://blog.mikemccandless.com
> 
> 
> On Wed, May 27, 2015 at 9:29 AM, Steve Rowe <[email protected]> wrote:
>> SSD: http://www.newegg.com/Product/Product.aspx?Item=N82E16820167299
>> CPU: http://www.newegg.com/Product/Product.aspx?Item=N82E16819117404
>> M/B: http://www.newegg.com/Product/Product.aspx?Item=N82E16813132518
>> RAM: http://www.newegg.com/Product/Product.aspx?Item=N82E16820231820
>> 
>> The mem wasn’t listed as supported by the mobo manufacturer, and it isn’t 
>> detected at its full speed (2800MHz), so currently running at 2400 
>> (“overclocked” from detected 2100 I think).  CPU isn’t overclocked from 
>> stock 3GHz, but I got a liquid cooler, thinking I’d experiment (haven’t much 
>> yet).  Even without overclocking the fans spin faster when all the cores are 
>> busy, and it’s quite the little space heater.
>> 
>> I installed Debian 8, but had to fix the installer in a couple places 
>> because it didn’t know about the new NVMe device naming scheme:
>> 
>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=785147
>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=785149
>> 
>> I also upgraded to the 4.0 Linux kernel, since Debian 8 ships with the 3.16 
>> kernel, and 3.19 contains a bunch of NVMe improvements.
>> 
>> And I turned “swappiness" down to zero (thanks to Mike: 
>> <http://blog.mikemccandless.com/2011/04/just-say-no-to-swapping.html>) after 
>> seeing a bunch of test stalls while running the Lucene monster tests with 4 
>> JVMs (takes about 2 hours, but I can get it down to 90 minutes or so by 
>> splitting the two tests in Test2BSortedDocValues out into their own suites - 
>> I’ll make an issue).
>> 
>> Steve
>> 
>>> On May 27, 2015, at 5:08 AM, Anshum Gupta <[email protected]> wrote:
>>> 
>>> 8-real-core Haswell-E with 64G DDR4 memory and a NVMe 750-series SSD.
>>> 
>>> Can run *all* of the Lucene and Solr tests in 10 minutes by running 
>>> multiple ant jobs in parallel!
>>> 
>>> On Wed, May 27, 2015 at 1:17 AM, Ramkumar R. Aiyengar 
>>> <[email protected]> wrote:
>>> Curious.. sarowe, what's the spec?
>>> 
>>> On 26 May 2015 20:41, "Anshum Gupta" <[email protected]> wrote:
>>> The last buch of fixes seems to have fixed this. The tests passed on a 
>>> Jenkins that had failing tests earlier.
>>> Thanks Steve Rowe for lending the super-powerful machine that runs the 
>>> entire suite in 8 min!
>>> 
>>> I've seen about 20 runs on that box and also runs on Policeman Jenkins with 
>>> no issues related to this test since the last commit so I've also 
>>> back-ported this to 5x as well.
>>> 
>>> On Tue, May 26, 2015 at 9:19 AM, Chris Hostetter <[email protected]> 
>>> wrote:
>>> 
>>> : Right, as I said, we weren't hitting this issue due to our Kerberos conf.
>>> : file. i.e. the only thing that was different on our machines as compared 
>>> to
>>> : everyone else and moving that conf file got the tests to fail for me. It
>>> : now fails fairly consistently without the patch (from SOLR-7468) and has
>>> : been looking good with the patch.
>>> 
>>> that smells like the kind of thing that sould have some "assume sanity
>>> checks" built into it.
>>> 
>>> Given:
>>> * the test setups a special env before running the test
>>> * the test assumes the specific env will exist
>>> * the user's machine may already have env properties set before running ant 
>>> that affect the expected special env
>>> 
>>> therefore: before the test does the setup of the special env, it should
>>> sanity check that the users basic env doesn't have any properties that
>>> violate the "basic" situation.
>>> 
>>> so, hypothetical example based on what little i understand the
>>> authentication stuff: if the purpose of a test is to prove that some
>>> requests work with (or w/o) kerberos authentication, then before doing any
>>> setup of a "mock" kerberos env (or before setting up a "mock" situation
>>> where no authentication is required), the test should verify that there
>>> isn't already an existing kerberos env. (or if possible: "unset" whatever
>>> env/properties define that env)
>>> 
>>> 
>>> trivial example of a similar situation is the script engine tests --
>>> TestBadConfigs.testBogusScriptEngine:  the purpose of the test is to
>>> ensure that a solrconfig.xml file that refers to a script engine (by
>>> name) which is not installed on the machine will produce an expeted error
>>> at Solr init.  before doing the Solr init, we have an whitebox assume that
>>> asks the JVM directly if a script engine with that name already exists)
>>> 
>>> 
>>> 
>>> -Hoss
>>> http://www.lucidworks.com/
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Anshum Gupta
>>> 
>>> 
>>> 
>>> --
>>> Anshum Gupta
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [JENKINS] Lucene-Solr-5.x-Linux (64bit/jdk1.8.0_45) - Build # 12644 - Failure!

Reply via email to