Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts

Josh Rosen Fri, 17 Oct 2014 23:07:31 -0700

I think that the fix was applied.  Take a look at 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21874/consoleFull


Here, I see a fetch command that mentions this specific PR branch rather than 
the wildcard that we had before:

 > git fetch --tags --progress https://github.com/apache/spark.git 
 > +refs/pull/2840/*:refs/remotes/origin/pr/2840/* # timeout=15

Do you have an example of a Spark PRB build that’s still failing with the old 
fetch failure?

- Josh
On October 17, 2014 at 11:03:14 PM, Davies Liu ([email protected]) wrote:

How can we know the changes has been applied? I had checked several  
recent builds, they all use the original configs.  

Davies  

On Fri, Oct 17, 2014 at 6:17 PM, Josh Rosen <[email protected]> wrote:  
> FYI, I edited the Spark Pull Request Builder job to try this out. Let’s see  
> if it works (I’ll be around to revert if it doesn’t).  
>  
> On October 17, 2014 at 5:26:56 PM, Davies Liu ([email protected]) wrote:  
>  
> One finding is that all the timeout happened with this command:  
>  
> git fetch --tags --progress https://github.com/apache/spark.git  
> +refs/pull/*:refs/remotes/origin/pr/*  
>  
> I'm thinking that maybe this may be a expensive call, we could try to  
> use a more cheap one:  
>  
> git fetch --tags --progress https://github.com/apache/spark.git  
> +refs/pull/XXX/*:refs/remotes/origin/pr/XXX/*  
>  
> XXX is the PullRequestID,  
>  
> The configuration support parameters [1], so we could put this in :  
>  
> +refs/pull//${ghprbPullId}/*:refs/remotes/origin/pr/${ghprbPullId}/*  
>  
> I have not tested this yet, could you give this a try?  
>  
> Davies  
>  
>  
> [1]  
> https://wiki.jenkins-ci.org/display/JENKINS/GitHub+pull+request+builder+plugin
>   
>  
> On Fri, Oct 17, 2014 at 5:00 PM, shane knapp <[email protected]> wrote:  
>> actually, nvm, you have to be run that command from our servers to affect  
>> our limit. run it all you want from your own machines! :P  
>>  
>> On Fri, Oct 17, 2014 at 4:59 PM, shane knapp <[email protected]> wrote:  
>>  
>>> yep, and i will tell you guys ONLY if you promise to NOT try this  
>>> yourselves... checking the rate limit also counts as a hit and increments  
>>> our numbers:  
>>>  
>>> # curl -i https://api.github.com/users/whatever 2> /dev/null | egrep  
>>> ^X-Rate  
>>> X-RateLimit-Limit: 60  
>>> X-RateLimit-Remaining: 51  
>>> X-RateLimit-Reset: 1413590269  
>>>  
>>> (yes, that is the exact url that they recommended on the github site lol)  
>>>  
>>> so, earlier today, we had a spark build fail w/a git timeout at 10:57am,  
>>> but there were only ~7 builds run that hour, so that points to us NOT  
>>> hitting the rate limit... at least for this fail. whee!  
>>>  
>>> is it beer-thirty yet?  
>>>  
>>> shane  
>>>  
>>>  
>>>  
>>> On Fri, Oct 17, 2014 at 4:52 PM, Nicholas Chammas <  
>>> [email protected]> wrote:  
>>>  
>>>> Wow, thanks for this deep dive Shane. Is there a way to check if we are  
>>>> getting hit by rate limiting directly, or do we need to contact GitHub  
>>>> for that?  
>>>>  
>>>> 2014년 10월 17일 금요일, shane knapp<[email protected]>님이 작성한 메시지:  
>>>>  
>>>> quick update:  
>>>>>  
>>>>> here are some stats i scraped over the past week of ALL pull request  
>>>>> builder projects and timeout failures. due to the large number of spark  
>>>>> ghprb jobs, i don't have great records earlier than oct 7th. the data  
>>>>> is  
>>>>> current up until ~230pm today:  
>>>>>  
>>>>> spark and new spark ghprb total builds vs git fetch timeouts:  
>>>>> $ for x in 10-{09..17}; do passed=$(grep $x SORTED.passed | grep -i  
>>>>> spark | wc -l); failed=$(grep $x SORTED | grep -i spark | wc -l); let  
>>>>> total=passed+failed; fail_percent=$(echo "scale=2; $failed/$total" | bc  
>>>>> |  
>>>>> sed "s/^\.//g"); line="$x -- total builds: $total\tp/f:  
>>>>> $passed/$failed\tfail%: $fail_percent%"; echo -e $line; done  
>>>>> 10-09 -- total builds: 140 p/f: 92/48 fail%: 34%  
>>>>> 10-10 -- total builds: 65 p/f: 59/6 fail%: 09%  
>>>>> 10-11 -- total builds: 29 p/f: 29/0 fail%: 0%  
>>>>> 10-12 -- total builds: 24 p/f: 21/3 fail%: 12%  
>>>>> 10-13 -- total builds: 39 p/f: 35/4 fail%: 10%  
>>>>> 10-14 -- total builds: 7 p/f: 5/2 fail%: 28%  
>>>>> 10-15 -- total builds: 37 p/f: 34/3 fail%: 08%  
>>>>> 10-16 -- total builds: 71 p/f: 59/12 fail%: 16%  
>>>>> 10-17 -- total builds: 26 p/f: 20/6 fail%: 23%  
>>>>>  
>>>>> all other ghprb builds vs git fetch timeouts:  
>>>>> $ for x in 10-{09..17}; do passed=$(grep $x SORTED.passed | grep -vi  
>>>>> spark | wc -l); failed=$(grep $x SORTED | grep -vi spark | wc -l); let  
>>>>> total=passed+failed; fail_percent=$(echo "scale=2; $failed/$total" | bc  
>>>>> |  
>>>>> sed "s/^\.//g"); line="$x -- total builds: $total\tp/f:  
>>>>> $passed/$failed\tfail%: $fail_percent%"; echo -e $line; done  
>>>>> 10-09 -- total builds: 16 p/f: 16/0 fail%: 0%  
>>>>> 10-10 -- total builds: 46 p/f: 40/6 fail%: 13%  
>>>>> 10-11 -- total builds: 4 p/f: 4/0 fail%: 0%  
>>>>> 10-12 -- total builds: 2 p/f: 2/0 fail%: 0%  
>>>>> 10-13 -- total builds: 2 p/f: 2/0 fail%: 0%  
>>>>> 10-14 -- total builds: 10 p/f: 10/0 fail%: 0%  
>>>>> 10-15 -- total builds: 5 p/f: 5/0 fail%: 0%  
>>>>> 10-16 -- total builds: 5 p/f: 5/0 fail%: 0%  
>>>>> 10-17 -- total builds: 0 p/f: 0/0 fail%: 0%  
>>>>>  
>>>>> note: the 15th was the day i rolled back to the earlier version of the  
>>>>> git plugin. it doesn't seem to have helped much, so i'll probably bring  
>>>>> us  
>>>>> back up to the latest version soon.  
>>>>> also note: rocking some floating point math on the CLI! ;)  
>>>>>  
>>>>> i also compared the distribution of git timeout failures vs time of  
>>>>> day,  
>>>>> and there appears to be no correlation. the failures are pretty evenly  
>>>>> distributed over each hour of the day.  
>>>>>  
>>>>> we could be hitting the rate limit due to the ghprb hitting github a  
>>>>> couple of times for each build, but we're averaging ~10-20 builds per  
>>>>> hour  
>>>>> (a build hits github 2-4 times, from what i can tell). i'll have to  
>>>>> look  
>>>>> more in to this on monday, but suffice to say we may need to move from  
>>>>> unauthorized https fetches to authorized requests. this means  
>>>>> retrofitting  
>>>>> all of our jobs. yay! fun! :)  
>>>>>  
>>>>> another option is to have local mirrors of all of the repos. the  
>>>>> problem w/this is that there might be a window where changes haven't  
>>>>> made  
>>>>> it to the local mirror and tests run against it. more fun stuff to  
>>>>> think  
>>>>> about...  
>>>>>  
>>>>> now that i have some stats, and a list of all of the times/dates of the  
>>>>> failures, i will be drafting my email to github and firing that off  
>>>>> later  
>>>>> today or first thing monday.  
>>>>>  
>>>>> have a great weekend everyone!  
>>>>>  
>>>>> shane, who spent way too much time on the CLI and is ready for some  
>>>>> beer.  
>>>>>  
>>>>> On Thu, Oct 16, 2014 at 1:04 PM, Nicholas Chammas <  
>>>>> [email protected]> wrote:  
>>>>>  
>>>>>> On Thu, Oct 16, 2014 at 3:55 PM, shane knapp <[email protected]>  
>>>>>> wrote:  
>>>>>>  
>>>>>>> i really, truly hate non-deterministic failures.  
>>>>>>  
>>>>>>  
>>>>>> Amen bruddah.  
>>>>>>  
>>>>>  
>>>>>  
>>>  
>  
> ---------------------------------------------------------------------  
> To unsubscribe, e-mail: [email protected]  
> For additional commands, e-mail: [email protected]  
>

Re: short jenkins downtime -- trying to get to the bottom of the git fetch timeouts

Reply via email to