Re: transaction failed after retry limit

bOR_ Tue, 23 Dec 2008 01:14:00 -0800

Some more testing:

When changing the do-year function to not do any virus evolution
(doseq [a (shuffle-java [birth death infect infect ]), the program
speeds up quite a bit,
and reaches the stuck state after 306600 years. 9 * 126288 behaviours
is quite close to 4*306600  behaviours.


306600   living:    918   infected:  0
pro alleles in population:  3    1.3411985456837219    (1564 259 13)
tap alleles in population:  3    1.8165208806652846    (1218 610 8)
mhc alleles in population:  1    1.0    (1836)
"Elapsed time: 2063.223 msecs"

2nd time: 306596

I'm trying to pinpoint the exact number where things go wrong. By
removing the birth and death function, and just having 10 infect
functions in the doseq, the population size is always 1000 (birth is
the only function that has a (> (rand) 0.5) before the dosync in it,
but if I remove birth, I need to remove death as well).

stuck at
107360   living:    1000   infected:  1000
with a do-year do-seq of 10 infects. (doseq [a (shuffle-java [infect
infect infect infect infect infect infect infect infect infect ])

107360 * 10 * 1000 = 1073600000.

1073600000, which happens to be near-exactly half of 2**31. So with
the change in the svn http://code.google.com/p/clojure/source/detail?r=1181,
I should be able to run simulations that are 5 billion times bigger.
That should do ;).

Not sure what I could change to avoid exhausting lastPoint though.

On Dec 23, 8:46 am, bOR_ <boris.sch...@gmail.com> wrote:
> Seems fairly reproducable.
>
> With a population of a 1000 people:
>
> 126288   living:    939   infected:  933
>  ave VL:  5.5225080385852285
> pro alleles in population:  6    3.9313139960273147    (621 467 404
> 360 24 2)
> tap alleles in population:  4    2.7024232960074537    (830 718 317
> 13)
> mhc alleles in population:  5    3.8634863880746124    (706 490 288
> 280 114)
> "Elapsed time: 28014.93 msecs"
>
> With a population of 10,000 people:
> 12626   living:    9068   infected:  8980
>  ave VL:  5.234532293986656
> pro alleles in population:  22    4.068465156597464    (8164 2554 1903
> 1143 1022 717 712 483 422 391 274 160 84 74 9 9 5 3 3 2 1 1)
> tap alleles in population:  21    3.2913665666051255    (9429 2163
> 1495 1461 891 596 487 443 423 361 289 49 23 8 6 4 2 2 2 1 1)
> mhc alleles in population:  17    11.51415636214343    (2636 2010 1914
> 1834 1400 1322 1260 1234 1162 998 894 506 454 292 198 20 2)
> "Elapsed time: 984358.517 msecs"
>
> So. It actually happens 10 times earlier with 10,000 people than with
> a 1000 ones. Puzzling.
>
> On Dec 22, 3:33 pm, bOR_ <boris.sch...@gmail.com> wrote:
>
> > * So far it happened in both instances that I ran the simulation for
> > more than 100k simulated years, so while this is reproducable, it does
> > take a number of hours to get there. I can see if I can get the effect
> > faster with a smaller population or something.
>
> > * When I start the simulation, the memory usage is 2.4% of the
> > available memory (16gb), and it is happily running on 8 Intel(R) Xeon
> > (R) CPU X5482  @ 3.20GHz 's.
> > (from 'top').
>
> > * inc-year:
>
> > (defn inc-year
> >   [_]
> >   (dosync (commute year inc)))
>
> > *Whole source is 
> > here:http://clojure.googlegroups.com/web/eden.clj?gsc=rQ4WoRYAAAB68Q78LH5o...
>
> > *gather indeed scans all refs, but is only called once every 1000
> > years, and right after an 'await', so I figured everything should have
> > been free then.
>
> > On Dec 22, 2:56 pm, Rich Hickey <richhic...@gmail.com> wrote:
>
> > > On Dec 22, 7:41 am, bOR_ <boris.sch...@gmail.com> wrote:
>
> > > > Hi all,
>
> > > > Long post, but it boils down that I'm running into a transaction
> > > > failed after retry limit after running my simulation for a couple of
> > > > hours. I chatted briefly with fyuryu in #clojure, and am now pasting
> > > > some of the hopefully relevant information into this post. Hope anyone
> > > > can shed a light. The recommendation of fyuryu was to use 'await-for'
> > > > rather than await, but I'm a but worried that that is just a way to
> > > > ignore some underlying problem.
>
> > > > I've the simulation still online and in limbo (long live emacs --
> > > > daemon), so I can answer additional questions.
>
> > > > I'll paste part of the program, the output, the agent-errors and some
> > > > additional things I tried below.
>
> > > Generally, you can get retry limit failures when a long-running
> > > transaction contends for the same refs as short-running transactions.
> > > It is hard to see what is going on with your sim without all the
> > > source.
>
> > > How many cores?
> > > What is the memory utilization?
> > > Do you have any blocking calls anywhere?
> > > What does inc-year do?
>
> > > Calls like 'gather' in a dosync can cause congestion, as I presume it
> > > does a scan of all refs?
>
> > > > I started mucking with it a bit more and find that I can't change a
> > > > single ref. Everything seems to be locked. If I make 'death' do a
> > > > println each time it is tried, I see that it is indeed trying to apply
> > > > itself to ref 1 about several thousand times.
>
> > > I don't like the sound of that. If you could create a reproducible
> > > test case I'll chase it down.
>
> > > Rich
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To post to this group, send email to clojure@googlegroups.com
To unsubscribe from this group, send email to 
clojure+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/clojure?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: transaction failed after retry limit

Reply via email to