Hi Elena, It's on a new branch in the same repository, you can see it here: https://github.com/pabloem/Kokiri/tree/file_correlation
I changed the whole simulator.py file. I made sure to comment the header of every function, but don't have too many in-line comments. You can let me know if you need more clarifications. Regards Pablo On Sun, Jun 29, 2014 at 6:57 PM, Elena Stepanova <ele...@montyprogram.com> wrote: > Hi Pablo, > > > On 29.06.2014 6:25, Pablo Estrada wrote: > >> Hi Elena and all, >> I guess I should admit that my excitement was a bit too much; but also I'm >> definitely not 'jumping' into this strategy. As I said, I am trying to use >> the lessons learned from all the experiments to make the best predictions. >> >> That being said, a strong point about the new strategy is that rather than >> purely use failure rate to predict failure rate, it uses more data to try >> to make predictions - and it experiences more consistency of prediction. >> On >> the 3k-training and 2k-predicting simulations its advantage is not so >> apparent (they fare similarly, with the 'standard' strategy being the best >> one), but it becomes more evident with longer predicting. >> >> I ran tests with 20k-training rounds and 20k-prediction rounds, and the >> new >> strategy fared a lot better. I have attached charts with comparisons of >> both of them. We can observe that with a running set of 500, the original >> algorithm had a very nice almost 95% recall in shorter tests, but it falls >> to less than 50% with longer testing (And it must be a lot lower if we >> average the last couple of thousand runs, rathen the the 20k simulation >> runs together) >> > > Okay, thanks, it looks much more convincing. > Indeed, as we already discussed before, the problem with the previous > strategy or implementation is that the recall deteriorates quickly after > you stop using complete results as the learning material, and start taking > into account only simulated results (which is what would be happening in > real life). If the new method helps to solve this problem, it's worth > looking into. > > You mentioned before that you pushed the new code, where is it located? > I'd like to look at it before making any further conclusions. > > Regards, > Elena > > > >> Since the goal of the project is to provide consistent long-term test >> optimization, we would want to take all we can learn from the new strategy >> - and, improve the consistency of the recall over long-term simulation. >> >> Nevertheless, I agree that there are important lessons in the original >> strategy, particularly that >90% recall ion shorter prediction periods. >> That's why I'm still tuning and testing. >> >> Again, all advice and observations are welcome. >> Hope everyone is having a nice weekend. >> Pablo >> >> >> On Sun, Jun 29, 2014 at 12:53 AM, Elena Stepanova < >> ele...@montyprogram.com> >> wrote: >> >> Hi Pablo, >>> >>> Could you please explain why you are considering the new results being >>> better? I don't see any obvious improvement. >>> >>> As I understand from the defaults, previously you were running tests with >>> 2000 training rounds and 3000 simulation rounds, and you've already had >>> ~70% on 300 runs and ~80% on 500 runs, see your email of June 19, >>> no_options_simulation.jpg. >>> >>> Now you have switched the limits, you are running with 3000 training and >>> 2000 simulation rounds. It makes a big difference, if you re-run tests >>> with >>> the old algorithm with the new limits, you'll get +10% easily, thus RS >>> 300 >>> will be around the same 80%, and RS 500 should be even higher, pushing >>> 90%, >>> while now you have barely 85%. >>> >>> Before jumping onto the new algorithm, please provide the comparison of >>> the old and new approach with equal pre-conditions and parameters. >>> >>> Thanks, >>> Elena >>> >>> >>> >>> On 28.06.2014 6:44, Pablo Estrada wrote: >>> >>> Hi all, >>>> well, as I said, I have incorporated a very simple weighted failure rate >>>> into the strategy, and I have found quite encouraging results. The >>>> recall >>>> looks better than earlier tests. I am attaching two charts with data >>>> compiled from runs with 3000 training rounds and 2000 simulation (5000 >>>> test >>>> runs analyzed in total): >>>> >>>> - The recall by running set size (As shown, it reaches 80% with 300 >>>> tests) >>>> - The index of failure in the priority queue (running set: 500, >>>> training >>>> >>>> 3000, simulation 2000) >>>> >>>> It is interesting to look at chart number 2: >>>> The first 10 or so places have a very high count of found failures. >>>> These >>>> most likely come from repeated failures (tests that failed in the >>>> previous >>>> run and were caught in the next one). The next ones have a skew to the >>>> right, and these come from the file-change model. >>>> >>>> I am glad of these new results : ). I have a couple new ideas to try to >>>> push the recall a bit further up, but I wanted to show the progress >>>> first. >>>> Also, I will do a thorough code review before any new changes, to make >>>> sure >>>> that the results are valid. Interestingly enough, in this new strategy >>>> the >>>> code is simpler. >>>> Also, I will run a test with a more long term period (20,000 training, >>>> 20,000 simulation), to see if the recall degrades as time passes and we >>>> miss more failures. >>>> >>>> Regards! >>>> Pablo >>>> >>>> >>>> On Fri, Jun 27, 2014 at 4:48 PM, Pablo Estrada <polecito...@gmail.com> >>>> wrote: >>>> >>>> Hello everyone, >>>> >>>>> I took the last couple of days working on a new strategy to calculate >>>>> the >>>>> relevance of a test. The results are not sufficient by themselves, but >>>>> I >>>>> believe they point to an interesting direction. This strategy uses that >>>>> rate of co-occurrence of events to estimate the relevance of a test, >>>>> and >>>>> the events that it uses are the following: >>>>> >>>>> - File editions since last run >>>>> - Test failure in last run >>>>> >>>>> >>>>> The strategy has also two stages: >>>>> >>>>> 1. Training stage >>>>> 2. Executing stage >>>>> >>>>> >>>>> In the training stage, it goes through the available data, and does the >>>>> following: >>>>> >>>>> - If test A failed: >>>>> - It counts and stores all the files that were edited since the >>>>> last >>>>> >>>>> test_run (the last test_run depends on BRANCH, PLATFORM, and other >>>>> factors) >>>>> - If test A failed also in the previous test run, it also counts >>>>> that >>>>> >>>>> >>>>> In the executing stage, the training algorithm is still applied, but >>>>> the >>>>> decision of whether a test runs is based on its relevance, the >>>>> relevance >>>>> is >>>>> calculated as the sum of the following: >>>>> >>>>> - The percentage of times a test has failed in two subsequent >>>>> >>>>> test_runs, multiplied by whether the test failed in the previous >>>>> run >>>>> (if >>>>> the test didn't fail in the previous run, this quantity is 0) >>>>> - For each file that was edited since the last test_run, the >>>>> >>>>> percentage of times that the test has failed after this file was >>>>> edited >>>>> >>>>> (The explanation is a bit clumsy, I can clear it up if you wish so) >>>>> The results have not been too bad, nor too good. With a running set of >>>>> 200 >>>>> tests, a training phase of 3000 test runs, and an executing stage of >>>>> 2000 >>>>> test runs, I have achieved recall of 0.50. It's not too great, nor too >>>>> bad. >>>>> >>>>> Nonetheless, while running tests, I found something interesting: >>>>> >>>>> - I removed the first factor of the relevance. I decided to not >>>>> care >>>>> >>>>> about whether a test failed in the previous test run. I was only >>>>> using the >>>>> file-change factor. Naturally, the recall decreased, from 0.50 to >>>>> 0.39 (the >>>>> decrease was not too big)... and the distribution of failed tests >>>>> in >>>>> the >>>>> priority queue had a good skew towards the front of the queue (so >>>>> it >>>>> seems >>>>> that the files help somewhat, to indicate the likelihood of a >>>>> failure). I >>>>> attached this chart. >>>>> >>>>> An interesting problem that I encountered was that about 50% of the >>>>> test_runs don't have any file changes nor test failures, and so the >>>>> relevance of all tests is zero. Here is where the original strategy (a >>>>> weighted average of failures) could be useful, so that even if we don't >>>>> have any information to guess which tests to run, we just go ahead and >>>>> run >>>>> the ones that have failed the most, recently. >>>>> >>>>> I will work on mixing up both strategies a bit in the next few days, >>>>> and >>>>> see what comes of that. >>>>> >>>>> By the way, I pushed the code to github. The code is completely >>>>> different, >>>>> so may be better to wait until I have new results soon. >>>>> >>>>> Regards! >>>>> Pablo >>>>> >>>>> >>>>> >>>> >>
_______________________________________________ Mailing list: https://launchpad.net/~maria-developers Post to : maria-developers@lists.launchpad.net Unsubscribe : https://launchpad.net/~maria-developers More help : https://help.launchpad.net/ListHelp