First, thanks for looking into this.  Rather than answer inline, I will 
just comment a few things:

* Unfortunately, the tests use real data because they try to pick up 
problems with the code, not test the performance, however, some of the 
tests do run against only data from the repository.  I will not make my 
personal training data available publicly, however I gave Matthew Flatt 
access to it and he can run those tests.

* there are two additional tests that can be run directly, they are on the 
"ah/perf-test" branch and are named "test/t-db-insert-activity.rkt" and 
"test/t-df-mean-max.rkt" -- the second one sets up a minimal `df-mean-max` 
test with some real data.  Interestingly, `df-mean-max` runs faster in 
RacketCS in that test, which is not what I found when I run the other tests 
-- this needs more investigation.

* I was aware that using `in-list`, `in-vector`, etc have better 
performance characteristics, but I prefer not to use them unless necessary: 
it is much nicer when code works with a variety of container formats, as 
much as possible, I have occasionally changed data representation, and 
Racket will not pick up that a function which uses `in-list` is now being 
passed a vector, this will result in a run time error instead.  In fact, 
most of the tests just run the code in the hope of triggering contract 
violations.

Which brings me to the last point:  the BAVG set that `get-mean-max-bounds` 
receives will have about 95 items in it, it will be this list [1]. If I 
would use a data-set of 10 million items in BAVG, I would have bigger 
problems: the `spline` function which also uses this BAVG set would run out 
of memory as it would try to construct a matrix of 10e14 items which would 
take 720 terabytes of memory (assuming floating point numbers)

More realistic values are, which I am aiming for:

* input data processing functions, such as `df-mean-max`, should work 
reasonably fast for a data-set of 3600 to 10800 items, which corresponds to 
activities that last 1 to 3 hours.  They should also work acceptably for 
data sets of 21600 items (an Ironman bike split).
* data plot and summary functions should work reasonably fast for data sets 
of 100 to 5000 items (depending on the case) -- in fact my code goes to 
great lengths to ensure that the data that is passed to any plot function 
is reasonably small, so plots are displayed in 1 second or less.

I am always looking for feedback and if you want to spend some time 
diagnosing performance issues, I can setup more individual tests, or I can 
let you know what realistic data sets are so you can setup tests of your 
own.

Best Regards,
Alex.

[1]  
https://github.com/alex-hhh/ActivityLog2/blob/72598c184caf541c4e2a60aa93e19159b10563af/rkt/data-frame/meanmax.rkt#L60

On Tuesday, February 5, 2019 at 9:49:12 PM UTC+8, gustavo wrote:
>
> I have been trying a few variations of the code. It would be nice to have 
> a test branch that use only the data in the repository. I used some fake 
> data instead.
>
> For the tests, I used the function *get-mean-max-bounds* 
> https://github.com/alex-hhh/ActivityLog2/blob/master/rkt/data-frame/meanmax.rkt#L409
>  
> with this data 
>

>   (define fake-data2
>     (for/list ([_ (in-range 10000000)])
>       (if (< (random) .01)
>          (vector #f #f)
>          (vector (- (random) .5) (- (random) .5)))))
>
> so, I tested with
>
>   (time (get-mean-max-bounds fake-data2))
>
>
>
> *** The main time improvement was changing 
>   (for ([b bavg] #:when (vector-ref b 1))
>     ...)
> to
>   (for ([b (*in-list* bavg)] #:when (vector-ref b 1))
>     ...)
>
> This increase the speed to the double or more. In the microbenchmark, the 
> new duration is the 40%-50% of the original duration.
>
> IIUC, in all functions you know the type of sequence of the arguments, so 
> my advice is to add in-list or in-vector to each and every for in the whole 
> file (or project).
>
> This is a good general recommendation. With in-list or in-vector or 
> in-range, the generated code is very efficient. Without them, the code has 
> to create a generic object to track the iteration, and the code is much 
> slower.
>
>
> *** I tried eliminating the set! and using for/fold instead. The problem 
> is that the code is slower :(. In general it's better to avoid mutable 
> variables, but in this case removing them makes the program slower. We 
> should take a look at the internal code of Racket and try to fix it, 
> because in a perfect world the version without set! should be faster. 
> Meanwhile, keep the current version...
>
>
> *** I tried replacing the for and set! with an explicit loop. Something 
> like
>      (let loop ([bavg bavg] [min-x #f] [max-x #f]  [min-y #f] [max-y #f])
>        ...)
>
> With this change, there is an additional 5% improvement in the speed, but 
> the legibility is reduced too much. So this is better than the version with 
> for and in-list, but I recommend to keep the legible version.
>
>
> *** I tried replacing the initial value of min-x and friends with +inf.0, 
> and removing the if in the updates. I'm convinced this is a good idea, but 
> the change in speed is negligible.
>
>
>
>
> In conclusion, try adding as much in-list, in-vector and in-range as you 
> can.
>
> Gustavo
>
>
>
>
> On Thu, Jan 31, 2019 at 9:58 AM Alex Harsanyi <alexha...@gmail.com 
> <javascript:>> wrote:
>
>>
>> On Thursday, January 31, 2019 at 9:23:39 AM UTC+8, Matthew Flatt wrote:
>>>
>>> > I would be happy to help you identify where the performance 
>>> degradation 
>>> > between Racket 7.1 and CS is when running these tests. 
>>>
>>> Small examples that illustrate slowness in a specific subsystem are 
>>> always helpful. I can't always make the subsystem go faster right away, 
>>> but sometimes. 
>>>
>>>
>> I timed some key functions in my application to understand which parts of 
>> Racket CS are slow.  I did a write-up in the Gist listed below, but the 
>> result seems to be that even functions that run Racket only code with no IO 
>> or calls into C libraries run slower in Racket CS.  Code that calls into 
>> the database library to run SQL insert queries runs significantly slower.  
>> The only things which were faster in Racket CS were one "Racket only" 
>> function, `df-histogram` and a function which retrieved data from an SQL 
>> query, `df-read/sql`
>>
>> https://gist.github.com/alex-hhh/1ebc1c83b68ee4620a70fc30d2caa6a3
>>
>> Alex.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Racket Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to racket-users...@googlegroups.com <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Racket Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to racket-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to