[PATCH] Move slow path out of 'scm_get_byte_or_eof' et al

2013-04-02 Thread Mark H Weaver
Andy suggested this.  Okay to push to stable-2.0?

Mark


>From 0b70eba61ee26654f442199e7cc49838c5b0030e Mon Sep 17 00:00:00 2001
From: Mark H Weaver 
Date: Tue, 2 Apr 2013 05:33:24 -0400
Subject: [PATCH] Move slow path out of 'scm_get_byte_or_eof' et al.

Suggested by Andy Wingo.

* libguile/inline.h (scm_get_byte_or_eof, scm_peek_byte_or_eof): Keep
  only the fast path here, with fallback to 'scm_i_get_byte_or_eof' and
  'scm_i_peek_byte_or_eof'.

* libguile/ports.c (scm_i_get_byte_or_eof, scm_i_peek_byte_or_eof): New
  internal functions.

* libguile/ports.h (scm_i_get_byte_or_eof, scm_i_peek_byte_or_eof): Add
  prototypes.
---
 libguile/inline.h |   44 ++--
 libguile/ports.c  |   41 +
 libguile/ports.h  |2 ++
 3 files changed, 53 insertions(+), 34 deletions(-)

diff --git a/libguile/inline.h b/libguile/inline.h
index 88ba7f7..17d8a0c 100644
--- a/libguile/inline.h
+++ b/libguile/inline.h
@@ -96,50 +96,26 @@ scm_is_string (SCM x)
 SCM_INLINE_IMPLEMENTATION int
 scm_get_byte_or_eof (SCM port)
 {
-  int c;
   scm_t_port *pt = SCM_PTAB_ENTRY (port);
 
-  if (pt->rw_active == SCM_PORT_WRITE)
-/* may be marginally faster than calling scm_flush.  */
-scm_ptobs[SCM_PTOBNUM (port)].flush (port);
-
-  if (pt->rw_random)
-pt->rw_active = SCM_PORT_READ;
-
-  if (pt->read_pos >= pt->read_end)
-{
-  if (SCM_UNLIKELY (scm_fill_input (port) == EOF))
-	return EOF;
-}
-
-  c = *(pt->read_pos++);
-
-  return c;
+  if (SCM_LIKELY ((pt->rw_active == SCM_PORT_READ || !pt->rw_random)
+  && pt->read_pos < pt->read_end))
+return *pt->read_pos++;
+  else
+return scm_i_get_byte_or_eof (port);
 }
 
 /* Like `scm_get_byte_or_eof' but does not change PORT's `read_pos'.  */
 SCM_INLINE_IMPLEMENTATION int
 scm_peek_byte_or_eof (SCM port)
 {
-  int c;
   scm_t_port *pt = SCM_PTAB_ENTRY (port);
 
-  if (pt->rw_active == SCM_PORT_WRITE)
-/* may be marginally faster than calling scm_flush.  */
-scm_ptobs[SCM_PTOBNUM (port)].flush (port);
-
-  if (pt->rw_random)
-pt->rw_active = SCM_PORT_READ;
-
-  if (pt->read_pos >= pt->read_end)
-{
-  if (SCM_UNLIKELY (scm_fill_input (port) == EOF))
-	return EOF;
-}
-
-  c = *pt->read_pos;
-
-  return c;
+  if (SCM_LIKELY ((pt->rw_active == SCM_PORT_READ || !pt->rw_random)
+  && pt->read_pos < pt->read_end))
+return *pt->read_pos;
+  else
+return scm_i_peek_byte_or_eof (port);
 }
 
 SCM_INLINE_IMPLEMENTATION void
diff --git a/libguile/ports.c b/libguile/ports.c
index becdbed..76ec83f 100644
--- a/libguile/ports.c
+++ b/libguile/ports.c
@@ -1439,6 +1439,47 @@ scm_fill_input (SCM port)
   return scm_ptobs[SCM_PTOBNUM (port)].fill_input (port);
 }
 
+/* Slow-path fallback for 'scm_get_byte_or_eof' in inline.h */
+int
+scm_i_get_byte_or_eof (SCM port)
+{
+  scm_t_port *pt = SCM_PTAB_ENTRY (port);
+
+  if (pt->rw_active == SCM_PORT_WRITE)
+scm_flush (port);
+
+  if (pt->rw_random)
+pt->rw_active = SCM_PORT_READ;
+
+  if (pt->read_pos >= pt->read_end)
+{
+  if (SCM_UNLIKELY (scm_fill_input (port) == EOF))
+	return EOF;
+}
+
+  return *pt->read_pos++;
+}
+
+/* Slow-path fallback for 'scm_peek_byte_or_eof' in inline.h */
+int
+scm_i_peek_byte_or_eof (SCM port)
+{
+  scm_t_port *pt = SCM_PTAB_ENTRY (port);
+
+  if (pt->rw_active == SCM_PORT_WRITE)
+scm_flush (port);
+
+  if (pt->rw_random)
+pt->rw_active = SCM_PORT_READ;
+
+  if (pt->read_pos >= pt->read_end)
+{
+  if (SCM_UNLIKELY (scm_fill_input (port) == EOF))
+	return EOF;
+}
+
+  return *pt->read_pos;
+}
 
 /* scm_lfwrite
  *
diff --git a/libguile/ports.h b/libguile/ports.h
index 53d5081..54bf595 100644
--- a/libguile/ports.h
+++ b/libguile/ports.h
@@ -328,6 +328,8 @@ scm_i_default_port_conversion_handler (void);
 /* Use HANDLER as the default conversion strategy for future ports.  */
 SCM_INTERNAL void
 scm_i_set_default_port_conversion_handler (scm_t_string_failed_conversion_handler);
+SCM_INTERNAL int scm_i_get_byte_or_eof (SCM port);
+SCM_INTERNAL int scm_i_peek_byte_or_eof (SCM port);
 
 SCM_API SCM scm_port_conversion_strategy (SCM port);
 SCM_API SCM scm_set_port_conversion_strategy_x (SCM port, SCM behavior);
-- 
1.7.10.4



Re: array-copy! is slow & array-map.c (was: Extremly slow for format & string-join)

2013-04-02 Thread Daniel Llorens

On Apr 1, 2013, at 19:15, Daniel Llorens wrote:

> Third, none of the above are causing the slowness of array-copy!. I noticed 
> that there's a double indirection in racp(). The second patch removes it. 
> Actually this double indirection goes on all over array-map.c and I don't 
> understand why it's needed...

This patch does the same for array-fill!. The improvement is similar.

Before:

scheme@(guile-user)> (define a (make-typed-array 'a *unspecified* 100 10))
scheme@(guile-user)> ,time (array-fill! a #\v)
;; 1.211000s real time, 1.20s run time.  0.00s spent in GC.

After:

scheme@(guile-user)> ,time (array-fill! a #\v)
;; 0.855000s real time, 0.85s run time.  0.00s spent in GC.

Regards

Daniel




0004-Remove-double-indirection-in-array-fill.patch
Description: Binary data


Re: [PATCH] Move slow path out of 'scm_get_byte_or_eof' et al

2013-04-02 Thread Mike Gran
> From: Mark H Weaver 
>  
>Andy suggested this.  Okay to push to stable-2.0?

Hi Mark,
 
I can't imagine any possible way that this patch could make
performance worse, but, it is in the very core of the ports.
 
I don't think you can get away without at least a bit of
profiling.
 
-Mike



Re: array-copy! is slow & array-map.c (was: Extremly slow for format & string-join)

2013-04-02 Thread Daniel Llorens

On Apr 2, 2013, at 12:19, Daniel Llorens wrote:

> On Apr 1, 2013, at 19:15, Daniel Llorens wrote:
> 
>> Third, none of the above are causing the slowness of array-copy!. I noticed 
>> that there's a double indirection in racp(). The second patch removes it. 
>> Actually this double indirection goes on all over array-map.c and I don't 
>> understand why it's needed...
> 
> This patch does the same for array-fill!. The improvement is similar.

These two patches do it for array-map!.

The first patch avoids cons for the 1-argument case. The second patch removes 
the double indirection for the first two arguments in every case. Since there's 
some work done inside the loop, the improvement is smaller than for array-copy! 
or array-fill!.

Before the first patch:

scheme@(guile-user)> (define a (make-array 0. 100 10))
scheme@(guile-user)> (define b (make-array 0. 100 10))
scheme@(guile-user)> (define c (make-array *unspecified* 100 10))
scheme@(guile-user)> (define d (transpose-array (make-array *unspecified* 10 
100) 1 0))
scheme@(guile-user)> ,time (array-map! c (const 0.))
;; 0.558498s real time, 0.556974s run time.  0.00s spent in GC.
scheme@(guile-user)> ,time (array-map! c values a)
;; 1.467717s real time, 1.463470s run time.  0.269786s spent in GC.
scheme@(guile-user)> ,time (array-map! c values d)
;; 1.500785s real time, 1.496482s run time.  0.254156s spent in GC.
scheme@(guile-user)> ,time (array-map! c (lambda (a b) (+ a b)) a b)
;; 2.278655s real time, 2.271948s run time.  0.528379s spent in GC.

After the first patch:

scheme@(guile-user)> (define a (make-array 0. 100 10))
scheme@(guile-user)> (define b (make-array 0. 100 10))
scheme@(guile-user)> (define c (make-array *unspecified* 100 10))
scheme@(guile-user)> (define d (transpose-array (make-array *unspecified* 10 
100) 1 0))
scheme@(guile-user)> ,time (array-map! c (const 0.))
;; 0.559337s real time, 0.557811s run time.  0.00s spent in GC.
scheme@(guile-user)> ,time (array-map! c values a)
;; 1.038836s real time, 1.035859s run time.  0.134621s spent in GC.
scheme@(guile-user)> ,time (array-map! c values d)
;; 1.041937s real time, 1.039000s run time.  0.125787s spent in GC.
scheme@(guile-user)> ,time (array-map! c (lambda (a b) (+ a b)) a b)
;; 2.294594s real time, 2.287883s run time.  0.518576s spent in GC.

After both patches:

scheme@(guile-user)> (define a (make-array 0. 100 10))
scheme@(guile-user)> (define b (make-array 0. 100 10))
scheme@(guile-user)> (define c (make-array *unspecified* 100 10))
scheme@(guile-user)> (define d (transpose-array (make-array *unspecified* 10 
100) 1 0))
scheme@(guile-user)> ,time (array-map! c (const 0.))
;; 0.467621s real time, 0.466343s run time.  0.00s spent in GC.
scheme@(guile-user)> ,time (array-map! c values a)
;; 0.823515s real time, 0.821152s run time.  0.135106s spent in GC.
scheme@(guile-user)> ,time (array-map! c values d)
;; 0.825138s real time, 0.822763s run time.  0.124970s spent in GC.
scheme@(guile-user)> ,time (array-map! c (lambda (a b) (+ a b)) a b)
;; 2.066735s real time, 2.060635s run time.  0.525089s spent in GC.



0005-Avoid-per-element-cons-for-1-arg-case-of-array-map.patch
Description: Binary data


0006-Remove-double-indirection-in-array-map-with-2-args.patch
Description: Binary data


Re: array-copy! is slow & array-map.c (was: Extremly slow for format & string-join)

2013-04-02 Thread Daniel Llorens

On Apr 2, 2013, at 12:19, Daniel Llorens wrote:

> On Apr 1, 2013, at 19:15, Daniel Llorens wrote:
> 
>> Third, none of the above are causing the slowness of array-copy!. I noticed 
>> that there's a double indirection in racp(). The second patch removes it. 
>> Actually this double indirection goes on all over array-map.c and I don't 
>> understand why it's needed...
> 
> This patch does the same for array-fill!. The improvement is similar.

These two patches do it for array-map!.

The first patch avoids cons for the 1-argument case. The second patch removes 
the double indirection for the first two arguments in every case. Since there's 
some work done inside the loop, the improvement is smaller than for array-copy! 
or array-fill!.

Before the first patch:

scheme@(guile-user)> (define a (make-array 0. 100 10))
scheme@(guile-user)> (define b (make-array 0. 100 10))
scheme@(guile-user)> (define c (make-array *unspecified* 100 10))
scheme@(guile-user)> (define d (transpose-array (make-array *unspecified* 10 
100) 1 0))
scheme@(guile-user)> ,time (array-map! c (const 0.))
;; 0.558498s real time, 0.556974s run time.  0.00s spent in GC.
scheme@(guile-user)> ,time (array-map! c values a)
;; 1.467717s real time, 1.463470s run time.  0.269786s spent in GC.
scheme@(guile-user)> ,time (array-map! c values d)
;; 1.500785s real time, 1.496482s run time.  0.254156s spent in GC.
scheme@(guile-user)> ,time (array-map! c (lambda (a b) (+ a b)) a b)
;; 2.278655s real time, 2.271948s run time.  0.528379s spent in GC.

After the first patch:

scheme@(guile-user)> (define a (make-array 0. 100 10))
scheme@(guile-user)> (define b (make-array 0. 100 10))
scheme@(guile-user)> (define c (make-array *unspecified* 100 10))
scheme@(guile-user)> (define d (transpose-array (make-array *unspecified* 10 
100) 1 0))
scheme@(guile-user)> ,time (array-map! c (const 0.))
;; 0.559337s real time, 0.557811s run time.  0.00s spent in GC.
scheme@(guile-user)> ,time (array-map! c values a)
;; 1.038836s real time, 1.035859s run time.  0.134621s spent in GC.
scheme@(guile-user)> ,time (array-map! c values d)
;; 1.041937s real time, 1.039000s run time.  0.125787s spent in GC.
scheme@(guile-user)> ,time (array-map! c (lambda (a b) (+ a b)) a b)
;; 2.294594s real time, 2.287883s run time.  0.518576s spent in GC.

After both patches:

scheme@(guile-user)> (define a (make-array 0. 100 10))
scheme@(guile-user)> (define b (make-array 0. 100 10))
scheme@(guile-user)> (define c (make-array *unspecified* 100 10))
scheme@(guile-user)> (define d (transpose-array (make-array *unspecified* 10 
100) 1 0))
scheme@(guile-user)> ,time (array-map! c (const 0.))
;; 0.467621s real time, 0.466343s run time.  0.00s spent in GC.
scheme@(guile-user)> ,time (array-map! c values a)
;; 0.823515s real time, 0.821152s run time.  0.135106s spent in GC.
scheme@(guile-user)> ,time (array-map! c values d)
;; 0.825138s real time, 0.822763s run time.  0.124970s spent in GC.
scheme@(guile-user)> ,time (array-map! c (lambda (a b) (+ a b)) a b)
;; 2.066735s real time, 2.060635s run time.  0.525089s spent in GC.



0005-Avoid-per-element-cons-for-1-arg-case-of-array-map.patch
Description: Binary data


0006-Remove-double-indirection-in-array-map-with-2-args.patch
Description: Binary data


Re: array-copy! is slow & array-map.c (was: Extremly slow for format & string-join)

2013-04-02 Thread Daniel Llorens

This patch is for array-for-each. Only the first argument.

Before:

scheme@(guile-user)> (define a (make-array 1 100 10))
scheme@(guile-user)> ,time (let ((x 0)) (array-for-each (lambda (a) (set! x (+ 
a x))) a) x)
$8 = 1000
;; 0.621887s real time, 0.620181s run time.  0.00s spent in GC.

After:

scheme@(guile-user)> (define a (make-array 1 100 10))
scheme@(guile-user)> ,time (let ((x 0)) (array-for-each (lambda (a) (set! x (+ 
a x))) a) x)
$2 = 1000
;; 0.529748s real time, 0.528301s run time.  0.00s spent in GC.

Regards,

Daniel



0007-Remove-double-indirection-for-1st-arg-of-array-for-e.patch
Description: Binary data




Re: array-copy! is slow & array-map.c (was: Extremly slow for format & string-join)

2013-04-02 Thread Daniel Llorens

Hello,

sorry for the repeated message earlier. 

Ok, i think it's clear that all these double indirections can be eliminated 
globally by composing the strides of the array descriptor with the stride of 
SCM_I_ARRAY_V (array) and using that directly in the array descriptor. Is there 
any reason not to do this?

Regards,

Daniel


Re: Extremly slow for format & string-join

2013-04-02 Thread Ludovic Courtès
Mark H Weaver  skribis:

> Indeed, the implementation of 'string-join' was very bad: about O(n^2)

[...]

> Before:
>
> scheme@(guile-user)> ,time (define s (string-join (make-list 1 "test") 
> "-"))
> ;; 0.998800s real time, 0.996677s run time.  0.984885s spent in GC.
> scheme@(guile-user)> ,time (define s (string-join (make-list 10 "test") 
> "-"))
> ;; 98.006569s real time, 97.817077s run time.  97.795970s spent in GC.
>
> After:
>
> scheme@(guile-user)> ,time (define s (string-join (make-list 1 "test") 
> "-"))
> ;; 0.006362s real time, 0.006351s run time.  0.00s spent in GC.
> scheme@(guile-user)> ,time (define s (string-join (make-list 10 "test") 
> "-"))
> ;; 0.028513s real time, 0.028457s run time.  0.022235s spent in GC.
> scheme@(guile-user)> ,time (define s (string-join (make-list 100 "test") 
> "-"))
> ;; 0.303098s real time, 0.302543s run time.  0.289639s spent in GC.
> scheme@(guile-user)> ,time (define s (string-join (make-list 1000 "test") 
> "-"))
> ;; 3.288105s real time, 3.281922s run time.  3.174460s spent in GC.

Nice!

Ludo’.




Re: [PATCH] Move slow path out of 'scm_get_byte_or_eof' et al

2013-04-02 Thread Ludovic Courtès
Mike Gran  skribis:

>> From: Mark H Weaver 
>>  
>>Andy suggested this.  Okay to push to stable-2.0?
>
> Hi Mark,
>  
> I can't imagine any possible way that this patch could make
> performance worse, but, it is in the very core of the ports.
>  
> I don't think you can get away without at least a bit of
> profiling.

Yes, it’s most likely the Right Thing, but figures to confirm this would
be nice.

Ludo’.




Re: [PATCH] Move slow path out of 'scm_get_byte_or_eof' et al

2013-04-02 Thread Mark H Weaver
Mike Gran  writes:

> I can't imagine any possible way that this patch could make
> performance worse, but, it is in the very core of the ports.
>  
> I don't think you can get away without at least a bit of
> profiling.

Fair enough.  First of all, this patch reduces the size of the built
libguile.so by about 42 kilobytes, and libguile.a by about 96 kilobytes.

As for performance: the updated patches (attached below) slows things
down by about one quarter of one percent on my machine.  The specific
benchmark I did was to call 'read-string' on an 11 megabyte ASCII text
file, with the port-encoding set to UTF-8.  In this case, all the reads
are done using 'scm_get_byte_or_eof' (called from 'get_utf8_codepoint').

So I think this is an acceptable cost for such a great reduction in code
size.  What do you think?

Thanks for nudging me to do the measurement.  To be honest, the original
patch I posted slowed things down by 4.5%, which I found extremely
surprising.  I fixed it by adding an internal 'static' version of
'scm_fill_input'.  So for those who doubt the cost of function calls
though the shared library PLT (which, within a shared library, is
*every* function call to a non-static function), let this be a lesson.
That's the only thing that changed here, and it made more than a 4%
difference, even though the procedure call in question was done only
once per 4 kilobyte buffer!

Here's the raw data:

before:
   12767060  libguile-2.0.a
6999271  libguile-2.0.so.22.6.0

after:
   12671162  libguile-2.0.a  (96 kbytes smaller)
6956989  libguile-2.0.so.22.6.0  (42 kbytes smaller)

test file (the contents of 'psyntax.scm' repeated many times):
-rw-r--r-- 1 mhw mhw 11503780 Apr  2 12:46 /home/mhw/BIG-FILE

before:
scheme@(guile-user)> ,time (define s (call-with-input-file "/home/mhw/BIG-FILE" 
read-string))
;; 2.953214s real time, 2.951423s run time.  0.027767s spent in GC.
scheme@(guile-user)> ,time (define s (call-with-input-file "/home/mhw/BIG-FILE" 
read-string))
;; 2.942170s real time, 2.940581s run time.  0.011873s spent in GC.
scheme@(guile-user)> ,time (define s (call-with-input-file "/home/mhw/BIG-FILE" 
read-string))
;; 2.954309s real time, 2.950164s run time.  0.011952s spent in GC.
avg: 2.94738936

after:
scheme@(guile-user)> ,time (define s (call-with-input-file "/home/mhw/BIG-FILE" 
read-string))
;; 2.964777s real time, 2.957597s run time.  0.012116s spent in GC.
scheme@(guile-user)> ,time (define s (call-with-input-file "/home/mhw/BIG-FILE" 
read-string))
;; 2.971099s real time, 2.968656s run time.  0.012020s spent in GC.
scheme@(guile-user)> ,time (define s (call-with-input-file "/home/mhw/BIG-FILE" 
read-string))
;; 2.942377s real time, 2.940015s run time.  0.012085s spent in GC.
avg: 2.95542263

0.27% slower

I've attached the new patches.  Okay to push now?

   Thanks,
 Mark


>From 187fa0b9e7ff9b2d6204517a9daa9009245c7511 Mon Sep 17 00:00:00 2001
From: Mark H Weaver 
Date: Tue, 2 Apr 2013 13:33:14 -0400
Subject: [PATCH 1/2] Add a static version of 'scm_fill_input' to ports.c.

* libguile/ports.c (scm_i_fill_input): New static function, containing
  the code that was previously in 'scm_fill_input'.
  (scm_fill_input): Simply call 'scm_i_fill_input'.
  (scm_c_read): Use 'scm_i_fill_input'.
---
 libguile/ports.c |   24 +++-
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/libguile/ports.c b/libguile/ports.c
index becdbed..13a993e 100644
--- a/libguile/ports.c
+++ b/libguile/ports.c
@@ -1419,8 +1419,8 @@ scm_getc (SCM port)
 /* this should only be called when the read buffer is empty.  it
tries to refill the read buffer.  it returns the first char from
the port, which is either EOF or *(pt->read_pos).  */
-int
-scm_fill_input (SCM port)
+static int
+scm_i_fill_input (SCM port)
 {
   scm_t_port *pt = SCM_PTAB_ENTRY (port);
 
@@ -1439,6 +1439,12 @@ scm_fill_input (SCM port)
   return scm_ptobs[SCM_PTOBNUM (port)].fill_input (port);
 }
 
+int
+scm_fill_input (SCM port)
+{
+  return scm_i_fill_input (port);
+}
+
 
 /* scm_lfwrite
  *
@@ -1547,8 +1553,8 @@ scm_c_read (SCM port, void *buffer, size_t size)
   if (size == 0)
 return n_read;
 
-  /* Now we will call scm_fill_input repeatedly until we have read the
- requested number of bytes.  (Note that a single scm_fill_input
+  /* Now we will call scm_i_fill_input repeatedly until we have read the
+ requested number of bytes.  (Note that a single scm_i_fill_input
  call does not guarantee to fill the whole of the port's read
  buffer.) */
   if (pt->read_buf_size <= 1 && pt->encoding == NULL)
@@ -1556,12 +1562,12 @@ scm_c_read (SCM port, void *buffer, size_t size)
   /* The port that we are reading from is unbuffered - i.e. does
 	 not have its own persistent buffer - but we have a buffer,
 	 provided by our caller, that is the right size for the data
-	 that is wanted.  For the following scm_fill_input calls,
+	 that is wanted.  For the following scm_i_

Re: [PATCH] Move slow path out of 'scm_get_byte_or_eof' et al

2013-04-02 Thread Mike Gran
> From: Mark H Weaver 

> Thanks for nudging me to do the measurement.  To be honest, the original
> patch I posted slowed things down by 4.5%, which I found extremely
> surprising.  I fixed it by adding an internal 'static' version of
> 'scm_fill_input'.  So for those who doubt the cost of function calls
> though the shared library PLT (which, within a shared library, is
> *every* function call to a non-static function), let this be a lesson.
> That's the only thing that changed here, and it made more than a 4%
> difference, even though the procedure call in question was done only
> once per 4 kilobyte buffer!

> I've attached the new patches.  Okay to push now?

I don't think I get a vote, but, I vote yes anyway.

-Mike