On 2011 Feb 03, at 13:57, glenn andreas wrote:

> Why not just use NSScanner's scanDouble: instead of trying to scan a string 
> that you think is valid and then convert to a double?

Well, because in running my tests I'd observed that using -scanDecimal: 
degraded performance even when it tried and failed to scan a non-number string, 
but I forgot that in real life you usually know enough about the structure of 
the string you're scanning that you rarely do this.

I've since realized that this thing is difficult to optimize in the general 
case.  What I'm actually trying to do is improve the performance of Blake 
Seely's BSJSONAdditions category on NSScanner, where it scans the JSON string 
representing a number value.  Blake used -scanDecimal:, which is bulletproof, 
but is sometimes way slow (read on), and few applications need 38 decimal 
places of accuracy.  The JSON standard allows "e/E" notation to be used in 
number strings, but does not allow for localized decimal separators; i.e. it's 
always ".".  And, in Blake's design, the method will not execute for valid JSON 
if there is not a valid number string at the scan location.  In my current 
project, the number strings are all integers, although many are 64-bit 
integers.  In some cases (accounting/finance??), one may need the 
38-decimal-digit precision of NSDecimal.

So, here's my last shot in case anyone is following this.  Probably there are 
still corner cases I haven't thought of, but anyhow, this works for me.  With 
the parameters as hard-coded below, the speedup of "not-accurately" vs. 
"scanDecimal" case is about 120x.  Anyone needing to optimize the performance 
of number-string scanning can adjust the parameters to fit their project.

One sure conclusion is that, if you plan to scan long strings, hundreds of KB, 
you will greatly improve performance, and possibly eliminate crashes, by 
surrounding -scanDecimal: with its own local autorelease pool.

Another conclusion is that the speedup is highly sensitive to the length of the 
string, which is controlled by the 'strexponent' in main().  For strings of 100 
characters or less, there is not much speed improvement.  However, for longer 
strings the speed improvement is approximately proportional to 2^strexponent, 
which is the length of the string.  This makes no sense to me, since, in my 
mind, -scanDecimal: should begin scanning at the scanLocation, scan whatever 
number string is there, and then stop.  It should not even need to know the 
characters before or after.  But probably I'm overlooking some unforeseen  
requirement of some corner case.

Thanks again for the feedback on this.

#import <Cocoa/Cocoa.h>

@interface NSScanner (Speedy)

/*!
 @brief    Scans for a string representing a number value, returning
 if found by reference, and offering an option to improve performance
 if "double precision" is sufficient.
 
 @details  The string to be scanned may be either a plain decimal
 number string such as "-123.456", or a string using scientific
 "e" notation such as "-123e10".
 
 The performance improvement when using accurately:NO is typically
 120x when scanning a string of 60K characters, and also autoreleased
 memory allocations are greatly reduced.  The improvement is
 propotional to the string length in this regime.
 
 This method is optimized assuming that the decimal point in the
 scanned numbers is the dot ".", and never localized to ",".
 
 Invoke this method with NULL as number to simply scan past a string
 representation of a number.
 
 @param    number  Upon return, contains the scanned value as either an
 NSNumber or NSDecimalNumber value.

 @param    accurately  If YES, the 'number' pointed to upon return will
 be an NSDecimalNumber, with 38 decimal digits of precision.  If NO,
 the 'number' pointed to upon return will be an NSNumber whose value
 is only as accurate as the 'binary64' variant defined in the IEEE
 Standard for Floating-Point Arithmetic (IEEE Standard 754).
 
 @result   YES if the receiver scans anything, otherwise NO.
 */
- (BOOL)scanJSONNumber:(NSNumber**)number
            accurately:(BOOL)accurately ;

@end

@implementation NSScanner (Speedy)

- (BOOL)scanJSONNumber:(NSNumber**)number
            accurately:(BOOL)accurately {
    BOOL result = NO ;
    
    if (!accurately) {
        NSCharacterSet* integerSet = [NSCharacterSet 
characterSetWithCharactersInString:@"0123456789+-"] ;
        NSCharacterSet* floatSet = [NSCharacterSet 
characterSetWithCharactersInString:@"0123456789+-eE."] ;
        NSInteger scanLocation = [self scanLocation] ;
        [self scanCharactersFromSet:integerSet
                         intoString:NULL] ;
        NSInteger nextNonInteger = [self scanLocation] ;
        [self setScanLocation:scanLocation] ;
        [self scanCharactersFromSet:floatSet
                         intoString:NULL] ;
        NSInteger nextNonFloat = [self scanLocation] ;
        [self setScanLocation:scanLocation] ;
        
        BOOL isInteger = (nextNonFloat == nextNonInteger) ;
        
        if (isInteger) {
            long long daLongLong ;
            result = [self scanLongLong:&daLongLong] ;
            if (result) {
                *number = [NSNumber numberWithLongLong:daLongLong] ;
            }
        }
        
        if (!result) {
            double daDouble ;
            result = [self scanDouble:&daDouble] ;
            if (result) {
                *number = [NSNumber numberWithDouble:daDouble] ;
            }
        }
    }
    else {
        NSDecimal decimal ;
        // This local autorelease pool is quite necessary, when parsing
        // strings with dense numbers, of 500K characters.  Otherwise,
        // memory allocations go into the gigabytes.
        NSAutoreleasePool* pool = [[NSAutoreleasePool alloc] init] ;
        result = [self scanDecimal:&decimal] ;
        [pool release] ;
        if (result) {
            *number = [NSDecimalNumber decimalNumberWithDecimal:decimal] ;
        }
    }
    
    return result ;
}

@end

#define NOT_ACCURATELY 0
#define ACCURATELY 1
#define SCAN_DECIMAL 2


NSTimeInterval TestScanner(
                           NSScanner *scanner,
                           NSInteger how,
                           double expectedChecksum) {
    // We use a local autorelease pool, otherwise -scanDecimal: starts
    // using gigabytes of virtual memory, which makes both the HomeBrew
    // and scanDecimal methods take longer, and slow our Mac.
    NSAutoreleasePool* pool = [[NSAutoreleasePool alloc] init] ;
    
    [scanner setScanLocation:0] ;
    double checksum = 0.0 ;
    NSTimeInterval elapsed = 0.0 ;
    NSInteger lastProgress = 0 ;
    NSInteger progressInterval = 5 ;
    NSInteger nChars = [[scanner string] length] ;
    while (![scanner isAtEnd]) {
        // show progress
        NSInteger currentSeconds = floor(elapsed) ;
        if ((currentSeconds % progressInterval == 0) && (currentSeconds > 
(lastProgress + 1))) {
            NSLog(@"scanned %d / %d chars", [scanner scanLocation], nChars) ;
            lastProgress = currentSeconds ;
        }
        
        [scanner scanString:@":"
                 intoString:NULL] ;
        
        NSNumber* number = nil ;
        BOOL didscanJSONNumber ;
        
        NSDate* date = [NSDate date] ;
        switch (how) {
            case NOT_ACCURATELY:
                didscanJSONNumber = [scanner scanJSONNumber:&number
                                         accurately:NO] ;
                break ;
            case ACCURATELY:
                didscanJSONNumber = [scanner scanJSONNumber:&number
                                         accurately:YES] ;
                break ;
            case SCAN_DECIMAL:;
                NSDecimal decimal ;
                // This local autorelease pool is quite necessary, when parsing
                // strings with dense numbers, of 500K characters.  Otherwise,
                // memory allocations go into the gigabytes.
                NSAutoreleasePool* pool = [[NSAutoreleasePool alloc] init] ;
                didscanJSONNumber = [scanner scanDecimal:&decimal];
                [pool release] ;
                if (didscanJSONNumber) {
                    number = [NSDecimalNumber decimalNumberWithDecimal:decimal] 
;
                }
                break ;
            default:
                break ;
        }
        elapsed += -[date timeIntervalSinceNow] ;
        
        if (didscanJSONNumber) {
            double value = [number doubleValue] ;
            checksum += value ;
        }
        else {
            [scanner setScanLocation:[scanner scanLocation] + 1] ;
        }
    }
    
    BOOL checksumOK = checksum == expectedChecksum ;
    

    NSString* howWord ;
    
    switch (how) {
        case NOT_ACCURATELY:
            howWord = @"not-accurately" ;
            break ;
        case ACCURATELY:
            howWord = @"    accurately" ;
            break ;
        case SCAN_DECIMAL:;
            howWord = @"   scanDecimal" ;
            break ;
        default:
            howWord = @"???" ;
            break ;
    }
    
    
    NSLog(@"%@:  time = %9.3e seconds   checksum = %0.16f (%@)",
          howWord,
          elapsed,
          checksum,
          checksumOK ? @"OK" : [NSString stringWithFormat:@"Expected %0.16f", 
expectedChecksum]) ;
    [pool release] ;
    
    return elapsed ;
}

int main(int argc, char *argv[]) {
    NSAutoreleasePool* pool = [[NSAutoreleasePool alloc] init] ;
    
    // For this test, template should be a string of several numbers,
    // representative of the data in the actual application being modelled,
    // each preceded by a single ":" colon character, which add up to 'sum'. 
    NSString* template = 
@":2:-1:3.056789e6:-3.056789e6:1271193582057:-1271193582057" ;
    double sum = 1.0 ;
    NSMutableString* s = [template mutableCopy] ;
    NSInteger i ;
    NSInteger strexponent = 10 ;
    // Concatenate the template onto itself 2^strexponent times
    for (i=0; i<strexponent; i++) {
        [s appendString:s] ;        
    }
    
    NSLog(@"Scanner string length: %d chars", [s length]) ;
    NSScanner* scanner = [[NSScanner alloc] initWithString:s] ;
    
    NSTimeInterval elapsedNotAccurately = 0.0 ;
    NSTimeInterval elapsedAccurately = 0.0 ;
    NSTimeInterval elapsedScanDecimal = 0.0 ;
    
    double checksum = sum * pow(2, strexponent) ;
    
    NSInteger j ;
    // Repeat 3x in order to randomize any advantage or disadvantage
    // which may occur from being first or last, due to "priming"
    // of virtual memory, other processes running, or whatever.
    for (j=0; j<3; j++) {
        elapsedNotAccurately += TestScanner(scanner, NOT_ACCURATELY, checksum) ;
        elapsedAccurately += TestScanner(scanner, ACCURATELY, checksum) ;
        elapsedScanDecimal += TestScanner(scanner, SCAN_DECIMAL, checksum) ;
    }
    NSLog (@"Speed Improvement from scanDecimal to accurately: %0.2f",
           elapsedScanDecimal/elapsedAccurately) ;
    NSLog (@"Speed Improvement from scanDecimal to not-accurately: %0.2f",
           elapsedScanDecimal/elapsedNotAccurately) ;
    
    [scanner release] ;
    [s release] ;
    
    [pool drain] ;
    
    return 0 ;     
}

_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to arch...@mail-archive.com

Reply via email to