Some years ago, this list discussed the difficulties of parsing comma-separated-value file format; Richard Gaskin has a great article about it at http://www.fourthworld.com/embassy/articles/csv-must-die.html

Following that discussion, I came up with some code to parse CSV in Livecode which was significantly faster than the straightforwards methods (quoted in the above article). At the time, I put that speed gain down to two factors

1. a way of looking at the problem "sideways" that enables a different approach
2. a 'clever' use of split + array access

Recently the topic came up again, and I looked at the code again; I now realize that in fact the speed gain came entirely from the first of those two factors, and using split + arrays was not helpful. Livecode's chunk handling is (in this case) faster than using arrays (my only excuse is that I was new to Livecode, and so I was using techniques I was familiar with from other languages). So I revised the code to use chunk handling rather than split+arrays, and the resulting code runs about 40% faster, with the added benefit of being slightly easier to read and understand. The only slightly mind-bending feature of the new code is the use of

    set the lineDelimiter to quote
    repeat for each line k in pData ....

I find it hard to think about "lines" that aren't actually lines :-)

So - for anyone who needs or wants more speed, here's the code

function CSV3Tab pData,pcoldelim
  local tNuData -- contains tabbed copy of data
  local tReturnPlaceholder -- replaces cr in field data to avoid line
  --                       breaks which would be misread as records;
  --                       replaced later during dislay
  local tEscapedQuotePlaceholder -- used for keeping track of quotes
  --                       in data
  local tInQuotedText -- flag set while reading data between quotes
  local tInsideQuoted, k
  --
  put numtochar(11) into tReturnPlaceholder -- vertical tab as
  --                       placeholder
  put numtochar(2)  into tEscapedQuotePlaceholder -- used to simplify
  --                       distinction between quotes in data and those
  --                       used in delimiters
  --
  if pcoldelim is empty then put comma into pcoldelim
  -- Normalize line endings:
  replace crlf with cr in pData          -- Win to UNIX
  replace numtochar(13) with cr in pData -- Mac to UNIX
  --
  -- Put placeholder in escaped quote (non-delimiter) chars:
  replace ("\"&quote) with tEscapedQuotePlaceholder in pData
  replace quote&quote with tEscapedQuotePlaceholder in pData
  --
  put space before pData   -- to avoid ambiguity of starting context
  put False into tInsideQuoted
  set the linedel to quote
  repeat for each line k in pData
    if (tInsideQuoted) then
      replace cr with tReturnPlaceholder in k
      put k after tNuData
      put False into tInsideQuoted
    else
      replace pcoldelim with numtochar(29) in k
      put k after tNuData
      put true into tInsideQuoted
    end if
  end repeat
  --
  delete char 1 of tNuData -- remove the leading space
  replace tEscapedQuotePlaceholder with quote in tNuData
  return tNuData
end CSV3Tab


-- Alex.

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to