Re: [R] CSV format issues

Guillaume Meurice Tue, 24 Jul 2012 03:49:49 -0700

Dear all

thanks for support.


I'll set up an heuristic so as to catch most of the case. 
For the handful of ambiguous case, I'll exclude them from analysis, with a 
feedback to the software editor so as he can correct the format of the csv file.

Bests,
--
Guillaume



Le Jul 23, 2012 à 4:54 PM, peter dalgaard a écrit :

> 
> On Jul 23, 2012, at 15:06 , Guillaume Meurice wrote:
> 
>> Dear all, 
>> 
>> I have some encoding problem which I'm not familiar with.
>> Here is the case : 
>> I'm read data files which can have been generated from a  computer either 
>> with global settings in french or in english.
>> 
>> Here is an exemple ouf data file :
>> 
>> * English output
>> Time,Value
>> 17,-0.0753953
>> 17.05,-6.352454E-02
>> 
>> * French output.
>> Time,Value
>> 32,-7,183246E-02
>> 32,05,3,469364E-02
>> 
>> In the first case, I can totally retrieve both columns, splitting each line 
>> using the comma as a separator.
>> In the second case, it's impossible, since the comma (in french) is also 
>> used to separate decimal. Usually, the CSV french file format add some 
>> quote, to distinguish the comma used as column separator from comma used as 
>> decimal, like the following : 
>> 
>> Time,Value
>> 32,"-7,183246E-02"
>> "32,05","3,469364E-02"
>> 
>> Since I'm expecting 2 numbers, I can set that if there is 3 comma, the first 
>> two number are to be gathered as well as the two lefting ones.
>> But in case of only two comma, which number is the floating one (I know that 
>> it is the second one, but who this is a great source of bugs ...).
>> 
>> the unix tools "file" returns : 
>> ===
>> $ file P23_RD06_High\ Sensitivity\ DNA\ 
>> Assay_DE04103664_2012-06-27_11-57-29_Sample1.csv 
>> $ P23_RD06_High Sensitivity DNA 
>> Assay_DE04103664_2012-06-27_11-57-29_Sample1.csv: ASCII text, with CRLF line 
>> terminators
>> ===
>> 
>> 
>> Unfortunately, the raw file doesn't contains the precious quote. So sorry to 
>> bother with this question which is not totally related to R (which I'm 
>> using). Do you know if there any facilities using R to get the data in the 
>> good format ?
> 
> As you already observe, there can't be. There's just no way of seeing whether 
> 32,7,8 is 32.7, 8.0 or 32.0, 7.8. That's why the "usual" CSV format in 
> jurisdictions with comma as decimal separator has semicolon as the field 
> separator.
> 
> You may be able to scrape through with various heuristics, such as
> 
> - a leading 0 likely belongs to a decimal part 
> - if there's an exponent, then there is likely a decimal point
> - there can't be a sign in the decimal part
> - times should be (roughly) equidistant and in increasing sequence
> 
> R is fairly well equipped with tools to let you create code to do this, look 
> at strsplit(), count.fields(), grep() etc., but it will be a fair bit of 
> work, and you may still end up with a handful of truly ambiguous cases.
> 
> Ultimately, however, the issue is that someone messed up the collection of 
> data and now try to make it your problem. In a consulting situation, that 
> should cost serious extra money. Other options include 
> 
> - redo the data, this time with all computers set to English (if there's an 
> internal storage format, this could be more realistic than you think)
> - return the faulty data collecting software (some time back in the 1990's, 
> the Paradox data base had the same stupid bug of double usage of commas, but 
> it is really unheard of in 2012)
> 
> -- 
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Email: pd....@cbs.dk  Priv: pda...@gmail.com
> 
> 
> 
> 
> 
> 
> 
> 

--
Guillaume Meurice - PhD
Plateforme de Bioinformatique






        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] CSV format issues

Reply via email to