Dear all thanks for support.
I'll set up an heuristic so as to catch most of the case. For the handful of ambiguous case, I'll exclude them from analysis, with a feedback to the software editor so as he can correct the format of the csv file. Bests, -- Guillaume Le Jul 23, 2012 à 4:54 PM, peter dalgaard a écrit : > > On Jul 23, 2012, at 15:06 , Guillaume Meurice wrote: > >> Dear all, >> >> I have some encoding problem which I'm not familiar with. >> Here is the case : >> I'm read data files which can have been generated from a computer either >> with global settings in french or in english. >> >> Here is an exemple ouf data file : >> >> * English output >> Time,Value >> 17,-0.0753953 >> 17.05,-6.352454E-02 >> >> * French output. >> Time,Value >> 32,-7,183246E-02 >> 32,05,3,469364E-02 >> >> In the first case, I can totally retrieve both columns, splitting each line >> using the comma as a separator. >> In the second case, it's impossible, since the comma (in french) is also >> used to separate decimal. Usually, the CSV french file format add some >> quote, to distinguish the comma used as column separator from comma used as >> decimal, like the following : >> >> Time,Value >> 32,"-7,183246E-02" >> "32,05","3,469364E-02" >> >> Since I'm expecting 2 numbers, I can set that if there is 3 comma, the first >> two number are to be gathered as well as the two lefting ones. >> But in case of only two comma, which number is the floating one (I know that >> it is the second one, but who this is a great source of bugs ...). >> >> the unix tools "file" returns : >> === >> $ file P23_RD06_High\ Sensitivity\ DNA\ >> Assay_DE04103664_2012-06-27_11-57-29_Sample1.csv >> $ P23_RD06_High Sensitivity DNA >> Assay_DE04103664_2012-06-27_11-57-29_Sample1.csv: ASCII text, with CRLF line >> terminators >> === >> >> >> Unfortunately, the raw file doesn't contains the precious quote. So sorry to >> bother with this question which is not totally related to R (which I'm >> using). Do you know if there any facilities using R to get the data in the >> good format ? > > As you already observe, there can't be. There's just no way of seeing whether > 32,7,8 is 32.7, 8.0 or 32.0, 7.8. That's why the "usual" CSV format in > jurisdictions with comma as decimal separator has semicolon as the field > separator. > > You may be able to scrape through with various heuristics, such as > > - a leading 0 likely belongs to a decimal part > - if there's an exponent, then there is likely a decimal point > - there can't be a sign in the decimal part > - times should be (roughly) equidistant and in increasing sequence > > R is fairly well equipped with tools to let you create code to do this, look > at strsplit(), count.fields(), grep() etc., but it will be a fair bit of > work, and you may still end up with a handful of truly ambiguous cases. > > Ultimately, however, the issue is that someone messed up the collection of > data and now try to make it your problem. In a consulting situation, that > should cost serious extra money. Other options include > > - redo the data, this time with all computers set to English (if there's an > internal storage format, this could be more realistic than you think) > - return the faulty data collecting software (some time back in the 1990's, > the Paradox data base had the same stupid bug of double usage of commas, but > it is really unheard of in 2012) > > -- > Peter Dalgaard, Professor, > Center for Statistics, Copenhagen Business School > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > Phone: (+45)38153501 > Email: pd....@cbs.dk Priv: pda...@gmail.com > > > > > > > > -- Guillaume Meurice - PhD Plateforme de Bioinformatique [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.