Dear all

thanks for support.

I'll set up an heuristic so as to catch most of the case. 
For the handful of ambiguous case, I'll exclude them from analysis, with a 
feedback to the software editor so as he can correct the format of the csv file.

Bests,
--
Guillaume



Le Jul 23, 2012 à 4:54 PM, peter dalgaard a écrit :

> 
> On Jul 23, 2012, at 15:06 , Guillaume Meurice wrote:
> 
>> Dear all, 
>> 
>> I have some encoding problem which I'm not familiar with.
>> Here is the case : 
>> I'm read data files which can have been generated from a  computer either 
>> with global settings in french or in english.
>> 
>> Here is an exemple ouf data file :
>> 
>> * English output
>> Time,Value
>> 17,-0.0753953
>> 17.05,-6.352454E-02
>> 
>> * French output.
>> Time,Value
>> 32,-7,183246E-02
>> 32,05,3,469364E-02
>> 
>> In the first case, I can totally retrieve both columns, splitting each line 
>> using the comma as a separator.
>> In the second case, it's impossible, since the comma (in french) is also 
>> used to separate decimal. Usually, the CSV french file format add some 
>> quote, to distinguish the comma used as column separator from comma used as 
>> decimal, like the following : 
>> 
>> Time,Value
>> 32,"-7,183246E-02"
>> "32,05","3,469364E-02"
>> 
>> Since I'm expecting 2 numbers, I can set that if there is 3 comma, the first 
>> two number are to be gathered as well as the two lefting ones.
>> But in case of only two comma, which number is the floating one (I know that 
>> it is the second one, but who this is a great source of bugs ...).
>> 
>> the unix tools "file" returns : 
>> ===
>> $ file P23_RD06_High\ Sensitivity\ DNA\ 
>> Assay_DE04103664_2012-06-27_11-57-29_Sample1.csv 
>> $ P23_RD06_High Sensitivity DNA 
>> Assay_DE04103664_2012-06-27_11-57-29_Sample1.csv: ASCII text, with CRLF line 
>> terminators
>> ===
>> 
>> 
>> Unfortunately, the raw file doesn't contains the precious quote. So sorry to 
>> bother with this question which is not totally related to R (which I'm 
>> using). Do you know if there any facilities using R to get the data in the 
>> good format ?
> 
> As you already observe, there can't be. There's just no way of seeing whether 
> 32,7,8 is 32.7, 8.0 or 32.0, 7.8. That's why the "usual" CSV format in 
> jurisdictions with comma as decimal separator has semicolon as the field 
> separator.
> 
> You may be able to scrape through with various heuristics, such as
> 
> - a leading 0 likely belongs to a decimal part 
> - if there's an exponent, then there is likely a decimal point
> - there can't be a sign in the decimal part
> - times should be (roughly) equidistant and in increasing sequence
> 
> R is fairly well equipped with tools to let you create code to do this, look 
> at strsplit(), count.fields(), grep() etc., but it will be a fair bit of 
> work, and you may still end up with a handful of truly ambiguous cases.
> 
> Ultimately, however, the issue is that someone messed up the collection of 
> data and now try to make it your problem. In a consulting situation, that 
> should cost serious extra money. Other options include 
> 
> - redo the data, this time with all computers set to English (if there's an 
> internal storage format, this could be more realistic than you think)
> - return the faulty data collecting software (some time back in the 1990's, 
> the Paradox data base had the same stupid bug of double usage of commas, but 
> it is really unheard of in 2012)
> 
> -- 
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Email: pd....@cbs.dk  Priv: pda...@gmail.com
> 
> 
> 
> 
> 
> 
> 
> 

--
Guillaume Meurice - PhD
Plateforme de Bioinformatique






        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to