Say I have a tab-delimited table I want to read into R. What should I
expect to happen if some of the entries contain the character " ' "? I
thought it would read the file fine, but that is not what happens.
Instead, all the values in between two " ' "s get read into one field,
and things are just seriously messed up. Is this a bug, and besides
removing the offending characters, is there a fix?

Example Input file:

testFile.txt:
3499    9031    424823  COP'B2  118094989       XP_422637.2
3499    7955    114454  copb2   50080158        NP_001001940.1
3499    7227    45757   betaCop 24584107        NP_524836.2
3499    7165    1278426 AgaP_AGAP004798 158297839       XP_318012.4
3499    6239    177779  F38E11.5        17540286        NP_501671.1
3499    4896    2540050 sec'27  19113604        NP_596811.1
3499    4932    852740  SEC27   6321301 NP_011378.1
3499    28985   2897447 KLLA0B01958g    50303353        XP_451618.1
3499    33169   4621659 AGOS_AFL118W    45198403        NP_985432.1
3499    148305  2682116 MGG_10504       145615762       XP_366285.2
3499    5141    2709504 NCU07319.1      32414251        XP_327605.1
3499    3702    820842  AT3G15980       30683862        NP_850592.1
3499    3702    841666  AT1G52360       15218215        NP_175645.1
3499    3702    844339  AT1G79990       30699476        NP_178116.2
3499    4530    4340097 Os06g0143900    115466360       NP_001056779.1

testDat <- read.table('testFile.txt',sep='\t')
testDat

     V1     V2      V3
1  3499   9031  424823
2  3499   4932  852740
3  3499  28985 2897447
4  3499  33169 4621659
5  3499 148305 2682116
6  3499   5141 2709504
7  3499   3702  820842
8  3499   3702  841666
9  3499   3702  844339
10 3499   4530 4340097



                                       V4
1  
COPB2\t118094989\tXP_422637.2\n3499\t7955\t114454\tcopb2\t50080158\tNP_001001940.1\n3499\t7227\t45757\tbetaCop\t24584107\tNP_524836.2\n3499\t7165\t1278426\tAgaP_AGAP004798\t158297839\tXP_318012.4\n3499\t6239\t177779\tF38E11.5\t17540286\tNP_501671.1\n3499\t4896\t2540050\tsec27
2


                                    SEC27
3


                             KLLA0B01958g
4


                             AGOS_AFL118W
5


                                MGG_10504
6


                               NCU07319.1
7


                                AT3G15980
8


                                AT1G52360
9


                                AT1G79990
10


                             Os06g0143900
          V5             V6
1   19113604    NP_596811.1
2    6321301    NP_011378.1
3   50303353    XP_451618.1
4   45198403    NP_985432.1
5  145615762    XP_366285.2
6   32414251    XP_327605.1
7   30683862    NP_850592.1
8   15218215    NP_175645.1
9   30699476    NP_178116.2
10 115466360 NP_001056779.1

I would appreciate any feedback.

Thanks,

-Robert

> sessionInfo()
R version 2.12.1 (2010-12-16)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United
States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] tools_2.12.1


Robert M. Flight, Ph.D.
University of Louisville Bioinformatics Laboratory
University of Louisville
Louisville, KY

PH 502-852-1809 (HSC)
PH 502-852-0467 (Belknap)
EM robert.fli...@louisville.edu
EM rfligh...@gmail.com

Williams and Holland's Law:
       If enough data is collected, anything may be proven by
statistical methods.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to