On 01/11/2009 7:43 AM, onyourmark wrote:
Hi. I have a huge list called twitter:
It's a list, but more importantly it's a VCorpus and a Corpus. You
should use the functions appropriate to those classes to extract the
strings making up the data, declare their encoding properly (or convert
them to your native encoding), then use read.delim() on a textConnection
to read them in.
Duncan Murdoch
dim(twitter)
NULL
str(twitter)
List of 1
$ :Classes 'PlainTextDocument', 'TextDocument', 'character' atomic
[1:35575] 11999;10:47:14;20;10;2009;ObamaLouverture;Trails Mixed Lessons For
Governance From Campaigner-in-chief: President obama jumps campaign 09
tuesday.. http://bit.ly/2eHMaN;Florida;USA;FL;;;27.6648274;-81.5157535
12210;10:47:37;20;10;2009;David_Stringer;William Hague heading Washington
meets Gen. Jim Jones, Sen. John McCain others. Will Obama team raise
worries EU ties?;London, England;United Kingdom;Greater
London;Westminster;;51.5001524;-0.1262362
12355;10:47:53;20;10;2009;Singsabit;RT @Drudge_Report PAPER: Excuses wearing
thin Obama, media pals... http://tinyurl.com/yfw6cd9;So.
California;USA;CA;;;36.778261;-119.4179324
12407;10:47:59;20;10;2009;obamavideonews;Obama News Obama Afghanistan
troop decision timing (AFP) : AFP - Pres.. http://bit.ly/3KPUr8 #obama
#video;USA;USA;;;;37.09024;-95.712891 ...
.. ..- attr(*, "Author")= chr(0)
.. ..- attr(*, "DateTimeStamp")= POSIXlt[1:9], format: "2009-10-31
04:46:56"
.. ..- attr(*, "Description")= chr(0)
.. ..- attr(*, "Heading")= chr(0)
.. ..- attr(*, "ID")= chr "1"
.. ..- attr(*, "Language")= chr "en"
.. ..- attr(*, "LocalMetaData")= list()
.. ..- attr(*, "Origin")= chr(0)
- attr(*, "CMetaData")=List of 3
..$ NodeID : num 0
..$ MetaData:List of 2
.. ..$ create_date: POSIXlt[1:9], format: "2009-10-31 04:46:56"
.. ..$ creator : Named chr ""
.. .. ..- attr(*, "names")= chr "LOGNAME"
..$ Children: NULL
..- attr(*, "class")= chr "MetaDataNode"
- attr(*, "DMetaData")='data.frame': 1 obs. of 1 variable:
..$ MetaID: num 0
- attr(*, "class")= chr [1:3] "VCorpus" "Corpus" "list"
It contains tweets but in many languages. The "columns" are separated by
semi-colons. I am using the tm package and it is a "corpus".
It looks like this:
547282;06:37:17;21;10;2009;dani_jade18;@Laura_Whyte1 day
:p;Huddersfield/Lincoln;United
Kingdom;Kirklees;Kirklees;;53.6468475;-1.7727296
547283;06:37:17;21;10;2009;fabiomafra;alguém traz mais lenha pro computador
da facool? BOM DIA.;Belo Horizonte - MG -
BR;Brazil;MG;;;-19.8157306;-43.9542226
547284;06:37:17;21;10;2009;romanotr;Вау, "Репортеры без границ" опубликовали
список стран со свободой слова, из 173 Грузия на 81 месте опережая Украину.
Успехи,успехи...;Portugal Aveiro;Portugal;Aveiro;;;40.6411848;-8.6536169
547285;06:37:18;21;10;2009;Y_T_;Playing: Beth Orton <\;Someone's
Daughter>\;;Kanazawa, Japan;Japan;Ishikawa
Prefecture;;;36.5613254;136.6562051
Error: invalid input
'547286;06:37:18;21;10;2009;Atogey;支æŒä½
,国家需è¦ä»–们,但是国家的未æ¥ä¸èƒ½é 他们…RT
@zuola ￿我觉得 @wenyunc
I want to convert it to "fields" or columns and so I thought I should
convert it to a dataframe. I tried
twitterDF<-as.data.frame(twitter)
Error in sort.list(y) :
invalid input
'547286;06:37:18;21;10;2009;Atogey;支æŒä½
,国家需è¦ä»–们,但是国家的未æ¥ä¸èƒ½é 他们…RT
@zuola ￿我觉得 @wenyunchao
一点都ä¸ä¹è§‚。真æ£çš„ä¹è§‚åº”è¯¥æ˜¯ï¼šä½ å…³æˆ‘åˆæ€Žä¹ˆæ
·ï¼Œåæ£æ”¿æ²»æ–—争ä¸ä¼šä¸¢æŽ‰æ€§å‘½ï¼Œè€å出æ¥åŽæ›´æ˜¯ä¸€æ¡å¥½æ±‰ã€‚北风还是èˆä¸å¾—*霸地ä½ã€è‚‰ã€ä¹¦ã€å¥³äººå’Œç½‘络的,ä¸è¿‡ç‰¢é‡Œä¸ä¼šæä¾›è¿™äº›ã€‚å¦â€¦;山西,浙江;China;Zhejiang;;;28.695035;119.751054'
in 'utf8towcs'
Can anyone suggest what I can do?
P.S. Actually, I would love to remove all the non-English tweets but I have
no clue about how to do that.
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.