On 01/11/2009 7:43 AM, onyourmark wrote:
Hi. I have a huge list called twitter:

It's a list, but more importantly it's a VCorpus and a Corpus. You should use the functions appropriate to those classes to extract the strings making up the data, declare their encoding properly (or convert them to your native encoding), then use read.delim() on a textConnection to read them in.

Duncan Murdoch


dim(twitter)
NULL
str(twitter)
List of 1
 $ :Classes 'PlainTextDocument', 'TextDocument', 'character'  atomic
[1:35575] 11999;10:47:14;20;10;2009;ObamaLouverture;Trails Mixed Lessons For
Governance From Campaigner-in-chief: President obama jumps campaign 09 tuesday.. http://bit.ly/2eHMaN;Florida;USA;FL;;;27.6648274;-81.5157535 12210;10:47:37;20;10;2009;David_Stringer;William Hague heading Washington meets Gen. Jim Jones, Sen. John McCain others. Will Obama team raise
worries  EU ties?;London, England;United Kingdom;Greater
London;Westminster;;51.5001524;-0.1262362
12355;10:47:53;20;10;2009;Singsabit;RT @Drudge_Report PAPER: Excuses wearing
thin  Obama, media pals... http://tinyurl.com/yfw6cd9;So.
California;USA;CA;;;36.778261;-119.4179324
12407;10:47:59;20;10;2009;obamavideonews;Obama News Obama   Afghanistan
troop decision timing (AFP) : AFP - Pres.. http://bit.ly/3KPUr8 #obama
#video;USA;USA;;;;37.09024;-95.712891 ...
.. ..- attr(*, "Author")= chr(0) .. ..- attr(*, "DateTimeStamp")= POSIXlt[1:9], format: "2009-10-31
04:46:56"
.. ..- attr(*, "Description")= chr(0) .. ..- attr(*, "Heading")= chr(0) .. ..- attr(*, "ID")= chr "1"
  .. ..- attr(*, "Language")= chr "en"
  .. ..- attr(*, "LocalMetaData")= list()
.. ..- attr(*, "Origin")= chr(0) - attr(*, "CMetaData")=List of 3
  ..$ NodeID  : num 0
  ..$ MetaData:List of 2
  .. ..$ create_date: POSIXlt[1:9], format: "2009-10-31 04:46:56"
  .. ..$ creator    : Named chr ""
  .. .. ..- attr(*, "names")= chr "LOGNAME"
  ..$ Children: NULL
  ..- attr(*, "class")= chr "MetaDataNode"
 - attr(*, "DMetaData")='data.frame':   1 obs. of  1 variable:
  ..$ MetaID: num 0
 - attr(*, "class")= chr [1:3] "VCorpus" "Corpus" "list"

It contains tweets but in many languages. The "columns" are separated by
semi-colons. I am using the tm package and it is a "corpus".

It looks like this:

547282;06:37:17;21;10;2009;dani_jade18;@Laura_Whyte1   day
:p;Huddersfield/Lincoln;United
Kingdom;Kirklees;Kirklees;;53.6468475;-1.7727296
547283;06:37:17;21;10;2009;fabiomafra;alguém traz mais lenha pro computador
da facool? BOM DIA.;Belo Horizonte - MG -
BR;Brazil;MG;;;-19.8157306;-43.9542226
547284;06:37:17;21;10;2009;romanotr;Вау, "Репортеры без границ" опубликовали
список стран со свободой слова, из 173 Грузия на 81 месте опережая Украину.
Успехи,успехи...;Portugal Aveiro;Portugal;Aveiro;;;40.6411848;-8.6536169
547285;06:37:18;21;10;2009;Y_T_;Playing: Beth Orton &lt\;Someone's
Daughter&gt\;;Kanazawa, Japan;Japan;Ishikawa
Prefecture;;;36.5613254;136.6562051
Error: invalid input
'547286;06:37:18;21;10;2009;Atogey;æ”¯æŒä½ 
ï¼Œå›½å®¶éœ€è¦ä»–ä»¬ï¼Œä½†æ˜¯å›½å®¶çš„æœªæ¥ä¸èƒ½é ä»–ä»¬â€¦RT
@zuola ￿我觉得 @wenyunc

I want to convert it to "fields" or columns and so I thought I should
convert it to a dataframe. I tried

twitterDF<-as.data.frame(twitter)
Error in sort.list(y) : invalid input
'547286;06:37:18;21;10;2009;Atogey;æ”¯æŒä½ 
ï¼Œå›½å®¶éœ€è¦ä»–ä»¬ï¼Œä½†æ˜¯å›½å®¶çš„æœªæ¥ä¸èƒ½é ä»–ä»¬â€¦RT
@zuola ￿我觉得 @wenyunchao
ä¸€ç‚¹éƒ½ä¸ä¹è§‚ã€‚çœŸæ­£çš„ä¹è§‚åº”è¯¥æ˜¯ï¼šä½ å…³æˆ‘åˆæ€Žä¹ˆæ 
·ï¼Œåæ­£æ”¿æ²»æ–—争不会丢掉性命,老子出来后更是一条好汉。北风还是舍不得*霸地位、肉、书、女人和网络的,不过牢里不会提供这些。另…;山西,浙江;China;Zhejiang;;;28.695035;119.751054'
in 'utf8towcs'

Can anyone suggest what I can do?
P.S. Actually, I would love to remove all the non-English tweets but I have
no clue about how to do that.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to