Hi Bert, After doing sapply(your_dataframe, "class"), it seems like R is recognizing that the field is of type numeric after all, the problem (it seems) is how the import_list() function from the rio package is reading the data (my suspicion).
Best regards, Paul El mar, 30 ene 2024 a las 14:59, Paul Bernal (<paulberna...@gmail.com>) escribió: > Hi Bert, > > Below the information you asked me for: > > nrow(mydataset) > [1] 2986276 > > ######## > > sapply(mydataset, "class") > $`Transit Date` > [1] "POSIXct" "POSIXt" > > $`Market Segment` > [1] "character" > > $`Número de Tránsitos` > [1] "numeric" > > $`Tar No` > [1] "character" > > $`Beam Range (Operations)` > [1] "character" > > $`Operational Vessel Ranges Group` > [1] "character" > > $`Rcnst PCUMS` > [1] "numeric" > > $`Toll Amount` > [1] "numeric" > > $Beam > [1] "numeric" > > $Length > [1] "numeric" > > $`Trn Draft (FT)` > [1] "numeric" > > $`Other Income Amt` > [1] "numeric" > > $`Total Other Income Amount` > [1] "logical" > > $`Booking Charges` > [1] "numeric" > > $`Booking Cancellation` > [1] "logical" > > $`Booking Auction` > [1] "logical" > > $`_file` > [1] "integer" > > Hope this helps you understand what I am dealling with. > > Cheers, > Paul > > El mar, 30 ene 2024 a las 14:19, Bert Gunter (<bgunter.4...@gmail.com>) > escribió: > >> Incidentally, "didn't work" is not very useful information. Please tell >> us exactly what error message or apparently aberrant result you received. >> Also, what do you get from: >> >> sapply(your_dataframe, "class") >> nrow(your_dataframe) >> >> (as I suspect what you think it is, isn't). >> >> Cheers, >> Bert >> >> On Tue, Jan 30, 2024 at 11:01 AM Bert Gunter <bgunter.4...@gmail.com> >> wrote: >> >>> "I cannot change the data type from >>> boolean to numeric. I tried doing dataset$my_field = >>> as.numeric(dataset$my_field), I also tried to do dataset <- >>> dataset[complete.cases(dataset), ], didn't work either. " >>> >>> Sorry, but all I can say is: huh? >>> >>> > dt <- data.frame(a = c(NA,NA, FALSE, TRUE), b = 1:4) >>> > dt >>> a b >>> 1 NA 1 >>> 2 NA 2 >>> 3 FALSE 3 >>> 4 TRUE 4 >>> > sapply(dt, class) >>> a b >>> "logical" "integer" >>> > dt$a <- as.numeric(dt$a) >>> > dt >>> a b >>> 1 NA 1 >>> 2 NA 2 >>> 3 0 3 >>> 4 1 4 >>> > sapply(dt, class) >>> a b >>> "numeric" "integer" >>> >>> So either I'm missing something or you are. Happy to be corrected and >>> chastised if the former. >>> >>> Cheers, >>> Bert >>> >>> >>> On Tue, Jan 30, 2024 at 10:41 AM Paul Bernal <paulberna...@gmail.com> >>> wrote: >>> >>>> Dear friend Duncan, >>>> >>>> Thank you so much for your kind reply. Yes, that is exactly what is >>>> happening, there are a lot of NA values at the start, so R assumes that >>>> the >>>> field is of type boolean. The challenge that I am facing is that I want >>>> to >>>> read into R an Excel file that has many sheets (46 in this case) but I >>>> wanted to combine all 46 sheets into a single dataframe (since the >>>> columns >>>> are exactly the same for all 46 sheets). The rio package does this >>>> nicely, >>>> the problem is that, once I have the full dataframe (which amounts to >>>> roughly 2.98 million rows total), I cannot change the data type from >>>> boolean to numeric. I tried doing dataset$my_field = >>>> as.numeric(dataset$my_field), I also tried to do dataset <- >>>> dataset[complete.cases(dataset), ], didn't work either. >>>> >>>> The only thing that worked for me was to take a single sheed and through >>>> the read_excel function use the guess_max parameter and set it to a >>>> sufficiently large number (a number >= to the total amount of the full >>>> merged dataset). I want to automate the merging of the N number of Excel >>>> sheets so that I don't have to be manually doing it. Unless there is a >>>> way >>>> to accomplish something similar to what rio's package function >>>> import_list >>>> does, that is able to keep the field's numeric data type nature. >>>> >>>> Cheers, >>>> Paul >>>> >>>> El mar, 30 ene 2024 a las 12:23, Duncan Murdoch (< >>>> murdoch.dun...@gmail.com>) >>>> escribió: >>>> >>>> > On 30/01/2024 11:10 a.m., Paul Bernal wrote: >>>> > > Dear friends, >>>> > > >>>> > > Hope you are doing well. I am currently using R version 4.3.2, and I >>>> > have a >>>> > > .xlsx file that has 46 sheets on it. I basically combined all 46 >>>> sheets >>>> > > and read them as a single dataframe in R using package rio. >>>> > > >>>> > > I read a solution using package readlx, as suggested in a >>>> StackOverflow >>>> > > discussion as follows: >>>> > > df <- read_excel(path = filepath, sheet = sheet_name, guess_max = >>>> > 100000). >>>> > > Now, when you have so many sheets (46 in my case) in an Excel file, >>>> the >>>> > rio >>>> > > methodology is more practical. >>>> > > >>>> > > This is what I did: >>>> > > path = >>>> > > >>>> > >>>> "C:/Users/myuser/Documents/DataScienceF/Forecast_and_Econometric_Analysis_FIGI >>>> > > (4).xlsx" >>>> > > figidat = import_list(path, rbind = TRUE) #here figidat refers to my >>>> > dataset >>>> > > >>>> > > Now, it successfully imports and merges all records, however, some >>>> fields >>>> > > (despite being numeric), R interprets as a boolean field. >>>> > > >>>> > > Here is the structure of the field that is causing me problems (I >>>> > apologize >>>> > > for the length): >>>> > > structure(list(StoreCharges = c(NA, NA, NA, NA, NA, NA, NA, NA, >>>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, >>>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, >>>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, >>>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, >>>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, >>>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, >>>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, >>>> > > NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, >>>> > ... >>>> > > FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, NA, NA, >>>> > > FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, >>>> > > FALSE, FALSE, FALSE)), class = c("tbl_df", "tbl", "data.frame" >>>> > > ), row.names = c(NA, -7033L)) >>>> > > >>>> > > As you can see, when I do the dput, it gives me a bunch of TRUE and >>>> FALSE >>>> > > values, when in reality I have records with value $0, records with >>>> > amounts >>>> > >> $0 and also a bunch of blank records. >>>> > > >>>> > > Any help will be greatly appreciated. >>>> > >>>> > I don't know how read_excel() determines column types, but some >>>> > functions look only at the first n rows to guess the type. It appears >>>> > you have a lot of NA values at the start. That is a logical value, so >>>> > that might be what is going wrong. >>>> > >>>> > In read.table() and related functions, you can specify the types of >>>> > column explicitly. It sounds as though that's what you should do if >>>> > read_excel() offers that as a possibility. >>>> > >>>> > Duncan Murdoch >>>> > >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> ______________________________________________ >>>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>> [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.