[ https://issues.apache.org/jira/browse/ARROW-17432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617742#comment-17617742 ]
Guillermo Duran edited comment on ARROW-17432 at 10/14/22 2:19 PM: ------------------------------------------------------------------- Thanks for your suggestion [~fmic] As why we have a decimal notation on the checklists ID column... that's a good question, but basically that's how the data is exported from the eBird db. {code:java} $ head obs.csv checklist_id,species_code,exotic_category,obs_count,only_presence_reported,only_slash_reported,valid,reviewed,has_media 771514.0,comgra,,1.0,0.0,0.0,1.0,0.0,0.0 771514.0,grycat,,1.0,0.0,0.0,1.0,0.0,0.0 771514.0,bkcchi,,1.0,0.0,0.0,1.0,0.0,0.0 771514.0,sonspa,,3.0,0.0,0.0,1.0,0.0,0.0 771514.0,bnhcow,,1.0,0.0,0.0,1.0,0.0,0.0 771514.0,amegfi,,4.0,0.0,0.0,1.0,0.0,0.0 771514.0,rewbla,,1.0,0.0,0.0,1.0,0.0,0.0 771515.0,yelwar,,4.0,0.0,0.0,1.0,0.0,0.0 771515.0,amecro,,1.0,0.0,0.0,1.0,0.0,0.0 {code} I can do some regex trickery to remove the ".0" from the csv before _open_dataset()_ but in any case it's weird that I have the issue only when importing the csv with the R Arrow library. If I change the type for that column to int32 on the _open_dataset()_ schema, I get a conversion error due to the mismatch. If I do the subset with the decimal I still get the same weird rows: {code:java} sch = arrow::schema(checklist_id = float32(), species_code = string(), exotic_category = float32(), obs_count = float32(), only_presence_reported = float32(), only_slash_reported = float32(), valid = float32(), reviewed = float32(), has_media = float32() ) csv_stream <- open_dataset(csv_file, format = "csv", col_types = sch) csv_stream |> filter(checklist_id == 18543372.0) |> arrange(species_code) |> collect() # A tibble: 50 × 9 checklist_id species_code exotic_category obs_count only_presence_reported only_slash_reported valid reviewed has_media <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 18543372 altori NA 3 0 0 1 0 0 2 18543372 amekes NA 1 0 0 1 0 0 3 18543372 amered NA 40 0 0 1 1 0 4 18543372 amerob NA 30 0 0 1 0 0 5 18543372 amerob NA 9 0 0 1 0 0 6 18543372 balori NA 9 0 0 1 0 0 7 18543372 blkter NA 9 0 0 1 0 0 8 18543372 blkvul NA 20 0 0 1 0 0 9 18543372 buggna NA 1 0 0 1 0 0 10 18543372 buwwar NA 1 0 0 1 0 0 # … with 40 more rows # ℹ Use `print(n = ...)` to see more rows{code} Even if I filter by an interval, the result is the same: {code:java} csv_stream |> filter(checklist_id > 18543371 & checklist_id < 18543373) |> arrange(species_code) |> collect() # A tibble: 50 × 9 checklist_id species_code exotic_category obs_count only_presence_reported only_slash_reported valid reviewed has_media <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 18543372 altori NA 3 0 0 1 0 0 2 18543372 amekes NA 1 0 0 1 0 0 3 18543372 amered NA 40 0 0 1 1 0 4 18543372 amerob NA 30 0 0 1 0 0 5 18543372 amerob NA 9 0 0 1 0 0 6 18543372 balori NA 9 0 0 1 0 0 7 18543372 blkter NA 9 0 0 1 0 0 8 18543372 blkvul NA 20 0 0 1 0 0 9 18543372 buggna NA 1 0 0 1 0 0 10 18543372 buwwar NA 1 0 0 1 0 0 # … with 40 more rows # ℹ Use `print(n = ...)` to see more rows {code} Also, thinking about the possible differences between the schema used to import the csv in Arrow's Python vs R, it seems that both are the "same": {code:java} import pyarrow as pa import pyarrow.dataset as ds import pyarrow.compute as pc import pandas as pd test_rows_csv = pd.read_csv("/ebird_erd2021/full/obs.csv", nrows = 1000) sch = pa.Schema.from_pandas(test_rows_csv) sch checklist_id: double species_code: string exotic_category: double obs_count: double only_presence_reported: double only_slash_reported: double valid: double reviewed: double has_media: double -- schema metadata -- pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 1391{code} was (Author: JIRAUSER294488): Thanks for your suggestion [~fmic] As why we have a decimal notation on the checklists ID column... that's a good question, but basically that's how the data is exported from the eBird db. {code:java} $ head obs.csv checklist_id,species_code,exotic_category,obs_count,only_presence_reported,only_slash_reported,valid,reviewed,has_media 771514.0,comgra,,1.0,0.0,0.0,1.0,0.0,0.0 771514.0,grycat,,1.0,0.0,0.0,1.0,0.0,0.0 771514.0,bkcchi,,1.0,0.0,0.0,1.0,0.0,0.0 771514.0,sonspa,,3.0,0.0,0.0,1.0,0.0,0.0 771514.0,bnhcow,,1.0,0.0,0.0,1.0,0.0,0.0 771514.0,amegfi,,4.0,0.0,0.0,1.0,0.0,0.0 771514.0,rewbla,,1.0,0.0,0.0,1.0,0.0,0.0 771515.0,yelwar,,4.0,0.0,0.0,1.0,0.0,0.0 771515.0,amecro,,1.0,0.0,0.0,1.0,0.0,0.0 {code} I can do some regex trickery to remove the ".0" from the csv before _open_dataset()_ but in any case it's weird that I have the issue only when importing the csv with the R Arrow library. If I change the type for that column to int32 on the _open_dataset()_ schema, I get a conversion error due to the mismatch. If I do the subset with the decimal I still get the same weird rows: {code:java} sch = arrow::schema(checklist_id = float32(), species_code = string(), exotic_category = float32(), obs_count = float32(), only_presence_reported = float32(), only_slash_reported = float32(), valid = float32(), reviewed = float32(), has_media = float32() ) csv_stream <- open_dataset(csv_file, format = "csv", col_types = sch) csv_stream |> filter(checklist_id == 18543372.0) |> arrange(species_code) |> collect() # A tibble: 50 × 9 checklist_id species_code exotic_category obs_count only_presence_reported only_slash_reported valid reviewed has_media <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 18543372 altori NA 3 0 0 1 0 0 2 18543372 amekes NA 1 0 0 1 0 0 3 18543372 amered NA 40 0 0 1 1 0 4 18543372 amerob NA 30 0 0 1 0 0 5 18543372 amerob NA 9 0 0 1 0 0 6 18543372 balori NA 9 0 0 1 0 0 7 18543372 blkter NA 9 0 0 1 0 0 8 18543372 blkvul NA 20 0 0 1 0 0 9 18543372 buggna NA 1 0 0 1 0 0 10 18543372 buwwar NA 1 0 0 1 0 0 # … with 40 more rows # ℹ Use `print(n = ...)` to see more rows{code} Even if I filter by an interval, the result is the same: {code:java} csv_stream |> filter(checklist_id > 18543371 & checklist_id < 18543373) |> arrange(species_code) |> collect() # A tibble: 50 × 9 checklist_id species_code exotic_category obs_count only_presence_reported only_slash_reported valid reviewed has_media <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 18543372 altori NA 3 0 0 1 0 0 2 18543372 amekes NA 1 0 0 1 0 0 3 18543372 amered NA 40 0 0 1 1 0 4 18543372 amerob NA 30 0 0 1 0 0 5 18543372 amerob NA 9 0 0 1 0 0 6 18543372 balori NA 9 0 0 1 0 0 7 18543372 blkter NA 9 0 0 1 0 0 8 18543372 blkvul NA 20 0 0 1 0 0 9 18543372 buggna NA 1 0 0 1 0 0 10 18543372 buwwar NA 1 0 0 1 0 0 # … with 40 more rows # ℹ Use `print(n = ...)` to see more rows {code} > [R] messed up rows when importing large csv into parquet > -------------------------------------------------------- > > Key: ARROW-17432 > URL: https://issues.apache.org/jira/browse/ARROW-17432 > Project: Apache Arrow > Issue Type: Bug > Components: R > Affects Versions: 8.0.0, 9.0.0 > Environment: R version 4.2.1 > Running in Arch Linux - EndeavourOS > arrow_info() > Arrow package version: 9.0.0 > Capabilities: > > dataset TRUE > substrait FALSE > parquet TRUE > json TRUE > s3 TRUE > gcs TRUE > utf8proc TRUE > re2 TRUE > snappy TRUE > gzip TRUE > brotli TRUE > zstd TRUE > lz4 TRUE > lz4_frame TRUE > lzo FALSE > bz2 TRUE > jemalloc TRUE > mimalloc TRUE > Memory: > > Allocator jemalloc > Current 49.31 Kb > Max 1.63 Mb > Runtime: > > SIMD Level avx2 > Detected SIMD Level avx2 > Build: > > C++ Library Version 9.0.0 > C++ Compiler GNU > C++ Compiler Version 7.5.0 > #### > print(pa.__version__) > 9.0.0 > Reporter: Guillermo Duran > Priority: Major > > This is a weird issue that creates new rows when importing a large csv (56 > GB) into parquet in R. It occurred with both R Arrow 8.0.0 and 9.0.0 BUT > didn't occur with the Python Arrow library 9.0.0. Due to the large size of > the original csv it's difficult to create a reproducible example, but I share > the code and outputs. > The code I use in R to import the csv: > {code:java} > library(arrow) > library(dplyr) > > csv_file <- "/ebird_erd2021/full/obs.csv" > dest <- "/ebird_erd2021/full/obs_parquet/" > sch = arrow::schema(checklist_id = float32(), > species_code = string(), > exotic_category = float32(), > obs_count = float32(), > only_presence_reported = float32(), > only_slash_reported = float32(), > valid = float32(), > reviewed = float32(), > has_media = float32() > ) > csv_stream <- open_dataset(csv_file, format = "csv", > schema = sch, skip_rows = 1) > write_dataset(csv_stream, dest, format = "parquet", > max_rows_per_file=1000000L, > hive_style = TRUE, > existing_data_behavior = "overwrite"){code} > When I load the dataset and check one random _checklist_id_ I get rows that > are not part of the _obs.csv_ file. There shouldn't be duplicated species in > a checklist but there are ({_}amerob{_} for example)... also note that the > duplicated species have different {_}obs_count{_}. 50 species in total in > that specific {_}checklist_id{_}. > {code:java} > parquet_arrow <- open_dataset(dest, format = "parquet") > parquet_arrow |> > filter(checklist_id == 18543372) |> > arrange(species_code) |> > collect() > # A tibble: 50 × 3 > checklist_id species_code obs_count > <dbl> <chr> <dbl> > 1 18543372 altori 3 > 2 18543372 amekes 1 > 3 18543372 amered 40 > 4 18543372 amerob 30 > 5 18543372 amerob 9 > 6 18543372 balori 9 > 7 18543372 blkter 9 > 8 18543372 blkvul 20 > 9 18543372 buggna 1 > 10 18543372 buwwar 1 > # … with 40 more rows > # ℹ Use `print(n = ...)` to see more rows{code} > If I use awk to query the csv file with that same checklist id, I get > something different: > {code:java} > $ awk -F "," '{ if ($1 == 18543372) { print } }' obs.csv > 18543372.0,rewbla,,60.0,0.0,0.0,1.0,0.0,0.0 > 18543372.0,amerob,,30.0,0.0,0.0,1.0,0.0,0.0 > 18543372.0,robgro,,2.0,0.0,0.0,1.0,0.0,0.0 > 18543372.0,eastow,,1.0,0.0,0.0,1.0,0.0,0.0 > 18543372.0,sedwre1,,2.0,0.0,0.0,1.0,0.0,0.0 > 18543372.0,ovenbi1,,1.0,0.0,0.0,1.0,0.0,0.0 > 18543372.0,buggna,,1.0,0.0,0.0,1.0,0.0,0.0 > 18543372.0,reshaw,,1.0,0.0,0.0,1.0,0.0,0.0 > 18543372.0,turvul,,1.0,0.0,0.0,1.0,0.0,0.0 > 18543372.0,gowwar,,1.0,0.0,0.0,1.0,0.0,0.0 > 18543372.0,balori,,9.0,0.0,0.0,1.0,0.0,0.0 > 18543372.0,buwwar,,1.0,0.0,0.0,1.0,0.0,0.0 > 18543372.0,grycat,,1.0,0.0,0.0,1.0,0.0,0.0 > 18543372.0,cangoo,,6.0,0.0,0.0,1.0,0.0,0.0 > 18543372.0,houwre,,1.0,0.0,0.0,1.0,0.0,0.0 > 18543372.0,amered,,40.0,0.0,0.0,1.0,1.0,0.0 > 18543372.0,norwat,,2.0,0.0,0.0,1.0,0.0,0.0{code} > 17 different species and no repetitions... Look _amerob_ on the 2nd line > only, with 30 _obs_count_ > > If I import the csv into parquet using the Python Arrow library as: > {code:java} > import pyarrow as pa > import pyarrow.dataset as ds > import pyarrow.compute as pc > import pandas as pd > test_rows_csv = pd.read_csv("/ebird_erd2021/full/obs.csv", > nrows = 1000) > sch = pa.Schema.from_pandas(test_rows_csv) > csv_file = ds.dataset("/ebird_erd2021/full/obs.csv", > schema = sch, > format = "csv") > ds.write_dataset(csv_file, > "ebird_erd2021/full/obs_parquet_py/", > format = "parquet", > schema = sch, > use_threads = True, > max_rows_per_file = 1000000, > max_rows_per_group = 1000000, > existing_data_behavior = "error"){code} > And then load it in R doing the same checklist search: > {code:java} > parquet_py <- "/ebird_erd2021/full/obs_parquet_py/" > parquet_arrow <- open_dataset(parquet_py, format = "parquet") > parquet_arrow |> > filter(checklist_id == 18543372) |> > arrange(species_code) |> > select(checklist_id, species_code, obs_count) |> > collect() > # A tibble: 17 × 3 > checklist_id species_code obs_count > <dbl> <chr> <dbl> > 1 18543372 amered 40 > 2 18543372 amerob 30 > 3 18543372 balori 9 > 4 18543372 buggna 1 > 5 18543372 buwwar 1 > 6 18543372 cangoo 6 > 7 18543372 eastow 1 > 8 18543372 gowwar 1 > 9 18543372 grycat 1 > 10 18543372 houwre 1 > 11 18543372 norwat 2 > 12 18543372 ovenbi1 1 > 13 18543372 reshaw 1 > 14 18543372 rewbla 60 > 15 18543372 robgro 2 > 16 18543372 sedwre1 2 > 17 18543372 turvul 1{code} > I get exactly what I should. No _species_code_ repeated and 17 different > species. > > Due to these differences I guess something weird must be happening in the R > arrow library. -- This message was sent by Atlassian Jira (v8.20.10#820010)