Otávio Vasques created ARROW-7727:
-------------------------------------
Summary: Unable to read a ParquetDataset when schema validation is
on.
Key: ARROW-7727
URL: https://issues.apache.org/jira/browse/ARROW-7727
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.15.1
Environment: _libgcc_mutex 0.1 main
arrow-cpp 0.15.1 py37h982ac2c_6 conda-forge
attrs 19.3.0 py_0 conda-forge
backcall 0.1.0 py_0 conda-forge
bleach 3.1.0 py_0 conda-forge
boost-cpp 1.70.0 h8e57a91_2 conda-forge
brotli 1.0.7 he1b5a44_1000 conda-forge
bzip2 1.0.8 h516909a_2 conda-forge
c-ares 1.15.0 h516909a_1001 conda-forge
ca-certificates 2019.11.28 hecc5488_0 conda-forge
certifi 2019.11.28 py37_0 conda-forge
decorator 4.4.1 py_0 conda-forge
defusedxml 0.6.0 py_0 conda-forge
double-conversion 3.1.5 he1b5a44_2 conda-forge
entrypoints 0.3 py37_1000 conda-forge
gflags 2.2.2 he1b5a44_1002 conda-forge
glog 0.4.0 he1b5a44_1 conda-forge
grpc-cpp 1.25.0 h213be95_2 conda-forge
icu 64.2 he1b5a44_1 conda-forge
importlib_metadata 1.4.0 py37_0 conda-forge
inflect 4.0.0 py37_1 conda-forge
ipykernel 5.1.4 py37h5ca1d4c_0 conda-forge
ipython 7.11.1 py37h5ca1d4c_0 conda-forge
ipython_genutils 0.2.0 py_1 conda-forge
jaraco.itertools 5.0.0 py_0 conda-forge
jedi 0.16.0 py37_0 conda-forge
jinja2 2.10.3 py_0 conda-forge
jsonschema 3.2.0 py37_0 conda-forge
jupyter_client 5.3.4 py37_1 conda-forge
jupyter_core 4.6.1 py37_0 conda-forge
ld_impl_linux-64 2.33.1 h53a641e_7
libblas 3.8.0 14_openblas conda-forge
libcblas 3.8.0 14_openblas conda-forge
libedit 3.1.20181209 hc058e9b_0
libevent 2.1.10 h72c5cf5_0 conda-forge
libffi 3.2.1 hd88cf55_4
libgcc-ng 9.1.0 hdf63c60_0
libgfortran-ng 7.3.0 hdf63c60_4 conda-forge
liblapack 3.8.0 14_openblas conda-forge
libopenblas 0.3.7 h5ec1e0e_6 conda-forge
libprotobuf 3.11.0 h8b12597_0 conda-forge
libsodium 1.0.17 h516909a_0 conda-forge
libstdcxx-ng 9.1.0 hdf63c60_0
lz4-c 1.8.3 he1b5a44_1001 conda-forge
markupsafe 1.1.1 py37h516909a_0 conda-forge
mistune 0.8.4 py37h516909a_1000 conda-forge
more-itertools 8.1.0 py_0 conda-forge
nbconvert 5.6.1 py37_0 conda-forge
nbformat 5.0.4 py_0 conda-forge
ncurses 6.1 he6710b0_1
notebook 6.0.3 py37_0 conda-forge
numpy 1.17.5 py37h95a1406_0 conda-forge
openssl 1.1.1d h516909a_0 conda-forge
pandas 0.25.3 py37hb3f55d8_0 conda-forge
pandoc 2.9.1.1 0 conda-forge
pandocfilters 1.4.2 py_1 conda-forge
parquet-cpp 1.5.1 2 conda-forge
parso 0.6.0 py_0 conda-forge
pexpect 4.8.0 py37_0 conda-forge
pickleshare 0.7.5 py37_1000 conda-forge
pip 20.0.2 py37_0
prometheus_client 0.7.1 py_0 conda-forge
prompt_toolkit 3.0.2 py_0 conda-forge
ptyprocess 0.6.0 py_1001 conda-forge
pyarrow 0.15.1 py37h8b68381_1 conda-forge
pygments 2.5.2 py_0 conda-forge
pyrsistent 0.15.7 py37h516909a_0 conda-forge
python 3.7.6 h0371630_2
python-dateutil 2.8.1 py_0 conda-forge
pytz 2019.3 py_0 conda-forge
pyzmq 18.1.1 py37h1768529_0 conda-forge
re2 2020.01.01 he1b5a44_0 conda-forge
readline 7.0 h7b6447c_5
send2trash 1.5.0 py_0 conda-forge
setuptools 45.1.0 py37_0
six 1.14.0 py37_0 conda-forge
snappy 1.1.7 he1b5a44_1003 conda-forge
sqlite 3.30.1 h7b6447c_0
terminado 0.8.3 py37_0 conda-forge
testpath 0.4.4 py_0 conda-forge
thrift-cpp 0.12.0 hf3afdfd_1004 conda-forge
tk 8.6.8 hbc83047_0
tornado 6.0.3 py37h516909a_0 conda-forge
traitlets 4.3.3 py37_0 conda-forge
uriparser 0.9.3 he1b5a44_1 conda-forge
wcwidth 0.1.8 py_0 conda-forge
webencodings 0.5.1 py_1 conda-forge
wheel 0.33.6 py37_0
xz 5.2.4 h14c3975_4
zeromq 4.3.2 he1b5a44_2 conda-forge
zipp 2.1.0 py_0 conda-forge
zlib 1.2.11 h7b6447c_3
zstd 1.4.4 h3b9ef0a_1 conda-forge
Reporter: Otávio Vasques
Fix For: 0.16.0
I was trying to read a subset of my parquet files using the ParquetDataset
object with a predefined schema, when it tries to validate the schema a
`to_arrow_schema` is called and the schema does not support this. I don't what
is happening, this is a sample:
``` python
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import numpy as np
schema = pa.schema([
pa.field("field1", pa.string()),
pa.field("field2", pa.string()),
pa.field("field3", pa.string()),
])
...
pq_dataset = pq.ParquetDataset(file_groups[0], schema=schema)
AttributeError: 'pyarrow.lib.Schema' object has no attribute 'to_arrow_schema'
```
If we check the type of the schema as defined above we get:
```
type(schema)
pyarrow.lib.Schema
```
But the required type according with the docs is `pyarrow.parquet.Schema`, I
don't know how to produce a object with this since we are forbbiden to use the
Schema constructor directly.
If we check the implementation on github we get directly this line
[here|https://github.com/apache/arrow/blob/apache-arrow-0.15.1/python/pyarrow/parquet.py#L1097]:
```
dataset_schema = self.schema.to_arrow_schema()
```
Is this a problem in the schema builder or the parquet dataset object?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)