This is a folder that contains some parquet files. What do you mean?
ParquetDatasetFactory can only be used for a file?? FileSystemDatasetFactory
can be used for folders.??Or can you tell me how to use parquetDatasetFactory
correctly? What do I need to make sure of? For example, what should I notice
about the metadata_path parameter? It's best to have an example. The reason I
want to use ParquetDatasetFactory is because using the FileSystemDatasetFactory
process seems to as follows
```
FileSystemDatasetFactory--->get a dataset
dataset->GetFragments--------->get fragments for parquet files in the
folder
for fragment in fragments ------->construct a scanner
builder---->Finish()--->get a scanner
scanner ->ToTable() --->get a table (read the file to memory)
// I want to filt some columns before ToTable(), But it seems that only struct
table has the function of ColumnNames()
```
Is this a wrong way?
My ultimate goal is to use arrow to read S3 parquet files for tensorflow
training
------------------ ???????? ------------------
??????:
"dev"
<[email protected]>;
????????: 2022??4??9??(??????) ????11:38
??????: "dev"<[email protected]>;
????: Re: construct dataset for s3 by ParquetDatasetFactory failed
Is `iceberg-test/warehouse/test/metadata` a parquet file? I only ask
because there is no extension. The commented out
FileSystemDatasetFactory is only accessing bucket_uri so it would
potentially succeed even if the metadata file did not exist.
On Fri, Apr 8, 2022 at 1:48 AM 1057445597 <[email protected]> wrote:
>
> I want use ParquetDatasetFactory to create a dataset for s3, but failed!
The error message as follows
>
>
> /build/apache-arrow-7.0.0/cpp/src/arrow/result.cc:28: ValueOrDie called on
an error: IOError: Path does not exist 'iceberg-test/warehouse/test/metadata'
/lib/x86_64-linux-gnu/libarrow.so.700(+0x10430bb)[0x7f4ee6fe50bb]
/lib/x86_64-linux-gnu/libarrow.so.700(_ZN5arrow4util8ArrowLogD1Ev+0xed)[0x7f4ee6fe52fd]
/lib/x86_64-linux-gnu/libarrow.so.700(_ZN5arrow8internal17InvalidValueOrDieERKNS_6StatusE+0x17e)[0x7f4ee7104a2e]
./example(+0xd97d)[0x564087f3e97d] ./example(+0x8bc2)[0x564087f39bc2]
./example(+0x94c8)[0x564087f3a4c8] ./example(+0x9fb4)[0x564087f3afb4]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f4ee572b0b3]
./example(+0x69fe)[0x564087f379fe] Aborted (core dumped)
>
>
> In the follow code snippet??There is a line of comment code??use
FileSystemDatasetFactory to create dataset, It works well, Can't a dataset be
created through a ParquetDatasetFactory????
>
>
> std::shared_ptr<ds::Dataset&gt; GetDatasetFromS3(const
std::string&amp; access_key,
>
const std::string&amp; secret_key,
>
const std::string&amp; endpoint_override,
>
const std::string&amp; bucket_uri) {
> EnsureS3Initialized();
>
> S3Options s3Options = S3Options::FromAccessKey(access_key,
secret_key);
> s3Options.endpoint_override = endpoint_override;
> s3Options.scheme = "http";
>
> std::shared_ptr<S3FileSystem&gt; s3fs =
S3FileSystem::Make(s3Options).ValueOrDie();
>
> std::string path;
> std::stringstream ss;
> ss << "s3://" << access_key << ":" << secret_key
> << "@" << K_METADATA_PATH
> <<
"?scheme=http&amp;endpoint_override=" << endpoint_override;
> auto fs = arrow::fs::FileSystemFromUri(ss.str(),
&amp;path).ValueOrDie();
> // auto fileInfo = fs-&gt;GetFileInfo().ValueOrDie();
>
> auto format = std::make_shared<ParquetFileFormat&gt;();
>
> // FileSelector selector;
> // selector.base_dir = bucket_uri;
>
> // FileSystemFactoryOptions options;
> ds::ParquetFactoryOptions options;
>
> std::string metadata_path = bucket_uri;
>
> ds::FileSource source(bucket_uri, s3fs);
> //auto factory = ds::ParquetDatasetFactory::Make(source,
bucket_uri, fs, format, options).ValueOrDie();
> auto factory = ds::ParquetDatasetFactory::Make(path, fs,
format, options).ValueOrDie();
>
> //auto factory = FileSystemDatasetFactory::Make(s3fs,
selector, format, options).ValueOrDie();
> return factory-&gt;Finish().ValueOrDie();
> }