Re: create iceberg on minio s3 got "The AWS Access Key Id you provided does not exist in our records."

Jack Ye Mon, 16 Aug 2021 19:30:00 -0700

Have you included the hadoop-aws jar?
https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws
-Jack


On Mon, Aug 16, 2021 at 7:09 PM Lian Jiang <[email protected]> wrote:

> Jack,
>
> You are right. S3FileIO will not work on minio since minio does not
> support ACL: https://docs.min.io/docs/minio-server-limits-per-tenant.html
>
> To use iceberg, minio + s3a, I used below script to launch spark-shell:
>
> /spark/bin/spark-shell --packages $DEPENDENCIES \
>     --conf
> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>     --conf spark.sql.catalog.hive_test.type=hive  \
> *    --conf
> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO
> \*
>     --conf spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse \
>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>     --conf spark.hadoop.fs.s3a.access.key=minio \
>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>     --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>
>
>
> *The spark code:*
>
> import org.apache.spark.sql.SparkSession
> val values = List(1,2,3,4,5)
>
> val spark = SparkSession.builder().master("local").getOrCreate()
> import spark.implicits._
> val df = values.toDF()
>
> val core = "mytable"
> val table = s"hive_test.mydb.${core}"
> val s3IcePath = s"s3a://east/${core}.ice"
>
> df.writeTo(table)
>     .tableProperty("write.format.default", "parquet")
>     .tableProperty("location", s3IcePath)
>     .createOrReplace()
>
>
> *Still the same error:*
> java.lang.ClassNotFoundException: Class
> org.apache.hadoop.fs.s3a.S3AFileSystem not found
>
>
> What else could be wrong? Thanks for any clue.
>
>
>
> On Mon, Aug 16, 2021 at 9:35 AM Jack Ye <[email protected]> wrote:
>
>> Sorry for the late reply, I thought I replied on Friday but the email did
>> not send successfully.
>>
>> As Daniel said, you don't need to setup S3A if you are using S3FileIO.
>>
>> Th S3FileIO by default reads the default credentials chain to check
>> credential setups one by one:
>> https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html#credentials-chain
>>
>> If you would like to use a specialized credential provider, you can
>> directly customize your S3 client:
>> https://iceberg.apache.org/aws/#aws-client-customization
>>
>> It looks like you are trying to use MinIO to mount S3A file system? If
>> you have to use MinIO then there is not a way to integrate with S3FileIO
>> right now. (maybe I am wrong on this, I don't know much about MinIO)
>>
>> To directly use S3FileIO with HiveCatalog, simply do:
>>
>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>     --conf
>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>     --conf
>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
>>     --conf spark.sql.catalog.hive_test.warehouse=s3://bucket
>>
>> Best,
>> Jack Ye
>>
>>
>>
>> On Sun, Aug 15, 2021 at 2:53 PM Lian Jiang <[email protected]> wrote:
>>
>>> Thanks. I prefer S3FileIO as it is recommended by iceberg. Do you have a
>>> sample using hive catalog, s3FileIO, spark API (as opposed to SQL), S3
>>> access.key and secret.key? It is hard to get all settings right for this
>>> combination without an example. Appreciate any help.
>>>
>>> On Fri, Aug 13, 2021 at 6:01 PM Daniel Weeks <[email protected]>
>>> wrote:
>>>
>>>> So, if I recall correctly, the hive server does need access to check
>>>> and create paths for table locations.
>>>>
>>>> There may be an option to disable this behavior, but otherwise the fs
>>>> implementation probably needs to be available to the hive metastore.
>>>>
>>>> -Dan
>>>>
>>>> On Fri, Aug 13, 2021, 4:48 PM Lian Jiang <[email protected]> wrote:
>>>>
>>>>> Thanks Daniel.
>>>>>
>>>>> After modifying the script to,
>>>>>
>>>>> export AWS_REGION=us-east-1
>>>>> export AWS_ACCESS_KEY_ID=minio
>>>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>>>
>>>>> ICEBERG_VERSION=0.11.1
>>>>>
>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0"
>>>>>
>>>>> MINIOSERVER=192.168.160.5
>>>>>
>>>>>
>>>>> # add AWS dependnecy
>>>>> AWS_SDK_VERSION=2.15.40
>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>>>> AWS_PACKAGES=(
>>>>>     "bundle"
>>>>>     "url-connection-client"
>>>>> )
>>>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>>>> done
>>>>>
>>>>> # start Spark SQL client shell
>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>     --conf
>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>>     --conf
>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>>
>>>>> I got: MetaException: java.lang.RuntimeException:
>>>>> java.lang.ClassNotFoundException: Class
>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found. My hive server is not
>>>>> using s3 and should not cause this error. Any ideas? Thanks.
>>>>>
>>>>>
>>>>> I got "ClassNotFoundException: Class
>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found". Any idea what 
>>>>> dependency
>>>>> could I miss?
>>>>>
>>>>> On Fri, Aug 13, 2021 at 4:03 PM Daniel Weeks <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hey Lian,
>>>>>>
>>>>>> At a cursory glance, it appears that you might be mixing two
>>>>>> different FileIO implementations, which may be why you are not getting 
>>>>>> the
>>>>>> expected result.
>>>>>>
>>>>>> When you set: --conf
>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO 
>>>>>> you're
>>>>>> actually switching over to the native S3 implementation within Iceberg 
>>>>>> (as
>>>>>> opposed to S3AFileSystem via HadoopFileIO).  However, all of the 
>>>>>> following
>>>>>> settings to setup access are then set for the S3AFileSystem (which would
>>>>>> not be used with S3FileIO).
>>>>>>
>>>>>> You might try just removing that line since it should use the
>>>>>> HadoopFileIO at that point and may work.
>>>>>>
>>>>>> Hope that's helpful,
>>>>>> -Dan
>>>>>>
>>>>>> On Fri, Aug 13, 2021 at 3:50 PM Lian Jiang <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I try to create an iceberg table on minio s3 and hive.
>>>>>>>
>>>>>>> *This is how I launch spark-shell:*
>>>>>>>
>>>>>>> # add Iceberg dependency
>>>>>>> export AWS_REGION=us-east-1
>>>>>>> export AWS_ACCESS_KEY_ID=minio
>>>>>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>>>>>
>>>>>>> ICEBERG_VERSION=0.11.1
>>>>>>>
>>>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION"
>>>>>>>
>>>>>>> MINIOSERVER=192.168.160.5
>>>>>>>
>>>>>>>
>>>>>>> # add AWS dependnecy
>>>>>>> AWS_SDK_VERSION=2.15.40
>>>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>>>>>> AWS_PACKAGES=(
>>>>>>>     "bundle"
>>>>>>>     "url-connection-client"
>>>>>>> )
>>>>>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>>>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>>>>>> done
>>>>>>>
>>>>>>> # start Spark SQL client shell
>>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>>>     --conf
>>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>>>     --conf spark.sql.catalog.hive_test.warehouse=s3a://east/prefix \
>>>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>>>     --conf
>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
>>>>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>>>>     --conf
>>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>>>>
>>>>>>> *Here is the spark code to create the iceberg table:*
>>>>>>>
>>>>>>> import org.apache.spark.sql.SparkSession
>>>>>>> val values = List(1,2,3,4,5)
>>>>>>>
>>>>>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>>>>>> import spark.implicits._
>>>>>>> val df = values.toDF()
>>>>>>>
>>>>>>> val core = "mytable8"
>>>>>>> val table = s"hive_test.mydb.${core}"
>>>>>>> val s3IcePath = s"s3a://spark-test/${core}.ice"
>>>>>>>
>>>>>>> df.writeTo(table)
>>>>>>>     .tableProperty("write.format.default", "parquet")
>>>>>>>     .tableProperty("location", s3IcePath)
>>>>>>>     .createOrReplace()
>>>>>>>
>>>>>>> I got an error "The AWS Access Key Id you provided does not exist in
>>>>>>> our records.".
>>>>>>>
>>>>>>> I have verified that I can login minio UI using the same username
>>>>>>> and password that I passed to spark-shell via AWS_ACCESS_KEY_ID and
>>>>>>> AWS_SECRET_ACCESS_KEY env variables.
>>>>>>> https://github.com/apache/iceberg/issues/2168 is related but does
>>>>>>> not help me. Not sure why the credential does not work for iceberg + 
>>>>>>> AWS.
>>>>>>> Any idea or an example of writing an iceberg table to S3 using hive 
>>>>>>> catalog
>>>>>>> will be highly appreciated! Thanks.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Create your own email signature
>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>
>>>>
>>>
>>> --
>>>
>>> Create your own email signature
>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>
>>
>
> --
>
> Create your own email signature
> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>

Re: create iceberg on minio s3 got "The AWS Access Key Id you provided does not exist in our records."

Reply via email to