I'm not sure that I'm following why MinIO won't work with S3FileIO.
S3FileIO assumes that the credentials are handled by a credentials provider
outside of S3FileIO. How does MinIO handle credentials?

Ryan

On Mon, Aug 16, 2021 at 7:57 PM Jack Ye <yezhao...@gmail.com> wrote:

> Talked with Lian on Slack, the user is using a hadoop 3.2.1 + hive
> (postgres) + spark + minio docker installation. There might be some S3A
> related dependencies missing on the Hive server side based on the stack
> trace. Let's see if that fixes the issue.
> -Jack
>
> On Mon, Aug 16, 2021 at 7:32 PM Lian Jiang <jiangok2...@gmail.com> wrote:
>
>> This is my full script launching spark-shell:
>>
>> # add Iceberg dependency
>> export AWS_REGION=us-east-1
>> export AWS_ACCESS_KEY_ID=minio
>> export AWS_SECRET_ACCESS_KEY=minio123
>>
>> ICEBERG_VERSION=0.11.1
>>
>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0"
>>
>> MINIOSERVER=192.168.176.5
>>
>>
>> # add AWS dependnecy
>> AWS_SDK_VERSION=2.15.40
>> AWS_MAVEN_GROUP=software.amazon.awssdk
>> AWS_PACKAGES=(
>>     "bundle"
>>     "url-connection-client"
>> )
>> for pkg in "${AWS_PACKAGES[@]}"; do
>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>> done
>>
>> # start Spark SQL client shell
>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>     --conf
>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>     --conf
>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO \
>>     --conf spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse \
>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>     --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>
>>
>> Let me know if anything is missing. Thanks.
>>
>> On Mon, Aug 16, 2021 at 7:29 PM Jack Ye <yezhao...@gmail.com> wrote:
>>
>>> Have you included the hadoop-aws jar?
>>> https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws
>>> -Jack
>>>
>>> On Mon, Aug 16, 2021 at 7:09 PM Lian Jiang <jiangok2...@gmail.com>
>>> wrote:
>>>
>>>> Jack,
>>>>
>>>> You are right. S3FileIO will not work on minio since minio does not
>>>> support ACL:
>>>> https://docs.min.io/docs/minio-server-limits-per-tenant.html
>>>>
>>>> To use iceberg, minio + s3a, I used below script to launch spark-shell:
>>>>
>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>     --conf
>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>> *    --conf
>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO
>>>> \*
>>>>     --conf spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse \
>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>     --conf
>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>
>>>>
>>>>
>>>> *The spark code:*
>>>>
>>>> import org.apache.spark.sql.SparkSession
>>>> val values = List(1,2,3,4,5)
>>>>
>>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>>> import spark.implicits._
>>>> val df = values.toDF()
>>>>
>>>> val core = "mytable"
>>>> val table = s"hive_test.mydb.${core}"
>>>> val s3IcePath = s"s3a://east/${core}.ice"
>>>>
>>>> df.writeTo(table)
>>>>     .tableProperty("write.format.default", "parquet")
>>>>     .tableProperty("location", s3IcePath)
>>>>     .createOrReplace()
>>>>
>>>>
>>>> *Still the same error:*
>>>> java.lang.ClassNotFoundException: Class
>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found
>>>>
>>>>
>>>> What else could be wrong? Thanks for any clue.
>>>>
>>>>
>>>>
>>>> On Mon, Aug 16, 2021 at 9:35 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>>
>>>>> Sorry for the late reply, I thought I replied on Friday but the email
>>>>> did not send successfully.
>>>>>
>>>>> As Daniel said, you don't need to setup S3A if you are using S3FileIO.
>>>>>
>>>>> Th S3FileIO by default reads the default credentials chain to check
>>>>> credential setups one by one:
>>>>> https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html#credentials-chain
>>>>>
>>>>> If you would like to use a specialized credential provider, you can
>>>>> directly customize your S3 client:
>>>>> https://iceberg.apache.org/aws/#aws-client-customization
>>>>>
>>>>> It looks like you are trying to use MinIO to mount S3A file system? If
>>>>> you have to use MinIO then there is not a way to integrate with S3FileIO
>>>>> right now. (maybe I am wrong on this, I don't know much about MinIO)
>>>>>
>>>>> To directly use S3FileIO with HiveCatalog, simply do:
>>>>>
>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>     --conf
>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>     --conf
>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
>>>>>     --conf spark.sql.catalog.hive_test.warehouse=s3://bucket
>>>>>
>>>>> Best,
>>>>> Jack Ye
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Aug 15, 2021 at 2:53 PM Lian Jiang <jiangok2...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks. I prefer S3FileIO as it is recommended by iceberg. Do you
>>>>>> have a sample using hive catalog, s3FileIO, spark API (as opposed to 
>>>>>> SQL),
>>>>>> S3 access.key and secret.key? It is hard to get all settings right for 
>>>>>> this
>>>>>> combination without an example. Appreciate any help.
>>>>>>
>>>>>> On Fri, Aug 13, 2021 at 6:01 PM Daniel Weeks <
>>>>>> daniel.c.we...@gmail.com> wrote:
>>>>>>
>>>>>>> So, if I recall correctly, the hive server does need access to check
>>>>>>> and create paths for table locations.
>>>>>>>
>>>>>>> There may be an option to disable this behavior, but otherwise the
>>>>>>> fs implementation probably needs to be available to the hive metastore.
>>>>>>>
>>>>>>> -Dan
>>>>>>>
>>>>>>> On Fri, Aug 13, 2021, 4:48 PM Lian Jiang <jiangok2...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks Daniel.
>>>>>>>>
>>>>>>>> After modifying the script to,
>>>>>>>>
>>>>>>>> export AWS_REGION=us-east-1
>>>>>>>> export AWS_ACCESS_KEY_ID=minio
>>>>>>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>>>>>>
>>>>>>>> ICEBERG_VERSION=0.11.1
>>>>>>>>
>>>>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0"
>>>>>>>>
>>>>>>>> MINIOSERVER=192.168.160.5
>>>>>>>>
>>>>>>>>
>>>>>>>> # add AWS dependnecy
>>>>>>>> AWS_SDK_VERSION=2.15.40
>>>>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>>>>>>> AWS_PACKAGES=(
>>>>>>>>     "bundle"
>>>>>>>>     "url-connection-client"
>>>>>>>> )
>>>>>>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>>>>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>>>>>>> done
>>>>>>>>
>>>>>>>> # start Spark SQL client shell
>>>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>>>>     --conf
>>>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>>>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>>>>>     --conf
>>>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>>>>>
>>>>>>>> I got: MetaException: java.lang.RuntimeException:
>>>>>>>> java.lang.ClassNotFoundException: Class
>>>>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found. My hive server is not
>>>>>>>> using s3 and should not cause this error. Any ideas? Thanks.
>>>>>>>>
>>>>>>>>
>>>>>>>> I got "ClassNotFoundException: Class
>>>>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found". Any idea what 
>>>>>>>> dependency
>>>>>>>> could I miss?
>>>>>>>>
>>>>>>>> On Fri, Aug 13, 2021 at 4:03 PM Daniel Weeks <
>>>>>>>> daniel.c.we...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hey Lian,
>>>>>>>>>
>>>>>>>>> At a cursory glance, it appears that you might be mixing two
>>>>>>>>> different FileIO implementations, which may be why you are not 
>>>>>>>>> getting the
>>>>>>>>> expected result.
>>>>>>>>>
>>>>>>>>> When you set: --conf
>>>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO
>>>>>>>>>  you're
>>>>>>>>> actually switching over to the native S3 implementation within 
>>>>>>>>> Iceberg (as
>>>>>>>>> opposed to S3AFileSystem via HadoopFileIO).  However, all of the 
>>>>>>>>> following
>>>>>>>>> settings to setup access are then set for the S3AFileSystem (which 
>>>>>>>>> would
>>>>>>>>> not be used with S3FileIO).
>>>>>>>>>
>>>>>>>>> You might try just removing that line since it should use the
>>>>>>>>> HadoopFileIO at that point and may work.
>>>>>>>>>
>>>>>>>>> Hope that's helpful,
>>>>>>>>> -Dan
>>>>>>>>>
>>>>>>>>> On Fri, Aug 13, 2021 at 3:50 PM Lian Jiang <jiangok2...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I try to create an iceberg table on minio s3 and hive.
>>>>>>>>>>
>>>>>>>>>> *This is how I launch spark-shell:*
>>>>>>>>>>
>>>>>>>>>> # add Iceberg dependency
>>>>>>>>>> export AWS_REGION=us-east-1
>>>>>>>>>> export AWS_ACCESS_KEY_ID=minio
>>>>>>>>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>>>>>>>>
>>>>>>>>>> ICEBERG_VERSION=0.11.1
>>>>>>>>>>
>>>>>>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION"
>>>>>>>>>>
>>>>>>>>>> MINIOSERVER=192.168.160.5
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> # add AWS dependnecy
>>>>>>>>>> AWS_SDK_VERSION=2.15.40
>>>>>>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>>>>>>>>> AWS_PACKAGES=(
>>>>>>>>>>     "bundle"
>>>>>>>>>>     "url-connection-client"
>>>>>>>>>> )
>>>>>>>>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>>>>>>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>>>>>>>>> done
>>>>>>>>>>
>>>>>>>>>> # start Spark SQL client shell
>>>>>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>>>>>>     --conf
>>>>>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>>>>>>     --conf
>>>>>>>>>> spark.sql.catalog.hive_test.warehouse=s3a://east/prefix \
>>>>>>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>>>>>>     --conf
>>>>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO
>>>>>>>>>>  \
>>>>>>>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000
>>>>>>>>>> \
>>>>>>>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>>>>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>>>>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>>>>>>>     --conf
>>>>>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>>>>>>>
>>>>>>>>>> *Here is the spark code to create the iceberg table:*
>>>>>>>>>>
>>>>>>>>>> import org.apache.spark.sql.SparkSession
>>>>>>>>>> val values = List(1,2,3,4,5)
>>>>>>>>>>
>>>>>>>>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>>>>>>>>> import spark.implicits._
>>>>>>>>>> val df = values.toDF()
>>>>>>>>>>
>>>>>>>>>> val core = "mytable8"
>>>>>>>>>> val table = s"hive_test.mydb.${core}"
>>>>>>>>>> val s3IcePath = s"s3a://spark-test/${core}.ice"
>>>>>>>>>>
>>>>>>>>>> df.writeTo(table)
>>>>>>>>>>     .tableProperty("write.format.default", "parquet")
>>>>>>>>>>     .tableProperty("location", s3IcePath)
>>>>>>>>>>     .createOrReplace()
>>>>>>>>>>
>>>>>>>>>> I got an error "The AWS Access Key Id you provided does not exist
>>>>>>>>>> in our records.".
>>>>>>>>>>
>>>>>>>>>> I have verified that I can login minio UI using the same username
>>>>>>>>>> and password that I passed to spark-shell via AWS_ACCESS_KEY_ID and
>>>>>>>>>> AWS_SECRET_ACCESS_KEY env variables.
>>>>>>>>>> https://github.com/apache/iceberg/issues/2168 is related but
>>>>>>>>>> does not help me. Not sure why the credential does not work for 
>>>>>>>>>> iceberg +
>>>>>>>>>> AWS. Any idea or an example of writing an iceberg table to S3 using 
>>>>>>>>>> hive
>>>>>>>>>> catalog will be highly appreciated! Thanks.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Create your own email signature
>>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Create your own email signature
>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>
>>>>>
>>>>
>>>> --
>>>>
>>>> Create your own email signature
>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>
>>>
>>
>> --
>>
>> Create your own email signature
>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>
>

-- 
Ryan Blue
Tabular

Reply via email to