Talked with Lian on Slack, the user is using a hadoop 3.2.1 + hive
(postgres) + spark + minio docker installation. There might be some S3A
related dependencies missing on the Hive server side based on the stack
trace. Let's see if that fixes the issue.
-Jack

On Mon, Aug 16, 2021 at 7:32 PM Lian Jiang <jiangok2...@gmail.com> wrote:

> This is my full script launching spark-shell:
>
> # add Iceberg dependency
> export AWS_REGION=us-east-1
> export AWS_ACCESS_KEY_ID=minio
> export AWS_SECRET_ACCESS_KEY=minio123
>
> ICEBERG_VERSION=0.11.1
>
> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0"
>
> MINIOSERVER=192.168.176.5
>
>
> # add AWS dependnecy
> AWS_SDK_VERSION=2.15.40
> AWS_MAVEN_GROUP=software.amazon.awssdk
> AWS_PACKAGES=(
>     "bundle"
>     "url-connection-client"
> )
> for pkg in "${AWS_PACKAGES[@]}"; do
>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
> done
>
> # start Spark SQL client shell
> /spark/bin/spark-shell --packages $DEPENDENCIES \
>     --conf
> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>     --conf spark.sql.catalog.hive_test.type=hive  \
>     --conf
> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO \
>     --conf spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse \
>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>     --conf spark.hadoop.fs.s3a.access.key=minio \
>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>     --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>
>
> Let me know if anything is missing. Thanks.
>
> On Mon, Aug 16, 2021 at 7:29 PM Jack Ye <yezhao...@gmail.com> wrote:
>
>> Have you included the hadoop-aws jar?
>> https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws
>> -Jack
>>
>> On Mon, Aug 16, 2021 at 7:09 PM Lian Jiang <jiangok2...@gmail.com> wrote:
>>
>>> Jack,
>>>
>>> You are right. S3FileIO will not work on minio since minio does not
>>> support ACL:
>>> https://docs.min.io/docs/minio-server-limits-per-tenant.html
>>>
>>> To use iceberg, minio + s3a, I used below script to launch spark-shell:
>>>
>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>     --conf
>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>> *    --conf
>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO
>>> \*
>>>     --conf spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse \
>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>     --conf
>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>
>>>
>>>
>>> *The spark code:*
>>>
>>> import org.apache.spark.sql.SparkSession
>>> val values = List(1,2,3,4,5)
>>>
>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>> import spark.implicits._
>>> val df = values.toDF()
>>>
>>> val core = "mytable"
>>> val table = s"hive_test.mydb.${core}"
>>> val s3IcePath = s"s3a://east/${core}.ice"
>>>
>>> df.writeTo(table)
>>>     .tableProperty("write.format.default", "parquet")
>>>     .tableProperty("location", s3IcePath)
>>>     .createOrReplace()
>>>
>>>
>>> *Still the same error:*
>>> java.lang.ClassNotFoundException: Class
>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found
>>>
>>>
>>> What else could be wrong? Thanks for any clue.
>>>
>>>
>>>
>>> On Mon, Aug 16, 2021 at 9:35 AM Jack Ye <yezhao...@gmail.com> wrote:
>>>
>>>> Sorry for the late reply, I thought I replied on Friday but the email
>>>> did not send successfully.
>>>>
>>>> As Daniel said, you don't need to setup S3A if you are using S3FileIO.
>>>>
>>>> Th S3FileIO by default reads the default credentials chain to check
>>>> credential setups one by one:
>>>> https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html#credentials-chain
>>>>
>>>> If you would like to use a specialized credential provider, you can
>>>> directly customize your S3 client:
>>>> https://iceberg.apache.org/aws/#aws-client-customization
>>>>
>>>> It looks like you are trying to use MinIO to mount S3A file system? If
>>>> you have to use MinIO then there is not a way to integrate with S3FileIO
>>>> right now. (maybe I am wrong on this, I don't know much about MinIO)
>>>>
>>>> To directly use S3FileIO with HiveCatalog, simply do:
>>>>
>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>     --conf
>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>     --conf
>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
>>>>     --conf spark.sql.catalog.hive_test.warehouse=s3://bucket
>>>>
>>>> Best,
>>>> Jack Ye
>>>>
>>>>
>>>>
>>>> On Sun, Aug 15, 2021 at 2:53 PM Lian Jiang <jiangok2...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks. I prefer S3FileIO as it is recommended by iceberg. Do you have
>>>>> a sample using hive catalog, s3FileIO, spark API (as opposed to SQL), S3
>>>>> access.key and secret.key? It is hard to get all settings right for this
>>>>> combination without an example. Appreciate any help.
>>>>>
>>>>> On Fri, Aug 13, 2021 at 6:01 PM Daniel Weeks <daniel.c.we...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> So, if I recall correctly, the hive server does need access to check
>>>>>> and create paths for table locations.
>>>>>>
>>>>>> There may be an option to disable this behavior, but otherwise the fs
>>>>>> implementation probably needs to be available to the hive metastore.
>>>>>>
>>>>>> -Dan
>>>>>>
>>>>>> On Fri, Aug 13, 2021, 4:48 PM Lian Jiang <jiangok2...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks Daniel.
>>>>>>>
>>>>>>> After modifying the script to,
>>>>>>>
>>>>>>> export AWS_REGION=us-east-1
>>>>>>> export AWS_ACCESS_KEY_ID=minio
>>>>>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>>>>>
>>>>>>> ICEBERG_VERSION=0.11.1
>>>>>>>
>>>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0"
>>>>>>>
>>>>>>> MINIOSERVER=192.168.160.5
>>>>>>>
>>>>>>>
>>>>>>> # add AWS dependnecy
>>>>>>> AWS_SDK_VERSION=2.15.40
>>>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>>>>>> AWS_PACKAGES=(
>>>>>>>     "bundle"
>>>>>>>     "url-connection-client"
>>>>>>> )
>>>>>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>>>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>>>>>> done
>>>>>>>
>>>>>>> # start Spark SQL client shell
>>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>>>     --conf
>>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>>>>     --conf
>>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>>>>
>>>>>>> I got: MetaException: java.lang.RuntimeException:
>>>>>>> java.lang.ClassNotFoundException: Class
>>>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found. My hive server is not
>>>>>>> using s3 and should not cause this error. Any ideas? Thanks.
>>>>>>>
>>>>>>>
>>>>>>> I got "ClassNotFoundException: Class
>>>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found". Any idea what 
>>>>>>> dependency
>>>>>>> could I miss?
>>>>>>>
>>>>>>> On Fri, Aug 13, 2021 at 4:03 PM Daniel Weeks <
>>>>>>> daniel.c.we...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hey Lian,
>>>>>>>>
>>>>>>>> At a cursory glance, it appears that you might be mixing two
>>>>>>>> different FileIO implementations, which may be why you are not getting 
>>>>>>>> the
>>>>>>>> expected result.
>>>>>>>>
>>>>>>>> When you set: --conf
>>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO 
>>>>>>>> you're
>>>>>>>> actually switching over to the native S3 implementation within Iceberg 
>>>>>>>> (as
>>>>>>>> opposed to S3AFileSystem via HadoopFileIO).  However, all of the 
>>>>>>>> following
>>>>>>>> settings to setup access are then set for the S3AFileSystem (which 
>>>>>>>> would
>>>>>>>> not be used with S3FileIO).
>>>>>>>>
>>>>>>>> You might try just removing that line since it should use the
>>>>>>>> HadoopFileIO at that point and may work.
>>>>>>>>
>>>>>>>> Hope that's helpful,
>>>>>>>> -Dan
>>>>>>>>
>>>>>>>> On Fri, Aug 13, 2021 at 3:50 PM Lian Jiang <jiangok2...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I try to create an iceberg table on minio s3 and hive.
>>>>>>>>>
>>>>>>>>> *This is how I launch spark-shell:*
>>>>>>>>>
>>>>>>>>> # add Iceberg dependency
>>>>>>>>> export AWS_REGION=us-east-1
>>>>>>>>> export AWS_ACCESS_KEY_ID=minio
>>>>>>>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>>>>>>>
>>>>>>>>> ICEBERG_VERSION=0.11.1
>>>>>>>>>
>>>>>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION"
>>>>>>>>>
>>>>>>>>> MINIOSERVER=192.168.160.5
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> # add AWS dependnecy
>>>>>>>>> AWS_SDK_VERSION=2.15.40
>>>>>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>>>>>>>> AWS_PACKAGES=(
>>>>>>>>>     "bundle"
>>>>>>>>>     "url-connection-client"
>>>>>>>>> )
>>>>>>>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>>>>>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>>>>>>>> done
>>>>>>>>>
>>>>>>>>> # start Spark SQL client shell
>>>>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>>>>>     --conf
>>>>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>>>>>     --conf spark.sql.catalog.hive_test.warehouse=s3a://east/prefix
>>>>>>>>> \
>>>>>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>>>>>     --conf
>>>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO
>>>>>>>>>  \
>>>>>>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>>>>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>>>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>>>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>>>>>>     --conf
>>>>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>>>>>>
>>>>>>>>> *Here is the spark code to create the iceberg table:*
>>>>>>>>>
>>>>>>>>> import org.apache.spark.sql.SparkSession
>>>>>>>>> val values = List(1,2,3,4,5)
>>>>>>>>>
>>>>>>>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>>>>>>>> import spark.implicits._
>>>>>>>>> val df = values.toDF()
>>>>>>>>>
>>>>>>>>> val core = "mytable8"
>>>>>>>>> val table = s"hive_test.mydb.${core}"
>>>>>>>>> val s3IcePath = s"s3a://spark-test/${core}.ice"
>>>>>>>>>
>>>>>>>>> df.writeTo(table)
>>>>>>>>>     .tableProperty("write.format.default", "parquet")
>>>>>>>>>     .tableProperty("location", s3IcePath)
>>>>>>>>>     .createOrReplace()
>>>>>>>>>
>>>>>>>>> I got an error "The AWS Access Key Id you provided does not exist
>>>>>>>>> in our records.".
>>>>>>>>>
>>>>>>>>> I have verified that I can login minio UI using the same username
>>>>>>>>> and password that I passed to spark-shell via AWS_ACCESS_KEY_ID and
>>>>>>>>> AWS_SECRET_ACCESS_KEY env variables.
>>>>>>>>> https://github.com/apache/iceberg/issues/2168 is related but does
>>>>>>>>> not help me. Not sure why the credential does not work for iceberg + 
>>>>>>>>> AWS.
>>>>>>>>> Any idea or an example of writing an iceberg table to S3 using hive 
>>>>>>>>> catalog
>>>>>>>>> will be highly appreciated! Thanks.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Create your own email signature
>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Create your own email signature
>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>
>>>>
>>>
>>> --
>>>
>>> Create your own email signature
>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>
>>
>
> --
>
> Create your own email signature
> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>

Reply via email to