Talked with Lian on Slack, the user is using a hadoop 3.2.1 + hive (postgres) + spark + minio docker installation. There might be some S3A related dependencies missing on the Hive server side based on the stack trace. Let's see if that fixes the issue. -Jack
On Mon, Aug 16, 2021 at 7:32 PM Lian Jiang <jiangok2...@gmail.com> wrote: > This is my full script launching spark-shell: > > # add Iceberg dependency > export AWS_REGION=us-east-1 > export AWS_ACCESS_KEY_ID=minio > export AWS_SECRET_ACCESS_KEY=minio123 > > ICEBERG_VERSION=0.11.1 > > DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0" > > MINIOSERVER=192.168.176.5 > > > # add AWS dependnecy > AWS_SDK_VERSION=2.15.40 > AWS_MAVEN_GROUP=software.amazon.awssdk > AWS_PACKAGES=( > "bundle" > "url-connection-client" > ) > for pkg in "${AWS_PACKAGES[@]}"; do > DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION" > done > > # start Spark SQL client shell > /spark/bin/spark-shell --packages $DEPENDENCIES \ > --conf > spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \ > --conf spark.sql.catalog.hive_test.type=hive \ > --conf > spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO \ > --conf spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse \ > --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \ > --conf spark.hadoop.fs.s3a.access.key=minio \ > --conf spark.hadoop.fs.s3a.secret.key=minio123 \ > --conf spark.hadoop.fs.s3a.path.style.access=true \ > --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem > > > Let me know if anything is missing. Thanks. > > On Mon, Aug 16, 2021 at 7:29 PM Jack Ye <yezhao...@gmail.com> wrote: > >> Have you included the hadoop-aws jar? >> https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws >> -Jack >> >> On Mon, Aug 16, 2021 at 7:09 PM Lian Jiang <jiangok2...@gmail.com> wrote: >> >>> Jack, >>> >>> You are right. S3FileIO will not work on minio since minio does not >>> support ACL: >>> https://docs.min.io/docs/minio-server-limits-per-tenant.html >>> >>> To use iceberg, minio + s3a, I used below script to launch spark-shell: >>> >>> /spark/bin/spark-shell --packages $DEPENDENCIES \ >>> --conf >>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \ >>> --conf spark.sql.catalog.hive_test.type=hive \ >>> * --conf >>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO >>> \* >>> --conf spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse \ >>> --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \ >>> --conf spark.hadoop.fs.s3a.access.key=minio \ >>> --conf spark.hadoop.fs.s3a.secret.key=minio123 \ >>> --conf spark.hadoop.fs.s3a.path.style.access=true \ >>> --conf >>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem >>> >>> >>> >>> *The spark code:* >>> >>> import org.apache.spark.sql.SparkSession >>> val values = List(1,2,3,4,5) >>> >>> val spark = SparkSession.builder().master("local").getOrCreate() >>> import spark.implicits._ >>> val df = values.toDF() >>> >>> val core = "mytable" >>> val table = s"hive_test.mydb.${core}" >>> val s3IcePath = s"s3a://east/${core}.ice" >>> >>> df.writeTo(table) >>> .tableProperty("write.format.default", "parquet") >>> .tableProperty("location", s3IcePath) >>> .createOrReplace() >>> >>> >>> *Still the same error:* >>> java.lang.ClassNotFoundException: Class >>> org.apache.hadoop.fs.s3a.S3AFileSystem not found >>> >>> >>> What else could be wrong? Thanks for any clue. >>> >>> >>> >>> On Mon, Aug 16, 2021 at 9:35 AM Jack Ye <yezhao...@gmail.com> wrote: >>> >>>> Sorry for the late reply, I thought I replied on Friday but the email >>>> did not send successfully. >>>> >>>> As Daniel said, you don't need to setup S3A if you are using S3FileIO. >>>> >>>> Th S3FileIO by default reads the default credentials chain to check >>>> credential setups one by one: >>>> https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html#credentials-chain >>>> >>>> If you would like to use a specialized credential provider, you can >>>> directly customize your S3 client: >>>> https://iceberg.apache.org/aws/#aws-client-customization >>>> >>>> It looks like you are trying to use MinIO to mount S3A file system? If >>>> you have to use MinIO then there is not a way to integrate with S3FileIO >>>> right now. (maybe I am wrong on this, I don't know much about MinIO) >>>> >>>> To directly use S3FileIO with HiveCatalog, simply do: >>>> >>>> /spark/bin/spark-shell --packages $DEPENDENCIES \ >>>> --conf >>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \ >>>> --conf spark.sql.catalog.hive_test.type=hive \ >>>> --conf >>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \ >>>> --conf spark.sql.catalog.hive_test.warehouse=s3://bucket >>>> >>>> Best, >>>> Jack Ye >>>> >>>> >>>> >>>> On Sun, Aug 15, 2021 at 2:53 PM Lian Jiang <jiangok2...@gmail.com> >>>> wrote: >>>> >>>>> Thanks. I prefer S3FileIO as it is recommended by iceberg. Do you have >>>>> a sample using hive catalog, s3FileIO, spark API (as opposed to SQL), S3 >>>>> access.key and secret.key? It is hard to get all settings right for this >>>>> combination without an example. Appreciate any help. >>>>> >>>>> On Fri, Aug 13, 2021 at 6:01 PM Daniel Weeks <daniel.c.we...@gmail.com> >>>>> wrote: >>>>> >>>>>> So, if I recall correctly, the hive server does need access to check >>>>>> and create paths for table locations. >>>>>> >>>>>> There may be an option to disable this behavior, but otherwise the fs >>>>>> implementation probably needs to be available to the hive metastore. >>>>>> >>>>>> -Dan >>>>>> >>>>>> On Fri, Aug 13, 2021, 4:48 PM Lian Jiang <jiangok2...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Thanks Daniel. >>>>>>> >>>>>>> After modifying the script to, >>>>>>> >>>>>>> export AWS_REGION=us-east-1 >>>>>>> export AWS_ACCESS_KEY_ID=minio >>>>>>> export AWS_SECRET_ACCESS_KEY=minio123 >>>>>>> >>>>>>> ICEBERG_VERSION=0.11.1 >>>>>>> >>>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0" >>>>>>> >>>>>>> MINIOSERVER=192.168.160.5 >>>>>>> >>>>>>> >>>>>>> # add AWS dependnecy >>>>>>> AWS_SDK_VERSION=2.15.40 >>>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk >>>>>>> AWS_PACKAGES=( >>>>>>> "bundle" >>>>>>> "url-connection-client" >>>>>>> ) >>>>>>> for pkg in "${AWS_PACKAGES[@]}"; do >>>>>>> DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION" >>>>>>> done >>>>>>> >>>>>>> # start Spark SQL client shell >>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \ >>>>>>> --conf >>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \ >>>>>>> --conf spark.sql.catalog.hive_test.type=hive \ >>>>>>> --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \ >>>>>>> --conf spark.hadoop.fs.s3a.access.key=minio \ >>>>>>> --conf spark.hadoop.fs.s3a.secret.key=minio123 \ >>>>>>> --conf spark.hadoop.fs.s3a.path.style.access=true \ >>>>>>> --conf >>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem >>>>>>> >>>>>>> I got: MetaException: java.lang.RuntimeException: >>>>>>> java.lang.ClassNotFoundException: Class >>>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found. My hive server is not >>>>>>> using s3 and should not cause this error. Any ideas? Thanks. >>>>>>> >>>>>>> >>>>>>> I got "ClassNotFoundException: Class >>>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found". Any idea what >>>>>>> dependency >>>>>>> could I miss? >>>>>>> >>>>>>> On Fri, Aug 13, 2021 at 4:03 PM Daniel Weeks < >>>>>>> daniel.c.we...@gmail.com> wrote: >>>>>>> >>>>>>>> Hey Lian, >>>>>>>> >>>>>>>> At a cursory glance, it appears that you might be mixing two >>>>>>>> different FileIO implementations, which may be why you are not getting >>>>>>>> the >>>>>>>> expected result. >>>>>>>> >>>>>>>> When you set: --conf >>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO >>>>>>>> you're >>>>>>>> actually switching over to the native S3 implementation within Iceberg >>>>>>>> (as >>>>>>>> opposed to S3AFileSystem via HadoopFileIO). However, all of the >>>>>>>> following >>>>>>>> settings to setup access are then set for the S3AFileSystem (which >>>>>>>> would >>>>>>>> not be used with S3FileIO). >>>>>>>> >>>>>>>> You might try just removing that line since it should use the >>>>>>>> HadoopFileIO at that point and may work. >>>>>>>> >>>>>>>> Hope that's helpful, >>>>>>>> -Dan >>>>>>>> >>>>>>>> On Fri, Aug 13, 2021 at 3:50 PM Lian Jiang <jiangok2...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I try to create an iceberg table on minio s3 and hive. >>>>>>>>> >>>>>>>>> *This is how I launch spark-shell:* >>>>>>>>> >>>>>>>>> # add Iceberg dependency >>>>>>>>> export AWS_REGION=us-east-1 >>>>>>>>> export AWS_ACCESS_KEY_ID=minio >>>>>>>>> export AWS_SECRET_ACCESS_KEY=minio123 >>>>>>>>> >>>>>>>>> ICEBERG_VERSION=0.11.1 >>>>>>>>> >>>>>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION" >>>>>>>>> >>>>>>>>> MINIOSERVER=192.168.160.5 >>>>>>>>> >>>>>>>>> >>>>>>>>> # add AWS dependnecy >>>>>>>>> AWS_SDK_VERSION=2.15.40 >>>>>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk >>>>>>>>> AWS_PACKAGES=( >>>>>>>>> "bundle" >>>>>>>>> "url-connection-client" >>>>>>>>> ) >>>>>>>>> for pkg in "${AWS_PACKAGES[@]}"; do >>>>>>>>> DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION" >>>>>>>>> done >>>>>>>>> >>>>>>>>> # start Spark SQL client shell >>>>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \ >>>>>>>>> --conf >>>>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \ >>>>>>>>> --conf spark.sql.catalog.hive_test.warehouse=s3a://east/prefix >>>>>>>>> \ >>>>>>>>> --conf spark.sql.catalog.hive_test.type=hive \ >>>>>>>>> --conf >>>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO >>>>>>>>> \ >>>>>>>>> --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \ >>>>>>>>> --conf spark.hadoop.fs.s3a.access.key=minio \ >>>>>>>>> --conf spark.hadoop.fs.s3a.secret.key=minio123 \ >>>>>>>>> --conf spark.hadoop.fs.s3a.path.style.access=true \ >>>>>>>>> --conf >>>>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem >>>>>>>>> >>>>>>>>> *Here is the spark code to create the iceberg table:* >>>>>>>>> >>>>>>>>> import org.apache.spark.sql.SparkSession >>>>>>>>> val values = List(1,2,3,4,5) >>>>>>>>> >>>>>>>>> val spark = SparkSession.builder().master("local").getOrCreate() >>>>>>>>> import spark.implicits._ >>>>>>>>> val df = values.toDF() >>>>>>>>> >>>>>>>>> val core = "mytable8" >>>>>>>>> val table = s"hive_test.mydb.${core}" >>>>>>>>> val s3IcePath = s"s3a://spark-test/${core}.ice" >>>>>>>>> >>>>>>>>> df.writeTo(table) >>>>>>>>> .tableProperty("write.format.default", "parquet") >>>>>>>>> .tableProperty("location", s3IcePath) >>>>>>>>> .createOrReplace() >>>>>>>>> >>>>>>>>> I got an error "The AWS Access Key Id you provided does not exist >>>>>>>>> in our records.". >>>>>>>>> >>>>>>>>> I have verified that I can login minio UI using the same username >>>>>>>>> and password that I passed to spark-shell via AWS_ACCESS_KEY_ID and >>>>>>>>> AWS_SECRET_ACCESS_KEY env variables. >>>>>>>>> https://github.com/apache/iceberg/issues/2168 is related but does >>>>>>>>> not help me. Not sure why the credential does not work for iceberg + >>>>>>>>> AWS. >>>>>>>>> Any idea or an example of writing an iceberg table to S3 using hive >>>>>>>>> catalog >>>>>>>>> will be highly appreciated! Thanks. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Create your own email signature >>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>>>> >>>>>> >>>>> >>>>> -- >>>>> >>>>> Create your own email signature >>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>> >>>> >>> >>> -- >>> >>> Create your own email signature >>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>> >> > > -- > > Create your own email signature > <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >