This is my full script launching spark-shell: # add Iceberg dependency export AWS_REGION=us-east-1 export AWS_ACCESS_KEY_ID=minio export AWS_SECRET_ACCESS_KEY=minio123
ICEBERG_VERSION=0.11.1 DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0" MINIOSERVER=192.168.176.5 # add AWS dependnecy AWS_SDK_VERSION=2.15.40 AWS_MAVEN_GROUP=software.amazon.awssdk AWS_PACKAGES=( "bundle" "url-connection-client" ) for pkg in "${AWS_PACKAGES[@]}"; do DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION" done # start Spark SQL client shell /spark/bin/spark-shell --packages $DEPENDENCIES \ --conf spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.hive_test.type=hive \ --conf spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO \ --conf spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse \ --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \ --conf spark.hadoop.fs.s3a.access.key=minio \ --conf spark.hadoop.fs.s3a.secret.key=minio123 \ --conf spark.hadoop.fs.s3a.path.style.access=true \ --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem Let me know if anything is missing. Thanks. On Mon, Aug 16, 2021 at 7:29 PM Jack Ye <yezhao...@gmail.com> wrote: > Have you included the hadoop-aws jar? > https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws > -Jack > > On Mon, Aug 16, 2021 at 7:09 PM Lian Jiang <jiangok2...@gmail.com> wrote: > >> Jack, >> >> You are right. S3FileIO will not work on minio since minio does not >> support ACL: https://docs.min.io/docs/minio-server-limits-per-tenant.html >> >> To use iceberg, minio + s3a, I used below script to launch spark-shell: >> >> /spark/bin/spark-shell --packages $DEPENDENCIES \ >> --conf >> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \ >> --conf spark.sql.catalog.hive_test.type=hive \ >> * --conf >> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO >> \* >> --conf spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse \ >> --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \ >> --conf spark.hadoop.fs.s3a.access.key=minio \ >> --conf spark.hadoop.fs.s3a.secret.key=minio123 \ >> --conf spark.hadoop.fs.s3a.path.style.access=true \ >> --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem >> >> >> >> *The spark code:* >> >> import org.apache.spark.sql.SparkSession >> val values = List(1,2,3,4,5) >> >> val spark = SparkSession.builder().master("local").getOrCreate() >> import spark.implicits._ >> val df = values.toDF() >> >> val core = "mytable" >> val table = s"hive_test.mydb.${core}" >> val s3IcePath = s"s3a://east/${core}.ice" >> >> df.writeTo(table) >> .tableProperty("write.format.default", "parquet") >> .tableProperty("location", s3IcePath) >> .createOrReplace() >> >> >> *Still the same error:* >> java.lang.ClassNotFoundException: Class >> org.apache.hadoop.fs.s3a.S3AFileSystem not found >> >> >> What else could be wrong? Thanks for any clue. >> >> >> >> On Mon, Aug 16, 2021 at 9:35 AM Jack Ye <yezhao...@gmail.com> wrote: >> >>> Sorry for the late reply, I thought I replied on Friday but the email >>> did not send successfully. >>> >>> As Daniel said, you don't need to setup S3A if you are using S3FileIO. >>> >>> Th S3FileIO by default reads the default credentials chain to check >>> credential setups one by one: >>> https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html#credentials-chain >>> >>> If you would like to use a specialized credential provider, you can >>> directly customize your S3 client: >>> https://iceberg.apache.org/aws/#aws-client-customization >>> >>> It looks like you are trying to use MinIO to mount S3A file system? If >>> you have to use MinIO then there is not a way to integrate with S3FileIO >>> right now. (maybe I am wrong on this, I don't know much about MinIO) >>> >>> To directly use S3FileIO with HiveCatalog, simply do: >>> >>> /spark/bin/spark-shell --packages $DEPENDENCIES \ >>> --conf >>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \ >>> --conf spark.sql.catalog.hive_test.type=hive \ >>> --conf >>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \ >>> --conf spark.sql.catalog.hive_test.warehouse=s3://bucket >>> >>> Best, >>> Jack Ye >>> >>> >>> >>> On Sun, Aug 15, 2021 at 2:53 PM Lian Jiang <jiangok2...@gmail.com> >>> wrote: >>> >>>> Thanks. I prefer S3FileIO as it is recommended by iceberg. Do you have >>>> a sample using hive catalog, s3FileIO, spark API (as opposed to SQL), S3 >>>> access.key and secret.key? It is hard to get all settings right for this >>>> combination without an example. Appreciate any help. >>>> >>>> On Fri, Aug 13, 2021 at 6:01 PM Daniel Weeks <daniel.c.we...@gmail.com> >>>> wrote: >>>> >>>>> So, if I recall correctly, the hive server does need access to check >>>>> and create paths for table locations. >>>>> >>>>> There may be an option to disable this behavior, but otherwise the fs >>>>> implementation probably needs to be available to the hive metastore. >>>>> >>>>> -Dan >>>>> >>>>> On Fri, Aug 13, 2021, 4:48 PM Lian Jiang <jiangok2...@gmail.com> >>>>> wrote: >>>>> >>>>>> Thanks Daniel. >>>>>> >>>>>> After modifying the script to, >>>>>> >>>>>> export AWS_REGION=us-east-1 >>>>>> export AWS_ACCESS_KEY_ID=minio >>>>>> export AWS_SECRET_ACCESS_KEY=minio123 >>>>>> >>>>>> ICEBERG_VERSION=0.11.1 >>>>>> >>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0" >>>>>> >>>>>> MINIOSERVER=192.168.160.5 >>>>>> >>>>>> >>>>>> # add AWS dependnecy >>>>>> AWS_SDK_VERSION=2.15.40 >>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk >>>>>> AWS_PACKAGES=( >>>>>> "bundle" >>>>>> "url-connection-client" >>>>>> ) >>>>>> for pkg in "${AWS_PACKAGES[@]}"; do >>>>>> DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION" >>>>>> done >>>>>> >>>>>> # start Spark SQL client shell >>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \ >>>>>> --conf >>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \ >>>>>> --conf spark.sql.catalog.hive_test.type=hive \ >>>>>> --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \ >>>>>> --conf spark.hadoop.fs.s3a.access.key=minio \ >>>>>> --conf spark.hadoop.fs.s3a.secret.key=minio123 \ >>>>>> --conf spark.hadoop.fs.s3a.path.style.access=true \ >>>>>> --conf >>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem >>>>>> >>>>>> I got: MetaException: java.lang.RuntimeException: >>>>>> java.lang.ClassNotFoundException: Class >>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found. My hive server is not >>>>>> using s3 and should not cause this error. Any ideas? Thanks. >>>>>> >>>>>> >>>>>> I got "ClassNotFoundException: Class >>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found". Any idea what >>>>>> dependency >>>>>> could I miss? >>>>>> >>>>>> On Fri, Aug 13, 2021 at 4:03 PM Daniel Weeks < >>>>>> daniel.c.we...@gmail.com> wrote: >>>>>> >>>>>>> Hey Lian, >>>>>>> >>>>>>> At a cursory glance, it appears that you might be mixing two >>>>>>> different FileIO implementations, which may be why you are not getting >>>>>>> the >>>>>>> expected result. >>>>>>> >>>>>>> When you set: --conf >>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO >>>>>>> you're >>>>>>> actually switching over to the native S3 implementation within Iceberg >>>>>>> (as >>>>>>> opposed to S3AFileSystem via HadoopFileIO). However, all of the >>>>>>> following >>>>>>> settings to setup access are then set for the S3AFileSystem (which would >>>>>>> not be used with S3FileIO). >>>>>>> >>>>>>> You might try just removing that line since it should use the >>>>>>> HadoopFileIO at that point and may work. >>>>>>> >>>>>>> Hope that's helpful, >>>>>>> -Dan >>>>>>> >>>>>>> On Fri, Aug 13, 2021 at 3:50 PM Lian Jiang <jiangok2...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I try to create an iceberg table on minio s3 and hive. >>>>>>>> >>>>>>>> *This is how I launch spark-shell:* >>>>>>>> >>>>>>>> # add Iceberg dependency >>>>>>>> export AWS_REGION=us-east-1 >>>>>>>> export AWS_ACCESS_KEY_ID=minio >>>>>>>> export AWS_SECRET_ACCESS_KEY=minio123 >>>>>>>> >>>>>>>> ICEBERG_VERSION=0.11.1 >>>>>>>> >>>>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION" >>>>>>>> >>>>>>>> MINIOSERVER=192.168.160.5 >>>>>>>> >>>>>>>> >>>>>>>> # add AWS dependnecy >>>>>>>> AWS_SDK_VERSION=2.15.40 >>>>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk >>>>>>>> AWS_PACKAGES=( >>>>>>>> "bundle" >>>>>>>> "url-connection-client" >>>>>>>> ) >>>>>>>> for pkg in "${AWS_PACKAGES[@]}"; do >>>>>>>> DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION" >>>>>>>> done >>>>>>>> >>>>>>>> # start Spark SQL client shell >>>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \ >>>>>>>> --conf >>>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \ >>>>>>>> --conf spark.sql.catalog.hive_test.warehouse=s3a://east/prefix \ >>>>>>>> --conf spark.sql.catalog.hive_test.type=hive \ >>>>>>>> --conf >>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO >>>>>>>> \ >>>>>>>> --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \ >>>>>>>> --conf spark.hadoop.fs.s3a.access.key=minio \ >>>>>>>> --conf spark.hadoop.fs.s3a.secret.key=minio123 \ >>>>>>>> --conf spark.hadoop.fs.s3a.path.style.access=true \ >>>>>>>> --conf >>>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem >>>>>>>> >>>>>>>> *Here is the spark code to create the iceberg table:* >>>>>>>> >>>>>>>> import org.apache.spark.sql.SparkSession >>>>>>>> val values = List(1,2,3,4,5) >>>>>>>> >>>>>>>> val spark = SparkSession.builder().master("local").getOrCreate() >>>>>>>> import spark.implicits._ >>>>>>>> val df = values.toDF() >>>>>>>> >>>>>>>> val core = "mytable8" >>>>>>>> val table = s"hive_test.mydb.${core}" >>>>>>>> val s3IcePath = s"s3a://spark-test/${core}.ice" >>>>>>>> >>>>>>>> df.writeTo(table) >>>>>>>> .tableProperty("write.format.default", "parquet") >>>>>>>> .tableProperty("location", s3IcePath) >>>>>>>> .createOrReplace() >>>>>>>> >>>>>>>> I got an error "The AWS Access Key Id you provided does not exist >>>>>>>> in our records.". >>>>>>>> >>>>>>>> I have verified that I can login minio UI using the same username >>>>>>>> and password that I passed to spark-shell via AWS_ACCESS_KEY_ID and >>>>>>>> AWS_SECRET_ACCESS_KEY env variables. >>>>>>>> https://github.com/apache/iceberg/issues/2168 is related but does >>>>>>>> not help me. Not sure why the credential does not work for iceberg + >>>>>>>> AWS. >>>>>>>> Any idea or an example of writing an iceberg table to S3 using hive >>>>>>>> catalog >>>>>>>> will be highly appreciated! Thanks. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Create your own email signature >>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>>> >>>>> >>>> >>>> -- >>>> >>>> Create your own email signature >>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>> >>> >> >> -- >> >> Create your own email signature >> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >> > -- Create your own email signature <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>