Have you included the hadoop-aws jar? https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws -Jack
On Mon, Aug 16, 2021 at 7:09 PM Lian Jiang <jiangok2...@gmail.com> wrote: > Jack, > > You are right. S3FileIO will not work on minio since minio does not > support ACL: https://docs.min.io/docs/minio-server-limits-per-tenant.html > > To use iceberg, minio + s3a, I used below script to launch spark-shell: > > /spark/bin/spark-shell --packages $DEPENDENCIES \ > --conf > spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \ > --conf spark.sql.catalog.hive_test.type=hive \ > * --conf > spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO > \* > --conf spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse \ > --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \ > --conf spark.hadoop.fs.s3a.access.key=minio \ > --conf spark.hadoop.fs.s3a.secret.key=minio123 \ > --conf spark.hadoop.fs.s3a.path.style.access=true \ > --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem > > > > *The spark code:* > > import org.apache.spark.sql.SparkSession > val values = List(1,2,3,4,5) > > val spark = SparkSession.builder().master("local").getOrCreate() > import spark.implicits._ > val df = values.toDF() > > val core = "mytable" > val table = s"hive_test.mydb.${core}" > val s3IcePath = s"s3a://east/${core}.ice" > > df.writeTo(table) > .tableProperty("write.format.default", "parquet") > .tableProperty("location", s3IcePath) > .createOrReplace() > > > *Still the same error:* > java.lang.ClassNotFoundException: Class > org.apache.hadoop.fs.s3a.S3AFileSystem not found > > > What else could be wrong? Thanks for any clue. > > > > On Mon, Aug 16, 2021 at 9:35 AM Jack Ye <yezhao...@gmail.com> wrote: > >> Sorry for the late reply, I thought I replied on Friday but the email did >> not send successfully. >> >> As Daniel said, you don't need to setup S3A if you are using S3FileIO. >> >> Th S3FileIO by default reads the default credentials chain to check >> credential setups one by one: >> https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html#credentials-chain >> >> If you would like to use a specialized credential provider, you can >> directly customize your S3 client: >> https://iceberg.apache.org/aws/#aws-client-customization >> >> It looks like you are trying to use MinIO to mount S3A file system? If >> you have to use MinIO then there is not a way to integrate with S3FileIO >> right now. (maybe I am wrong on this, I don't know much about MinIO) >> >> To directly use S3FileIO with HiveCatalog, simply do: >> >> /spark/bin/spark-shell --packages $DEPENDENCIES \ >> --conf >> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \ >> --conf spark.sql.catalog.hive_test.type=hive \ >> --conf >> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \ >> --conf spark.sql.catalog.hive_test.warehouse=s3://bucket >> >> Best, >> Jack Ye >> >> >> >> On Sun, Aug 15, 2021 at 2:53 PM Lian Jiang <jiangok2...@gmail.com> wrote: >> >>> Thanks. I prefer S3FileIO as it is recommended by iceberg. Do you have a >>> sample using hive catalog, s3FileIO, spark API (as opposed to SQL), S3 >>> access.key and secret.key? It is hard to get all settings right for this >>> combination without an example. Appreciate any help. >>> >>> On Fri, Aug 13, 2021 at 6:01 PM Daniel Weeks <daniel.c.we...@gmail.com> >>> wrote: >>> >>>> So, if I recall correctly, the hive server does need access to check >>>> and create paths for table locations. >>>> >>>> There may be an option to disable this behavior, but otherwise the fs >>>> implementation probably needs to be available to the hive metastore. >>>> >>>> -Dan >>>> >>>> On Fri, Aug 13, 2021, 4:48 PM Lian Jiang <jiangok2...@gmail.com> wrote: >>>> >>>>> Thanks Daniel. >>>>> >>>>> After modifying the script to, >>>>> >>>>> export AWS_REGION=us-east-1 >>>>> export AWS_ACCESS_KEY_ID=minio >>>>> export AWS_SECRET_ACCESS_KEY=minio123 >>>>> >>>>> ICEBERG_VERSION=0.11.1 >>>>> >>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0" >>>>> >>>>> MINIOSERVER=192.168.160.5 >>>>> >>>>> >>>>> # add AWS dependnecy >>>>> AWS_SDK_VERSION=2.15.40 >>>>> AWS_MAVEN_GROUP=software.amazon.awssdk >>>>> AWS_PACKAGES=( >>>>> "bundle" >>>>> "url-connection-client" >>>>> ) >>>>> for pkg in "${AWS_PACKAGES[@]}"; do >>>>> DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION" >>>>> done >>>>> >>>>> # start Spark SQL client shell >>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \ >>>>> --conf >>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \ >>>>> --conf spark.sql.catalog.hive_test.type=hive \ >>>>> --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \ >>>>> --conf spark.hadoop.fs.s3a.access.key=minio \ >>>>> --conf spark.hadoop.fs.s3a.secret.key=minio123 \ >>>>> --conf spark.hadoop.fs.s3a.path.style.access=true \ >>>>> --conf >>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem >>>>> >>>>> I got: MetaException: java.lang.RuntimeException: >>>>> java.lang.ClassNotFoundException: Class >>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found. My hive server is not >>>>> using s3 and should not cause this error. Any ideas? Thanks. >>>>> >>>>> >>>>> I got "ClassNotFoundException: Class >>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found". Any idea what >>>>> dependency >>>>> could I miss? >>>>> >>>>> On Fri, Aug 13, 2021 at 4:03 PM Daniel Weeks <daniel.c.we...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hey Lian, >>>>>> >>>>>> At a cursory glance, it appears that you might be mixing two >>>>>> different FileIO implementations, which may be why you are not getting >>>>>> the >>>>>> expected result. >>>>>> >>>>>> When you set: --conf >>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO >>>>>> you're >>>>>> actually switching over to the native S3 implementation within Iceberg >>>>>> (as >>>>>> opposed to S3AFileSystem via HadoopFileIO). However, all of the >>>>>> following >>>>>> settings to setup access are then set for the S3AFileSystem (which would >>>>>> not be used with S3FileIO). >>>>>> >>>>>> You might try just removing that line since it should use the >>>>>> HadoopFileIO at that point and may work. >>>>>> >>>>>> Hope that's helpful, >>>>>> -Dan >>>>>> >>>>>> On Fri, Aug 13, 2021 at 3:50 PM Lian Jiang <jiangok2...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I try to create an iceberg table on minio s3 and hive. >>>>>>> >>>>>>> *This is how I launch spark-shell:* >>>>>>> >>>>>>> # add Iceberg dependency >>>>>>> export AWS_REGION=us-east-1 >>>>>>> export AWS_ACCESS_KEY_ID=minio >>>>>>> export AWS_SECRET_ACCESS_KEY=minio123 >>>>>>> >>>>>>> ICEBERG_VERSION=0.11.1 >>>>>>> >>>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION" >>>>>>> >>>>>>> MINIOSERVER=192.168.160.5 >>>>>>> >>>>>>> >>>>>>> # add AWS dependnecy >>>>>>> AWS_SDK_VERSION=2.15.40 >>>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk >>>>>>> AWS_PACKAGES=( >>>>>>> "bundle" >>>>>>> "url-connection-client" >>>>>>> ) >>>>>>> for pkg in "${AWS_PACKAGES[@]}"; do >>>>>>> DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION" >>>>>>> done >>>>>>> >>>>>>> # start Spark SQL client shell >>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \ >>>>>>> --conf >>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \ >>>>>>> --conf spark.sql.catalog.hive_test.warehouse=s3a://east/prefix \ >>>>>>> --conf spark.sql.catalog.hive_test.type=hive \ >>>>>>> --conf >>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \ >>>>>>> --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \ >>>>>>> --conf spark.hadoop.fs.s3a.access.key=minio \ >>>>>>> --conf spark.hadoop.fs.s3a.secret.key=minio123 \ >>>>>>> --conf spark.hadoop.fs.s3a.path.style.access=true \ >>>>>>> --conf >>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem >>>>>>> >>>>>>> *Here is the spark code to create the iceberg table:* >>>>>>> >>>>>>> import org.apache.spark.sql.SparkSession >>>>>>> val values = List(1,2,3,4,5) >>>>>>> >>>>>>> val spark = SparkSession.builder().master("local").getOrCreate() >>>>>>> import spark.implicits._ >>>>>>> val df = values.toDF() >>>>>>> >>>>>>> val core = "mytable8" >>>>>>> val table = s"hive_test.mydb.${core}" >>>>>>> val s3IcePath = s"s3a://spark-test/${core}.ice" >>>>>>> >>>>>>> df.writeTo(table) >>>>>>> .tableProperty("write.format.default", "parquet") >>>>>>> .tableProperty("location", s3IcePath) >>>>>>> .createOrReplace() >>>>>>> >>>>>>> I got an error "The AWS Access Key Id you provided does not exist in >>>>>>> our records.". >>>>>>> >>>>>>> I have verified that I can login minio UI using the same username >>>>>>> and password that I passed to spark-shell via AWS_ACCESS_KEY_ID and >>>>>>> AWS_SECRET_ACCESS_KEY env variables. >>>>>>> https://github.com/apache/iceberg/issues/2168 is related but does >>>>>>> not help me. Not sure why the credential does not work for iceberg + >>>>>>> AWS. >>>>>>> Any idea or an example of writing an iceberg table to S3 using hive >>>>>>> catalog >>>>>>> will be highly appreciated! Thanks. >>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> -- >>>>> >>>>> Create your own email signature >>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>> >>>> >>> >>> -- >>> >>> Create your own email signature >>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>> >> > > -- > > Create your own email signature > <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >