I'm not sure that I'm following why MinIO won't work with S3FileIO. S3FileIO assumes that the credentials are handled by a credentials provider outside of S3FileIO. How does MinIO handle credentials?
Ryan On Mon, Aug 16, 2021 at 7:57 PM Jack Ye <yezhao...@gmail.com> wrote: > Talked with Lian on Slack, the user is using a hadoop 3.2.1 + hive > (postgres) + spark + minio docker installation. There might be some S3A > related dependencies missing on the Hive server side based on the stack > trace. Let's see if that fixes the issue. > -Jack > > On Mon, Aug 16, 2021 at 7:32 PM Lian Jiang <jiangok2...@gmail.com> wrote: > >> This is my full script launching spark-shell: >> >> # add Iceberg dependency >> export AWS_REGION=us-east-1 >> export AWS_ACCESS_KEY_ID=minio >> export AWS_SECRET_ACCESS_KEY=minio123 >> >> ICEBERG_VERSION=0.11.1 >> >> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0" >> >> MINIOSERVER=192.168.176.5 >> >> >> # add AWS dependnecy >> AWS_SDK_VERSION=2.15.40 >> AWS_MAVEN_GROUP=software.amazon.awssdk >> AWS_PACKAGES=( >> "bundle" >> "url-connection-client" >> ) >> for pkg in "${AWS_PACKAGES[@]}"; do >> DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION" >> done >> >> # start Spark SQL client shell >> /spark/bin/spark-shell --packages $DEPENDENCIES \ >> --conf >> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \ >> --conf spark.sql.catalog.hive_test.type=hive \ >> --conf >> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO \ >> --conf spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse \ >> --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \ >> --conf spark.hadoop.fs.s3a.access.key=minio \ >> --conf spark.hadoop.fs.s3a.secret.key=minio123 \ >> --conf spark.hadoop.fs.s3a.path.style.access=true \ >> --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem >> >> >> Let me know if anything is missing. Thanks. >> >> On Mon, Aug 16, 2021 at 7:29 PM Jack Ye <yezhao...@gmail.com> wrote: >> >>> Have you included the hadoop-aws jar? >>> https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws >>> -Jack >>> >>> On Mon, Aug 16, 2021 at 7:09 PM Lian Jiang <jiangok2...@gmail.com> >>> wrote: >>> >>>> Jack, >>>> >>>> You are right. S3FileIO will not work on minio since minio does not >>>> support ACL: >>>> https://docs.min.io/docs/minio-server-limits-per-tenant.html >>>> >>>> To use iceberg, minio + s3a, I used below script to launch spark-shell: >>>> >>>> /spark/bin/spark-shell --packages $DEPENDENCIES \ >>>> --conf >>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \ >>>> --conf spark.sql.catalog.hive_test.type=hive \ >>>> * --conf >>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO >>>> \* >>>> --conf spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse \ >>>> --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \ >>>> --conf spark.hadoop.fs.s3a.access.key=minio \ >>>> --conf spark.hadoop.fs.s3a.secret.key=minio123 \ >>>> --conf spark.hadoop.fs.s3a.path.style.access=true \ >>>> --conf >>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem >>>> >>>> >>>> >>>> *The spark code:* >>>> >>>> import org.apache.spark.sql.SparkSession >>>> val values = List(1,2,3,4,5) >>>> >>>> val spark = SparkSession.builder().master("local").getOrCreate() >>>> import spark.implicits._ >>>> val df = values.toDF() >>>> >>>> val core = "mytable" >>>> val table = s"hive_test.mydb.${core}" >>>> val s3IcePath = s"s3a://east/${core}.ice" >>>> >>>> df.writeTo(table) >>>> .tableProperty("write.format.default", "parquet") >>>> .tableProperty("location", s3IcePath) >>>> .createOrReplace() >>>> >>>> >>>> *Still the same error:* >>>> java.lang.ClassNotFoundException: Class >>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found >>>> >>>> >>>> What else could be wrong? Thanks for any clue. >>>> >>>> >>>> >>>> On Mon, Aug 16, 2021 at 9:35 AM Jack Ye <yezhao...@gmail.com> wrote: >>>> >>>>> Sorry for the late reply, I thought I replied on Friday but the email >>>>> did not send successfully. >>>>> >>>>> As Daniel said, you don't need to setup S3A if you are using S3FileIO. >>>>> >>>>> Th S3FileIO by default reads the default credentials chain to check >>>>> credential setups one by one: >>>>> https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html#credentials-chain >>>>> >>>>> If you would like to use a specialized credential provider, you can >>>>> directly customize your S3 client: >>>>> https://iceberg.apache.org/aws/#aws-client-customization >>>>> >>>>> It looks like you are trying to use MinIO to mount S3A file system? If >>>>> you have to use MinIO then there is not a way to integrate with S3FileIO >>>>> right now. (maybe I am wrong on this, I don't know much about MinIO) >>>>> >>>>> To directly use S3FileIO with HiveCatalog, simply do: >>>>> >>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \ >>>>> --conf >>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \ >>>>> --conf spark.sql.catalog.hive_test.type=hive \ >>>>> --conf >>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \ >>>>> --conf spark.sql.catalog.hive_test.warehouse=s3://bucket >>>>> >>>>> Best, >>>>> Jack Ye >>>>> >>>>> >>>>> >>>>> On Sun, Aug 15, 2021 at 2:53 PM Lian Jiang <jiangok2...@gmail.com> >>>>> wrote: >>>>> >>>>>> Thanks. I prefer S3FileIO as it is recommended by iceberg. Do you >>>>>> have a sample using hive catalog, s3FileIO, spark API (as opposed to >>>>>> SQL), >>>>>> S3 access.key and secret.key? It is hard to get all settings right for >>>>>> this >>>>>> combination without an example. Appreciate any help. >>>>>> >>>>>> On Fri, Aug 13, 2021 at 6:01 PM Daniel Weeks < >>>>>> daniel.c.we...@gmail.com> wrote: >>>>>> >>>>>>> So, if I recall correctly, the hive server does need access to check >>>>>>> and create paths for table locations. >>>>>>> >>>>>>> There may be an option to disable this behavior, but otherwise the >>>>>>> fs implementation probably needs to be available to the hive metastore. >>>>>>> >>>>>>> -Dan >>>>>>> >>>>>>> On Fri, Aug 13, 2021, 4:48 PM Lian Jiang <jiangok2...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks Daniel. >>>>>>>> >>>>>>>> After modifying the script to, >>>>>>>> >>>>>>>> export AWS_REGION=us-east-1 >>>>>>>> export AWS_ACCESS_KEY_ID=minio >>>>>>>> export AWS_SECRET_ACCESS_KEY=minio123 >>>>>>>> >>>>>>>> ICEBERG_VERSION=0.11.1 >>>>>>>> >>>>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0" >>>>>>>> >>>>>>>> MINIOSERVER=192.168.160.5 >>>>>>>> >>>>>>>> >>>>>>>> # add AWS dependnecy >>>>>>>> AWS_SDK_VERSION=2.15.40 >>>>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk >>>>>>>> AWS_PACKAGES=( >>>>>>>> "bundle" >>>>>>>> "url-connection-client" >>>>>>>> ) >>>>>>>> for pkg in "${AWS_PACKAGES[@]}"; do >>>>>>>> DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION" >>>>>>>> done >>>>>>>> >>>>>>>> # start Spark SQL client shell >>>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \ >>>>>>>> --conf >>>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \ >>>>>>>> --conf spark.sql.catalog.hive_test.type=hive \ >>>>>>>> --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \ >>>>>>>> --conf spark.hadoop.fs.s3a.access.key=minio \ >>>>>>>> --conf spark.hadoop.fs.s3a.secret.key=minio123 \ >>>>>>>> --conf spark.hadoop.fs.s3a.path.style.access=true \ >>>>>>>> --conf >>>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem >>>>>>>> >>>>>>>> I got: MetaException: java.lang.RuntimeException: >>>>>>>> java.lang.ClassNotFoundException: Class >>>>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found. My hive server is not >>>>>>>> using s3 and should not cause this error. Any ideas? Thanks. >>>>>>>> >>>>>>>> >>>>>>>> I got "ClassNotFoundException: Class >>>>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found". Any idea what >>>>>>>> dependency >>>>>>>> could I miss? >>>>>>>> >>>>>>>> On Fri, Aug 13, 2021 at 4:03 PM Daniel Weeks < >>>>>>>> daniel.c.we...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Hey Lian, >>>>>>>>> >>>>>>>>> At a cursory glance, it appears that you might be mixing two >>>>>>>>> different FileIO implementations, which may be why you are not >>>>>>>>> getting the >>>>>>>>> expected result. >>>>>>>>> >>>>>>>>> When you set: --conf >>>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO >>>>>>>>> you're >>>>>>>>> actually switching over to the native S3 implementation within >>>>>>>>> Iceberg (as >>>>>>>>> opposed to S3AFileSystem via HadoopFileIO). However, all of the >>>>>>>>> following >>>>>>>>> settings to setup access are then set for the S3AFileSystem (which >>>>>>>>> would >>>>>>>>> not be used with S3FileIO). >>>>>>>>> >>>>>>>>> You might try just removing that line since it should use the >>>>>>>>> HadoopFileIO at that point and may work. >>>>>>>>> >>>>>>>>> Hope that's helpful, >>>>>>>>> -Dan >>>>>>>>> >>>>>>>>> On Fri, Aug 13, 2021 at 3:50 PM Lian Jiang <jiangok2...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I try to create an iceberg table on minio s3 and hive. >>>>>>>>>> >>>>>>>>>> *This is how I launch spark-shell:* >>>>>>>>>> >>>>>>>>>> # add Iceberg dependency >>>>>>>>>> export AWS_REGION=us-east-1 >>>>>>>>>> export AWS_ACCESS_KEY_ID=minio >>>>>>>>>> export AWS_SECRET_ACCESS_KEY=minio123 >>>>>>>>>> >>>>>>>>>> ICEBERG_VERSION=0.11.1 >>>>>>>>>> >>>>>>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION" >>>>>>>>>> >>>>>>>>>> MINIOSERVER=192.168.160.5 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> # add AWS dependnecy >>>>>>>>>> AWS_SDK_VERSION=2.15.40 >>>>>>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk >>>>>>>>>> AWS_PACKAGES=( >>>>>>>>>> "bundle" >>>>>>>>>> "url-connection-client" >>>>>>>>>> ) >>>>>>>>>> for pkg in "${AWS_PACKAGES[@]}"; do >>>>>>>>>> DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION" >>>>>>>>>> done >>>>>>>>>> >>>>>>>>>> # start Spark SQL client shell >>>>>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \ >>>>>>>>>> --conf >>>>>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \ >>>>>>>>>> --conf >>>>>>>>>> spark.sql.catalog.hive_test.warehouse=s3a://east/prefix \ >>>>>>>>>> --conf spark.sql.catalog.hive_test.type=hive \ >>>>>>>>>> --conf >>>>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO >>>>>>>>>> \ >>>>>>>>>> --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 >>>>>>>>>> \ >>>>>>>>>> --conf spark.hadoop.fs.s3a.access.key=minio \ >>>>>>>>>> --conf spark.hadoop.fs.s3a.secret.key=minio123 \ >>>>>>>>>> --conf spark.hadoop.fs.s3a.path.style.access=true \ >>>>>>>>>> --conf >>>>>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem >>>>>>>>>> >>>>>>>>>> *Here is the spark code to create the iceberg table:* >>>>>>>>>> >>>>>>>>>> import org.apache.spark.sql.SparkSession >>>>>>>>>> val values = List(1,2,3,4,5) >>>>>>>>>> >>>>>>>>>> val spark = SparkSession.builder().master("local").getOrCreate() >>>>>>>>>> import spark.implicits._ >>>>>>>>>> val df = values.toDF() >>>>>>>>>> >>>>>>>>>> val core = "mytable8" >>>>>>>>>> val table = s"hive_test.mydb.${core}" >>>>>>>>>> val s3IcePath = s"s3a://spark-test/${core}.ice" >>>>>>>>>> >>>>>>>>>> df.writeTo(table) >>>>>>>>>> .tableProperty("write.format.default", "parquet") >>>>>>>>>> .tableProperty("location", s3IcePath) >>>>>>>>>> .createOrReplace() >>>>>>>>>> >>>>>>>>>> I got an error "The AWS Access Key Id you provided does not exist >>>>>>>>>> in our records.". >>>>>>>>>> >>>>>>>>>> I have verified that I can login minio UI using the same username >>>>>>>>>> and password that I passed to spark-shell via AWS_ACCESS_KEY_ID and >>>>>>>>>> AWS_SECRET_ACCESS_KEY env variables. >>>>>>>>>> https://github.com/apache/iceberg/issues/2168 is related but >>>>>>>>>> does not help me. Not sure why the credential does not work for >>>>>>>>>> iceberg + >>>>>>>>>> AWS. Any idea or an example of writing an iceberg table to S3 using >>>>>>>>>> hive >>>>>>>>>> catalog will be highly appreciated! Thanks. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> Create your own email signature >>>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>>>>> >>>>>>> >>>>>> >>>>>> -- >>>>>> >>>>>> Create your own email signature >>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>>>> >>>>> >>>> >>>> -- >>>> >>>> Create your own email signature >>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >>>> >>> >> >> -- >> >> Create your own email signature >> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592> >> > -- Ryan Blue Tabular