This is my full script launching spark-shell:

# add Iceberg dependency
export AWS_REGION=us-east-1
export AWS_ACCESS_KEY_ID=minio
export AWS_SECRET_ACCESS_KEY=minio123

ICEBERG_VERSION=0.11.1
DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0"

MINIOSERVER=192.168.176.5


# add AWS dependnecy
AWS_SDK_VERSION=2.15.40
AWS_MAVEN_GROUP=software.amazon.awssdk
AWS_PACKAGES=(
    "bundle"
    "url-connection-client"
)
for pkg in "${AWS_PACKAGES[@]}"; do
    DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
done

# start Spark SQL client shell
/spark/bin/spark-shell --packages $DEPENDENCIES \
    --conf
spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
    --conf spark.sql.catalog.hive_test.type=hive  \
    --conf
spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO \
    --conf spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse \
    --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
    --conf spark.hadoop.fs.s3a.access.key=minio \
    --conf spark.hadoop.fs.s3a.secret.key=minio123 \
    --conf spark.hadoop.fs.s3a.path.style.access=true \
    --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem


Let me know if anything is missing. Thanks.

On Mon, Aug 16, 2021 at 7:29 PM Jack Ye <yezhao...@gmail.com> wrote:

> Have you included the hadoop-aws jar?
> https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws
> -Jack
>
> On Mon, Aug 16, 2021 at 7:09 PM Lian Jiang <jiangok2...@gmail.com> wrote:
>
>> Jack,
>>
>> You are right. S3FileIO will not work on minio since minio does not
>> support ACL: https://docs.min.io/docs/minio-server-limits-per-tenant.html
>>
>> To use iceberg, minio + s3a, I used below script to launch spark-shell:
>>
>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>     --conf
>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>     --conf spark.sql.catalog.hive_test.type=hive  \
>> *    --conf
>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.hadoop.HadoopFileIO
>> \*
>>     --conf spark.sql.catalog.hive_test.warehouse=s3a://east/warehouse \
>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>     --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>
>>
>>
>> *The spark code:*
>>
>> import org.apache.spark.sql.SparkSession
>> val values = List(1,2,3,4,5)
>>
>> val spark = SparkSession.builder().master("local").getOrCreate()
>> import spark.implicits._
>> val df = values.toDF()
>>
>> val core = "mytable"
>> val table = s"hive_test.mydb.${core}"
>> val s3IcePath = s"s3a://east/${core}.ice"
>>
>> df.writeTo(table)
>>     .tableProperty("write.format.default", "parquet")
>>     .tableProperty("location", s3IcePath)
>>     .createOrReplace()
>>
>>
>> *Still the same error:*
>> java.lang.ClassNotFoundException: Class
>> org.apache.hadoop.fs.s3a.S3AFileSystem not found
>>
>>
>> What else could be wrong? Thanks for any clue.
>>
>>
>>
>> On Mon, Aug 16, 2021 at 9:35 AM Jack Ye <yezhao...@gmail.com> wrote:
>>
>>> Sorry for the late reply, I thought I replied on Friday but the email
>>> did not send successfully.
>>>
>>> As Daniel said, you don't need to setup S3A if you are using S3FileIO.
>>>
>>> Th S3FileIO by default reads the default credentials chain to check
>>> credential setups one by one:
>>> https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html#credentials-chain
>>>
>>> If you would like to use a specialized credential provider, you can
>>> directly customize your S3 client:
>>> https://iceberg.apache.org/aws/#aws-client-customization
>>>
>>> It looks like you are trying to use MinIO to mount S3A file system? If
>>> you have to use MinIO then there is not a way to integrate with S3FileIO
>>> right now. (maybe I am wrong on this, I don't know much about MinIO)
>>>
>>> To directly use S3FileIO with HiveCatalog, simply do:
>>>
>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>     --conf
>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>     --conf
>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
>>>     --conf spark.sql.catalog.hive_test.warehouse=s3://bucket
>>>
>>> Best,
>>> Jack Ye
>>>
>>>
>>>
>>> On Sun, Aug 15, 2021 at 2:53 PM Lian Jiang <jiangok2...@gmail.com>
>>> wrote:
>>>
>>>> Thanks. I prefer S3FileIO as it is recommended by iceberg. Do you have
>>>> a sample using hive catalog, s3FileIO, spark API (as opposed to SQL), S3
>>>> access.key and secret.key? It is hard to get all settings right for this
>>>> combination without an example. Appreciate any help.
>>>>
>>>> On Fri, Aug 13, 2021 at 6:01 PM Daniel Weeks <daniel.c.we...@gmail.com>
>>>> wrote:
>>>>
>>>>> So, if I recall correctly, the hive server does need access to check
>>>>> and create paths for table locations.
>>>>>
>>>>> There may be an option to disable this behavior, but otherwise the fs
>>>>> implementation probably needs to be available to the hive metastore.
>>>>>
>>>>> -Dan
>>>>>
>>>>> On Fri, Aug 13, 2021, 4:48 PM Lian Jiang <jiangok2...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks Daniel.
>>>>>>
>>>>>> After modifying the script to,
>>>>>>
>>>>>> export AWS_REGION=us-east-1
>>>>>> export AWS_ACCESS_KEY_ID=minio
>>>>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>>>>
>>>>>> ICEBERG_VERSION=0.11.1
>>>>>>
>>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION,org.apache.iceberg:iceberg-hive-runtime:$ICEBERG_VERSION,org.apache.hadoop:hadoop-aws:3.2.0"
>>>>>>
>>>>>> MINIOSERVER=192.168.160.5
>>>>>>
>>>>>>
>>>>>> # add AWS dependnecy
>>>>>> AWS_SDK_VERSION=2.15.40
>>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>>>>> AWS_PACKAGES=(
>>>>>>     "bundle"
>>>>>>     "url-connection-client"
>>>>>> )
>>>>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>>>>> done
>>>>>>
>>>>>> # start Spark SQL client shell
>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>>     --conf
>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>>>     --conf
>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>>>
>>>>>> I got: MetaException: java.lang.RuntimeException:
>>>>>> java.lang.ClassNotFoundException: Class
>>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found. My hive server is not
>>>>>> using s3 and should not cause this error. Any ideas? Thanks.
>>>>>>
>>>>>>
>>>>>> I got "ClassNotFoundException: Class
>>>>>> org.apache.hadoop.fs.s3a.S3AFileSystem not found". Any idea what 
>>>>>> dependency
>>>>>> could I miss?
>>>>>>
>>>>>> On Fri, Aug 13, 2021 at 4:03 PM Daniel Weeks <
>>>>>> daniel.c.we...@gmail.com> wrote:
>>>>>>
>>>>>>> Hey Lian,
>>>>>>>
>>>>>>> At a cursory glance, it appears that you might be mixing two
>>>>>>> different FileIO implementations, which may be why you are not getting 
>>>>>>> the
>>>>>>> expected result.
>>>>>>>
>>>>>>> When you set: --conf
>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO 
>>>>>>> you're
>>>>>>> actually switching over to the native S3 implementation within Iceberg 
>>>>>>> (as
>>>>>>> opposed to S3AFileSystem via HadoopFileIO).  However, all of the 
>>>>>>> following
>>>>>>> settings to setup access are then set for the S3AFileSystem (which would
>>>>>>> not be used with S3FileIO).
>>>>>>>
>>>>>>> You might try just removing that line since it should use the
>>>>>>> HadoopFileIO at that point and may work.
>>>>>>>
>>>>>>> Hope that's helpful,
>>>>>>> -Dan
>>>>>>>
>>>>>>> On Fri, Aug 13, 2021 at 3:50 PM Lian Jiang <jiangok2...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I try to create an iceberg table on minio s3 and hive.
>>>>>>>>
>>>>>>>> *This is how I launch spark-shell:*
>>>>>>>>
>>>>>>>> # add Iceberg dependency
>>>>>>>> export AWS_REGION=us-east-1
>>>>>>>> export AWS_ACCESS_KEY_ID=minio
>>>>>>>> export AWS_SECRET_ACCESS_KEY=minio123
>>>>>>>>
>>>>>>>> ICEBERG_VERSION=0.11.1
>>>>>>>>
>>>>>>>> DEPENDENCIES="org.apache.iceberg:iceberg-spark3-runtime:$ICEBERG_VERSION"
>>>>>>>>
>>>>>>>> MINIOSERVER=192.168.160.5
>>>>>>>>
>>>>>>>>
>>>>>>>> # add AWS dependnecy
>>>>>>>> AWS_SDK_VERSION=2.15.40
>>>>>>>> AWS_MAVEN_GROUP=software.amazon.awssdk
>>>>>>>> AWS_PACKAGES=(
>>>>>>>>     "bundle"
>>>>>>>>     "url-connection-client"
>>>>>>>> )
>>>>>>>> for pkg in "${AWS_PACKAGES[@]}"; do
>>>>>>>>     DEPENDENCIES+=",$AWS_MAVEN_GROUP:$pkg:$AWS_SDK_VERSION"
>>>>>>>> done
>>>>>>>>
>>>>>>>> # start Spark SQL client shell
>>>>>>>> /spark/bin/spark-shell --packages $DEPENDENCIES \
>>>>>>>>     --conf
>>>>>>>> spark.sql.catalog.hive_test=org.apache.iceberg.spark.SparkCatalog \
>>>>>>>>     --conf spark.sql.catalog.hive_test.warehouse=s3a://east/prefix \
>>>>>>>>     --conf spark.sql.catalog.hive_test.type=hive  \
>>>>>>>>     --conf
>>>>>>>> spark.sql.catalog.hive_test.io-impl=org.apache.iceberg.aws.s3.S3FileIO 
>>>>>>>> \
>>>>>>>>     --conf spark.hadoop.fs.s3a.endpoint=http://$MINIOSERVER:9000 \
>>>>>>>>     --conf spark.hadoop.fs.s3a.access.key=minio \
>>>>>>>>     --conf spark.hadoop.fs.s3a.secret.key=minio123 \
>>>>>>>>     --conf spark.hadoop.fs.s3a.path.style.access=true \
>>>>>>>>     --conf
>>>>>>>> spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
>>>>>>>>
>>>>>>>> *Here is the spark code to create the iceberg table:*
>>>>>>>>
>>>>>>>> import org.apache.spark.sql.SparkSession
>>>>>>>> val values = List(1,2,3,4,5)
>>>>>>>>
>>>>>>>> val spark = SparkSession.builder().master("local").getOrCreate()
>>>>>>>> import spark.implicits._
>>>>>>>> val df = values.toDF()
>>>>>>>>
>>>>>>>> val core = "mytable8"
>>>>>>>> val table = s"hive_test.mydb.${core}"
>>>>>>>> val s3IcePath = s"s3a://spark-test/${core}.ice"
>>>>>>>>
>>>>>>>> df.writeTo(table)
>>>>>>>>     .tableProperty("write.format.default", "parquet")
>>>>>>>>     .tableProperty("location", s3IcePath)
>>>>>>>>     .createOrReplace()
>>>>>>>>
>>>>>>>> I got an error "The AWS Access Key Id you provided does not exist
>>>>>>>> in our records.".
>>>>>>>>
>>>>>>>> I have verified that I can login minio UI using the same username
>>>>>>>> and password that I passed to spark-shell via AWS_ACCESS_KEY_ID and
>>>>>>>> AWS_SECRET_ACCESS_KEY env variables.
>>>>>>>> https://github.com/apache/iceberg/issues/2168 is related but does
>>>>>>>> not help me. Not sure why the credential does not work for iceberg + 
>>>>>>>> AWS.
>>>>>>>> Any idea or an example of writing an iceberg table to S3 using hive 
>>>>>>>> catalog
>>>>>>>> will be highly appreciated! Thanks.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Create your own email signature
>>>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>>>
>>>>>
>>>>
>>>> --
>>>>
>>>> Create your own email signature
>>>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>>>
>>>
>>
>> --
>>
>> Create your own email signature
>> <https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>
>>
>

-- 

Create your own email signature
<https://www.wisestamp.com/signature-in-email/?utm_source=promotion&utm_medium=signature&utm_campaign=create_your_own&srcid=5234462839406592>

Reply via email to