Issue on Spark on K8s with Proxy user on Kerberized HDFS : Spark-25355

2022-04-29 Thread Pralabh Kumar
Hi dev Team

Spark-25355 added the functionality of the proxy user on K8s . However
proxy user on K8s with Kerberized HDFS is not working .  It is throwing
exception and

22/04/21 17:50:30 WARN Client: Exception encountered while connecting to
the server : org.apache.hadoop.security.AccessControlException: Client
cannot authenticate via:[TOKEN, KERBEROS]


Exception in thread "main" java.net.ConnectException: Call From 
to  failed on connection exception: java.net.ConnectException:
Connection refused; For more details see:  http://
wiki.apache.org/hadoop/ConnectionRefused

at
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native
Method)

at
java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(Unknown
Source)

at
java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
Source)

at java.base/java.lang.reflect.Constructor.newInstance(Unknown Source)

at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:831)

at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:755)

at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1501)

at org.apache.hadoop.ipc.Client.call(Client.java:1443)

at org.apache.hadoop.ipc.Client.call(Client.java:1353)

at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)

at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)

at com.sun.proxy.$Proxy14.getFileInfo(Unknown Source)

at



On debugging deep , we found the proxy user doesn't have access to
delegation tokens in case of K8s .SparkSubmit.submit explicitly creating
the proxy user and this user doesn't have delegation token.


Please help me with the same.


Regards

Pralabh Kumar


Re: Apache Spark 3.3 Release

2022-04-29 Thread Maciej
Thanks for the updated Max!

Just a small clarification ‒ the following should be moved to RESOLVED:

1. SPARK-37396: Inline type hint files for files in python/pyspark/mllib
2. SPARK-37395: Inline type hint files for files in python/pyspark/ml
3. SPARK-37093: Inline type hints python/pyspark/streaming

On 4/28/22 14:42, Maxim Gekk wrote:
> Hello All,
> 
> I am going to create the first release candidate of Spark 3.3 at the
> beginning of the next week if there are no objections. Below is the list
> of allow features, and their current status. At the moment, only one
> feature is still in progress, but it can be postponed to the next
> release, I guess:
> 
> IN PROGRESS:
> 
>  1. SPARK-28516: Data Type Formatting Functions: `to_char`
> 
> IN PROGRESS but won't/couldn't be merged to branch-3.3:
> 
>  1. SPARK-37650: Tell spark-env.sh the python interpreter
>  2. SPARK-36664: Log time spent waiting for cluster resources
>  3. SPARK-37396: Inline type hint files for files in python/pyspark/mllib
>  4. SPARK-37395: Inline type hint files for files in python/pyspark/ml
>  5. SPARK-37093: Inline type hints python/pyspark/streaming
> 
> RESOLVED:
> 
>  1. SPARK-32268: Bloom Filter Join
>  2. SPARK-38548: New SQL function: try_sum
>  3. SPARK-38063: Support SQL split_part function
>  4. SPARK-38432: Refactor framework so as JDBC dialect could compile
> filter by self way
>  5. SPARK-34863: Support nested column in Spark Parquet vectorized readers
>  6. SPARK-38194: Make Yarn memory overhead factor configurable
>  7. SPARK-37618: Support cleaning up shuffle blocks from external
> shuffle service
>  8. SPARK-37831: Add task partition id in metrics
>  9. SPARK-37974: Implement vectorized DELTA_BYTE_ARRAY and
> DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support
> 10. SPARK-38590: New SQL function: try_to_binary
> 11. SPARK-37377: Refactor V2 Partitioning interface and remove
> deprecated usage of Distribution
> 12. SPARK-38085: DataSource V2: Handle DELETE commands for group-based
> sources
> 13. SPARK-34659: Web UI does not correctly get appId
> 14. SPARK-38589: New SQL function: try_avg
> 15. SPARK-37691: Support ANSI Aggregation Function: percentile_disc
> 16. SPARK-34079: Improvement CTE table scan
> 
> 
> Max Gekk
> 
> Software Engineer
> 
> Databricks, Inc.
> 
> 
> 
> On Fri, Apr 15, 2022 at 4:28 PM Maxim Gekk  > wrote:
> 
> Hello All,
> 
> Current status of features from the allow list for branch-3.3 is:
> 
> IN PROGRESS:
> 
>  1. SPARK-37691: Support ANSI Aggregation Function: percentile_disc
>  2. SPARK-28516: Data Type Formatting Functions: `to_char`
>  3. SPARK-34079: Improvement CTE table scan
> 
> IN PROGRESS but won't/couldn't be merged to branch-3.3:
> 
>  1. SPARK-37650: Tell spark-env.sh the python interpreter
>  2. SPARK-36664: Log time spent waiting for cluster resources
>  3. SPARK-37396: Inline type hint files for files in
> python/pyspark/mllib
>  4. SPARK-37395: Inline type hint files for files in python/pyspark/ml
>  5. SPARK-37093: Inline type hints python/pyspark/streaming
> 
> RESOLVED:
> 
>  1. SPARK-32268: Bloom Filter Join
>  2. SPARK-38548: New SQL function: try_sum
>  3. SPARK-38063: Support SQL split_part function
>  4. SPARK-38432: Refactor framework so as JDBC dialect could compile
> filter by self way
>  5. SPARK-34863: Support nested column in Spark Parquet vectorized
> readers
>  6. SPARK-38194: Make Yarn memory overhead factor configurable
>  7. SPARK-37618: Support cleaning up shuffle blocks from external
> shuffle service
>  8. SPARK-37831: Add task partition id in metrics
>  9. SPARK-37974: Implement vectorized DELTA_BYTE_ARRAY and
> DELTA_LENGTH_BYTE_ARRAY encodings for Parquet V2 support
> 10. SPARK-38590: New SQL function: try_to_binary
> 11. SPARK-37377: Refactor V2 Partitioning interface and remove
> deprecated usage of Distribution
> 12. SPARK-38085: DataSource V2: Handle DELETE commands for
> group-based sources
> 13. SPARK-34659: Web UI does not correctly get appId
> 14. SPARK-38589: New SQL function: try_avg
> 
> 
> Max Gekk
> 
> Software Engineer
> 
> Databricks, Inc.
> 
> 
> 
> On Mon, Apr 4, 2022 at 9:27 PM Maxim Gekk  > wrote:
> 
> Hello All,
> 
> Below is current status of features from the allow list:
> 
> IN PROGRESS:
> 
>  1. SPARK-37396: Inline type hint files for files in
> python/pyspark/mllib
>  2. SPARK-37395: Inline type hint files for files in
> python/pyspark/ml
>  3. SPARK-37093: Inline type hints python/pyspark/streaming
>  4. SPARK-37377: Refactor V2 Partitioning interface and remove
> deprecated usage of Distribution
>  5. SPARK-38085: DataSource V2: Handle DELETE commands for