rubenssoto opened a new issue #3966: URL: https://github.com/apache/hudi/issues/3966
Hello Guys, We are starting to deploy Spark on Kubernetes and we need to Hudi sync with our Hive metastore, but we are facing a lot of errors like you could see on the following logs: [log_2.txt](https://github.com/apache/hudi/files/7516536/log_2.txt) Our hive metastore is in another Pod, but in the same cluster. We create hive-site.xml, but with spark in kubernetes is not possible to use the $SPARK_HOME/conf, so we use the following path ENV SPARK_CLASSPATH=$SPARK_HOME/cluster-conf The Docker File of our Spark Image: ``` # Global args ARG SPARK_VERSION=3.0.3 ARG HADOOP_VERSION=3.2.0-cloud ARG SCALA_VERSION=2.12 # Build Spark FROM openjdk:8-jdk-slim as build ENV DEBIAN_FRONTEND=noninteractive ENV M2_HOME=/opt/maven ENV MAVEN_HOME=/opt/maven ENV PATH=${PATH}:${M2_HOME}/bin ARG SPARK_VERSION ARG HADOOP_VERSION ARG SCALA_VERSION RUN apt-get -qq update && \ apt-get -qq upgrade -y && \ apt-get -qq install -y git wget curl && \ wget -q https://dlcdn.apache.org/maven/maven-3/3.8.1/binaries/apache-maven-3.8.1-bin.tar.gz -P /tmp && \ tar xf /tmp/apache-maven-3.8.1-bin.tar.gz -C /opt && \ ln -s /opt/apache-maven-3.8.1 /opt/maven && \ apt-get -qq install gnupg && \ echo "deb https://repo.scala-sbt.org/scalasbt/debian all main" | tee -a /etc/apt/sources.list.d/sbt.list && \ echo "deb https://repo.scala-sbt.org/scalasbt/debian /" | tee -a /etc/apt/sources.list.d/sbt_old.list && \ curl -sL "https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x2EE0EA64E40A89B84B2DF73499E82A75642AC823" | apt-key add && \ apt-get -qq update && \ apt-get -qq install sbt && \ apt-get -qq install -y r-base && \ apt install -y python && \ curl https://bootstrap.pypa.io/pip/2.7/get-pip.py -o get-pip.py && \ python get-pip.py && \ apt install -y python3 python3-pip && \ rm -r /usr/lib/python*/ensurepip && \ pip install --upgrade pip setuptools RUN cd / && \ git clone https://github.com/apache/spark.git --branch v${SPARK_VERSION} --single-branch && \ cd /spark && \ dev/make-distribution.sh \ --name hadoop-${HADOOP_VERSION}-scala-${SCALA_VERSION} --pip --tgz -DskipTests \ -Phadoop-3.2 \ -Phadoop-cloud \ -Pkubernetes \ -Phive && \ cp spark-${SPARK_VERSION}-bin-hadoop-${HADOOP_VERSION}-scala-${SCALA_VERSION}.tgz / # Spark image FROM openjdk:8-jre-slim ENV BASE_IMAGE openjdk:8-jre-slim RUN set -ex && \ sed -i 's/http:/https:/g' /etc/apt/sources.list && \ apt-get update && \ ln -s /lib /lib64 && \ apt install -y bash tini libc6 libpam-modules krb5-user libnss3 wget bzip2 && \ rm /bin/sh && \ ln -sv /bin/bash /bin/sh && \ echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \ chgrp root /etc/passwd && chmod ug+rw /etc/passwd && \ rm -rf /var/cache/apt/* RUN apt-get -qq update && \ apt-get -qq upgrade -y && \ apt-get -qq install -y coreutils \ cron \ initscripts \ git \ curl \ unixodbc-dev \ sasl2-bin \ libsasl2-2 \ libsasl2-modules \ libsasl2-dev \ g++ \ gcc \ libspatialindex-dev ARG SPARK_VERSION ARG HADOOP_VERSION ARG SCALA_VERSION ENV SPARK_HOME=/opt/spark ENV SPARK_CONF_DIR=$SPARK_HOME/conf ENV SPARK_CLASSPATH=$SPARK_HOME/cluster-conf ENV PYTHONHASHSEED=0 ENV CONDA_DIR=/opt/conda ENV SHELL=/bin/bash ENV PATH=$PATH:$SPARK_HOME/bin:$CONDA_DIR/bin ENV M2_HOME=/opt/maven ENV MAVEN_HOME=/opt/maven ENV PATH=${PATH}:${M2_HOME}/bin ARG MINICONDA_VERSION=4.8.3 ARG MINICONDA_MD5=d63adf39f2c220950a063e0529d4ff74 ARG CONDA_VERSION=4.8.3 ARG PYTHON_VERSION=3.7.8 ARG spark_uid=185 # Install Conda (https://github.com/jupyter/docker-stacks/blob/6d42503c684f3de9b17ce92a6b0c952ef2d1ecd8/base-notebook/Dockerfile#L78-L101) RUN mkdir -p $CONDA_DIR && \ cd /tmp && \ wget --quiet https://repo.continuum.io/miniconda/Miniconda3-py38_${MINICONDA_VERSION}-Linux-x86_64.sh && \ echo "${MINICONDA_MD5} *Miniconda3-py38_${MINICONDA_VERSION}-Linux-x86_64.sh" | md5sum -c - && \ /bin/bash Miniconda3-py38_${MINICONDA_VERSION}-Linux-x86_64.sh -f -b -p $CONDA_DIR && \ rm Miniconda3-py38_${MINICONDA_VERSION}-Linux-x86_64.sh && \ echo "conda ${CONDA_VERSION}" >> $CONDA_DIR/conda-meta/pinned && \ conda config --system --prepend channels conda-forge && \ conda config --system --set auto_update_conda false && \ conda config --system --set show_channel_urls true && \ conda config --system --set channel_priority strict && \ if [ ! $PYTHON_VERSION = 'default' ]; then conda install --yes python=$PYTHON_VERSION; fi && \ conda list python | grep '^python ' | tr -s ' ' | cut -d '.' -f 1,2 | sed 's/$/.*/' >> $CONDA_DIR/conda-meta/pinned && \ conda install --quiet --yes conda && \ conda install --quiet --yes pip && \ conda install --quiet --yes numpy scipy pandas scikit-learn && \ conda install --quiet --yes -c conda-forge pyarrow && \ conda update --all --quiet --yes && \ conda clean --all -f -y # Install Spark COPY --from=build /spark-${SPARK_VERSION}-bin-hadoop-${HADOOP_VERSION}-scala-${SCALA_VERSION}.tgz / RUN tar -xzf /spark-${SPARK_VERSION}-bin-hadoop-${HADOOP_VERSION}-scala-${SCALA_VERSION}.tgz -C /opt/ && \ ln -s /opt/spark-${SPARK_VERSION}-bin-hadoop-${HADOOP_VERSION}-scala-${SCALA_VERSION} $SPARK_HOME && \ rm -f /spark-${SPARK_VERSION}-bin-hadoop-${HADOOP_VERSION}-scala-${SCALA_VERSION}.tgz && \ mkdir -p $SPARK_HOME/work-dir && \ mkdir -p $SPARK_HOME/spark-warehouse && \ mkdir -p $SPARK_HOME/cluster-conf COPY config/* $SPARK_CONF_DIR/ COPY config/* $SPARK_HOME/cluster-conf/ COPY entrypoint.sh /opt/ RUN chmod +x /opt/entrypoint.sh WORKDIR $SPARK_HOME/work-dir RUN chmod g+w /opt/spark/work-dir ##Install Hudi RUN wget -q https://dlcdn.apache.org/maven/maven-3/3.8.3/binaries/apache-maven-3.8.3-bin.tar.gz -P /tmp && \ tar xf /tmp/apache-maven-3.8.3-bin.tar.gz -C /opt && \ ln -s /opt/apache-maven-3.8.3 /opt/maven RUN mvn dependency:copy -Dartifact=org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0 -DoutputDirectory=$SPARK_HOME/jars/ RUN mvn dependency:copy -Dartifact=org.apache.spark:spark-avro_2.12:3.1.2 -DoutputDirectory=$SPARK_HOME/jars/ RUN mvn dependency:copy -Dartifact=org.apache.hudi:hudi-hive-sync:0.9.0 -DoutputDirectory=$SPARK_HOME/jars/ ENTRYPOINT [ "/opt/entrypoint.sh" ] USER ${spark_uid} ``` Could you help me? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
