[GitHub] [hudi] rubenssoto opened a new issue #3966: [SUPPORT] Hudi Sync Remote Hive - EKS

GitBox Wed, 10 Nov 2021 15:32:11 -0800


rubenssoto opened a new issue #3966:
URL: https://github.com/apache/hudi/issues/3966



   Hello Guys,
   
   We are starting to deploy Spark on Kubernetes and we need to Hudi sync with 
our Hive metastore, but we are facing a lot of errors like you could see on the 
following logs:
   [log_2.txt](https://github.com/apache/hudi/files/7516536/log_2.txt)
   
   
   Our hive metastore is in another Pod, but in the same cluster. 
   
   We create hive-site.xml, but with spark in kubernetes is not possible to use 
the $SPARK_HOME/conf, so we use the following path
   
   ENV SPARK_CLASSPATH=$SPARK_HOME/cluster-conf
   
   The Docker File of our Spark Image:
   
   ```
   # Global args
   ARG SPARK_VERSION=3.0.3
   ARG HADOOP_VERSION=3.2.0-cloud
   ARG SCALA_VERSION=2.12
   
   # Build Spark
   FROM openjdk:8-jdk-slim as build
   
   ENV DEBIAN_FRONTEND=noninteractive
   
   ENV M2_HOME=/opt/maven
   ENV MAVEN_HOME=/opt/maven
   ENV PATH=${PATH}:${M2_HOME}/bin
   
   ARG SPARK_VERSION
   ARG HADOOP_VERSION
   ARG SCALA_VERSION
   
   RUN apt-get -qq update && \
       apt-get -qq upgrade -y && \
       apt-get -qq install -y git wget curl && \
       wget -q 
https://dlcdn.apache.org/maven/maven-3/3.8.1/binaries/apache-maven-3.8.1-bin.tar.gz
 -P /tmp && \
       tar xf /tmp/apache-maven-3.8.1-bin.tar.gz -C /opt && \
       ln -s /opt/apache-maven-3.8.1 /opt/maven && \
       apt-get -qq install gnupg && \
       echo "deb https://repo.scala-sbt.org/scalasbt/debian all main" | tee -a 
/etc/apt/sources.list.d/sbt.list && \
       echo "deb https://repo.scala-sbt.org/scalasbt/debian /" | tee -a 
/etc/apt/sources.list.d/sbt_old.list && \
       curl -sL 
"https://keyserver.ubuntu.com/pks/lookup?op=get&search=0x2EE0EA64E40A89B84B2DF73499E82A75642AC823";
 | apt-key add && \
       apt-get -qq update && \
       apt-get -qq install sbt && \
       apt-get -qq install -y r-base && \
       apt install -y python && \
       curl https://bootstrap.pypa.io/pip/2.7/get-pip.py -o get-pip.py && \
       python get-pip.py && \
       apt install -y python3 python3-pip && \
       rm -r /usr/lib/python*/ensurepip && \
       pip install --upgrade pip setuptools
   
   RUN cd / && \
       git clone https://github.com/apache/spark.git --branch v${SPARK_VERSION} 
--single-branch && \
       cd /spark && \
       dev/make-distribution.sh \
           --name hadoop-${HADOOP_VERSION}-scala-${SCALA_VERSION} --pip --tgz 
-DskipTests \
           -Phadoop-3.2 \
           -Phadoop-cloud \
           -Pkubernetes \
           -Phive && \
       cp 
spark-${SPARK_VERSION}-bin-hadoop-${HADOOP_VERSION}-scala-${SCALA_VERSION}.tgz /
   
   # Spark image
   FROM openjdk:8-jre-slim
   
   ENV BASE_IMAGE  openjdk:8-jre-slim
   
   RUN set -ex && \
       sed -i 's/http:/https:/g' /etc/apt/sources.list && \
       apt-get update && \
       ln -s /lib /lib64 && \
       apt install -y bash tini libc6 libpam-modules krb5-user libnss3 wget 
bzip2 && \
       rm /bin/sh && \
       ln -sv /bin/bash /bin/sh && \
       echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \
       chgrp root /etc/passwd && chmod ug+rw /etc/passwd && \
       rm -rf /var/cache/apt/*
   
   RUN apt-get -qq update && \
       apt-get -qq upgrade -y && \
       apt-get -qq install -y coreutils \
                               cron \
                               initscripts \
                               git \
                               curl \
                               unixodbc-dev \
                               sasl2-bin \
                               libsasl2-2 \
                               libsasl2-modules \
                               libsasl2-dev \
                               g++ \
                               gcc \
                               libspatialindex-dev
   
   ARG SPARK_VERSION
   ARG HADOOP_VERSION
   ARG SCALA_VERSION
   
   ENV SPARK_HOME=/opt/spark
   ENV SPARK_CONF_DIR=$SPARK_HOME/conf
   ENV SPARK_CLASSPATH=$SPARK_HOME/cluster-conf
   
   ENV PYTHONHASHSEED=0
   ENV CONDA_DIR=/opt/conda
   ENV SHELL=/bin/bash
   
   ENV PATH=$PATH:$SPARK_HOME/bin:$CONDA_DIR/bin
   
   ENV M2_HOME=/opt/maven
   ENV MAVEN_HOME=/opt/maven
   ENV PATH=${PATH}:${M2_HOME}/bin 
   
   ARG MINICONDA_VERSION=4.8.3
   ARG MINICONDA_MD5=d63adf39f2c220950a063e0529d4ff74
   ARG CONDA_VERSION=4.8.3
   ARG PYTHON_VERSION=3.7.8
   
   ARG spark_uid=185
   
   # Install Conda 
(https://github.com/jupyter/docker-stacks/blob/6d42503c684f3de9b17ce92a6b0c952ef2d1ecd8/base-notebook/Dockerfile#L78-L101)
   RUN mkdir -p $CONDA_DIR && \
       cd /tmp && \
       wget --quiet 
https://repo.continuum.io/miniconda/Miniconda3-py38_${MINICONDA_VERSION}-Linux-x86_64.sh
 && \
       echo "${MINICONDA_MD5} 
*Miniconda3-py38_${MINICONDA_VERSION}-Linux-x86_64.sh" | md5sum -c - && \
       /bin/bash Miniconda3-py38_${MINICONDA_VERSION}-Linux-x86_64.sh -f -b -p 
$CONDA_DIR && \
       rm Miniconda3-py38_${MINICONDA_VERSION}-Linux-x86_64.sh && \
       echo "conda ${CONDA_VERSION}" >> $CONDA_DIR/conda-meta/pinned && \
       conda config --system --prepend channels conda-forge && \
       conda config --system --set auto_update_conda false && \
       conda config --system --set show_channel_urls true && \
       conda config --system --set channel_priority strict && \
       if [ ! $PYTHON_VERSION = 'default' ]; then conda install --yes 
python=$PYTHON_VERSION; fi && \
       conda list python | grep '^python ' | tr -s ' ' | cut -d '.' -f 1,2 | 
sed 's/$/.*/' >> $CONDA_DIR/conda-meta/pinned && \
       conda install --quiet --yes conda && \
       conda install --quiet --yes pip && \
       conda install --quiet --yes numpy scipy pandas scikit-learn && \
       conda install --quiet --yes -c conda-forge pyarrow && \
       conda update --all --quiet --yes && \
       conda clean --all -f -y
   
   # Install Spark 
   COPY --from=build 
/spark-${SPARK_VERSION}-bin-hadoop-${HADOOP_VERSION}-scala-${SCALA_VERSION}.tgz 
/
   RUN tar -xzf 
/spark-${SPARK_VERSION}-bin-hadoop-${HADOOP_VERSION}-scala-${SCALA_VERSION}.tgz 
-C /opt/ && \
       ln -s 
/opt/spark-${SPARK_VERSION}-bin-hadoop-${HADOOP_VERSION}-scala-${SCALA_VERSION} 
$SPARK_HOME && \
       rm -f 
/spark-${SPARK_VERSION}-bin-hadoop-${HADOOP_VERSION}-scala-${SCALA_VERSION}.tgz 
&& \
       mkdir -p $SPARK_HOME/work-dir && \
       mkdir -p $SPARK_HOME/spark-warehouse && \
       mkdir -p $SPARK_HOME/cluster-conf
   
   COPY config/* $SPARK_CONF_DIR/
   COPY config/* $SPARK_HOME/cluster-conf/
   COPY entrypoint.sh /opt/
   RUN chmod +x /opt/entrypoint.sh
   
   WORKDIR $SPARK_HOME/work-dir
   RUN chmod g+w /opt/spark/work-dir
   
   ##Install Hudi
   RUN wget -q 
https://dlcdn.apache.org/maven/maven-3/3.8.3/binaries/apache-maven-3.8.3-bin.tar.gz
 -P /tmp && \
       tar xf /tmp/apache-maven-3.8.3-bin.tar.gz -C /opt && \
       ln -s /opt/apache-maven-3.8.3 /opt/maven
   
   RUN mvn dependency:copy 
-Dartifact=org.apache.hudi:hudi-spark3-bundle_2.12:0.9.0 
-DoutputDirectory=$SPARK_HOME/jars/
   RUN mvn dependency:copy -Dartifact=org.apache.spark:spark-avro_2.12:3.1.2 
-DoutputDirectory=$SPARK_HOME/jars/
   RUN mvn dependency:copy -Dartifact=org.apache.hudi:hudi-hive-sync:0.9.0 
-DoutputDirectory=$SPARK_HOME/jars/
   
   
   ENTRYPOINT [ "/opt/entrypoint.sh" ]
   
   USER ${spark_uid}
   ```
   
   
   Could you help me?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] rubenssoto opened a new issue #3966: [SUPPORT] Hudi Sync Remote Hive - EKS

Reply via email to