Re: Flaky build in GitHub Actions

2021-07-21 Thread Dongjoon Hyun
Thank you, Hyukjin!

Dongjoon.

On Tue, Jul 20, 2021 at 8:53 PM Hyukjin Kwon  wrote:

> I filed a ticket at GitHub. I will share more details when I get a
> response from them.
>
> 2021년 7월 20일 (화) 오후 7:30, Hyukjin Kwon 님이 작성:
>
>> Hi all,
>>
>> Looks like there's something going on in the machines in GitHub Actions.
>> The build is now very flaky and keeps dying with symptoms like I guess
>> out-of-memory (?).
>> I will try to take a closer look tomorrow but it would be great if you
>> guys find some time to take a look into it 🙏
>>
>


Fwd: Unpacking and using external modules with PySpark inside k8s

2021-07-21 Thread Mich Talebzadeh
Hi,

I am aware that some fellow members in this dev group were involved in
creating scripts for running spark on kubernetes

# To build additional PySpark docker image$ ./bin/docker-image-tool.sh
-r  -t my-tag -p
./kubernetes/dockerfiles/spark/bindings/python/Dockerfile build


The problem I have explained is to be able to unpack packages like yaml and
pandas inside k8s


I am using


spark-submit --verbose \
   --master k8s://$K8S_SERVER \

 --archives=hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/pyspark_venv.tar.gz
\
   --deploy-mode cluster \
   --name pytest \
   --conf spark.kubernetes.namespace=spark \
   --conf spark.executor.instances=1 \
   --conf spark.kubernetes.driver.limit.cores=1 \
   --conf spark.executor.cores=1 \
   --conf spark.executor.memory=500m \
   --conf spark.kubernetes.container.image=${IMAGE} \
   --conf
spark.kubernetes.authenticate.driver.serviceAccountName=spark-serviceaccount
\
   --py-files hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/DSBQ.zip \
   hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/${APPLICATION}


The directory containing code is zipped as DSBQ.zip and it reads it ok.


However, it says in verbose mode


2021-07-21 17:01:29,038 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
Unpacking an archive hdfs://
50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz from
/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/pyspark_venv.tar.gz to
/opt/spark/work-dir/./pyspark_venv.tar.gz


In this case it tries to use pandas


The module ${APPLICATION} has this code


import sys
import os
import pkgutil
import pkg_resources

def main():
print("\n printing sys.path")
for p in sys.path:
   print(p)
user_paths = os.environ['PYTHONPATH'].split(os.pathsep)
print("\n Printing user_paths")
for p in user_paths:
   print(p)
v = sys.version
print("\n python version")
print(v)
print("\nlooping over pkg_resources.working_set")
for r in pkg_resources.working_set:
   print(r)
import pandas

if __name__ == "__main__":
  main()


The output is shown below

Unpacking an archive hdfs://
50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz from
/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/pyspark_venv.tar.gz to
/opt/spark/work-dir/./pyspark_venv.tar.gz

 printing sys.path
/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538
/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/DSBQ.zip
/opt/spark/python/lib/pyspark.zip
/opt/spark/python/lib/py4j-0.10.9-src.zip
/opt/spark/jars/spark-core_2.12-3.1.1.jar
/usr/lib/python37.zip
/usr/lib/python3.7
/usr/lib/python3.7/lib-dynload
/usr/local/lib/python3.7/dist-packages
/usr/lib/python3/dist-packages

 Printing user_paths
/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/DSBQ.zip
/opt/spark/python/lib/pyspark.zip
/opt/spark/python/lib/py4j-0.10.9-src.zip
/opt/spark/jars/spark-core_2.12-3.1.1.jar

 python version
3.7.3 (default, Jan 22 2021, 20:04:44)
[GCC 8.3.0]

looping over pkg_resources.working_set
setuptools 57.2.0
pip 21.1.3
wheel 0.32.3
six 1.12.0
SecretStorage 2.3.1
pyxdg 0.25
PyGObject 3.30.4
pycrypto 2.6.1
keyrings.alt 3.1.1
keyring 17.1.1
entrypoints 0.3
cryptography 2.6.1
asn1crypto 0.24.0
Traceback (most recent call last):
  File "/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/testpackages.py",
line 24, in 
main()
  File "/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/testpackages.py",
line 21, in main
import pandas
ModuleNotFoundError: No module named 'pandas'


Adding that if I go inside the docker and do


185@4a6747d59ff2:/opt/spark/work-dir$ pip3 list
Package   Version
- ---
asn1crypto0.24.0
cryptography  2.6.1
entrypoints   0.3
keyring   17.1.1
keyrings.alt  3.1.1
pip   21.1.3
pycrypto  2.6.1
PyGObject 3.30.4
pyxdg 0.25
SecretStorage 2.3.1
setuptools57.2.0
six   1.12.0
wheel 0.32.3


I don't get any external packages!


I opened a SO thead for this as well.


https://stackoverflow.com/questions/68461865/unpacking-and-using-external-modules-with-pyspark-inside-kubernetes


Do I need to hack Dockerfile to install the requirement.txt etc?


Thanks



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




-- Forwarded message -
From: Mich Talebzadeh 
Date: Tue, 20 Jul 2021 at 22:51
Subject: Unpacking and using external modules with PySpark inside k8s
To: user @spark 



I have been struggling with this.


Kubernetes (not that matters minikube is working fin

Re: Flaky build in GitHub Actions

2021-07-21 Thread Holden Karau
I noticed that the worker decommissioning suite maybe seems to be running
up against the memory limits so I'm going to try and see if I can get our
memory usage down a bit as well while we wait for GH response. In the
meantime, I'm assuming if things pass Jenkins we are OK with merging yes?

On Wed, Jul 21, 2021 at 10:03 AM Dongjoon Hyun 
wrote:

> Thank you, Hyukjin!
>
> Dongjoon.
>
> On Tue, Jul 20, 2021 at 8:53 PM Hyukjin Kwon  wrote:
>
>> I filed a ticket at GitHub. I will share more details when I get a
>> response from them.
>>
>> 2021년 7월 20일 (화) 오후 7:30, Hyukjin Kwon 님이 작성:
>>
>>> Hi all,
>>>
>>> Looks like there's something going on in the machines in GitHub Actions.
>>> The build is now very flaky and keeps dying with symptoms like I guess
>>> out-of-memory (?).
>>> I will try to take a closer look tomorrow but it would be great if you
>>> guys find some time to take a look into it 🙏
>>>
>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Unpacking and using external modules with PySpark inside k8s

2021-07-21 Thread Mich Talebzadeh
I managed to sort this one out.

Please see

https://stackoverflow.com/questions/68461865/unpacking-and-using-external-modules-with-pyspark-inside-kubernetes/68476548#68476548

HTH


   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 21 Jul 2021 at 18:10, Mich Talebzadeh 
wrote:

>
> Hi,
>
> I am aware that some fellow members in this dev group were involved in
> creating scripts for running spark on kubernetes
>
> # To build additional PySpark docker image$ ./bin/docker-image-tool.sh -r 
>  -t my-tag -p ./kubernetes/dockerfiles/spark/bindings/python/Dockerfile 
> build
>
>
> The problem I have explained is to be able to unpack packages like yaml
> and pandas inside k8s
>
>
> I am using
>
>
> spark-submit --verbose \
>--master k8s://$K8S_SERVER \
>
>  --archives=hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/pyspark_venv.tar.gz
> \
>--deploy-mode cluster \
>--name pytest \
>--conf spark.kubernetes.namespace=spark \
>--conf spark.executor.instances=1 \
>--conf spark.kubernetes.driver.limit.cores=1 \
>--conf spark.executor.cores=1 \
>--conf spark.executor.memory=500m \
>--conf spark.kubernetes.container.image=${IMAGE} \
>--conf
> spark.kubernetes.authenticate.driver.serviceAccountName=spark-serviceaccount
> \
>--py-files
> hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/DSBQ.zip \
>hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/${APPLICATION}
>
>
> The directory containing code is zipped as DSBQ.zip and it reads it ok.
>
>
> However, it says in verbose mode
>
>
> 2021-07-21 17:01:29,038 WARN util.NativeCodeLoader: Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> Unpacking an archive hdfs://
> 50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz from
> /tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/pyspark_venv.tar.gz to
> /opt/spark/work-dir/./pyspark_venv.tar.gz
>
>
> In this case it tries to use pandas
>
>
> The module ${APPLICATION} has this code
>
>
> import sys
> import os
> import pkgutil
> import pkg_resources
>
> def main():
> print("\n printing sys.path")
> for p in sys.path:
>print(p)
> user_paths = os.environ['PYTHONPATH'].split(os.pathsep)
> print("\n Printing user_paths")
> for p in user_paths:
>print(p)
> v = sys.version
> print("\n python version")
> print(v)
> print("\nlooping over pkg_resources.working_set")
> for r in pkg_resources.working_set:
>print(r)
> import pandas
>
> if __name__ == "__main__":
>   main()
>
>
> The output is shown below
>
> Unpacking an archive hdfs://
> 50.140.197.220:9000/minikube/codes/pyspark_venv.tar.gz from
> /tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/pyspark_venv.tar.gz to
> /opt/spark/work-dir/./pyspark_venv.tar.gz
>
>  printing sys.path
> /tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538
> /tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/DSBQ.zip
> /opt/spark/python/lib/pyspark.zip
> /opt/spark/python/lib/py4j-0.10.9-src.zip
> /opt/spark/jars/spark-core_2.12-3.1.1.jar
> /usr/lib/python37.zip
> /usr/lib/python3.7
> /usr/lib/python3.7/lib-dynload
> /usr/local/lib/python3.7/dist-packages
> /usr/lib/python3/dist-packages
>
>  Printing user_paths
> /tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/DSBQ.zip
> /opt/spark/python/lib/pyspark.zip
> /opt/spark/python/lib/py4j-0.10.9-src.zip
> /opt/spark/jars/spark-core_2.12-3.1.1.jar
>
>  python version
> 3.7.3 (default, Jan 22 2021, 20:04:44)
> [GCC 8.3.0]
>
> looping over pkg_resources.working_set
> setuptools 57.2.0
> pip 21.1.3
> wheel 0.32.3
> six 1.12.0
> SecretStorage 2.3.1
> pyxdg 0.25
> PyGObject 3.30.4
> pycrypto 2.6.1
> keyrings.alt 3.1.1
> keyring 17.1.1
> entrypoints 0.3
> cryptography 2.6.1
> asn1crypto 0.24.0
> Traceback (most recent call last):
>   File "/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/testpackages.py",
> line 24, in 
> main()
>   File "/tmp/spark-05775b32-14f4-42a4-8570-d6b59b99f538/testpackages.py",
> line 21, in main
> import pandas
> ModuleNotFoundError: No module named 'pandas'
>
>
> Adding that if I go inside the docker and do
>
>
> 185@4a6747d59ff2:/opt/spark/work-dir$ pip3 list
> Package   Version
> - ---
> asn1crypto0.24.0
> cryptography  2.6.1
> entrypoints   0.3
> keyring   17.1.1
> keyrings.alt  3.1.1
> pip   21.1.3
> pycrypto  2.6.1
> PyGObject 3.30.4
> pyxdg 0.25
> SecretStorage 2.3.1
> setuptools57.2.0
> six   1.12.0
> wheel 0.32.3
>
>
> I don't get any e

Re: Flaky build in GitHub Actions

2021-07-21 Thread Hyukjin Kwon
FYI, @Liang-Chi Hsieh  is trying to control the memory in
the test base at https://github.com/apache/spark/pull/33447 which looks
almost promising now.
While I don't object to merge things, would need to closely track how these
tests go at Github Actions in his PR (and in the main Apache repo)

2021년 7월 22일 (목) 오전 3:00, Holden Karau 님이 작성:

> I noticed that the worker decommissioning suite maybe seems to be running
> up against the memory limits so I'm going to try and see if I can get our
> memory usage down a bit as well while we wait for GH response. In the
> meantime, I'm assuming if things pass Jenkins we are OK with merging yes?
>
> On Wed, Jul 21, 2021 at 10:03 AM Dongjoon Hyun 
> wrote:
>
>> Thank you, Hyukjin!
>>
>> Dongjoon.
>>
>> On Tue, Jul 20, 2021 at 8:53 PM Hyukjin Kwon  wrote:
>>
>>> I filed a ticket at GitHub. I will share more details when I get a
>>> response from them.
>>>
>>> 2021년 7월 20일 (화) 오후 7:30, Hyukjin Kwon 님이 작성:
>>>
 Hi all,

 Looks like there's something going on in the machines in GitHub Actions.
 The build is now very flaky and keeps dying with symptoms like I guess
 out-of-memory (?).
 I will try to take a closer look tomorrow but it would be great if you
 guys find some time to take a look into it 🙏

>>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Time to start publishing Spark Docker Images?

2021-07-21 Thread Holden Karau
Hi Folks,

Many other distributed computing (https://hub.docker.com/r/rayproject/ray
https://hub.docker.com/u/daskdev) and ASF projects (
https://hub.docker.com/u/apache) now publish their images to dockerhub.

We've already got the docker image tooling in place, I think we'd need to
ask the ASF to grant permissions to the PMC to publish containers and
update the release steps but I think this could be useful for folks.

Cheers,

Holden

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re:Time to start publishing Spark Docker Images?

2021-07-21 Thread Kent Yao







+1Bests,






  





















Kent Yao @ Data Science Center, Hangzhou Research Institute, NetEase Corp.a spark enthusiastkyuubiis a unified multi-tenant JDBC interface for large-scale data processing and analytics, built on top of Apache Spark.spark-authorizerA Spark SQL extension which provides SQL Standard Authorization for Apache Spark.spark-postgres A library for reading data from and transferring data to Postgres / Greenplum with Spark SQL and DataFrames, 10~100x faster.itatchiA library that brings useful functions from various modern database management systems to Apache Spark.
















 


On 07/22/2021 11:13,Holden Karau wrote: 


Hi Folks,Many other distributed computing (https://hub.docker.com/r/rayproject/ray https://hub.docker.com/u/daskdev) and ASF projects (https://hub.docker.com/u/apache) now publish their images to dockerhub.We've already got the docker image tooling in place, I think we'd need to ask the ASF to grant permissions to the PMC to publish containers and update the release steps but I think this could be useful for folks.Cheers,Holden-- Twitter: https://twitter.com/holdenkarauBooks (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 YouTube Live Streams: https://www.youtube.com/user/holdenkarau




-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org