DataSourceV2 implementation for JDBC sources

2018-12-28 Thread Thomas D'Silva
Hi, I am trying to use the DataSourceV2 API to implement a spark connector for Apache Phoenix. I am not using JDBCRelation because I want to optimize how partitions are created during reads and provide support for more complicated filter pushdown. For reading I am using JdbcUtils.resultSetToSpark

RE: barrier execution mode with DataFrame and dynamic allocation

2018-12-28 Thread Ilya Matiach
Hi Xiangrui, Thank you for the quick reply and the great questions. “How does mmlspark handle dynamic allocation? Do you have a watch thread on the driver to restart the job if there are more workers? And when the number of workers decrease, can training continue without driver involved?” Curren

Re: Add packages for ipython notebook

2018-12-28 Thread Haibo Yan
Thank you for replying, Sean. error is as follows: Py4JJavaError: An error occurred while calling o49.load. : org.apache.spark.sql.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;

Add packages for ipython notebook

2018-12-28 Thread Haibo Yan
Dear spark dev I am trying to run IPython notebook with Kafka structured streaming support, I couldn't find a way to load Kafka package by adding "--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0" to PYSPARK_DRIVER_PYTHON_OPTS or even I changed my local pyspark script to "exec "${SPARK_H

Package option gets outdated jar when running with "latest"

2018-12-28 Thread Alessandro Liparoti
Hi everyone, I am encountering an annoying issue when running spark with external jar dependency downloaded from maven. This is how we run it spark-shell --repositories --packages When we release a new version and we have some big change in the API, things start to randomly break for some user