Raise Jenkins timeout?
I'm seeing jobs killed regularly, presumably because the time out (210 minutes?) https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.6/3907/console Possibly related: this master-SBT-2.7 build hasn't passed in weeks: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/ All seem to be timeouts, explicilty or implicitly. I know the answer is make tests faster, but failing that, can we raise the timeout again to ... 4 hours ? or maybe I misread why these are being killed? If somehow it's load on Jenkins servers, we could consider getting rid of the two different Hadoop builds as I think it serves little purpose to build separately for the two (or even support 2.6 specially).
Re: Raise Jenkins timeout?
++joshrosen On Mon, Oct 9, 2017 at 1:48 AM, Sean Owen wrote: > I'm seeing jobs killed regularly, presumably because the time out (210 > minutes?) > > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA% > 20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.6/3907/console > > Possibly related: this master-SBT-2.7 build hasn't passed in weeks: > > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA% > 20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/ > > All seem to be timeouts, explicilty or implicitly. > > I know the answer is make tests faster, but failing that, can we raise the > timeout again to ... 4 hours ? or maybe I misread why these are being > killed? > > If somehow it's load on Jenkins servers, we could consider getting rid of > the two different Hadoop builds as I think it serves little purpose to > build separately for the two (or even support 2.6 specially). >
Re: Raise Jenkins timeout?
I bumped the timeouts up to 255 minutes (to exceed https://github.com/apache/spark/blame/master/dev/run-tests-jenkins.py#L185). Let's see if this resolves the problem. On Mon, Oct 9, 2017 at 9:30 AM shane knapp wrote: > ++joshrosen > > On Mon, Oct 9, 2017 at 1:48 AM, Sean Owen wrote: > >> I'm seeing jobs killed regularly, presumably because the time out (210 >> minutes?) >> >> >> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.6/3907/console >> >> Possibly related: this master-SBT-2.7 build hasn't passed in weeks: >> >> >> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/ >> >> All seem to be timeouts, explicilty or implicitly. >> >> I know the answer is make tests faster, but failing that, can we raise >> the timeout again to ... 4 hours ? or maybe I misread why these are being >> killed? >> >> If somehow it's load on Jenkins servers, we could consider getting rid of >> the two different Hadoop builds as I think it serves little purpose to >> build separately for the two (or even support 2.6 specially). >> > >
Re: [VOTE][SPIP] SPARK-22026 data source v2 write path
I'm going to update the proposal: for the last point, although the user-facing API (`df.write.format(...).option(...).mode(...).save()`) mixes data and metadata operations, we are still able to separate them in the data source write API. We can have a mix-in trait `MetadataSupport` which has a method `create(options)`, so that data sources can mix in this trait and provide metadata creation support. Spark will call this `create` method inside `DataFrameWriter.save` if the specified data source has it. Note that file format data sources can ignore this new trait and still write data without metadata(it doesn't have metadata anyway). With this updated proposal, I'm calling a new vote for the data source v2 write path. The vote will be up for the next 72 hours. Please reply with your vote: +1: Yeah, let's go forward and implement the SPIP. +0: Don't really care. -1: I don't think this is a good idea because of the following technical reasons. Thanks! On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan wrote: > Hi all, > > After we merge the infrastructure of data source v2 read path, and have > some discussion for the write path, now I'm sending this email to call a > vote for Data Source v2 write path. > > The full document of the Data Source API V2 is: > https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ- > Z8qU5Frf6WMQZ6jJVM/edit > > The ready-for-review PR that implements the basic infrastructure for the > write path: > https://github.com/apache/spark/pull/19269 > > > The Data Source V1 write path asks implementations to write a DataFrame > directly, which is painful: > 1. Exposing upper-level API like DataFrame to Data Source API is not good > for maintenance. > 2. Data sources may need to preprocess the input data before writing, like > cluster/sort the input by some columns. It's better to do the preprocessing > in Spark instead of in the data source. > 3. Data sources need to take care of transaction themselves, which is > hard. And different data sources may come up with a very similar approach > for the transaction, which leads to many duplicated codes. > > To solve these pain points, I'm proposing the data source v2 writing > framework which is very similar to the reading framework, i.e., > WriteSupport -> DataSourceV2Writer -> DataWriterFactory -> DataWriter. > > Data Source V2 write path follows the existing FileCommitProtocol, and > have task/job level commit/abort, so that data sources can implement > transaction easier. > > We can create a mix-in trait for DataSourceV2Writer to specify the > requirement for input data, like clustering and ordering. > > Spark provides a very simple protocol for uses to connect to data sources. > A common way to write a dataframe to data sources: > `df.write.format(...).option(...).mode(...).save()`. > Spark passes the options and save mode to data sources, and schedules the > write job on the input data. And the data source should take care of the > metadata, e.g., the JDBC data source can create the table if it doesn't > exist, or fail the job and ask users to create the table in the > corresponding database first. Data sources can define some options for > users to carry some metadata information like partitioning/bucketing. > > > The vote will be up for the next 72 hours. Please reply with your vote: > > +1: Yeah, let's go forward and implement the SPIP. > +0: Don't really care. > -1: I don't think this is a good idea because of the following technical > reasons. > > Thanks! >
Re: [VOTE][SPIP] SPARK-22026 data source v2 write path
I'm adding my own +1 (binding). On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan wrote: > I'm going to update the proposal: for the last point, although the > user-facing API (`df.write.format(...).option(...).mode(...).save()`) > mixes data and metadata operations, we are still able to separate them in > the data source write API. We can have a mix-in trait `MetadataSupport` > which has a method `create(options)`, so that data sources can mix in this > trait and provide metadata creation support. Spark will call this `create` > method inside `DataFrameWriter.save` if the specified data source has it. > > Note that file format data sources can ignore this new trait and still > write data without metadata(it doesn't have metadata anyway). > > With this updated proposal, I'm calling a new vote for the data source v2 > write path. > > The vote will be up for the next 72 hours. Please reply with your vote: > > +1: Yeah, let's go forward and implement the SPIP. > +0: Don't really care. > -1: I don't think this is a good idea because of the following technical > reasons. > > Thanks! > > On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan wrote: > >> Hi all, >> >> After we merge the infrastructure of data source v2 read path, and have >> some discussion for the write path, now I'm sending this email to call a >> vote for Data Source v2 write path. >> >> The full document of the Data Source API V2 is: >> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ >> -Z8qU5Frf6WMQZ6jJVM/edit >> >> The ready-for-review PR that implements the basic infrastructure for the >> write path: >> https://github.com/apache/spark/pull/19269 >> >> >> The Data Source V1 write path asks implementations to write a DataFrame >> directly, which is painful: >> 1. Exposing upper-level API like DataFrame to Data Source API is not good >> for maintenance. >> 2. Data sources may need to preprocess the input data before writing, >> like cluster/sort the input by some columns. It's better to do the >> preprocessing in Spark instead of in the data source. >> 3. Data sources need to take care of transaction themselves, which is >> hard. And different data sources may come up with a very similar approach >> for the transaction, which leads to many duplicated codes. >> >> To solve these pain points, I'm proposing the data source v2 writing >> framework which is very similar to the reading framework, i.e., >> WriteSupport -> DataSourceV2Writer -> DataWriterFactory -> DataWriter. >> >> Data Source V2 write path follows the existing FileCommitProtocol, and >> have task/job level commit/abort, so that data sources can implement >> transaction easier. >> >> We can create a mix-in trait for DataSourceV2Writer to specify the >> requirement for input data, like clustering and ordering. >> >> Spark provides a very simple protocol for uses to connect to data >> sources. A common way to write a dataframe to data sources: >> `df.write.format(...).option(...).mode(...).save()`. >> Spark passes the options and save mode to data sources, and schedules the >> write job on the input data. And the data source should take care of the >> metadata, e.g., the JDBC data source can create the table if it doesn't >> exist, or fail the job and ask users to create the table in the >> corresponding database first. Data sources can define some options for >> users to carry some metadata information like partitioning/bucketing. >> >> >> The vote will be up for the next 72 hours. Please reply with your vote: >> >> +1: Yeah, let's go forward and implement the SPIP. >> +0: Don't really care. >> -1: I don't think this is a good idea because of the following technical >> reasons. >> >> Thanks! >> > >
[SPIP] SPARK-22229: RDMA Accelerated Shuffle Engine
Dear Spark community, I would like to call for the review of SPARK-9: "RDMA Accelerated Shuffle Engine". The request purpose is to embed an RDMA accelerated Shuffle Manager into mainstream Spark. Such implementation is available as an external plugin as part of the "SparkRDMA" project: https://github.com/Mellanox/SparkRDMA";> https://github.com/Mellanox/SparkRDMA. SparkRDMA already demonstrated enormous potential for accelerating shuffles seamlessly in both benchmarks and actual production environments. Adding RDMA capabilities to Spark will be one more important step in enabling lower-level acceleration as conveyed by the "Tungsten" project. SparkRDMA will be presented at Spark Summit 2017 in Dublin ( https://spark-summit.org/eu-2017/events/accelerating-shuffle-a-tailor-made-rdma-solution-for-apache-spark/ ). JIRA ticket: https://issues.apache.org/jira/browse/SPARK-9 PDF version: https://issues.apache.org/jira/secure/attachment/12891122/SPARK-9_SPIP_RDMA_Accelerated_Shuffle_Engine_Rev_1.0.pdf *Overview* An RDMA-accelerated shuffle engine can provide enormous performance benefits to shuffle-intensive Spark jobs, as demonstrated in the “SparkRDMA” plugin open-source project ( https://github.com/Mellanox/SparkRDMA). Using RDMA for shuffle improves CPU utilization significantly and reduces I/O processing overhead by bypassing the kernel and networking stack as well as avoiding memory copies entirely. Those valuable CPU cycles are then consumed directly by the actual Spark workloads, and help reducing the job runtime significantly. This performance gain is demonstrated with both industry standard HiBench TeraSort (shows 1.5x speedup in sorting) as well as shuffle intensive customer applications. SparkRDMA will be presented at Spark Summit 2017 in Dublin ( https://spark-summit.org/eu-2017/events/accelerating-shuffle-a-tailor-made-rdma-solution-for-apache-spark/ ). *Background and Motivation* Spark's current Shuffle engine implementation over “Netty” faces many performance issues often seen in other socket-based applications. Using standard TCP/IP communication socket-based model for heavy data transfers usually requires copying the data multiple times and going through many system calls in the I/O-path. These consume significant amounts of CPU cycles and memory that could have been otherwise assigned to the actual job at hand. This becomes even more critical with latency-sensitive Spark streaming and Deep Learning applications over SparkML. RDMA (Remote Direct Memory Access) is already a commodity technology that is supported on most mid-range to high-end Network Adapter cards, manufactured by various companies. Furthermore, RDMA-capable networks are already offered on public clouds such as Microsoft Azure, and will probably be supported in AWS soon to appeal to MPI users. Existing users of Spark on Microsoft Azure servers can get the benefits of RDMA by running on a suitable instance with this plugin, without needing any application changes. RDMA provides a unique approach for accessing memory locations over the network, without the need for copying on the transmitter side nor on the receiver side. These remote memory read and write operations are enabled by a standard interface which is part of mainstream Linux releases for many years now. This standardized interface allows direct access to remote memory from user-space, while skipping costly system calls in the I/O-path. Due to its many virtues, RDMA has found its way to be a standard data transfer protocol in HPC (High Performance Computing) applications, with MPI being the most prominent. RDMA has been traditionally associated with InfiniBand networks, but with the standardization of RDMA over Converged Ethernet (RoCE), RDMA is supported and widely used on Ethernet networks for many years now. Since Spark is all about performing everything in-memory, RDMA seems like a perfect fit for filling-in the gap of transferring intermediate in-memory data between the participating nodes. SparkRDMA ( https://github.com/Mellanox/SparkRDMA) is an exemplar of how shuffle performance can dramatically improve with the use of RDMA. Today, it is gaining significant traction with many users, and has successfully demonstrated major performance improvement on production applications in high-profile technology companies. SparkRDMA is a generic and easy-to-use plugin. However, in order to gain wide adoption with its effort-less acceleration, it must be integrated into Apache Spark itself. The purpose of this SPIP, is to introduce RDMA into mainstream Spark for improving Shuffle performance, and to pave the way for further accelerations such as GPUDirect (acceleration with NVidia GPUs over CUDA), NVMeoF (NVMe over Fabrics) and more. SparkRDMA will be presented at Spark Summit 2017 in Dublin ( https://spark-summit.org/eu-2017/events/accelerating-shuffle-a-tailor-made-rdma-solution-for-apache-spark/ ) *Target Personas* Any Spark us
Re: [VOTE][SPIP] SPARK-22026 data source v2 write path
+1 One thing with MetadataSupport - It's a bad idea to call it that unless adding new functions in that trait wouldn't break source/binary compatibility in the future. On Mon, Oct 9, 2017 at 6:07 PM, Wenchen Fan wrote: > I'm adding my own +1 (binding). > > On Tue, Oct 10, 2017 at 9:07 AM, Wenchen Fan wrote: > >> I'm going to update the proposal: for the last point, although the >> user-facing API (`df.write.format(...).option(...).mode(...).save()`) >> mixes data and metadata operations, we are still able to separate them in >> the data source write API. We can have a mix-in trait `MetadataSupport` >> which has a method `create(options)`, so that data sources can mix in this >> trait and provide metadata creation support. Spark will call this `create` >> method inside `DataFrameWriter.save` if the specified data source has it. >> >> Note that file format data sources can ignore this new trait and still >> write data without metadata(it doesn't have metadata anyway). >> >> With this updated proposal, I'm calling a new vote for the data source v2 >> write path. >> >> The vote will be up for the next 72 hours. Please reply with your vote: >> >> +1: Yeah, let's go forward and implement the SPIP. >> +0: Don't really care. >> -1: I don't think this is a good idea because of the following technical >> reasons. >> >> Thanks! >> >> On Tue, Oct 3, 2017 at 12:03 AM, Wenchen Fan wrote: >> >>> Hi all, >>> >>> After we merge the infrastructure of data source v2 read path, and have >>> some discussion for the write path, now I'm sending this email to call a >>> vote for Data Source v2 write path. >>> >>> The full document of the Data Source API V2 is: >>> https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ >>> -Z8qU5Frf6WMQZ6jJVM/edit >>> >>> The ready-for-review PR that implements the basic infrastructure for the >>> write path: >>> https://github.com/apache/spark/pull/19269 >>> >>> >>> The Data Source V1 write path asks implementations to write a DataFrame >>> directly, which is painful: >>> 1. Exposing upper-level API like DataFrame to Data Source API is not >>> good for maintenance. >>> 2. Data sources may need to preprocess the input data before writing, >>> like cluster/sort the input by some columns. It's better to do the >>> preprocessing in Spark instead of in the data source. >>> 3. Data sources need to take care of transaction themselves, which is >>> hard. And different data sources may come up with a very similar approach >>> for the transaction, which leads to many duplicated codes. >>> >>> To solve these pain points, I'm proposing the data source v2 writing >>> framework which is very similar to the reading framework, i.e., >>> WriteSupport -> DataSourceV2Writer -> DataWriterFactory -> DataWriter. >>> >>> Data Source V2 write path follows the existing FileCommitProtocol, and >>> have task/job level commit/abort, so that data sources can implement >>> transaction easier. >>> >>> We can create a mix-in trait for DataSourceV2Writer to specify the >>> requirement for input data, like clustering and ordering. >>> >>> Spark provides a very simple protocol for uses to connect to data >>> sources. A common way to write a dataframe to data sources: >>> `df.write.format(...).option(...).mode(...).save()`. >>> Spark passes the options and save mode to data sources, and schedules >>> the write job on the input data. And the data source should take care of >>> the metadata, e.g., the JDBC data source can create the table if it doesn't >>> exist, or fail the job and ask users to create the table in the >>> corresponding database first. Data sources can define some options for >>> users to carry some metadata information like partitioning/bucketing. >>> >>> >>> The vote will be up for the next 72 hours. Please reply with your vote: >>> >>> +1: Yeah, let's go forward and implement the SPIP. >>> +0: Don't really care. >>> -1: I don't think this is a good idea because of the following technical >>> reasons. >>> >>> Thanks! >>> >> >> >