Re: Joining 3 tables with 17 billions records

2017-11-02 Thread Jörn Franke
Well this sounds a lot for “only” 17 billion. However you can limit the resources of the job so no need that it takes all of them (might be a little bit longer). Alternatively did you try to use the hbase tables directly in Hive as external tables and do a simple ctas? Works better if Hive is on

Re: Joining 3 tables with 17 billions records

2017-11-02 Thread Chetan Khatri
Jorn, This is kind of one time load from Historical Data to Analytical Hive engine. Hive version 1.2.1 and Spark version 2.0.1 with MapR distribution. Writing every table to parquet and reading it could be very much time consuming, currently entire job could take ~8 hours on 8 node of 100 Gig ram

Re: Joining 3 tables with 17 billions records

2017-11-02 Thread Jörn Franke
Hi, Do you have a more detailed log/error message? Also, can you please provide us details on the tables (no of rows, columns, size etc). Is this just a one time thing or something regular? If it is a one time thing then I would tend more towards putting each table in HDFS (parquet or ORC) and

Joining 3 tables with 17 billions records

2017-11-02 Thread Chetan Khatri
Hello Spark Developers, I have 3 tables that i am reading from HBase and wants to do join transformation and save to Hive Parquet external table. Currently my join is failing with container failed error. 1. Read table A from Hbase with ~17 billion records. 2. repartition on primary key of table A

Re: Kicking off the process around Spark 2.2.1

2017-11-02 Thread Holden Karau
I agree, except in this case we probably want some of the fixes that are going into the maintenance release to be present in the new feature release (like the CRAN issue). On Thu, Nov 2, 2017 at 12:12 PM, Reynold Xin wrote: > Why tie a maintenance release to a feature release? They are supposed

Re: Kicking off the process around Spark 2.2.1

2017-11-02 Thread Reynold Xin
Why tie a maintenance release to a feature release? They are supposed to be independent and we should be able to make a lot of maintenance releases as needed. On Thu, Nov 2, 2017 at 7:13 PM Sean Owen wrote: > The feature freeze is "mid November" : > http://spark.apache.org/versioning-policy.html

Re: Kicking off the process around Spark 2.2.1

2017-11-02 Thread Sean Owen
The feature freeze is "mid November" : http://spark.apache.org/versioning-policy.html Let's say... Nov 15? any body have a better date? Although it'd be nice to get 2.2.1 out sooner than later in all events, and kind of makes sense to get out first, they need not go in order. It just might be dist

Re: Kicking off the process around Spark 2.2.1

2017-11-02 Thread Holden Karau
I’m fine with picking a feature freeze, although then we should branch close to that point. Is there interest in still seeing 2.3 try and go out around the nominal schedule? Personally, from a release stand point, I’d rather see 2.2.1 go out first so we don’t end up with 2.3 potentially going out

Re: Kicking off the process around Spark 2.2.1

2017-11-02 Thread Felix Cheung
I think it will be great to set a feature freeze date for 2.3.0 first, as a minor release. There are a few new stuff that would be good to have and then we will likely need time to stabilize, before cutting RCs. From: Holden Karau Sent: Thursday, November 2, 201

Spark build is failing in amplab Jenkins

2017-11-02 Thread Pralabh Kumar
Hi Dev Spark build is failing in Jenkins https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83353/consoleFull Python versions prior to 2.7 are not supported. Build step 'Execute shell' marked build as failure Archiving artifacts Recording test results ERROR: Step ?Publish JUnit

Re: Kicking off the process around Spark 2.2.1

2017-11-02 Thread Holden Karau
If it’s desired I’d be happy to start on 2.3 once 2.2.1 is finished. On Thu, Nov 2, 2017 at 10:24 AM Felix Cheung wrote: > For the 2.2.1, we are still working through a few bugs. Hopefully it won't > be long. > > > -- > *From:* Kevin Grealish > *Sent:* Thursday, Nove

Re: Kicking off the process around Spark 2.2.1

2017-11-02 Thread Felix Cheung
For the 2.2.1, we are still working through a few bugs. Hopefully it won't be long. From: Kevin Grealish Sent: Thursday, November 2, 2017 9:51:56 AM To: Felix Cheung; Sean Owen; Holden Karau Cc: dev@spark.apache.org Subject: RE: Kicking off the process around Sp

RE: Kicking off the process around Spark 2.2.1

2017-11-02 Thread Kevin Grealish
Any update on expected 2.2.1 (or 2.3.0) release process? From: Felix Cheung [mailto:felixcheun...@hotmail.com] Sent: Thursday, October 26, 2017 10:04 AM To: Sean Owen ; Holden Karau Cc: dev@spark.apache.org Subject: Re: Kicking off the process around Spark 2.2.1 Yes! I can take on RM for 2.2.1.

Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-02 Thread Sean Owen
+0 simply because I don't feel I know enough to have an opinion. I have no reason to doubt the change though, from a skim through the doc. On Wed, Nov 1, 2017 at 3:37 PM Reynold Xin wrote: > Earlier I sent out a discussion thread for CP in Structured Streaming: > > https://issues.apache.org/jira