Re: Spark Improvement Proposals(Internet mail)

2016-10-17 Thread 黄明
There’s no need to compare to Flink’s Streaming Model. Spark should focus more 
on how to go beyond itself.


From the beginning, Spark’s success comes from it’s unified model can satisfiy 
SQL,Streaming, Machine Learning Models and Graphs Jobs …… all in One.  But From 
1.6 to 2.0, the abstraction from RDD to DataFrame make no contribution to these 
two important areas (ML & Graph) with any substantial progress. Most things is 
for SQL and Streaming, which make Spark have to face the competition with 
Flink. But guys, these is not surposed to be the battle what Spark should face.


SIP is a good start. Voice from technical communication should be heard and 
accepted, not buried in the PR bodies. Nowadays, Spark don’t lack of committers 
or contributors. The right direction and focus area, will decide where it goes, 
what competitor it encounter, and finally what it can be.

---
Sincerely
Andy

 原始邮件
发件人: Debasish Das
收件人: Tomasz Gawęda
抄送: dev@spark.apache.org; Cody 
Koeninger
发送时间: 2016年10月17日(周一) 10:21
主题: Re: Spark Improvement Proposals(Internet mail)

Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as soon 
as I looked into it since compared to writing Java map-reduce and Cascading 
code, Spark made writing distributed code fun...But now as we went deeper with 
Spark and real-time streaming use-case gets more prominent, I think it is time 
to bring a messaging model in conjunction with the batch/micro-batch API that 
Spark is good atakka-streams close integration with spark micro-batching 
APIs looks like a great direction to stay in the game with Apache Flink...Spark 
2.0 integrated streaming with batch with the assumption is that micro-batching 
is sufficient to run SQL commands on stream but do we really have time to do 
SQL processing at streaming data within 1-2 seconds ?

After reading the email chain, I started to look into Flink documentation and 
if you compare it with Spark documentation, I think we have major work to do 
detailing out Spark internals so that more people from community start to take 
active role in improving the issues so that Spark stays strong compared to 
Flink.

https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals

https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals

Spark is no longer an engine that works for micro-batch and batch...We (and I 
am sure many others) are pushing spark as an engine for stream and query 
processing.we need to make it a state-of-the-art engine for high speed 
streaming data and user queries as well !

On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda 
mailto:tomasz.gaw...@outlook.com>> wrote:
Hi everyone,

I'm quite late with my answer, but I think my suggestions may help a
little bit. :) Many technical and organizational topics were mentioned,
but I want to focus on these negative posts about Spark and about "haters"

I really like Spark. Easy of use, speed, very good community - it's
everything here. But Every project has to "flight" on "framework market"
to be still no 1. I'm following many Spark and Big Data communities,
maybe my mail will inspire someone :)

You (every Spark developer; so far I didn't have enough time to join
contributing to Spark) has done excellent job. So why are some people
saying that Flink (or other framework) is better, like it was posted in
this mailing list? No, not because that framework is better in all
cases.. In my opinion, many of these discussions where started after
Flink marketing-like posts. Please look at StackOverflow "Flink vs "
posts, almost every post in "winned" by Flink. Answers are sometimes
saying nothing about other frameworks, Flink's users (often PMC's) are
just posting same information about real-time streaming, about delta
iterations, etc. It look smart and very often it is marked as an aswer,
even if - in my opinion - there wasn't told all the truth.


My suggestion: I don't have enough money and knowledgle to perform huge
performance test. Maybe some company, that supports Spark (Databricks,
Cloudera? - just saying you're most visible in community :) ) could
perform performance test of:

- streaming engine - probably Spark will loose because of mini-batch
model, however currently the difference should be much lower that in
previous versions

- Machine Learning models

- batch jobs

- Graph jobs

- SQL queries

People will see that Spark is envolving and is also a modern framework,
because after reading posts mentioned above people may think "it is
outdated, future is in framework X".

Matei Zaharia posted excellent blog post about how Spark Structured
Streaming beats every other framework in terms of easy-of-use and
reliability. Performance tests, done in various environments (in
example: laptop, small 2 node cluster, 10-node cluster, 20-node
cluster), could be also very good marketing stuff to say "hey, you're
telling that you're better, but Spark is still faster and is still
getting even more fast!". This would be bas

trying to use Spark applications with modified Kryo

2016-10-17 Thread Prasun Ratn
Hi

I want to run some Spark applications with some changes in Kryo serializer.

Please correct me, but I think I need to recompile spark (instead of
just the Spark applications) in order to use the newly built Kryo
serializer?

I obtained Kryo 3.0.3 source and built it (mvn package install).

Next, I took the source code for Spark 2.0.1 and built it (build/mvn
-X -DskipTests -Dhadoop.version=2.6.0 clean package)

I then compiled the Spark applications.

However, I am not seeing my Kryo changes when I run the Spark applications.

Please let me know if my assumptions and steps are correct.

Thank you
Prasun

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Custom Monitoring of Spark applications

2016-10-17 Thread Nicolae Rosca
Hi all,

I am trying to write a custom Source for counting errors and output that
with Spark sink mechanism ( CSV or JMX ) and having some problems
understanding how this works.

1. I defined the Source, added counters created with MetricRegistry and
registered the Source

> SparkEnv.get().metricsSystem().registerSource(this)


2. Used that counter ( I could printout in driver the value )

3. With CsvSink my counter is reported but value is 0. !!

I have following questions:
 - I expect that codehale's Counter is serialised and registered but
because objects are different is not the right counter. I have a version
with accumulator and is working fine just little worried about performance.
( and design ) Is there another way of doing this ? maybe static fields ?

- When running on YARN how many sink objects will be created ?

- If I will create some singleton object and register that counter in
Spark, counting is right but will never report from executor. How to enable
reporting from executors when running on YARN ?

My custom Source:

public class CustomMonitoring implements Source {
> private MetricRegistry metricRegistry = new MetricRegistry();
> public CustomMonitoring(List counts) {
> for (String count : counts) {
> metricRegistry.counter(count);
> }
> SparkEnv.get().metricsSystem().registerSource(this);
> }
> @Override
> public String sourceName() {
> return TURBINE_CUSTOM_MONITORING;
> }
> public MetricRegistry metricRegistry() {
> return metricRegistry;
> }
> }


metrics.properties

> *.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
> *.sink.csv.directory=/tmp/csvSink/
> *.sink.csv.period=60
> *.sink.csv.unit=seconds

Thanks you,

Nicolae  R.


Re: trying to use Spark applications with modified Kryo

2016-10-17 Thread Steve Loughran

On 17 Oct 2016, at 10:02, Prasun Ratn 
mailto:prasun.r...@gmail.com>> wrote:

Hi

I want to run some Spark applications with some changes in Kryo serializer.

Please correct me, but I think I need to recompile spark (instead of
just the Spark applications) in order to use the newly built Kryo
serializer?

I obtained Kryo 3.0.3 source and built it (mvn package install).

Next, I took the source code for Spark 2.0.1 and built it (build/mvn
-X -DskipTests -Dhadoop.version=2.6.0 clean package)

I then compiled the Spark applications.

However, I am not seeing my Kryo changes when I run the Spark applications.


Kryo versions are very brittle.

You'll

-need to get an up to date/consistent version of Chill, which is where the 
transitive dependency on Kryo originates
-rebuild spark depending on that chill release

if you want hive integration, probably also rebuild Hive to be consistent too; 
the main reason Spark has its own Hive version is that
Kryo version sharing.

https://github.com/JoshRosen/hive/commits/release-1.2.1-spark2

Kryo has repackaged their class locations between versions. This lets the 
versions co-exist, but probably also explains why your apps aren't picking up 
the diffs.

Finally, keep an eye on this github PR

https://github.com/twitter/chill/issues/252



Re: trying to use Spark applications with modified Kryo

2016-10-17 Thread Prasun Ratn
Thanks a lot Steve!

On Mon, Oct 17, 2016 at 4:59 PM, Steve Loughran  wrote:
>
> On 17 Oct 2016, at 10:02, Prasun Ratn  wrote:
>
> Hi
>
> I want to run some Spark applications with some changes in Kryo serializer.
>
> Please correct me, but I think I need to recompile spark (instead of
> just the Spark applications) in order to use the newly built Kryo
> serializer?
>
> I obtained Kryo 3.0.3 source and built it (mvn package install).
>
> Next, I took the source code for Spark 2.0.1 and built it (build/mvn
> -X -DskipTests -Dhadoop.version=2.6.0 clean package)
>
> I then compiled the Spark applications.
>
> However, I am not seeing my Kryo changes when I run the Spark applications.
>
>
> Kryo versions are very brittle.
>
> You'll
>
> -need to get an up to date/consistent version of Chill, which is where the
> transitive dependency on Kryo originates
> -rebuild spark depending on that chill release
>
> if you want hive integration, probably also rebuild Hive to be consistent
> too; the main reason Spark has its own Hive version is that
> Kryo version sharing.
>
> https://github.com/JoshRosen/hive/commits/release-1.2.1-spark2
>
> Kryo has repackaged their class locations between versions. This lets the
> versions co-exist, but probably also explains why your apps aren't picking
> up the diffs.
>
> Finally, keep an eye on this github PR
>
> https://github.com/twitter/chill/issues/252
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark Improvement Proposals

2016-10-17 Thread Cody Koeninger
I think narrowly focusing on Flink or benchmarks is missing my point.

My point is evolve or die.  Spark's governance and organization is
hampering its ability to evolve technologically, and it needs to
change.

On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das  wrote:
> Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as
> soon as I looked into it since compared to writing Java map-reduce and
> Cascading code, Spark made writing distributed code fun...But now as we went
> deeper with Spark and real-time streaming use-case gets more prominent, I
> think it is time to bring a messaging model in conjunction with the
> batch/micro-batch API that Spark is good atakka-streams close
> integration with spark micro-batching APIs looks like a great direction to
> stay in the game with Apache Flink...Spark 2.0 integrated streaming with
> batch with the assumption is that micro-batching is sufficient to run SQL
> commands on stream but do we really have time to do SQL processing at
> streaming data within 1-2 seconds ?
>
> After reading the email chain, I started to look into Flink documentation
> and if you compare it with Spark documentation, I think we have major work
> to do detailing out Spark internals so that more people from community start
> to take active role in improving the issues so that Spark stays strong
> compared to Flink.
>
> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
>
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
>
> Spark is no longer an engine that works for micro-batch and batch...We (and
> I am sure many others) are pushing spark as an engine for stream and query
> processing.we need to make it a state-of-the-art engine for high speed
> streaming data and user queries as well !
>
> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda 
> wrote:
>>
>> Hi everyone,
>>
>> I'm quite late with my answer, but I think my suggestions may help a
>> little bit. :) Many technical and organizational topics were mentioned,
>> but I want to focus on these negative posts about Spark and about "haters"
>>
>> I really like Spark. Easy of use, speed, very good community - it's
>> everything here. But Every project has to "flight" on "framework market"
>> to be still no 1. I'm following many Spark and Big Data communities,
>> maybe my mail will inspire someone :)
>>
>> You (every Spark developer; so far I didn't have enough time to join
>> contributing to Spark) has done excellent job. So why are some people
>> saying that Flink (or other framework) is better, like it was posted in
>> this mailing list? No, not because that framework is better in all
>> cases.. In my opinion, many of these discussions where started after
>> Flink marketing-like posts. Please look at StackOverflow "Flink vs "
>> posts, almost every post in "winned" by Flink. Answers are sometimes
>> saying nothing about other frameworks, Flink's users (often PMC's) are
>> just posting same information about real-time streaming, about delta
>> iterations, etc. It look smart and very often it is marked as an aswer,
>> even if - in my opinion - there wasn't told all the truth.
>>
>>
>> My suggestion: I don't have enough money and knowledgle to perform huge
>> performance test. Maybe some company, that supports Spark (Databricks,
>> Cloudera? - just saying you're most visible in community :) ) could
>> perform performance test of:
>>
>> - streaming engine - probably Spark will loose because of mini-batch
>> model, however currently the difference should be much lower that in
>> previous versions
>>
>> - Machine Learning models
>>
>> - batch jobs
>>
>> - Graph jobs
>>
>> - SQL queries
>>
>> People will see that Spark is envolving and is also a modern framework,
>> because after reading posts mentioned above people may think "it is
>> outdated, future is in framework X".
>>
>> Matei Zaharia posted excellent blog post about how Spark Structured
>> Streaming beats every other framework in terms of easy-of-use and
>> reliability. Performance tests, done in various environments (in
>> example: laptop, small 2 node cluster, 10-node cluster, 20-node
>> cluster), could be also very good marketing stuff to say "hey, you're
>> telling that you're better, but Spark is still faster and is still
>> getting even more fast!". This would be based on facts (just numbers),
>> not opinions. It would be good for companies, for marketing puproses and
>> for every Spark developer
>>
>>
>> Second: real-time streaming. I've written some time ago about real-time
>> streaming support in Spark Structured Streaming. Some work should be
>> done to make SSS more low-latency, but I think it's possible. Maybe
>> Spark may look at Gearpump, which is also built on top of Akka? I don't
>> know yet, it is good topic for SIP. However I think that Spark should
>> have real-time streaming support. Currently I see many posts/comments
>> that "Spark has too big latency". Spark Streaming is doing very good
>> jobs with micro-

Re: cutting 2.0.2?

2016-10-17 Thread Cody Koeninger
SPARK-17841  three line bugfix that has a week old PR
SPARK-17812  being able to specify starting offsets is a must have for
a Kafka mvp in my opinion, already has a PR
SPARK-17813  I can put in a PR for this tonight if it'll be considered

On Mon, Oct 17, 2016 at 12:28 AM, Reynold Xin  wrote:
> Since 2.0.1, there have been a number of correctness fixes as well as some
> nice improvements to the experimental structured streaming (notably basic
> Kafka support). I'm thinking about cutting 2.0.2 later this week, before
> Spark Summit Europe. Let me know if there are specific things (bug fixes)
> you really want to merge into branch-2.0.
>
> Cheers.
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Odp.: Spark Improvement Proposals

2016-10-17 Thread Tomasz Gawęda
Maybe my mail was not clear enough.


I didn't want to write "lets focus on Flink" or any other framework. The idea 
with benchmarks was to show two things:

- why some people are doing bad PR for Spark

- how - in easy way - we can change it and show that Spark is still on the top


No more, no less. Benchmarks will be helpful, but I don't think they're the 
most important thing in Spark :) On the Spark main page there is still chart 
"Spark vs Hadoop". It is important to show that framework is not the same Spark 
with other API, but much faster and optimized, comparable or even faster than 
other frameworks.


About real-time streaming, I think it would be just good to see it in Spark. I 
very like current Spark model, but many voices that says "we need more" - 
community should listen also them and try to help them. With SIPs it would be 
easier, I've just posted this example as "thing that may be changed with SIP".


I very like unification via Datasets, but there is a lot of algorithms inside - 
let's make easy API, but with strong background (articles, benchmarks, 
descriptions, etc) that shows that Spark is still modern framework.


Maybe now my intention will be clearer :) As I said organizational ideas were 
already mentioned and I agree with them, my mail was just to show some aspects 
from my side, so from theside of developer and person who is trying to help 
others with Spark (via StackOverflow or other ways)


Pozdrawiam / Best regards,

Tomasz



Od: Cody Koeninger 
Wysłane: 17 października 2016 16:46
Do: Debasish Das
DW: Tomasz Gawęda; dev@spark.apache.org
Temat: Re: Spark Improvement Proposals

I think narrowly focusing on Flink or benchmarks is missing my point.

My point is evolve or die.  Spark's governance and organization is
hampering its ability to evolve technologically, and it needs to
change.

On Sun, Oct 16, 2016 at 9:21 PM, Debasish Das  wrote:
> Thanks Cody for bringing up a valid point...I picked up Spark in 2014 as
> soon as I looked into it since compared to writing Java map-reduce and
> Cascading code, Spark made writing distributed code fun...But now as we went
> deeper with Spark and real-time streaming use-case gets more prominent, I
> think it is time to bring a messaging model in conjunction with the
> batch/micro-batch API that Spark is good atakka-streams close
> integration with spark micro-batching APIs looks like a great direction to
> stay in the game with Apache Flink...Spark 2.0 integrated streaming with
> batch with the assumption is that micro-batching is sufficient to run SQL
> commands on stream but do we really have time to do SQL processing at
> streaming data within 1-2 seconds ?
>
> After reading the email chain, I started to look into Flink documentation
> and if you compare it with Spark documentation, I think we have major work
> to do detailing out Spark internals so that more people from community start
> to take active role in improving the issues so that Spark stays strong
> compared to Flink.
>
> https://cwiki.apache.org/confluence/display/SPARK/Spark+Internals
>
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Internals
>
> Spark is no longer an engine that works for micro-batch and batch...We (and
> I am sure many others) are pushing spark as an engine for stream and query
> processing.we need to make it a state-of-the-art engine for high speed
> streaming data and user queries as well !
>
> On Sun, Oct 16, 2016 at 1:30 PM, Tomasz Gawęda 
> wrote:
>>
>> Hi everyone,
>>
>> I'm quite late with my answer, but I think my suggestions may help a
>> little bit. :) Many technical and organizational topics were mentioned,
>> but I want to focus on these negative posts about Spark and about "haters"
>>
>> I really like Spark. Easy of use, speed, very good community - it's
>> everything here. But Every project has to "flight" on "framework market"
>> to be still no 1. I'm following many Spark and Big Data communities,
>> maybe my mail will inspire someone :)
>>
>> You (every Spark developer; so far I didn't have enough time to join
>> contributing to Spark) has done excellent job. So why are some people
>> saying that Flink (or other framework) is better, like it was posted in
>> this mailing list? No, not because that framework is better in all
>> cases.. In my opinion, many of these discussions where started after
>> Flink marketing-like posts. Please look at StackOverflow "Flink vs "
>> posts, almost every post in "winned" by Flink. Answers are sometimes
>> saying nothing about other frameworks, Flink's users (often PMC's) are
>> just posting same information about real-time streaming, about delta
>> iterations, etc. It look smart and very often it is marked as an aswer,
>> even if - in my opinion - there wasn't told all the truth.
>>
>>
>> My suggestion: I don't have enough money and knowledgle to perform huge
>> performance test. Maybe some company, that supports Spark (Databricks,
>> Cloudera? - 

Re: cutting 2.0.2?

2016-10-17 Thread Erik O'Shaughnessy
I would very much like to see SPARK-16962 included in 2.0.2 as it addresses
unaligned memory access patterns that crash non-x86 platforms.  I believe
this falls in the category of "correctness fix".  We (Oracle SAE) have
applied the fixes for SPARK-16962 to branch-2.0 and have not encountered any
problems on SPARC or x86 architectures attributable to unaligned accesses.
Including this fix will allow Oracle SPARC customers to run Apache Spark
without fear of crashing, expanding the reach of Apache Spark and making my
life a little easier :)

erik.oshaughne...@oracle.com



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/cutting-2-0-2-tp19473p19482.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Fwd: Large variation in spark in Task Deserialization Time

2016-10-17 Thread Pulasthi Supun Wickramasinghe
Hi Devs/All,

I am seeing a huge variation on spark Task Deserialization Time for my
collect and reduce operations. while most tasks complete within 100ms a few
take mote than a couple of seconds which slows the entire program down. I
have attached a screen shot of the web UI where you can see the variation


As you can see the Task Deserialization Time time has a Max of 7s and 75th
percentile at 0.3 seconds.

Does anyone know the reasons that may cause these kind of numbers. Any help
would be greatly appreciated.

Best Regards,
Pulasthi
-- 
Pulasthi S. Wickramasinghe
Graduate Student  | Research Assistant
School of Informatics and Computing | Digital Science Center
Indiana University, Bloomington
cell: 224-386-9035


Indexing w spark joins?

2016-10-17 Thread Michael Segel
Hi,

Apologies if I’ve asked this question before but I didn’t see it in the list 
and I’m certain that my last surviving brain cell has gone on strike over my 
attempt to reduce my caffeine intake…

Posting this to both user and dev because I think the question / topic jumps in 
to both camps.


Again since I’m a relative newbie on spark… I may be missing something so 
apologies up front…


With respect to Spark SQL,  in pre 2.0.x,  there were only hash joins?  In post 
2.0.x you have hash, semi-hash , and sorted list merge.

For the sake of simplicity… lets forget about cross product joins…

Has anyone looked at how we could use inverted tables to improve query 
performance?

The issue is that when you have a data sewer (lake) , what happens when your 
use case query is orthogonal to how your data is stored? This means full table 
scans.
By using secondary indexes, we can reduce this albeit at a cost of increasing 
your storage footprint by the size of the index.

Are there any JIRAs open that discuss this?

Indexes to assist in terms of ‘predicate push downs’ (using the index when a 
field in a where clause is indexed) rather than performing a full table scan.
Indexes to assist in the actual join if the join column is on an indexed column?

In the first, using an inverted table to produce a sort ordered set of row keys 
that you would then use in the join process (same as if you produced the subset 
based on the filter.)

To put this in perspective… here’s a dummy use case…

CCCis (CCC) is the middle man in the insurance industry. They have a piece of 
software that sits in the repair shop (e.g Joe’s Auto Body) and works with 
multiple insurance carriers.
The primary key in their data is going to be Insurance Company | Claim ID.  
This makes it very easy to find a specific claim for further processing.

Now lets say I want to do some analysis on determining the average cost of 
repairing a front end collision of a Volvo S80?
Or
Break down the number and types of accidents by car manufacturer , model and 
color.  (Then see if there is any correlation between car color and # and type 
of accidents)


As you can see, all of these queries are orthogonal to my storage.  So I need 
to create secondary indexes to help sift thru the data efficiently.

Does this make sense?

Please Note: I did some work for CCC back in the late 90’s. Any resemblance to 
their big data efforts is purely coincidence  and you can replace CCC with 
Allstate, Progressive, StateFarm or some other auto insurance company …

Thx

-Mike




[build system] jenkins downtime for backups delayed by a hung build

2016-10-17 Thread shane knapp
i just noticed that jenkins was still in quiet mode this morning due
to a hung build.  i killed the build, backups happened, and the queue
is now happily building.

sorry for any delay!

shane

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: source for org.spark-project.hive:1.2.1.spark2

2016-10-17 Thread Ryan Blue
Are these changes that the Hive community has rejected? I don't see a
compelling reason to have a long-term Spark fork of Hive.

rb

On Sat, Oct 15, 2016 at 5:27 AM, Steve Loughran 
wrote:

>
> On 15 Oct 2016, at 01:28, Ryan Blue  wrote:
>
> The Spark 2 branch is based on this one: https://github.com/
> JoshRosen/hive/commits/release-1.2.1-spark2
>
>
> Didn't know this had moved I had an outstanding PR against patricks
> which should really go in, if not already taken up ( HIVE-11720 ;
> https://github.com/pwendell/hive/pull/2 )
>
>
> IMO I think it would make sense if -somehow- that hive fork were in the
> ASF; it's got to be in sync with Spark releases, and its not been ideal for
> me in terms of getting one or two fixes in, the other one being culling
> groovy 2.4.4 as an export ( https://github.com/steveloughran/hive/tree/
> stevel/SPARK-13471-groovy-2.4.4 )
>
> I don't know if the hive team themselves would be up to having it in their
> repo, or if committership logistics would suit it anyway. Otherwise,
> approaching infra@ and asking for a forked repo is likely to work with a
> bit of prodding
>
> rb
>
> On Fri, Oct 14, 2016 at 4:33 PM, Ethan Aubin 
> wrote:
>
>> In an email thread [1] from Aug 2015, it was mentioned that the source
>> to org.spark-project.hive was at
>> https://github.com/pwendell/hive/commits/release-1.2.1-spark .
>> That branch has a 1.2.1.spark version but spark 2.0.1 uses
>> 1.2.1.spark2. Could anyone point me to the repo for 1.2.1.spark2?
>> Thanks --Ethan
>>
>> [https://mail-archives.apache.org/mod_mbox/spark-dev/201508.
>> mbox/%3ca0aa8b38-deee-476a-93ff-92fead06e...@hortonworks.com%3E]
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: source for org.spark-project.hive:1.2.1.spark2

2016-10-17 Thread Sean Owen
IIRC this was all about shading of dependencies, not changes to the source.

On Mon, Oct 17, 2016 at 6:26 PM Ryan Blue  wrote:

> Are these changes that the Hive community has rejected? I don't see a
> compelling reason to have a long-term Spark fork of Hive.
>
> rb
>
> On Sat, Oct 15, 2016 at 5:27 AM, Steve Loughran 
> wrote:
>
>
> On 15 Oct 2016, at 01:28, Ryan Blue  wrote:
>
> The Spark 2 branch is based on this one:
> https://github.com/JoshRosen/hive/commits/release-1.2.1-spark2
>
>
> Didn't know this had moved I had an outstanding PR against patricks
> which should really go in, if not already taken up ( HIVE-11720 ;
> https://github.com/pwendell/hive/pull/2 )
>
>
> IMO I think it would make sense if -somehow- that hive fork were in the
> ASF; it's got to be in sync with Spark releases, and its not been ideal for
> me in terms of getting one or two fixes in, the other one being culling
> groovy 2.4.4 as an export (
> https://github.com/steveloughran/hive/tree/stevel/SPARK-13471-groovy-2.4.4
>  )
>
> I don't know if the hive team themselves would be up to having it in their
> repo, or if committership logistics would suit it anyway. Otherwise,
> approaching infra@ and asking for a forked repo is likely to work with a
> bit of prodding
>
> rb
>
> On Fri, Oct 14, 2016 at 4:33 PM, Ethan Aubin 
> wrote:
>
> In an email thread [1] from Aug 2015, it was mentioned that the source
> to org.spark-project.hive was at
> https://github.com/pwendell/hive/commits/release-1.2.1-spark .
> That branch has a 1.2.1.spark version but spark 2.0.1 uses
> 1.2.1.spark2. Could anyone point me to the repo for 1.2.1.spark2?
> Thanks --Ethan
>
> [
> https://mail-archives.apache.org/mod_mbox/spark-dev/201508.mbox/%3ca0aa8b38-deee-476a-93ff-92fead06e...@hortonworks.com%3E
> ]
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>
>
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: cutting 2.0.2?

2016-10-17 Thread Sean Owen
(I don't think 2.0.2 will be released for a while if at all but that's not
what you're asking I think)

It's a fairly safe change, but also isn't exactly a fix in my opinion.
Because there are some other changes to make it all work for SPARC, I think
it's more realistic to look to the 2.1.0 release anyway, which is likely to
come first.



On Mon, Oct 17, 2016 at 4:09 PM Erik O'Shaughnessy <
erik.oshaughne...@oracle.com> wrote:

> I would very much like to see SPARK-16962 included in 2.0.2 as it addresses
> unaligned memory access patterns that crash non-x86 platforms.  I believe
> this falls in the category of "correctness fix".  We (Oracle SAE) have
> applied the fixes for SPARK-16962 to branch-2.0 and have not encountered
> any
> problems on SPARC or x86 architectures attributable to unaligned accesses.
> Including this fix will allow Oracle SPARC customers to run Apache Spark
> without fear of crashing, expanding the reach of Apache Spark and making my
> life a little easier :)
>
> erik.oshaughne...@oracle.com
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/cutting-2-0-2-tp19473p19482.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


[VOTE] Release Apache Spark 1.6.3 (RC1)

2016-10-17 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version
1.6.3. The vote is open until Thursday, Oct 20, 2016 at 18:00 PDT and
passes if a majority of at least 3+1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 1.6.3
[ ] -1 Do not release this package because ...


The tag to be voted on is v1.6.3-rc1
(7375bb0c825408ea010dcef31c0759cf94ffe5c2)

This release candidate addresses 50 JIRA tickets:
https://s.apache.org/spark-1.6.3-jira

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc1-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1205/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-1.6.3-rc1-docs/


===
== How can I help test this release?
===
If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions from 1.6.2.


== What justifies a -1 vote for this release?

This is a maintenance release in the 1.6.x series.  Bugs already present in
1.6.2, missing features, or bugs related to new features will not
necessarily block this release.