Re: Extending Scala style checks

2014-10-08 Thread Reynold Xin
Thanks. I added one. On Wed, Oct 8, 2014 at 8:49 AM, Nicholas Chammas wrote: > I've created SPARK-3849: Automate remaining Scala style rules > . > > Please create sub-tasks on this issue for rules that we have not automated > and let's work thro

Re: will/when Spark/SparkSQL will support ORCFile format

2014-10-08 Thread Cheng Lian
The foreign data source API PR also matters here https://www.github.com/apache/spark/pull/2475 Foreign data source like ORC can be added more easily and systematically after this PR is merged. On 10/9/14 8:22 AM, James Yu wrote: Thanks Mark! I will keep eye on it. @Evan, I saw people use bo

Re: spark-ec2 can't initialize spark-standalone module

2014-10-08 Thread Shivaram Venkataraman
There is a check to see if init.sh file exists (` if [[ -e $module/init.sh ]]; then`), so it just won't get called. Regarding spark-standalone not having a init.sh that is because we dont have any initialization work to do for it (its not necessary for all modules to have a init.sh) as the spark m

Re: will/when Spark/SparkSQL will support ORCFile format

2014-10-08 Thread James Yu
Thanks Mark! I will keep eye on it. @Evan, I saw people use both format, so I really want to have Spark support ORCFile. On Wed, Oct 8, 2014 at 11:12 AM, Mark Hamstra wrote: > https://github.com/apache/spark/pull/2576 > > > > On Wed, Oct 8, 2014 at 11:01 AM, Evan Chan > wrote: > >> James, >>

Re: Standardized Distance Functions in MLlib

2014-10-08 Thread Yu Ishikawa
Hi Xiangrui, Thank you very much for replying and letting me know that you upgraded breeze to 0.10 yesterday. Sorry that I didn't know that. > We don't want to maintain > another copy of the implementation in MLlib to keep the maintenance > cost low. Both spark and breeze are open-source proje

new jenkins update + tentative release date

2014-10-08 Thread shane knapp
greetings! i've got some updates regarding our new jenkins infrastructure, as well as the initial date and plan for rolling things out: *** current testing/build break whack-a-mole: a lot of out of date artifacts are cached in the current jenkins, which has caused a few builds during my testing t

Fwd: Accumulator question

2014-10-08 Thread Nathan Kronenfeld
I notice that accumulators register themselves with a private Accumulators object. I don't notice any way to unregister them when one is done. Am I missing something? If not, is there any plan for how to free up that memory? I've a case where we're gathering data from repeated queries using some

Re: Parquet schema migrations

2014-10-08 Thread Cody Koeninger
On Wed, Oct 8, 2014 at 3:19 PM, Michael Armbrust wrote: > > I was proposing you manually convert each different format into one > unified format (by adding literal nulls and such for missing columns) and > then union these converted datasets. It would be weird to have union all > try and do thi

spark-ec2 can't initialize spark-standalone module

2014-10-08 Thread Nicholas Chammas
This line in setup.sh initializes several modules, which are defined here . # Install / Init module

Re: Parquet schema migrations

2014-10-08 Thread Michael Armbrust
> > The kind of change we've made that it probably makes most sense to support > is adding a nullable column. I think that also implies supporting > "removing" a nullable column, as long as you don't end up with columns of > the same name but different type. > Filed here: https://issues.apache.org

Re: How to do broadcast join in SparkSQL

2014-10-08 Thread Michael Armbrust
Thanks for the input. We purposefully made sure that the config option did not make it into a release as it is not something that we are willing to support long term. That said we'll try and make this easier in the future either through hints or better support for statistics. In this particular

Re: Standardized Distance Functions in MLlib

2014-10-08 Thread Xiangrui Meng
Hi Yu, We upgraded breeze to 0.10 yesterday. So we can call the distance functions you contributed to breeze easily. We don't want to maintain another copy of the implementation in MLlib to keep the maintenance cost low. Both spark and breeze are open-source projects. We should try our best to avo

Re: will/when Spark/SparkSQL will support ORCFile format

2014-10-08 Thread Mark Hamstra
https://github.com/apache/spark/pull/2576 On Wed, Oct 8, 2014 at 11:01 AM, Evan Chan wrote: > James, > > Michael at the meetup last night said there was some development > activity around ORCFiles. > > I'm curious though, what are the pros and cons of ORCFiles vs Parquet? > > On Wed, Oct 8, 20

Re: will/when Spark/SparkSQL will support ORCFile format

2014-10-08 Thread Evan Chan
James, Michael at the meetup last night said there was some development activity around ORCFiles. I'm curious though, what are the pros and cons of ORCFiles vs Parquet? On Wed, Oct 8, 2014 at 10:03 AM, James Yu wrote: > Didn't see anyone asked the question before, but I was wondering if anyone

will/when Spark/SparkSQL will support ORCFile format

2014-10-08 Thread James Yu
Didn't see anyone asked the question before, but I was wondering if anyone knows if Spark/SparkSQL will support ORCFile format soon? ORCFile is getting more and more popular hi Hive world. Thanks, James

Re: Extending Scala style checks

2014-10-08 Thread Nicholas Chammas
I've created SPARK-3849: Automate remaining Scala style rules . Please create sub-tasks on this issue for rules that we have not automated and let's work through them as possible. I went ahead and created the first sub-task, SPARK-3850: Scala styl

Re: Spark on Mesos 0.20

2014-10-08 Thread RJ Nowling
Yep! That's the example I was talking about. Is an error message printed when it hangs? I get : 14/09/30 13:23:14 ERROR BlockManagerMasterActor: Got two different block manager registrations on 20140930-131734-1723727882-5050-1895-1 On Tue, Oct 7, 2014 at 8:36 PM, Fairiz Azizi wrote: > Sure

Re: Unneeded branches/tags

2014-10-08 Thread Nicholas Chammas
So: - tags: can delete - branches: stuck with ‘em Correct? Nick ​ On Wed, Oct 8, 2014 at 1:52 AM, Patrick Wendell wrote: > Actually - weirdly - we can delete old tags and it works with the > mirroring. Nick if you put together a list of un-needed tags I can > delete them. > > On Tue, Oc

Standardized Distance Functions in MLlib

2014-10-08 Thread Yu Ishikawa
Hi all, In my limited understanding of the MLlib, it is a good idea to use the various distance functions on some machine learning algorithms. For example, we can only use Euclidean distance metric in KMeans. And I am tackling with contributing hierarchical clustering to MLlib (https://issues.apa

Re: How to do broadcast join in SparkSQL

2014-10-08 Thread Jianshi Huang
Ok, currently there's cost-based optimization however Parquet statistics is not implemented... What's the good way if I want to join a big fact table with several tiny dimension tables in Spark SQL (1.1)? I wish we can allow user hint for the join. Jianshi On Wed, Oct 8, 2014 at 2:18 PM, Jiansh

Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-10-08 Thread Liquan Pei
I am working on a PR to leverage the HashJoin trait code to optimize the Left/Right outer join. It's already been tested locally and will send out the PR soon after some clean up. Thanks, Liquan On Wed, Oct 8, 2014 at 12:09 AM, Matei Zaharia wrote: > I'm pretty sure inner joins on Spark SQL alr

Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-10-08 Thread Matei Zaharia
I'm pretty sure inner joins on Spark SQL already build only one of the sides. Take a look at ShuffledHashJoin, which calls HashJoin.joinIterators. Only outer joins do both, and it seems like we could optimize it for those that are not full. Matei On Oct 7, 2014, at 11:04 PM, Haopu Wang wrote