Re: Help choose a GraphFrames logo

2025-01-18 Thread Matei Zaharia
It looks great to me! > On Jan 17, 2025, at 8:56 PM, Felix Cheung wrote: > > Nice > From: Russell Jurney > Sent: Friday, January 17, 2025 2:46:14 PM > To: Mich Talebzadeh > Cc: Ángel ; Russell Jurney > ; Denny Lee ; user > ; graphfra...@googlegroups.com > > Subject: Re: Help choose a Graph

Re: Why Apache Spark doesn't use Calcite?

2020-01-13 Thread Matei Zaharia
I’m pretty sure that Catalyst was built before Calcite, or at least in parallel. Calcite 1.0 was only released in 2015. From a technical standpoint, building Catalyst in Scala also made it more concise and easier to extend than an optimizer written in Java (you can find various presentations abo

Re: Spark 2.4.0 artifact in Maven repository

2018-11-06 Thread Matei Zaharia
Hi Bartosz, This is because the vote on 2.4 has passed (you can see the vote thread on the dev mailing list) and we are just working to get the release into various channels (Maven, PyPI, etc), which can take some time. Expect to see an announcement soon once that’s done. Matei > On Nov 4, 20

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Matei Zaharia
x27;t users prefer to get that notification sooner > rather than later? > > On Mon, Sep 17, 2018 at 12:58 PM Matei Zaharia > wrote: > I’d like to understand the maintenance burden of Python 2 before deprecating > it. Since it is not EOL yet, it might make sense to only deprecate it o

Re: Should python-2 be supported in Spark 3.0?

2018-09-17 Thread Matei Zaharia
I’d like to understand the maintenance burden of Python 2 before deprecating it. Since it is not EOL yet, it might make sense to only deprecate it once it’s EOL (which is still over a year from now). Supporting Python 2+3 seems less burdensome than supporting, say, multiple Scala versions in the

Re: Is there any open source framework that converts Cypher to SparkSQL?

2018-09-16 Thread Matei Zaharia
GraphFrames (https://graphframes.github.io) offers a Cypher-like syntax that then executes on Spark SQL. > On Sep 14, 2018, at 2:42 AM, kant kodali wrote: > > Hi All, > > Is there any open source framework that converts Cypher to SparkSQL? > > Thanks! ---

Re: how can I run spark job in my environment which is a single Ubuntu host with no hadoop installed

2018-06-17 Thread Matei Zaharia
Maybe your application is overriding the master variable when it creates its SparkContext. I see you are still passing “yarn-client” as an argument later to it in your command. > On Jun 17, 2018, at 11:53 AM, Raymond Xie wrote: > > Thank you Subhash. > > Here is the new command: > spark-submi

Re: Spark 1.x - End of life

2017-10-19 Thread Matei Zaharia
Hi Ismael, It depends on what you mean by “support”. In general, there won’t be new feature releases for 1.X (e.g. Spark 1.7) because all the new features are being added to the master branch. However, there is always room for bug fix releases if there is a catastrophic bug, and committers can

Re: Kill Spark Streaming JOB from Spark UI or Yarn

2017-08-27 Thread Matei Zaharia
The batches should all have the same application ID, so use that one. You can also find the application in the YARN UI to terminate it from there. Matei > On Aug 27, 2017, at 10:27 AM, KhajaAsmath Mohammed > wrote: > > Hi, > > I am new to spark streaming and not able to find an option to kil

Re: real world spark code

2017-07-25 Thread Matei Zaharia
You can also find a lot of GitHub repos for external packages here: http://spark.apache.org/third-party-projects.html Matei > On Jul 25, 2017, at 5:30 PM, Frank Austin Nothaft > wrote: > > There’s a number of real-world open source Spark applications in the sciences: > > genomics: > > githu

Re: Structured Streaming with Kafka Source, does it work??

2016-11-06 Thread Matei Zaharia
The Kafka source will only appear in 2.0.2 -- see this thread for the current release candidate: https://lists.apache.org/thread.html/597d630135e9eb3ede54bb0cc0b61a2b57b189588f269a64b58c9243@%3Cdev.spark.apache.org%3E . You can try that right now if you want from the staging Maven repo shown th

Re: RESTful Endpoint and Spark

2016-10-06 Thread Matei Zaharia
This is exactly what the Spark SQL Thrift server does, if you just want to access it using JDBC. Matei > On Oct 6, 2016, at 4:27 PM, Benjamin Kim wrote: > > Has anyone tried to integrate Spark with a server farm of RESTful API > endpoints or even HTTP web-servers for that matter? I know it’s

Re: Is "spark streaming" streaming or mini-batch?

2016-08-23 Thread Matei Zaharia
I think people explained this pretty well, but in practice, this distinction is also somewhat of a marketing term, because every system will perform some kind of batching. For example, every time you use TCP, the OS and network stack may buffer multiple messages together and send them at once; a

Re: unsubscribe

2016-08-10 Thread Matei Zaharia
To unsubscribe, please send an email to user-unsubscr...@spark.apache.org from the address you're subscribed from. Matei > On Aug 10, 2016, at 12:48 PM, Sohil Jain wrote: > > - To unsubscribe e-mail: user-unsubscr...@spark.

Re: Dropping late date in Structured Streaming

2016-08-06 Thread Matei Zaharia
Yes, a built-in mechanism is planned in future releases. You can also drop it using a filter for now but the stateful operators will still keep state for old windows. Matei > On Aug 6, 2016, at 9:40 AM, Amit Sela wrote: > > I've noticed that when using Structured Streaming with event-time win

Re: The Future Of DStream

2016-07-27 Thread Matei Zaharia
Yup, they will definitely coexist. Structured Streaming is currently alpha and will probably be complete in the next few releases, but Spark Streaming will continue to exist, because it gives the user more low-level control. It's similar to DataFrames vs RDDs (RDDs are the lower-level API for wh

Updated Spark logo

2016-06-10 Thread Matei Zaharia
Hi all, FYI, we've recently updated the Spark logo at https://spark.apache.org/ to say "Apache Spark" instead of just "Spark". Many ASF projects have been doing this recently to make it clearer that they are associated with the ASF, and indeed the ASF's branding guidelines generally require that

Re: Apache Spark Slack

2016-05-16 Thread Matei Zaharia
I don't think any of the developers use this as an official channel, but all the ASF IRC channels are indeed on FreeNode. If there's demand for it, we can document this on the website and say that it's mostly for users to find other users. Development discussions should happen on the dev mailing

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Matei Zaharia
This sounds good to me as well. The one thing we should pay attention to is how we update the docs so that people know to start with the spark.ml classes. Right now the docs list spark.mllib first and also seem more comprehensive in that area than in spark.ml, so maybe people naturally move towa

Re: simultaneous actions

2016-01-17 Thread Matei Zaharia
rkers-become-available basis)? > > On 15 January 2016 at 11:44, Koert Kuipers <mailto:ko...@tresata.com>> wrote: > we run multiple actions on the same (cached) rdd all the time, i guess in > different threads indeed (its in akka) > > On Fri, Jan 15, 2016 at 2:40 PM,

Re: Compiling only MLlib?

2016-01-15 Thread Matei Zaharia
Have you tried just downloading a pre-built package, or linking to Spark through Maven? You don't need to build it unless you are changing code inside it. Check out http://spark.apache.org/docs/latest/quick-start.html#self-contained-applications for how to link to it. Matei > On Jan 15, 2016,

Re: simultaneous actions

2016-01-15 Thread Matei Zaharia
RDDs actually are thread-safe, and quite a few applications use them this way, e.g. the JDBC server. Matei > On Jan 15, 2016, at 2:10 PM, Jakob Odersky wrote: > > I don't think RDDs are threadsafe. > More fundamentally however, why would you want to run RDD actions in > parallel? The idea beh

Re: Read from AWS s3 with out having to hard-code sensitive keys

2016-01-11 Thread Matei Zaharia
In production, I'd recommend using IAM roles to avoid having keys altogether. Take a look at http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html. Matei > On Jan 11, 2016, at 11:32 AM, Sabarish Sasidharan > wrote: > > If you are on EMR, these can go into your hdfs

Re: How to compile Spark with customized Hadoop?

2015-10-09 Thread Matei Zaharia
You can publish your version of Hadoop to your Maven cache with mvn publish (just give it a different version number, e.g. 2.7.0a) and then pass that as the Hadoop version to Spark's build (see http://spark.apache.org/docs/latest/building-spark.html

Re: Ranger-like Security on Spark

2015-09-03 Thread Matei Zaharia
policies as well? > > Best regards, Daniel. > > > On 03 Sep 2015, at 21:16, Matei Zaharia > <mailto:matei.zaha...@gmail.com>> wrote: > > > > If you run on YARN, you can use Kerberos, be authenticated as the right > > user, etc in the same way as MapRedu

Re: Ranger-like Security on Spark

2015-09-03 Thread Matei Zaharia
If you run on YARN, you can use Kerberos, be authenticated as the right user, etc in the same way as MapReduce jobs. Matei > On Sep 3, 2015, at 1:37 PM, Daniel Schulz > wrote: > > Hi, > > I really enjoy using Spark. An obstacle to sell it to our clients currently > is the missing Kerberos-l

Re: work around Size exceeds Integer.MAX_VALUE

2015-07-09 Thread Matei Zaharia
Thus means that one of your cached RDD partitions is bigger than 2 GB of data. You can fix it by having more partitions. If you read data from a file system like HDFS or S3, set the number of partitions higher in the sc.textFile, hadoopFile, etc methods (it's an optional second parameter to thos

Re: Spark or Storm

2015-06-17 Thread Matei Zaharia
derstand that > 1) There is no global ordering; e.g. an output operation for batch consisting > of offset [4,5,6] can be invoked before the operation for offset [1,2,3] > 2) If you wanted to achieve something similar to what TridentState does, > you'll have to do it yourself (for

Re: Spark or Storm

2015-06-17 Thread Matei Zaharia
This documentation is only for writes to an external system, but all the counting you do within your streaming app (e.g. if you use reduceByKeyAndWindow to keep track of a running count) is exactly-once. When you write to a storage system, no matter which streaming framework you use, you'll have

Re: Equivalent to Storm's 'field grouping' in Spark.

2015-06-03 Thread Matei Zaharia
This happens automatically when you use the byKey operations, e.g. reduceByKey, updateStateByKey, etc. Spark Streaming keeps the state for a given set of keys on a specific node and sends new tuples with that key to that. Matei > On Jun 3, 2015, at 6:31 AM, allonsy wrote: > > Hi everybody, >

Re: map - reduce only with disk

2015-06-02 Thread Matei Zaharia
like they do now? > > Thank you! > > 2015-06-02 21:25 GMT+02:00 Matei Zaharia <mailto:matei.zaha...@gmail.com>>: > You shouldn't have to persist the RDD at all, just call flatMap and reduce on > it directly. If you try to persist it, that will try to load the origin

Re: map - reduce only with disk

2015-06-02 Thread Matei Zaharia
quot;spark.executor.memory", "115g") > conf.set("spark.shuffle.file.buffer.kb", "1000") > > my spark-env.sh: > ulimit -n 200000 > SPARK_JAVA_OPTS="-Xss1g -Xmx129g -d64 -XX:-UseGCOverheadLimit > -XX:-UseCompressedOops" > SPARK

Re: map - reduce only with disk

2015-06-01 Thread Matei Zaharia
As long as you don't use cache(), these operations will go from disk to disk, and will only use a fixed amount of memory to build some intermediate results. However, note that because you're using groupByKey, that needs the values for each key to all fit in memory at once. In this case, if you'r

Re: Spark logo license

2015-05-19 Thread Matei Zaharia
Check out Apache's trademark guidelines here: http://www.apache.org/foundation/marks/ Matei > On May 20, 2015, at 12:02 AM, Justin Pihony wrote: > > What is the license on using the spark logo. Is it free to be used for > displaying commercially? > >

Re: Wish for 1.4: upper bound on # tasks in Mesos

2015-05-19 Thread Matei Zaharia
ome the limit of tasks per job :) > > cheers, > Tom > > On Tue, May 19, 2015 at 10:05 AM, Matei Zaharia <mailto:matei.zaha...@gmail.com>> wrote: > Hey Tom, > > Are you using the fine-grained or coarse-grained scheduler? For the > coarse-grained scheduler, there

Re: Wish for 1.4: upper bound on # tasks in Mesos

2015-05-19 Thread Matei Zaharia
Hey Tom, Are you using the fine-grained or coarse-grained scheduler? For the coarse-grained scheduler, there is a spark.cores.max config setting that will limit the total # of cores it grabs. This was there in earlier versions too. Matei > On May 19, 2015, at 12:39 PM, Thomas Dudziak wrote: >

Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread Matei Zaharia
(Sorry, for non-English people: that means it's a good thing.) Matei > On May 14, 2015, at 10:53 AM, Matei Zaharia wrote: > > ...This is madness! > >> On May 14, 2015, at 9:31 AM, dmoralesdf wrote: >> >> Hi there, >> >> We have released

Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread Matei Zaharia
...This is madness! > On May 14, 2015, at 9:31 AM, dmoralesdf wrote: > > Hi there, > > We have released our real-time aggregation engine based on Spark Streaming. > > SPARKTA is fully open source (Apache2) > > > You can checkout the slides showed up at the Strata past week: > > http://www.s

Re: large volume spark job spends most of the time in AppendOnlyMap.changeValue

2015-05-12 Thread Matei Zaharia
It could also be that your hash function is expensive. What is the key class you have for the reduceByKey / groupByKey? Matei > On May 12, 2015, at 10:08 AM, Night Wolf wrote: > > I'm seeing a similar thing with a slightly different stack trace. Ideas? > > org.apache.spark.util.collection.App

Re: Spark on Windows

2015-04-16 Thread Matei Zaharia
You could build Spark with Scala 2.11 on Mac / Linux and transfer it over to Windows. AFAIK it should build on Windows too, the only problem is that Maven might take a long time to download dependencies. What errors are you seeing? Matei > On Apr 16, 2015, at 9:23 AM, Arun Lists wrote: > > We

Re: Dataset announcement

2015-04-15 Thread Matei Zaharia
Very neat, Olivier; thanks for sharing this. Matei > On Apr 15, 2015, at 5:58 PM, Olivier Chapelle wrote: > > Dear Spark users, > > I would like to draw your attention to a dataset that we recently released, > which is as of now the largest machine learning dataset ever released; see > the fol

Re: IPyhon notebook command for spark need to be updated?

2015-03-20 Thread Matei Zaharia
Feel free to send a pull request to fix the doc (or say which versions it's needed in). Matei > On Mar 20, 2015, at 6:49 PM, Krishna Sankar wrote: > > Yep the command-option is gone. No big deal, just add the '%pylab inline' > command as part of your notebook. > Cheers > > > On Fri, Mar 20,

Re: Querying JSON in Spark SQL

2015-03-16 Thread Matei Zaharia
The programming guide has a short example: http://spark.apache.org/docs/latest/sql-programming-guide.html#json-datasets . Note that once you infer a schema for a JSON dataset, you can also use nested path notation (e.

Re: Berlin Apache Spark Meetup

2015-02-17 Thread Matei Zaharia
Thanks! I've added you. Matei > On Feb 17, 2015, at 4:06 PM, Ralph Bergmann | the4thFloor.eu > wrote: > > Hi, > > > there is a small Spark Meetup group in Berlin, Germany :-) > http://www.meetup.com/Berlin-Apache-Spark-Meetup/ > > Plaes add this group to the Meetups list at > https://spark.

Re: Beginner in Spark

2015-02-06 Thread Matei Zaharia
You don't need HDFS or virtual machines to run Spark. You can just download it, unzip it and run it on your laptop. See http://spark.apache.org/docs/latest/index.html . Matei > On Feb 6, 2015, at 2:58 PM, David Fallside wrote: > > King, consid

Re: Why must the dstream.foreachRDD(...) parameter be serializable?

2015-01-27 Thread Matei Zaharia
I believe this is needed for driver recovery in Spark Streaming. If your Spark driver program crashes, Spark Streaming can recover the application by reading the set of DStreams and output operations from a checkpoint file (see https://spark.apache.org/docs/latest/streaming-programming-guide.htm

Re: Spark UI and Spark Version on Google Compute Engine

2015-01-17 Thread Matei Zaharia
Unfortunately we don't have anything to do with Spark on GCE, so I'd suggest asking in the GCE support forum. You could also try to launch a Spark cluster by hand on nodes in there. Sigmoid Analytics published a package for this here: http://spark-packages.org/package/9 Matei > On Jan 17, 2015

Re: spark 1.2 compatibility

2015-01-16 Thread Matei Zaharia
The Apache Spark project should work with it, but I'm not sure you can get support from HDP (if you have that). Matei > On Jan 16, 2015, at 5:36 PM, Judy Nash > wrote: > > Should clarify on this. I personally have used HDP 2.1 + Spark 1.2 and have > not seen a problem. > > However official

Re: Spark's equivalent of ShellBolt

2015-01-14 Thread Matei Zaharia
You can use the pipe() function on RDDs to call external code. It passes data to an external program through stdin / stdout. For Spark Streaming, you would do dstream.transform(rdd => rdd.pipe(...)) to call it on each RDD. Matei > On Jan 14, 2015, at 8:41 PM, Umanga Bista wrote: > > > This i

Re: Pattern Matching / Equals on Case Classes in Spark Not Working

2015-01-12 Thread Matei Zaharia
Is this in the Spark shell? Case classes don't work correctly in the Spark shell unfortunately (though they do work in the Scala shell) because we change the way lines of code compile to allow shipping functions across the network. The best way to get case classes in there is to compile them int

Fwd: ApacheCon North America 2015 Call For Papers

2015-01-05 Thread Matei Zaharia
FYI, ApacheCon North America call for papers is up. Matei > Begin forwarded message: > > Date: January 5, 2015 at 9:40:41 AM PST > From: Rich Bowen > Reply-To: dev > To: dev > Subject: ApacheCon North America 2015 Call For Papers > > Fellow ASF enthusiasts, > > We now have less than a month

Re: JetS3T settings spark

2014-12-30 Thread Matei Zaharia
This file needs to be on your CLASSPATH actually, not just in a directory. The best way to pass it in is probably to package it into your application JAR. You can put it in src/main/resources in a Maven or SBT project, and check that it makes it into the JAR using jar tf yourfile.jar. Matei >

Re: action progress in ipython notebook?

2014-12-29 Thread Matei Zaharia
Hey Eric, sounds like you are running into several issues, but thanks for reporting them. Just to comment on a few of these: > I'm not seeing RDDs or SRDDs cached in the Spark UI. That page remains empty > despite my calling cache(). This is expected until you compute the RDDs the first time a

Re: When will spark 1.2 released?

2014-12-18 Thread Matei Zaharia
Yup, as he posted before, "An Apache infrastructure issue prevented me from pushing this last night. The issue was resolved today and I should be able to push the final release artifacts tonight." > On Dec 18, 2014, at 10:14 PM, Andrew Ash wrote: > > Patrick is working on the release as we spe

Re: wordcount job slow while input from NFS mount

2014-12-17 Thread Matei Zaharia
ment and the NFS server is running on the same server that Spark > is running on. So basically I mount the NFS on the same bare metal machine. > > Larry > > On Wed, Dec 17, 2014 at 11:42 AM, Matei Zaharia <mailto:matei.zaha...@gmail.com>> wrote: > The problem is v

Re: wordcount job slow while input from NFS mount

2014-12-17 Thread Matei Zaharia
The problem is very likely NFS, not Spark. What kind of network is it mounted over? You can also test the performance of your NFS by copying a file from it to a local disk or to /dev/null and seeing how many bytes per second it can copy. Matei > On Dec 17, 2014, at 9:38 AM, Larryliu wrote: >

Re: Spark SQL Roadmap?

2014-12-13 Thread Matei Zaharia
Spark SQL is already available, the reason for the "alpha component" label is that we are still tweaking some of the APIs so we have not yet guaranteed API stability for it. However, that is likely to happen soon (possibly 1.3). One of the major things added in Spark 1.2 was an external data sou

Re: what is the best way to implement mini batches?

2014-12-11 Thread Matei Zaharia
You can just do mapPartitions on the whole RDD, and then called sliding() on the iterator in each one to get a sliding window. One problem is that you will not be able to slide "forward" into the next partition at partition boundaries. If this matters to you, you need to do something more compli

Re: dockerized spark executor on mesos?

2014-12-03 Thread Matei Zaharia
I'd suggest asking about this on the Mesos list (CCed). As far as I know, there was actually some ongoing work for this. Matei > On Dec 3, 2014, at 9:46 AM, Dick Davies wrote: > > Just wondered if anyone had managed to start spark > jobs on mesos wrapped in a docker container? > > At present

Re: configure to run multiple tasks on a core

2014-11-26 Thread Matei Zaharia
Instead of SPARK_WORKER_INSTANCES you can also set SPARK_WORKER_CORES, to have one worker that thinks it has more cores. Matei > On Nov 26, 2014, at 5:01 PM, Yotto Koga wrote: > > Thanks Sean. That worked out well. > > For anyone who happens onto this post and wants to do the same, these are

Re: do not assemble the spark example jar

2014-11-25 Thread Matei Zaharia
BTW as another tip, it helps to keep the SBT console open as you make source changes (by just running sbt/sbt with no args). It's a lot faster the second time it builds something. Matei > On Nov 25, 2014, at 8:31 PM, Matei Zaharia wrote: > > You can do sbt/sbt assembly/assemb

Re: do not assemble the spark example jar

2014-11-25 Thread Matei Zaharia
You can do sbt/sbt assembly/assembly to assemble only the main package. Matei > On Nov 25, 2014, at 7:50 PM, lihu wrote: > > Hi, > The spark assembly is time costly. If I only need the > spark-assembly-1.1.0-hadoop2.3.0.jar, do not need the > spark-examples-1.1.0-hadoop2.3.0.jar. How to

Re: Configuring custom input format

2014-11-25 Thread Matei Zaharia
Job has physically been submitted. > > On Tue, Nov 25, 2014 at 5:31 PM, Matei Zaharia <mailto:matei.zaha...@gmail.com>> wrote: > How are you creating the object in your Scala shell? Maybe you can write a > function that directly returns the RDD, without assigning the object

Re: Configuring custom input format

2014-11-25 Thread Matei Zaharia
How are you creating the object in your Scala shell? Maybe you can write a function that directly returns the RDD, without assigning the object to a temporary variable. Matei > On Nov 5, 2014, at 2:54 PM, Corey Nolet wrote: > > The closer I look @ the stack trace in the Scala shell, it appear

Re: Spark SQL - Any time line to move beyond Alpha version ?

2014-11-25 Thread Matei Zaharia
The main reason for the alpha tag is actually that APIs might still be evolving, but we'd like to freeze the API as soon as possible. Hopefully it will happen in one of 1.3 or 1.4. In Spark 1.2, we're adding an external data source API that we'd like to get experience with before freezing it. M

Re: rack-topology.sh no such file or directory

2014-11-19 Thread Matei Zaharia
Your Hadoop configuration is set to look for this file to determine racks. Is the file present on cluster nodes? If not, look at your hdfs-site.xml and remove the setting for a rack topology script there (or it might be in core-site.xml). Matei > On Nov 19, 2014, at 12:13 PM, Arun Luthra wrot

Re: Load json format dataset as RDD

2014-11-16 Thread Matei Zaharia
Spark SQL gives you an RDD of Row objects that you can query similarly to most JSON object libraries. For example, you can use row(0) to access feature 0, then cast it to something like a String, an Int, a Seq, or another Row if it's a nested object. You can also select the fields you want using

Re: closure serialization behavior driving me crazy

2014-11-10 Thread Matei Zaharia
Hey Sandy, Try using the -Dsun.io.serialization.extendedDebugInfo=true flag on the JVM to print the contents of the objects. In addition, something else that helps is to do the following: { val _arr = arr models.map(... _arr ...) } Basically, copy the global variable into a local one. The

Re: Kafka version dependency in Spark 1.2

2014-11-10 Thread Matei Zaharia
Just curious, what are the pros and cons of this? Can the 0.8.1.1 client still talk to 0.8.0 versions of Kafka, or do you need it to match your Kafka version exactly? Matei > On Nov 10, 2014, at 9:48 AM, Bhaskar Dutta wrote: > > Hi, > > Is there any plan to bump the Kafka version dependency

Re: Why does this siimple spark program uses only one core?

2014-11-09 Thread Matei Zaharia
Call getNumPartitions() on your RDD to make sure it has the right number of partitions. You can also specify it when doing parallelize, e.g. rdd = sc.parallelize(xrange(1000), 10)) This should run in parallel if you have multiple partitions and cores, but it might be that during part of the pro

Re: wierd caching

2014-11-08 Thread Matei Zaharia
It might mean that some partition was computed on two nodes, because a task for it wasn't able to be scheduled locally on the first node. Did the RDD really have 426 partitions total? You can click on it and see where there are copies of each one. Matei > On Nov 8, 2014, at 10:16 PM, Nathan Kr

Re: Breaking the previous large-scale sort record with Spark

2014-11-05 Thread Matei Zaharia
#x27;s thanks to all of you folks that we are able to make this > happen." > > Updated blog post: > http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html > > > > > On Fri, Oct 10, 2014 at 7:54 AM, Matei Zaharia > wr

Re: Any "Replicated" RDD in Spark?

2014-11-05 Thread Matei Zaharia
best > way for me to do that? Collect RDD in driver first and create broadcast? Or > any shortcut in spark for this? > > Thanks! > > -Original Message- > From: Shuai Zheng [mailto:szheng.c...@gmail.com] > Sent: Wednesday, November 05, 2014 3:32 PM > To: 'Mat

Re: Spark v Redshift

2014-11-04 Thread Matei Zaharia
read data exported from Redshift into Spark or Hadoop. Matei > On Nov 4, 2014, at 3:51 PM, Matei Zaharia wrote: > > Is this about Spark SQL vs Redshift, or Spark in general? Spark in general > provides a broader set of capabilities than Redshift because it has APIs in > genera

Re: Spark v Redshift

2014-11-04 Thread Matei Zaharia
Is this about Spark SQL vs Redshift, or Spark in general? Spark in general provides a broader set of capabilities than Redshift because it has APIs in general-purpose languages (Java, Scala, Python) and libraries for things like machine learning and graph processing. For example, you might use S

Re: Any "Replicated" RDD in Spark?

2014-11-03 Thread Matei Zaharia
You need to use broadcast followed by flatMap or mapPartitions to do map-side joins (in your map function, you can look at the hash table you broadcast and see what records match it). Spark SQL also does it by default for tables smaller than the spark.sql.autoBroadcastJoinThreshold setting (by d

Re: union of SchemaRDDs

2014-11-01 Thread Matei Zaharia
Thanks Matei. What does unionAll do if the input RDD schemas are not 100% > compatible. Does it take the union of the columns and generalize the types? > > thanks > Daniel > > On Sat, Nov 1, 2014 at 6:08 PM, Matei Zaharia <mailto:matei.zaha...@gmail.com>> wrote: > Try

Re: union of SchemaRDDs

2014-11-01 Thread Matei Zaharia
Try unionAll, which is a special method on SchemaRDDs that keeps the schema on the results. Matei > On Nov 1, 2014, at 3:57 PM, Daniel Mahler wrote: > > I would like to combine 2 parquet tables I have create. > I tried: > > sc.union(sqx.parquetFile("fileA"), sqx.parquetFile("fileB")) >

Re: SparkContext.stop() ?

2014-10-31 Thread Matei Zaharia
You don't have to call it if you just exit your application, but it's useful for example in unit tests if you want to create and shut down a separate SparkContext for each test. Matei > On Oct 31, 2014, at 10:39 AM, Evan R. Sparks wrote: > > In cluster settings if you don't explicitly call sc

Re: Confused about class paths in spark 1.1.0

2014-10-30 Thread Matei Zaharia
ior as providing > "--driver.class.apth" to spark-shell. Correct? If so I will file a bug > report since this is definitely not the case. > > > On Thu, Oct 30, 2014 at 5:39 PM, Matei Zaharia <mailto:matei.zaha...@gmail.com>> wrote: > Try using --jars inst

Re: Confused about class paths in spark 1.1.0

2014-10-30 Thread Matei Zaharia
Try using --jars instead of the driver-only options; they should work with spark-shell too but they may be less tested. Unfortunately, you do have to specify each JAR separately; you can maybe use a shell script to list a directory and get a big list, or set up a project that builds all of the

Re: BUG: when running as "extends App", closures don't capture variables

2014-10-29 Thread Matei Zaharia
Good catch! If you'd like, you can send a pull request changing the files in docs/ to do this (see https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark ), otherwise maybe open an issue on https://issues.

Re: Is Spark in Java a bad idea?

2014-10-28 Thread Matei Zaharia
The overridable methods of RDD are marked as @DeveloperApi, which means that these are internal APIs used by people that might want to extend Spark, but are not guaranteed to remain stable across Spark versions (unlike Spark's public APIs). BTW, if you want a way to do this that does not involv

Re: Is Spark in Java a bad idea?

2014-10-28 Thread Matei Zaharia
A pretty large fraction of users use Java, but a few features are still not available in it. JdbcRDD is one of them -- this functionality will likely be superseded by Spark SQL when we add JDBC as a data source. In the meantime, to use it, I'd recommend writing a class in Scala that has Java-fri

Re: Primitive arrays in Spark

2014-10-21 Thread Matei Zaharia
It seems that ++ does the right thing on arrays of longs, and gives you another one: scala> val a = Array[Long](1,2,3) a: Array[Long] = Array(1, 2, 3) scala> val b = Array[Long](1,2,3) b: Array[Long] = Array(1, 2, 3) scala> a ++ b res0: Array[Long] = Array(1, 2, 3, 1, 2, 3) scala> res0.getClas

Re: Submissions open for Spark Summit East 2015

2014-10-19 Thread Matei Zaharia
BTW several people asked about registration and student passes. Registration will open in a few weeks, and like in previous Spark Summits, I expect there to be a special pass for students. Matei > On Oct 18, 2014, at 9:52 PM, Matei Zaharia wrote: > > After successful events in the

Submissions open for Spark Summit East 2015

2014-10-18 Thread Matei Zaharia
After successful events in the past two years, the Spark Summit conference has expanded for 2015, offering both an event in New York on March 18-19 and one in San Francisco on June 15-17. The conference is a great chance to meet people from throughout the Spark community and see the latest news,

Re: mllib.linalg.Vectors vs Breeze?

2014-10-18 Thread Matei Zaharia
toBreeze is private within Spark, it should not be accessible to users. If you want to make a Breeze vector from an MLlib one, it's pretty straightforward, and you can make your own utility function for it. Matei > On Oct 17, 2014, at 5:09 PM, Sean Owen wrote: > > Yes, I think that's the log

Re: Breaking the previous large-scale sort record with Spark

2014-10-13 Thread Matei Zaharia
nning into a variety of issues. Thanks in advance! > > On Oct 10, 2014 10:54 AM, "Matei Zaharia" wrote: > Hi folks, > > I interrupt your regularly scheduled user / dev list to bring you some pretty > cool news for the project, which is that we've been able to

Re: Blog post: An Absolutely Unofficial Way to Connect Tableau to SparkSQL (Spark 1.1)

2014-10-11 Thread Matei Zaharia
Very cool Denny, thanks for sharing this! Matei On Oct 11, 2014, at 9:46 AM, Denny Lee wrote: > https://www.concur.com/blog/en-us/connect-tableau-to-sparksql > > If you're wondering how to connect Tableau to SparkSQL - here are the steps > to connect Tableau to SparkSQL. > > > > Enjoy! >

Re: add Boulder-Denver Spark meetup to list on website

2014-10-10 Thread Matei Zaharia
Added you, thanks! (You may have to shift-refresh the page to see it updated). Matei On Oct 10, 2014, at 1:52 PM, Michael Oczkowski wrote: > Please add the Boulder-Denver Spark meetup group to the list on the website. > http://www.meetup.com/Boulder-Denver-Spark-Meetup/ > > > Michael Oczko

Breaking the previous large-scale sort record with Spark

2014-10-10 Thread Matei Zaharia
Hi folks, I interrupt your regularly scheduled user / dev list to bring you some pretty cool news for the project, which is that we've been able to use Spark to break MapReduce's 100 TB and 1 PB sort records, sorting data 3x faster on 10x fewer nodes. There's a detailed writeup at http://datab

Re: Convert a org.apache.spark.sql.SchemaRDD[Row] to a RDD of Strings

2014-10-09 Thread Matei Zaharia
A SchemaRDD is still an RDD, so you can just do rdd.map(row => row.toString). Or if you want to get a particular field of the row, you can do rdd.map(row => row(3).toString). Matei On Oct 9, 2014, at 1:22 PM, Soumya Simanta wrote: > I've a SchemaRDD that I want to convert to a RDD that contai

Re: Spark SQL question: why build hashtable for both sides in HashOuterJoin?

2014-10-08 Thread Matei Zaharia
I'm pretty sure inner joins on Spark SQL already build only one of the sides. Take a look at ShuffledHashJoin, which calls HashJoin.joinIterators. Only outer joins do both, and it seems like we could optimize it for those that are not full. Matei On Oct 7, 2014, at 11:04 PM, Haopu Wang wrote

Re: Spark SQL -- more than two tables for join

2014-10-07 Thread Matei Zaharia
The issue is that you're using SQLContext instead of HiveContext. SQLContext implements a smaller subset of the SQL language and so you're getting a SQL parse error because it doesn't support the syntax you have. Look at how you'd write this in HiveQL, and then try doing that with HiveContext.

Re: Strategies for reading large numbers of files

2014-10-06 Thread Matei Zaharia
The problem is that listing the metadata for all these files in S3 takes a long time. Something you can try is the following: split your files into several non-overlapping paths (e.g. s3n://bucket/purchase/2014/01, s3n://bucket/purchase/2014/02, etc), then do sc.parallelize over a list of such

Re: Using FunSuite to test Spark throws NullPointerException

2014-10-06 Thread Matei Zaharia
Weird, it seems like this is trying to use the SparkContext before it's initialized, or something like that. Have you tried unrolling this into a single method? I wonder if you just have multiple versions of these libraries on your classpath or something. Matei On Oct 4, 2014, at 1:40 PM, Mari

Re: Multiple spark shell sessions

2014-10-01 Thread Matei Zaharia
You need to set --total-executor-cores to limit how many total cores it grabs on the cluster. --executor-cores is just for each individual executor, but it will try to launch many of them. Matei On Oct 1, 2014, at 4:29 PM, Sanjay Subramanian wrote: > hey guys > > I am using spark 1.0.0+cdh

Re: Spark And Mapr

2014-10-01 Thread Matei Zaharia
It should just work in PySpark, the same way it does in Java / Scala apps. Matei On Oct 1, 2014, at 4:12 PM, Sungwook Yoon wrote: > > Yes.. you should use maprfs:// > > I personally haven't used pyspark, I just used scala shell or standalone with > MapR. > > I think you need to set classpat

Re: run scalding on spark

2014-10-01 Thread Matei Zaharia
Pretty cool, thanks for sharing this! I've added a link to it on the wiki: https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects. Matei On Oct 1, 2014, at 1:41 PM, Koert Kuipers wrote: > well, sort of! we make input/output formats (cascading taps, scalding > sources) a

  1   2   3   4   >