Re: Using Java in Spark shell

2016-05-25 Thread Keith
There is no java shell in spark. > On May 25, 2016, at 1:11 AM, Ashok Kumar wrote: > > Hello, > > A newbie question. > > Is it possible to use java code directly in spark shell without using maven > to build a jar file? > > How can I switch from scala to java in spark shell? > > Thanks >

Spark 1.4.0 SQL JDBC "partition stride"?

2015-06-21 Thread Keith Freeman
The spark docs section for "JDBC to Other Databases" (https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases) describes the partitioning as "... Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in tab

Spark SQL "partition stride"?

2016-01-11 Thread Keith Freeman
The spark docs section for "JDBC to Other Databases" (https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases) describes the partitioning as "... Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in tab

python rdd.partionBy(): any examples of a custom partitioner?

2015-12-07 Thread Keith Freeman
I'm not a python expert, so I'm wondering if anybody has a working example of a partitioner for the "partitionFunc" argument (default "portable_hash") to rdd.partitionBy()? - To unsubscribe, e-mail: user-unsubscr...@spark.apach

Re: Long-running job OOMs driver process

2016-11-18 Thread Keith Bourgoin
examples of this today, threaded and not. We were hoping that someone had seen this before and it rung a bell. Maybe there's a setting to clean up info from old jobs that we can adjust. Cheers, Keith. On Thu, Nov 17, 2016 at 9:50 PM Alexis Seigneurin wrote: > Hi Irina, > > I wou

Re: Long-running job OOMs driver process

2016-11-18 Thread Keith Bourgoin
-production data. Yong, that's a good point about the web content. I had forgotten to mention that when I first saw this a few months ago, on another project, I could sometimes trigger the OOM by trying to view the web ui for the job. That's another case I'll try to reproduce. Thank

Library dependencies in Spark

2017-01-10 Thread Keith Turner
I recently wrote a blog post[1] sharing my experiences with using Apache Spark to load data into Apache Fluo. One of the things I cover in this blog post is late binding of dependencies and exclusion of provided dependencies when building a shaded jar. When writing the post, I was unsure about dep

Re:

2017-01-20 Thread Keith Chapman
Hi Jacek, I've looked at SparkListener and tried it, I see it getting fired on the master but I don't see it getting fired on the workers in a cluster. Regards, Keith. http://keith-chapman.com On Fri, Jan 20, 2017 at 11:09 AM, Jacek Laskowski wrote: > Hi, > > (redirecting

Having issues reading a csv file into a DataSet using Spark 2.1

2017-03-22 Thread Keith Chapman
args(1)).as[Foo] ds.show } } Compiling the above program gives, I'd expect it to work as its a simple case class, changing it to as[String] works, but I would like to get the case class to work. [error] /home/keith/dataset/DataSetTest.scala:13: Unable to find encoder for type stored in

Re: Having issues reading a csv file into a DataSet using Spark 2.1

2017-03-22 Thread Keith Chapman
ion} > > object DatasetTest{ > > val spark: SparkSession = SparkSession > .builder() .master("local[8]") > .appName("Spark basic example").getOrCreate() > > import spark.implicits._ > > def main(Args: Array[String]) { > > var x = spark.read.fo

Re: Alternatives for dataframe collectAsList()

2017-04-04 Thread Keith Chapman
As Paul said it really depends on what you want to do with your data, perhaps writing it to a file would be a better option, but again it depends on what you want to do with the data you collect. Regards, Keith. http://keith-chapman.com On Tue, Apr 4, 2017 at 7:38 AM, Eike von Seggern wrote

Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?

2017-06-23 Thread Keith Chapman
/Dataframe instead of RDDs, so my question is: Is there custom partitioning of Dataset/Dataframe implemented in Spark? Can I accomplish the partial sort using mapPartitions on the resulting partitioned Dataset/Dataframe? Any thoughts? Regards, Keith. http://keith-chapman.com

Re: Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?

2017-06-24 Thread Keith Chapman
Thanks for the pointer Saliya, I'm looking got an equivalent api in dataset/dataframe for repartitionAndSortWithinPartitions, I've already converted most of the RDD's to Dataframes. Regards, Keith. http://keith-chapman.com On Sat, Jun 24, 2017 at 3:48 AM, Saliya Ekanayake wrot

Re: Is there an api in Dataset/Dataframe that does repartitionAndSortWithinPartitions?

2017-06-24 Thread Keith Chapman
Hi Nguyen, This looks promising and seems like I could achieve it using cluster by. Thanks for the pointer. Regards, Keith. http://keith-chapman.com On Sat, Jun 24, 2017 at 5:27 AM, nguyen duc Tuan wrote: > Hi Chapman, > You can use "cluster by" to do what you want. > h

Re: Get full RDD lineage for a spark job

2017-07-21 Thread Keith Chapman
Hi Ron, You can try using the toDebugString method on the RDD, this will print the RDD lineage. Regards, Keith. http://keith-chapman.com On Fri, Jul 21, 2017 at 11:24 AM, Ron Gonzalez wrote: > Hi, > Can someone point me to a test case or share sample code that is able to > extrac

Re: Get full RDD lineage for a spark job

2017-07-21 Thread Keith Chapman
You could also enable it with --conf spark.logLineage=true if you do not want to change any code. Regards, Keith. http://keith-chapman.com On Fri, Jul 21, 2017 at 7:57 PM, Keith Chapman wrote: > Hi Ron, > > You can try using the toDebugString method on the RDD, this will print

Re: What are some disadvantages of issuing a raw sql query to spark?

2017-07-25 Thread Keith Chapman
, Keith. http://keith-chapman.com On Tue, Jul 25, 2017 at 12:50 AM, kant kodali wrote: > HI All, > > I just want to run some spark structured streaming Job similar to this > > DS.filter(col("name").equalTo("john")) > .groupBy(functions.window(df1.col(

Re: What are some disadvantages of issuing a raw sql query to spark?

2017-07-25 Thread Keith Chapman
Here is an example of a window lead function, select *, lead(someColumn1) over ( partition by someColumn2 order by someColumn13 asc nulls first) as someName from someTable Regards, Keith. http://keith-chapman.com On Tue, Jul 25, 2017 at 9:15 AM, kant kodali wrote: > How do I Spec

A bug in spark or hadoop RPC with kerberos authentication?

2017-08-22 Thread Sun, Keith
stem.out.println(sc.toDebugString()); SparkSession sparkSessesion= SparkSession .builder() .master("yarn-client") //"yarn-client", "local" .config(sc) .appName(SparkEAZDebug.class.getName()) .enableHiveSupport() .getOrCreate(); Thanks very much. Keith

RE: A bug in spark or hadoop RPC with kerberos authentication?

2017-08-23 Thread Sun, Keith
Finally find the root cause and raise a bug issue in https://issues.apache.org/jira/browse/SPARK-21819 Thanks very much. Keith From: Sun, Keith Sent: 2017年8月22日 8:48 To: user@spark.apache.org Subject: A bug in spark or hadoop RPC with kerberos authentication? Hello , I met this very weird

RE: A bug in spark or hadoop RPC with kerberos authentication?

2017-08-23 Thread Sun, Keith
.builder() .master("yarn-client") //"yarn-client", "local" .config(sc) .appName(SparkEAZDebug.class.getName()) .enableHiveSupport()

How to find the temporary views' DDL

2017-10-01 Thread Sun, Keith
columns while not the ddl. Thanks very much. Keith From: Anastasios Zouzias [mailto:zouz...@gmail.com] Sent: Sunday, October 1, 2017 3:05 PM To: Kanagha Kumar Cc: user @spark Subject: Re: Error - Spark reading from HDFS via dataframes - Java Hi, Set the inferschema option to true in spark-csv

Re: update LD_LIBRARY_PATH when running apache job in a YARN cluster

2018-01-17 Thread Keith Chapman
Hi Manuel, You could use the following to add a path to the library search path, --conf spark.driver.extraLibraryPath=PathToLibFolder --conf spark.executor.extraLibraryPath=PathToLibFolder Thanks, Keith. Regards, Keith. http://keith-chapman.com On Wed, Jan 17, 2018 at 5:39 PM, Manuel Sopena

Spark not releasing shuffle files in time (with very large heap)

2018-02-22 Thread Keith Chapman
ee GC kicking in more often and the size of /tmp stays under control. Is there any way I could configure spark to handle this issue? One option that I have is to have GC run more often by setting spark.cleaner.periodicGC.interval to a much lower value. Is there a cleaner solution? Regards, Keith.

Re: Spark not releasing shuffle files in time (with very large heap)

2018-02-22 Thread Keith Chapman
My issue is that there is not enough pressure on GC, hence GC is not kicking in fast enough to delete the shuffle files of previous iterations. Regards, Keith. http://keith-chapman.com On Thu, Feb 22, 2018 at 6:58 PM, naresh Goud wrote: > It would be very difficult to tell without know

Can I get my custom spark strategy to run last?

2018-03-01 Thread Keith Chapman
Hi, I'd like to write a custom Spark strategy that runs after all the existing Spark strategies are run. Looking through the Spark code it seems like the custom strategies are prepended to the list of strategies in Spark. Is there a way I could get it to run last? Regards, Keith. http://

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-19 Thread Keith Chapman
Hi Michael, You could either set spark.local.dir through spark conf or java.io.tmpdir system property. Regards, Keith. http://keith-chapman.com On Mon, Mar 19, 2018 at 9:59 AM, Michael Shtelma wrote: > Hi everybody, > > I am running spark job on yarn, and my problem is that the

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-19 Thread Keith Chapman
Can you try setting spark.executor.extraJavaOptions to have -D java.io.tmpdir=someValue Regards, Keith. http://keith-chapman.com On Mon, Mar 19, 2018 at 10:29 AM, Michael Shtelma wrote: > Hi Keith, > > Thank you for your answer! > I have done this, and it is working for spark

Re: Running out of space on /tmp file system while running spark job on yarn because of size of blockmgr folder

2018-03-26 Thread Keith Chapman
Hi Michael, sorry for the late reply. I guess you may have to set it through the hdfs core-site.xml file. The property you need to set is "hadoop.tmp.dir" which defaults to "/tmp/hadoop-${user.name}" Regards, Keith. http://keith-chapman.com On Mon, Mar 19, 2018 at 1:05

Re: GC- Yarn vs Standalone K8

2018-06-11 Thread Keith Chapman
-XX:OnOutOfMemoryError='kill -9 %p' Regards, Keith. http://keith-chapman.com On Mon, Jun 11, 2018 at 8:22 PM, ankit jain wrote: > Hi, > Does anybody know if Yarn uses a different Garbage Collector from Spark > standalone? > > We migrated our application recently from EMR to K8(not using

Pyspark error when converting string to timestamp in map function

2018-08-17 Thread Keith Chapman
ot accept object %r in type %s" % (dataType, obj, type(obj))) TypeError: TimestampType can not accept object '2018-03-21 08:06:17' in type Regards, Keith. http://keith-chapman.com

RE: how to use cluster sparkSession like localSession

2018-11-04 Thread Sun, Keith
Hello, I think you can try with below , the reason is only yarn-cllient mode is supported for your scenario. master("yarn-client") Thanks very much. Keith From: 张万新 Sent: Thursday, November 1, 2018 11:36 PM To: 崔苗(数据与人工智能产品开发部) <0049003...@znv.com> Cc: user Subject: Re: ho

Re: [pyspark 2.3] count followed by write on dataframe

2019-05-20 Thread Keith Chapman
Yes that is correct, that would cause computation twice. If you want the computation to happen only once you can cache the dataframe and call count and write on the cached dataframe. Regards, Keith. http://keith-chapman.com On Mon, May 20, 2019 at 6:43 PM Rishi Shah wrote: > Hi All, >

Re: Override jars in spark submit

2019-06-19 Thread Keith Chapman
traclasspath the jar file needs to be present on all the executors. Regards, Keith. http://keith-chapman.com On Wed, Jun 19, 2019 at 8:57 PM naresh Goud wrote: > Hello All, > > How can we override jars in spark submit? > We have hive-exec-spark jar which is available as part of de

Re: Sorting tuples with byte key and byte value

2019-07-15 Thread Keith Chapman
execution and memory. I would rather use Dataframe sort operation if performance is key. Regards, Keith. http://keith-chapman.com On Mon, Jul 15, 2019 at 8:45 AM Supun Kamburugamuve < supun.kamburugam...@gmail.com> wrote: > Hi all, > > We are trying to measure the sorting performan

Re: Long-Running Spark application doesn't clean old shuffle data correctly

2019-07-21 Thread Keith Chapman
Hi Alex, Shuffle files in spark are deleted when the object holding a reference to the shuffle file on disk goes out of scope (is garbage collected by the JVM). Could it be the case that you are keeping these objects alive? Regards, Keith. http://keith-chapman.com On Sun, Jul 21, 2019 at 12

Re: Comparative study

2014-07-08 Thread Keith Simmons
boost (even without registering any custom serializers). Keith On Tue, Jul 8, 2014 at 2:58 PM, Robert James wrote: > As a new user, I can definitely say that my experience with Spark has > been rather raw. The appeal of interactive, batch, and in between all > using more or less straight

Re: Comparative study

2014-07-09 Thread Keith Simmons
Good point. Shows how personal use cases color how we interpret products. On Wed, Jul 9, 2014 at 1:08 AM, Sean Owen wrote: > On Wed, Jul 9, 2014 at 1:52 AM, Keith Simmons wrote: > >> Impala is *not* built on map/reduce, though it was built to replace >> Hive, which is map

Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Keith Simmons
s being mapped into the individual record types without a problem. The immediate cause seems to be a task trying to deserialize one or more SQL case classes before loading the spark uber jar, but I have no idea why this is happening, or why it only happens when I do a join. Ideas? Keith P.S.

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Keith Simmons
ing) as does: case class Record(value: String, key: Int) case class Record2(value: String, key: Int) Let me know if you need anymore details. On Tue, Jul 15, 2014 at 11:14 AM, Michael Armbrust wrote: > Are you registering multiple RDDs of case classes as tables concurrently? > You are

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Keith Simmons
k-core" % "1.0.1" % "provided" On Tue, Jul 15, 2014 at 12:21 PM, Zongheng Yang wrote: > FWIW, I am unable to reproduce this using the example program locally. > > On Tue, Jul 15, 2014 at 11:56 AM, Keith Simmons > wrote: > > Nope. All of them are reg

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Keith Simmons
d.java:701) On Tue, Jul 15, 2014 at 1:05 PM, Michael Armbrust wrote: > Can you print out the queryExecution? > > (i.e. println(sql().queryExecution)) > > > On Tue, Jul 15, 2014 at 12:44 PM, Keith Simmons > wrote: > >> To give a few more details of my environ

Re: Error while running Spark SQL join when using Spark 1.0.1

2014-07-15 Thread Keith Simmons
Cool. So Michael's hunch was correct, it is a thread issue. I'm currently using a tarball build, but I'll do a spark build with the patch as soon as I have a chance and test it out. Keith On Tue, Jul 15, 2014 at 4:14 PM, Zongheng Yang wrote: > Hi Keith & gorenuru,

Re: GraphX : AssertionError

2014-09-22 Thread Keith Massey
The triangle count also failed for me when I ran it on more than one node. There is this assertion in TriangleCount.scala that causes the failure: // double count should be even (divisible by two) assert((dblCount & 1) == 0) That did not hold true when I ran this on multiple nodes,

Hung spark executors don't count toward worker memory limit

2014-10-09 Thread Keith Simmons
executor has exited? Let me know if there's any additional information I can provide. Keith P.S. We're running spark 1.0.2

Re: Hung spark executors don't count toward worker memory limit

2014-10-09 Thread Keith Simmons
nching new job... 14/10/09 20:51:17 INFO Worker: Executor app-20141009204127-0029/1 finished with state KILLED As you can see, the first app didn't actually shutdown until two minutes after the new job launched. During that time, I was at double the worker memory limit. Keith On Thu, Oct

Re: Hung spark executors don't count toward worker memory limit

2014-10-13 Thread Keith Simmons
Maybe I should put this another way. If spark has two jobs, A and B, both of which consume the entire allocated memory pool, is it expected that spark can launch B before the executor processes tied to A are completely terminated? On Thu, Oct 9, 2014 at 6:57 PM, Keith Simmons wrote: > Actua

Setting only master heap

2014-10-22 Thread Keith Simmons
We've been getting some OOMs from the spark master since upgrading to Spark 1.1.0. I've found SPARK_DAEMON_MEMORY, but that also seems to increase the worker heap, which as far as I know is fine. Is there any setting which *only* increases the master heap size? Keith

Re: Setting only master heap

2014-10-26 Thread Keith Simmons
use that much memory, and even if there > are many applications it will discard the old ones appropriately, so unless > you have a ton (like thousands) of concurrently running applications > connecting to it there's little likelihood for it to OOM. At least that's > my under

Loading RDDs in a streaming fashion

2014-12-01 Thread Keith Simmons
(file, stream) => for every 10K records write records to stream and flush } Keith

Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Keith Simmons
should load each file into an rdd with context.textFile(), > flatmap that and union these rdds. > > also see > > http://stackoverflow.com/questions/23397907/spark-context-textfile-load-multiple-files > > > On 1 December 2014 at 16:50, Keith Simmons wrote: > >> Th

Re: Loading RDDs in a streaming fashion

2014-12-01 Thread Keith Simmons
Yep, that's definitely possible. It's one of the workarounds I was considering. I was just curious if there was a simpler (and perhaps more efficient) approach. Keith On Mon, Dec 1, 2014 at 6:28 PM, Andy Twigg wrote: > Could you modify your function so that it streams thro

Re: TriangleCount & Shortest Path under Spark

2014-03-13 Thread Keith Massey
ults, but they looked generally right. Not sure if this is the failure you are talking about or not. As far as shortest path, the programming guide had an example that worked well for me under https://spark.incubator.apache.org/docs/latest/graphx-programming-guide.html#pregel-api . Keith On Su

Spark Memory Bounds

2014-05-27 Thread Keith Simmons
decrease, and since each task is processing a single partition and there are a bounded number of tasks in flight, my memory use has a rough upper limit. Keith

Re: Spark Memory Bounds

2014-05-27 Thread Keith Simmons
tasks. Is my understanding correct? Specifically, once a key/value pair is serialized in the shuffle stage of a task, are the references to the raw java objects released before the next task is started. On Tue, May 27, 2014 at 6:21 PM, Christopher Nguyen wrote: > Keith, do you mean &quo

Re: Spark Memory Bounds

2014-05-28 Thread Keith Simmons
pretty good handle on the overall RDD contribution. Thanks for all the help. Keith On Wed, May 28, 2014 at 6:43 AM, Christopher Nguyen wrote: > Keith, please see inline. > > -- > Christopher T. Nguyen > Co-founder & CEO, Adatao <http://adatao.com> > linkedin.com/in/ctn