There is no java shell in spark.
> On May 25, 2016, at 1:11 AM, Ashok Kumar wrote:
>
> Hello,
>
> A newbie question.
>
> Is it possible to use java code directly in spark shell without using maven
> to build a jar file?
>
> How can I switch from scala to java in spark shell?
>
> Thanks
>
The spark docs section for "JDBC to Other Databases"
(https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases)
describes the partitioning as "... Notice that lowerBound and upperBound
are just used to decide the partition stride, not for filtering the rows
in tab
The spark docs section for "JDBC to Other Databases"
(https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases)
describes the partitioning as "... Notice that lowerBound and upperBound
are just used to decide the partition stride, not for filtering the rows
in tab
I'm not a python expert, so I'm wondering if anybody has a working
example of a partitioner for the "partitionFunc" argument (default
"portable_hash") to rdd.partitionBy()?
-
To unsubscribe, e-mail: user-unsubscr...@spark.apach
examples of this today, threaded and not. We were
hoping that someone had seen this before and it rung a bell. Maybe there's
a setting to clean up info from old jobs that we can adjust.
Cheers,
Keith.
On Thu, Nov 17, 2016 at 9:50 PM Alexis Seigneurin
wrote:
> Hi Irina,
>
> I wou
-production data.
Yong, that's a good point about the web content. I had forgotten to mention
that when I first saw this a few months ago, on another project, I could
sometimes trigger the OOM by trying to view the web ui for the job. That's
another case I'll try to reproduce.
Thank
I recently wrote a blog post[1] sharing my experiences with using
Apache Spark to load data into Apache Fluo. One of the things I cover
in this blog post is late binding of dependencies and exclusion of
provided dependencies when building a shaded jar. When writing the
post, I was unsure about dep
Hi Jacek,
I've looked at SparkListener and tried it, I see it getting fired on the
master but I don't see it getting fired on the workers in a cluster.
Regards,
Keith.
http://keith-chapman.com
On Fri, Jan 20, 2017 at 11:09 AM, Jacek Laskowski wrote:
> Hi,
>
> (redirecting
args(1)).as[Foo]
ds.show
}
}
Compiling the above program gives, I'd expect it to work as its a simple
case class, changing it to as[String] works, but I would like to get the
case class to work.
[error] /home/keith/dataset/DataSetTest.scala:13: Unable to find encoder
for type stored in
ion}
>
> object DatasetTest{
>
> val spark: SparkSession = SparkSession
> .builder() .master("local[8]")
> .appName("Spark basic example").getOrCreate()
>
> import spark.implicits._
>
> def main(Args: Array[String]) {
>
> var x = spark.read.fo
As Paul said it really depends on what you want to do with your data,
perhaps writing it to a file would be a better option, but again it depends
on what you want to do with the data you collect.
Regards,
Keith.
http://keith-chapman.com
On Tue, Apr 4, 2017 at 7:38 AM, Eike von Seggern
wrote
/Dataframe instead of RDDs, so my question is:
Is there custom partitioning of Dataset/Dataframe implemented in Spark?
Can I accomplish the partial sort using mapPartitions on the resulting
partitioned Dataset/Dataframe?
Any thoughts?
Regards,
Keith.
http://keith-chapman.com
Thanks for the pointer Saliya, I'm looking got an equivalent api in
dataset/dataframe for repartitionAndSortWithinPartitions, I've already
converted most of the RDD's to Dataframes.
Regards,
Keith.
http://keith-chapman.com
On Sat, Jun 24, 2017 at 3:48 AM, Saliya Ekanayake wrot
Hi Nguyen,
This looks promising and seems like I could achieve it using cluster by.
Thanks for the pointer.
Regards,
Keith.
http://keith-chapman.com
On Sat, Jun 24, 2017 at 5:27 AM, nguyen duc Tuan
wrote:
> Hi Chapman,
> You can use "cluster by" to do what you want.
> h
Hi Ron,
You can try using the toDebugString method on the RDD, this will print the
RDD lineage.
Regards,
Keith.
http://keith-chapman.com
On Fri, Jul 21, 2017 at 11:24 AM, Ron Gonzalez wrote:
> Hi,
> Can someone point me to a test case or share sample code that is able to
> extrac
You could also enable it with --conf spark.logLineage=true if you do not
want to change any code.
Regards,
Keith.
http://keith-chapman.com
On Fri, Jul 21, 2017 at 7:57 PM, Keith Chapman
wrote:
> Hi Ron,
>
> You can try using the toDebugString method on the RDD, this will print
,
Keith.
http://keith-chapman.com
On Tue, Jul 25, 2017 at 12:50 AM, kant kodali wrote:
> HI All,
>
> I just want to run some spark structured streaming Job similar to this
>
> DS.filter(col("name").equalTo("john"))
> .groupBy(functions.window(df1.col(
Here is an example of a window lead function,
select *, lead(someColumn1) over ( partition by someColumn2 order by
someColumn13 asc nulls first) as someName from someTable
Regards,
Keith.
http://keith-chapman.com
On Tue, Jul 25, 2017 at 9:15 AM, kant kodali wrote:
> How do I Spec
stem.out.println(sc.toDebugString());
SparkSession sparkSessesion= SparkSession
.builder()
.master("yarn-client") //"yarn-client", "local"
.config(sc)
.appName(SparkEAZDebug.class.getName())
.enableHiveSupport()
.getOrCreate();
Thanks very much.
Keith
Finally find the root cause and raise a bug issue in
https://issues.apache.org/jira/browse/SPARK-21819
Thanks very much.
Keith
From: Sun, Keith
Sent: 2017年8月22日 8:48
To: user@spark.apache.org
Subject: A bug in spark or hadoop RPC with kerberos authentication?
Hello ,
I met this very weird
.builder()
.master("yarn-client")
//"yarn-client", "local"
.config(sc)
.appName(SparkEAZDebug.class.getName())
.enableHiveSupport()
columns while not the ddl.
Thanks very much.
Keith
From: Anastasios Zouzias [mailto:zouz...@gmail.com]
Sent: Sunday, October 1, 2017 3:05 PM
To: Kanagha Kumar
Cc: user @spark
Subject: Re: Error - Spark reading from HDFS via dataframes - Java
Hi,
Set the inferschema option to true in spark-csv
Hi Manuel,
You could use the following to add a path to the library search path,
--conf spark.driver.extraLibraryPath=PathToLibFolder
--conf spark.executor.extraLibraryPath=PathToLibFolder
Thanks,
Keith.
Regards,
Keith.
http://keith-chapman.com
On Wed, Jan 17, 2018 at 5:39 PM, Manuel Sopena
ee GC
kicking in more often and the size of /tmp stays under control. Is there
any way I could configure spark to handle this issue?
One option that I have is to have GC run more often by
setting spark.cleaner.periodicGC.interval to a much lower value. Is there a
cleaner solution?
Regards,
Keith.
My issue is that there is not enough pressure on GC, hence GC is not
kicking in fast enough to delete the shuffle files of previous iterations.
Regards,
Keith.
http://keith-chapman.com
On Thu, Feb 22, 2018 at 6:58 PM, naresh Goud
wrote:
> It would be very difficult to tell without know
Hi,
I'd like to write a custom Spark strategy that runs after all the existing
Spark strategies are run. Looking through the Spark code it seems like the
custom strategies are prepended to the list of strategies in Spark. Is
there a way I could get it to run last?
Regards,
Keith.
http://
Hi Michael,
You could either set spark.local.dir through spark conf or java.io.tmpdir
system property.
Regards,
Keith.
http://keith-chapman.com
On Mon, Mar 19, 2018 at 9:59 AM, Michael Shtelma wrote:
> Hi everybody,
>
> I am running spark job on yarn, and my problem is that the
Can you try setting spark.executor.extraJavaOptions to have -D
java.io.tmpdir=someValue
Regards,
Keith.
http://keith-chapman.com
On Mon, Mar 19, 2018 at 10:29 AM, Michael Shtelma
wrote:
> Hi Keith,
>
> Thank you for your answer!
> I have done this, and it is working for spark
Hi Michael,
sorry for the late reply. I guess you may have to set it through the hdfs
core-site.xml file. The property you need to set is "hadoop.tmp.dir" which
defaults to "/tmp/hadoop-${user.name}"
Regards,
Keith.
http://keith-chapman.com
On Mon, Mar 19, 2018 at 1:05
-XX:OnOutOfMemoryError='kill -9 %p'
Regards,
Keith.
http://keith-chapman.com
On Mon, Jun 11, 2018 at 8:22 PM, ankit jain wrote:
> Hi,
> Does anybody know if Yarn uses a different Garbage Collector from Spark
> standalone?
>
> We migrated our application recently from EMR to K8(not using
ot accept object %r in type %s" % (dataType, obj,
type(obj)))
TypeError: TimestampType can not accept object '2018-03-21 08:06:17' in
type
Regards,
Keith.
http://keith-chapman.com
Hello,
I think you can try with below , the reason is only yarn-cllient mode is
supported for your scenario.
master("yarn-client")
Thanks very much.
Keith
From: 张万新
Sent: Thursday, November 1, 2018 11:36 PM
To: 崔苗(数据与人工智能产品开发部) <0049003...@znv.com>
Cc: user
Subject: Re: ho
Yes that is correct, that would cause computation twice. If you want the
computation to happen only once you can cache the dataframe and call count
and write on the cached dataframe.
Regards,
Keith.
http://keith-chapman.com
On Mon, May 20, 2019 at 6:43 PM Rishi Shah wrote:
> Hi All,
>
traclasspath the jar file needs to be present on all the
executors.
Regards,
Keith.
http://keith-chapman.com
On Wed, Jun 19, 2019 at 8:57 PM naresh Goud
wrote:
> Hello All,
>
> How can we override jars in spark submit?
> We have hive-exec-spark jar which is available as part of de
execution and memory. I would
rather use Dataframe sort operation if performance is key.
Regards,
Keith.
http://keith-chapman.com
On Mon, Jul 15, 2019 at 8:45 AM Supun Kamburugamuve <
supun.kamburugam...@gmail.com> wrote:
> Hi all,
>
> We are trying to measure the sorting performan
Hi Alex,
Shuffle files in spark are deleted when the object holding a reference to
the shuffle file on disk goes out of scope (is garbage collected by the
JVM). Could it be the case that you are keeping these objects alive?
Regards,
Keith.
http://keith-chapman.com
On Sun, Jul 21, 2019 at 12
boost (even without registering any custom
serializers).
Keith
On Tue, Jul 8, 2014 at 2:58 PM, Robert James wrote:
> As a new user, I can definitely say that my experience with Spark has
> been rather raw. The appeal of interactive, batch, and in between all
> using more or less straight
Good point. Shows how personal use cases color how we interpret products.
On Wed, Jul 9, 2014 at 1:08 AM, Sean Owen wrote:
> On Wed, Jul 9, 2014 at 1:52 AM, Keith Simmons wrote:
>
>> Impala is *not* built on map/reduce, though it was built to replace
>> Hive, which is map
s being mapped into the individual record types without a problem.
The immediate cause seems to be a task trying to deserialize one or more
SQL case classes before loading the spark uber jar, but I have no idea why
this is happening, or why it only happens when I do a join. Ideas?
Keith
P.S.
ing)
as does:
case class Record(value: String, key: Int)
case class Record2(value: String, key: Int)
Let me know if you need anymore details.
On Tue, Jul 15, 2014 at 11:14 AM, Michael Armbrust
wrote:
> Are you registering multiple RDDs of case classes as tables concurrently?
> You are
k-core" % "1.0.1" %
"provided"
On Tue, Jul 15, 2014 at 12:21 PM, Zongheng Yang
wrote:
> FWIW, I am unable to reproduce this using the example program locally.
>
> On Tue, Jul 15, 2014 at 11:56 AM, Keith Simmons
> wrote:
> > Nope. All of them are reg
d.java:701)
On Tue, Jul 15, 2014 at 1:05 PM, Michael Armbrust
wrote:
> Can you print out the queryExecution?
>
> (i.e. println(sql().queryExecution))
>
>
> On Tue, Jul 15, 2014 at 12:44 PM, Keith Simmons
> wrote:
>
>> To give a few more details of my environ
Cool. So Michael's hunch was correct, it is a thread issue. I'm currently
using a tarball build, but I'll do a spark build with the patch as soon as
I have a chance and test it out.
Keith
On Tue, Jul 15, 2014 at 4:14 PM, Zongheng Yang wrote:
> Hi Keith & gorenuru,
The triangle count also failed for me when I ran it on more than one node.
There is this assertion in TriangleCount.scala that causes the failure:
// double count should be even (divisible by two)
assert((dblCount & 1) == 0)
That did not hold true when I ran this on multiple nodes,
executor
has exited? Let me know if there's any additional information I can
provide.
Keith
P.S. We're running spark 1.0.2
nching new job...
14/10/09 20:51:17 INFO Worker: Executor app-20141009204127-0029/1 finished
with state KILLED
As you can see, the first app didn't actually shutdown until two minutes
after the new job launched. During that time, I was at double the worker
memory limit.
Keith
On Thu, Oct
Maybe I should put this another way. If spark has two jobs, A and B, both
of which consume the entire allocated memory pool, is it expected that
spark can launch B before the executor processes tied to A are completely
terminated?
On Thu, Oct 9, 2014 at 6:57 PM, Keith Simmons wrote:
> Actua
We've been getting some OOMs from the spark master since upgrading to Spark
1.1.0. I've found SPARK_DAEMON_MEMORY, but that also seems to increase the
worker heap, which as far as I know is fine. Is there any setting which
*only* increases the master heap size?
Keith
use that much memory, and even if there
> are many applications it will discard the old ones appropriately, so unless
> you have a ton (like thousands) of concurrently running applications
> connecting to it there's little likelihood for it to OOM. At least that's
> my under
(file, stream) =>
for every 10K records write records to stream and flush
}
Keith
should load each file into an rdd with context.textFile(),
> flatmap that and union these rdds.
>
> also see
>
> http://stackoverflow.com/questions/23397907/spark-context-textfile-load-multiple-files
>
>
> On 1 December 2014 at 16:50, Keith Simmons wrote:
>
>> Th
Yep, that's definitely possible. It's one of the workarounds I was
considering. I was just curious if there was a simpler (and perhaps more
efficient) approach.
Keith
On Mon, Dec 1, 2014 at 6:28 PM, Andy Twigg wrote:
> Could you modify your function so that it streams thro
ults, but they looked generally right. Not sure if this is the
failure you are talking about or not.
As far as shortest path, the programming guide had an example that worked
well for me under
https://spark.incubator.apache.org/docs/latest/graphx-programming-guide.html#pregel-api
.
Keith
On Su
decrease, and since each task
is processing a single partition and there are a bounded number of tasks in
flight, my memory use has a rough upper limit.
Keith
tasks. Is my
understanding correct? Specifically, once a key/value pair is serialized
in the shuffle stage of a task, are the references to the raw java objects
released before the next task is started.
On Tue, May 27, 2014 at 6:21 PM, Christopher Nguyen wrote:
> Keith, do you mean &quo
pretty good handle on the overall RDD contribution.
Thanks for all the help.
Keith
On Wed, May 28, 2014 at 6:43 AM, Christopher Nguyen wrote:
> Keith, please see inline.
>
> --
> Christopher T. Nguyen
> Co-founder & CEO, Adatao <http://adatao.com>
> linkedin.com/in/ctn
56 matches
Mail list logo