Hello there
I have the same requirement.
I submit a streaming job with yarn-cluster mode.
If I want to shutdown this endless YARN application, I should find out the
application id by myself and use "yarn appplication -kill " to kill
the application.
Therefore, if I can get returned application
I am trying to use updateStateByKey but receiving the following error.
(Spark Version 1.4.0)
Can someone please point out what might be the possible reason for this
error.
*The method
updateStateByKey(Function2,Optional,Optional>)
in the type JavaPairDStream is not applicable
for the arguments *
I have never tried this but there is yarn client api's that you can use in
your spark program to get the application id.
Here is the link to the yarn client java doc:
http://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/yarn/client/api/YarnClient.html
getApplications() is the method for your
You will have to properly order the columns before writing or you can
change the column order in the actual table according to your job.
Thanks
Best Regards
On Tue, Dec 15, 2015 at 1:47 AM, Bob Corsaro wrote:
> Is there anyway to map pyspark.sql.Row columns to JDBC table columns, or
> do I have
Send the mail to user-unsubscr...@spark.apache.org read more over here
http://spark.apache.org/community.html
Thanks
Best Regards
On Tue, Dec 15, 2015 at 3:39 AM, Mithila Joshi
wrote:
> unsubscribe
>
> On Mon, Dec 14, 2015 at 4:49 PM, Tim Barthram
> wrote:
>
>> UNSUBSCRIBE Thanks
>>
>>
>>
>> _
Which version of spark are you using? You can test this by opening up a
spark-shell, firing a simple job (sc.parallelize(1 to 100).collect()) and
then accessing the
http://sigmoid-driver:4040/api/v1/applications/Spark%20shell/jobs
[image: Inline image 1]
Thanks
Best Regards
On Tue, Dec 15, 2015
Awesome, thanks for the PR Koert!
/Anders
On Thu, Dec 17, 2015 at 10:22 PM Prasad Ravilla wrote:
> Thanks, Koert.
>
> Regards,
> Prasad.
>
> From: Koert Kuipers
> Date: Thursday, December 17, 2015 at 1:06 PM
> To: Prasad Ravilla
> Cc: Anders Arpteg, user
>
> Subject: Re: Large number of conf br
*First you create the HBase configuration:*
val hbaseTableName = "paid_daylevel"
val hbaseColumnName = "paid_impression"
val hconf = HBaseConfiguration.create()
hconf.set("hbase.zookeeper.quorum", "sigmoid-dev-master")
hconf.set("hbase.zookeeper.property.clientPort",
hi,
I think that people have reported the same issue elsewhere, and this should
be registered as a bug in SPARK
https://forums.databricks.com/questions/2142/self-join-in-spark-sql.html
Regards,
Gourav
On Thu, Dec 17, 2015 at 10:52 AM, Gourav Sengupta wrote:
> Hi Ted,
>
> The self join works
You can broadcast your json data and then do a map side join. This article
is a good start http://dmtolpeko.com/2015/02/20/map-side-join-in-spark/
Thanks
Best Regards
On Wed, Dec 16, 2015 at 2:51 AM, Alexander Pivovarov
wrote:
> I have big folder having ORC files. Files have duration field (e.g
If the port 7077 is open for public on your cluster, that's all you need to
take over the cluster. You can read a bit about it here
https://www.sigmoid.com/securing-apache-spark-cluster/
You can also look at this small exploit I wrote
https://www.exploit-db.com/exploits/36562/
Thanks
Best Regards
Something like this? This one uses the ZLIB compression, you can replace
the decompression logic with GZip one in your case.
compressedStream.map(x => {
val inflater = new Inflater()
inflater.setInput(x.getPayload)
val decompressedData = new Array[Byte](x.getPayload.size * 2)
Did you happened to have a look at this
https://issues.apache.org/jira/browse/SPARK-9629
Thanks
Best Regards
On Thu, Dec 17, 2015 at 12:02 PM, yaoxiaohua wrote:
> Hi guys,
>
> I have two nodes used as spark master, spark1,spark2
>
> Spark1.4.0
>
> Jdk 1.7 sunjdk
>
>
>
> Now thes
Is there any difference between the following snippets:
val df = hiveContext.createDataFrame(rows, schema)
df.registerTempTable("myTable")
df.cache()
and
val df = hiveContext.createDataFrame(rows, schema)
df.registerTempTable("myTable")
hiveContext.cacheTable("myTable")
-Sahil
---
Hi,
I have a table which is directly from S3 location and even a self join on
that cached table is causing the data to be read from S3 again.
The query plan in mentioned below:
== Parsed Logical Plan ==
Aggregate [count(1) AS count#1804L]
Project [user#0,programme_key#515]
Join Inner, Some((p
Thank you, Luciano, Shixiong.
I thought the "_2.11" part referred to the Kafka version - an
unfortunate coincidence.
Indeed
spark-submit --jars spark-streaming-kafka-assembly_2.10-1.5.2.jar
my_kafka_streaming_wordcount.py
OR
spark-submit --packages
org.apache.spark:spark-stream
Hi All,
Imagine I have a Production spark streaming kafka (direct connection)
subscriber and publisher jobs running which publish and subscriber (receive)
data from a kafka topic and I save one day's worth of data using
dstream.slice to Cassandra daily table (so I create daily table before
runnin
Spark 1.5.2
dfOld.registerTempTable("oldTableName")
sqlContext.cacheTable("oldTableName")
//
// do something
//
dfNew.registerTempTable("oldTableName")
sqlContext.cacheTable("oldTableName")
Now when I use the "oldTableName" table I do get the latest contents
from dfNew but do the conte
CacheManager#cacheQuery() is called where:
* Caches the data produced by the logical representation of the given
[[Queryable]].
...
val planToCache = query.queryExecution.analyzed
if (lookupCachedData(planToCache).nonEmpty) {
Is the schema for dfNew different from that of dfOld ?
Cheers
Hi,
the attached DAG shows that for the same table (self join) SPARK is
unnecessarily getting data from S3 for one side of the join where as its
able to use cache for the other side.
Regards,
Gourav
On Fri, Dec 18, 2015 at 10:29 AM, Gourav Sengupta wrote:
> Hi,
>
> I have a table which is dir
The picture is a bit hard to read.
I did a brief search but haven't found JIRA for this issue.
Consider logging a SPARK JIRA.
Cheers
On Fri, Dec 18, 2015 at 4:37 AM, Gourav Sengupta
wrote:
> Hi,
>
> the attached DAG shows that for the same table (self join) SPARK is
> unnecessarily getting da
Hello everyone,
I am testing some parallel program submission to a stand alone cluster.
Everything works alright, the problem is, for some reason, I can't submit more
than 3 programs to the cluster.
The fourth one, whether legacy or REST, simply hangs until one of the first
three completes.
I a
If you're really doing a daily batch job, have you considered just using
KafkaUtils.createRDD rather than a streaming job?
On Fri, Dec 18, 2015 at 5:04 AM, kali.tumm...@gmail.com <
kali.tumm...@gmail.com> wrote:
> Hi All,
>
> Imagine I have a Production spark streaming kafka (direct connection)
>
Hi all,
I am new to spark, I am trying to use log4j for logging my application.
But any how the logs are not getting written at specified file.
I have created application using maven, and kept log.properties file at
resources folder.
Application written in scala .
If there is any alte
Hi,
I am running into performance issue when joining data frames created from avro
files using spark-avro library.
The data frames are created from 120K avro files and the total size is around
1.5 TB.
The two data frames are very huge with billions of records.
The join for these two DataFrames
See this thread:
http://search-hadoop.com/m/q3RTtEor1vYWbsW
which mentioned:
SPARK-11105 Disitribute the log4j.properties files from the client to the
executors
FYI
On Fri, Dec 18, 2015 at 7:23 AM, Kalpesh Jadhav <
kalpesh.jad...@citiustech.com> wrote:
> Hi all,
>
>
>
> I am new to spark, I am
Hi Cody,
KafkaUtils.createRDD totally make sense now I can run my spark job once in
15 minutes extract data out of kafka and stop ..., I rely on kafka offset
for Incremental data am I right ? so no duplicate data will be returned.
Thanks
Sri
On Fri, Dec 18, 2015 at 2:41 PM, Cody Koeninger
You'll need to keep track of the offsets.
On Fri, Dec 18, 2015 at 9:51 AM, sri hari kali charan Tummala <
kali.tumm...@gmail.com> wrote:
> Hi Cody,
>
> KafkaUtils.createRDD totally make sense now I can run my spark job once in
> 15 minutes extract data out of kafka and stop ..., I rely on kafka o
Hi,
My Spark Batch job seems to hung up sometimes for a long time before it
starts the next stage/exits. Basically it happens when it has
mapPartition/foreachPartition in a stage. Any idea as to why this is
happening?
Thanks,
Swetha
--
View this message in context:
http://apache-spark-user-li
Need some clarification about the documentation. According to Spark doc
"the default interval is a multiple of the batch interval that is at least 10
seconds. It can be set by using dstream.checkpoint(checkpointInterval).
Typically, a checkpoint interval of 5 - 10 sliding intervals of a DStream
That was exactly the problem Michael, mode details in this post:
http://stackoverflow.com/questions/34184079/cannot-run-queries-in-sqlcontext-from-apache-spark-sql-1-5-2-getting-java-lang
*Matheus*
On Wed, Dec 9, 2015 at 4:43 PM, Michael Armbrust
wrote:
> java.lang.NoSuchMethodError almost alwa
Hi,
Am trying to configure log4j on an AWS EMR 4.2 Spark cluster for a streaming
job set in client mode.
I changed
/etc/spark/conf/log4j.properties
to use a FileAppender. However the INFO logging still goes to console.
Thanks for any suggestions,
--
Nick
>From the console:
Thanks Ted!
Yes, The schema might be different or the same.
What would be the answer for each situation?
On Fri, Dec 18, 2015 at 6:02 PM, Ted Yu wrote:
> CacheManager#cacheQuery() is called where:
> * Caches the data produced by the logical representation of the given
> [[Queryable]].
> ...
>
This method in CacheManager:
private[sql] def lookupCachedData(plan: LogicalPlan): Option[CachedData]
= readLock {
cachedData.find(cd => plan.sameResult(cd.plan))
Ied me to the following in
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
:
def sam
Can you try the lastest 1.6.0 RC which includes SPARK-1 ?
Cheers
On Fri, Dec 18, 2015 at 7:38 AM, Prasad Ravilla wrote:
> Hi,
>
> I am running into performance issue when joining data frames created from
> avro files using spark-avro library.
>
> The data frames are created from 120K avro f
So I looked at the function, my only worry is that the cache should be
cleared if I'm overwriting the cache with the same table name. I did this
experiment and the cache shows as table not cached but want to confirm
this. In addition to not using the old table values is it actually
removed/overwrit
>From the UI I see two rows for this on a streaming application:
RDD NameStorage LevelCached PartitionsFraction CachedSize in MemorySize in
ExternalBlockStoreSize on DiskIn-memory table myColorsTableMemory
Deserialized 1x Replicated2100%728.2 KB0.0 B0.0 BIn-memory table
myColorsTableMemory Deseria
Could you check the Scala version of your Kafka?
Best Regards,
Shixiong Zhu
2015-12-18 2:31 GMT-08:00 Christos Mantas :
> Thank you, Luciano, Shixiong.
>
> I thought the "_2.11" part referred to the Kafka version - an unfortunate
> coincidence.
>
> Indeed
>
> spark-submit --jars spark-streaming-
When second attempt is made to cache df3 which has same schema as the first
DataFrame, you would see the warning below:
scala> sqlContext.cacheTable("t1")
scala> sqlContext.isCached("t1")
res5: Boolean = true
scala> sqlContext.sql("select * from t1").show
+---+---+
| a| b|
+---+---+
| 1| 1|
Looks you have a reference to some Akka class. Could you post your codes?
Best Regards,
Shixiong Zhu
2015-12-17 23:43 GMT-08:00 Pankaj Narang :
> I am encountering below error. Can somebody guide ?
>
> Something similar is one this link
> https://github.com/elastic/elasticsearch-hadoop/issues/29
I created the following data, data.file
1 1
1 2
1 3
2 4
3 5
4 6
5 7
6 1
7 2
8 8
The following code:
def parse_line(line):
tokens = line.split(' ')
return (int(tokens[0]), int(tokens[1])), 1.0
lines = sc.textFile('./data.file')
linesTest = sc.textFile('./data.file')
During spark-submit when running hive on spark I get:
Exception in thread "main" java.util.ServiceConfigurationError:
org.apache.hadoop.fs.FileSystem: Provider
org.apache.hadoop.hdfs.HftpFileSystem could not be instantiated
Caused by: java.lang.IllegalAccessError: tried to access method
org.apac
Found the issue, a conflict between setting Java options in both
spark-defaults.conf and in the spark-submit.
--
Nick
From: Afshartous, Nick
Sent: Friday, December 18, 2015 11:46 AM
To: user@spark.apache.org
Subject: Configuring log4j
Hi,
Am trying t
You are right. "checkpointInterval" is only for data checkpointing.
"metadata checkpoint" is done for each batch. Feel free to send a PR to add
the missing doc.
Best Regards,
Shixiong Zhu
2015-12-18 8:26 GMT-08:00 Lan Jiang :
> Need some clarification about the documentation. According to Spark
Hi Saif, have you verified that the cluster has enough resources for all 4
programs?
-Andrew
2015-12-18 5:52 GMT-08:00 :
> Hello everyone,
>
> I am testing some parallel program submission to a stand alone cluster.
> Everything works alright, the problem is, for some reason, I can’t submit
> mor
Hi Roy,
I believe Spark just gets its application ID from YARN, so you can just do
`sc.applicationId`.
-Andrew
2015-12-18 0:14 GMT-08:00 Deepak Sharma :
> I have never tried this but there is yarn client api's that you can use in
> your spark program to get the application id.
> Here is the lin
Hi Antony,
The configuration to enable dynamic allocation is per-application.
If you only wish to enable this for one of your applications, just set
`spark.dynamicAllocation.enabled` to true for that application only. The
way it works under the hood is that application will start sending requests
I am going to say no, but have not actually tested this. Just going on
this line in the docs:
http://spark.apache.org/docs/latest/configuration.html
|spark.driver.extraClassPath| (none) Extra classpath entries to
prepend to the classpath of the driver.
/Note:/ In client mode, this config mus
Hi Rastan,
Unless you're using off-heap memory or starting multiple executors per
machine, I would recommend the r3.2xlarge option, since you don't actually
want gigantic heaps (100GB is more than enough). I've personally run Spark
on a very large scale with r3.8xlarge instances, but I've been usi
You need to use window functions to get this kind of behavior. Or use max
and a struct (
http://stackoverflow.com/questions/13523049/hive-sql-find-the-latest-record)
On Thu, Dec 17, 2015 at 11:55 PM, Timothée Carayol <
timothee.cara...@gmail.com> wrote:
> Hi all,
>
> I tried to do something like
Changing equality check from “<=>”to “===“ solved the problem. Performance
skyrocketed.
I am wondering why “<=>” cause a performance degrade?
val dates = new RetailDates()
val dataStructures = new DataStructures()
// Reading CSV Format input files -- retailDates
// This DF has 75 records
val
This is fixed in Spark 1.6.
On Fri, Dec 18, 2015 at 3:06 PM, Prasad Ravilla wrote:
> Changing equality check from “<=>”to “===“ solved the problem.
> Performance skyrocketed.
>
> I am wondering why “<=>” cause a performance degrade?
>
> val dates = new RetailDates()
> val dataStructures = new Da
Hi,
How to run multiple Spark jobs that takes Spark Streaming data as the
input as a workflow in Oozie? We have to run our Streaming job first and
then have a workflow of Spark Batch jobs to process the data.
Any suggestions on this would be of great help.
Thanks!
--
View this message in c
See https://issues.apache.org/jira/browse/SPARK-7301
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Ambiguous-references-to-a-field-set-in-a-partitioned-table-AND-the-data-tp22325p25740.html
Sent from the Apache Spark User List mailing list archive at Nabbl
Andrew, it's going to be 4 execotor jvms on each r3.8xlarge.
Rastan, you can run quick test using emr spark cluster on spot instances
and see what configuration works better. Without the tests it is all
speculation.
On Dec 18, 2015 1:53 PM, "Andrew Or" wrote:
> Hi Rastan,
>
> Unless you're using
Hi - I'm running Spark Streaming using PySpark 1.3 in yarn-client mode on CDH
5.4.4. The job sometimes runs a full 24hrs, but more often it fails
sometime during the day.
I'm getting several vague errors that I don't see much about when searching
online:
- py4j.Py4JException: Error while obtaini
Hello experts... i am new to spark, anyone please explain me how to fetch
data from hbase table in spark java
Thanks in Advance...
57 matches
Mail list logo