Increase your executor memory, Also you can play around with increasing the
number of partitions/parallelism etc.
Thanks
Best Regards
On Tue, Feb 17, 2015 at 3:39 AM, Harshvardhan Chauhan
wrote:
> Hi All,
>
>
> I need some help with Out Of Memory errors in my application. I am using
> Spark 1.1
Hi,
I'm trying to connect to Cassandra through PySpark using the
spark-cassandra-connector from datastax based on the work of Mike
Sukmanowsky.
I can use Spark and Cassandra through the datastax connector in Scala just
fine. Where things fail in PySpark is that an exception is raised in
org.apach
BTW we merged this today: https://github.com/apache/spark/pull/4640
This should allow us in the future to address column by name in a Row.
On Mon, Feb 16, 2015 at 11:39 AM, Michael Armbrust
wrote:
> I can unpack the code snippet a bit:
>
> caper.select('ran_id) is the same as saying "SELECT ra
Hi there,
I am trying to scale up the data size that my application is handling. This
application is running on a cluster with 16 slave nodes. Each slave node has
60GB memory. It is running in standalone mode. The data is coming from HDFS
that also in same local network.
In order to have an under
You can build your own spark with option -Phive-thriftserver.
You can publish the jars locally. I hope that would solve your problem.
On Mon, Feb 16, 2015 at 8:54 PM, Marco wrote:
> Ok, so will it be only available for the next version (1.30)?
>
> 2015-02-16 15:24 GMT+01:00 Ted Yu :
>
>> I sear
Hello,
I have a simple Kafka Spark Streaming example which I am still developing
in the standalone mode.
Here is what is puzzling me,
If I build the assembly jar, use bin/spark-submit to run it, it works fine.
But if I want to run the code from within Intellij IDE, then it will cry
for this erro
Hi Arush,
With your code, I still didn't see the output "Received X flumes events"..
bit1...@163.com
From: bit1...@163.com
Date: 2015-02-17 14:08
To: Arush Kharbanda
CC: user
Subject: Re: Re: Question about spark streaming+Flume
Ok, you are missing a letter in foreachRDD.. let me proceed..
Ok, you are missing a letter in foreachRDD.. let me proceed..
bit1...@163.com
From: Arush Kharbanda
Date: 2015-02-17 14:31
To: bit1...@163.com
CC: user
Subject: Re: Question about spark streaming+Flume
Hi
Can you try this
val lines = FlumeUtils.createStream(ssc,"localhost",)
// Prin
Thanks Arush..
With your code, compiling error occurs:
Error:(19, 11) value forechRDD is not a member of
org.apache.spark.streaming.dstream.ReceiverInputDStream[org.apache.spark.streaming.flume.SparkFlumeEvent]
lines.forechRDD(_.foreach(println))
^
From: Arush Kharbanda
Date: 2015-02-17 14
Hi
Can you try this
val lines = FlumeUtils.createStream(ssc,"localhost",)
// Print out the count of events received from this server in each
batch
lines.count().map(cnt => "Received " + cnt + " flume events. at " +
System.currentTimeMillis() )
lines.forechRDD(_.foreach(println))
Than
Hi there,
I am trying to scale up the data size that my application is handling. This
application is running on a cluster with 16 slave nodes. Each slave node
has 60GB memory. It is running in standalone mode. The data is coming from
HDFS that also in same local network.
In order to have an under
Hi,
I am trying Spark Streaming + Flume example:
1. Code
object SparkFlumeNGExample {
def main(args : Array[String]) {
val conf = new SparkConf().setAppName("SparkFlumeNGExample")
val ssc = new StreamingContext(conf, Seconds(10))
val lines = FlumeUtils.createStream(ssc,"localhost"
What is your scala version used to build Spark?
It seems your nscala-time library scala version is 2.11,
and default Spark scala version is 2.10.
On Tue Feb 17 2015 at 1:51:47 AM Hammam CHAMSI wrote:
> Hi All,
>
> Thanks in advance for your help. I have timestamp which I need to convert
> to da
Will do. Thanks a lot.
On Mon, Feb 16, 2015 at 7:20 PM, Davies Liu wrote:
> Can you try the example in pyspark-cassandra?
>
> If not, you could create a issue there.
>
> On Mon, Feb 16, 2015 at 4:07 PM, Mohamed Lrhazi
> wrote:
> > So I tired building the connector from:
> > https://github.com/
For the last question, you can trigger GC in JVM from Python by :
sc._jvm.System.gc()
On Mon, Feb 16, 2015 at 4:08 PM, Antony Mayi
wrote:
> thanks, that looks promissing but can't find any reference giving me more
> details - can you please point me to something? Also is it possible to force
> G
Can you try the example in pyspark-cassandra?
If not, you could create a issue there.
On Mon, Feb 16, 2015 at 4:07 PM, Mohamed Lrhazi
wrote:
> So I tired building the connector from:
> https://github.com/datastax/spark-cassandra-connector
>
> which seems to include the java class referenced in t
So I tired building the connector from:
https://github.com/datastax/spark-cassandra-connector
which seems to include the java class referenced in the error message:
[root@devzero spark]# unzip -l
spark-cassandra-connector/spark-cassandra-connector-java/target/scala-2.10/spark-cassandra-connector-
thanks, that looks promissing but can't find any reference giving me more
details - can you please point me to something? Also is it possible to force GC
from pyspark (as I am using pyspark)?
thanks,Antony.
On Monday, 16 February 2015, 21:05, Tathagata Das
wrote:
Correct, brute f
This will be fixed by https://github.com/apache/spark/pull/4629
On Fri, Feb 13, 2015 at 10:43 AM, Imran Rashid wrote:
> yeah I thought the same thing at first too, I suggested something equivalent
> w/ preservesPartitioning = true, but that isn't enough. the join is done by
> union-ing the two t
Oh, I don't know. thanks a lot Davies, gonna figure that out now
On Mon, Feb 16, 2015 at 5:31 PM, Davies Liu wrote:
> It also need the Cassandra jar:
> com.datastax.spark.connector.CassandraJavaUtil
>
> Is it included in /spark/pyspark-cassandra-0.1-SNAPSHOT.jar ?
>
>
>
> On Mon, Feb 16, 20
It also need the Cassandra jar: com.datastax.spark.connector.CassandraJavaUtil
Is it included in /spark/pyspark-cassandra-0.1-SNAPSHOT.jar ?
On Mon, Feb 16, 2015 at 1:20 PM, Mohamed Lrhazi
wrote:
> Yes, am sure the system cant find the jar.. but how do I fix that... my
> submit command includ
Hi All,
I need some help with Out Of Memory errors in my application. I am using
Spark 1.1.0 and my application is using Java API. I am running my app on
EC2 25 m3.xlarge (4 Cores 15GB Memory) instances. The app only fails
sometimes. Lots of mapToPair tasks a failing. My app is configured to ru
My first problem was somewhat similar to yours. You won't find a whole lot
of JDBC to Spark examples since I think a lot of the adoption for Spark is
from teams already experienced with Hadoop and already have an established
big data solution (so their data is already extracted from whatever
source
Hi Experts,
I have a large table with 54 million records (fact table), being joined
with 6 small tables (dimension tables). The size on disk of small tables is
within 5k and the record count is in the range of 4 - 200
All the worker nodes have RAM of 32GB allocated for spark. I have tried the
belo
Thanks Charles. I just realized a few minutes ago that I neglected to
show the step where I generated the key on the person ID. Thanks for the
pointer on the HDFS URL.
Next step is to process data from multiple RDDS. My data originates from
7 tables in a MySQL database. I used sqoop to create
I cannot comment about the correctness of Python code. I will assume your
caper_kv is keyed on something that uniquely identifies all the rows that
make up the person's record so your group by key makes sense, as does the
map. (I will also assume all of the rows that comprise a single person's
reco
Yes, am sure the system cant find the jar.. but how do I fix that... my
submit command includes the jar:
/spark/bin/spark-submit --py-files /spark/pyspark_cassandra.py --jars
/spark/pyspark-cassandra-0.1-SNAPSHOT.jar --driver-class-path
/spark/pyspark-cassandra-0.1-SNAPSHOT.jar /spark/test2.py
an
It seems that the jar for cassandra is not loaded, you should have
them in the classpath.
On Mon, Feb 16, 2015 at 12:08 PM, Mohamed Lrhazi
wrote:
> Hello all,
>
> Trying the example code from this package
> (https://github.com/Parsely/pyspark-cassandra) , I always get this error...
>
> Can you se
I'm a spark newbie working on his first attempt to do write an ETL
program. I could use some feedback to make sure I'm on the right path.
I've written a basic proof of concept that runs without errors and seems
to work, although I might be missing some issues when this is actually
run on more t
Hello all,
Trying the example code from this package (
https://github.com/Parsely/pyspark-cassandra) , I always get this error...
Can you see what I am doing wrong? from googling arounf it seems to be that
the jar is not found somehow... The spark log shows the JAR was processed
at least.
Thank
Correct, brute force clean up is not useful. Since Spark 1.0, Spark can do
automatic cleanup of files based on which RDDs are used/garbage collected
by JVM. That would be the best way, but depends on the JVM GC
characteristics. If you force a GC periodically in the driver that might
help you get ri
I can unpack the code snippet a bit:
caper.select('ran_id) is the same as saying "SELECT ran_id FROM table" in
SQL. Its always a good idea to explicitly request the columns you need
right before using them. That way you are tolerant of any changes to the
schema that might happen upstream.
The n
Is it possible to port WrappedArraySerializer.scala to your app ?
Pardon me for not knowing how to integrate Chill with Spark.
Cheers
On Mon, Feb 16, 2015 at 12:31 AM, Tao Xiao wrote:
> Thanks Ted
>
> After searching for a whole day, I still don't know how to let spark use
> twitter chill seri
I am just learning scala so I don't actually understand what your code
snippet is doing but thank you, I will learn more so I can figure it out.
I am new to all of this and still trying to make the mental shift from
normal programming to distributed programming, but it seems to me that
the row
For efficiency the row objects don't contain the schema so you can't get
the column by name directly. I usually do a select followed by pattern
matching. Something like the following:
caper.select('ran_id).map { case Row(ranId: String) => }
On Mon, Feb 16, 2015 at 8:54 AM, Eric Bell wrote:
> I
You probably want to mark the HiveContext as @transient as its not valid to
use it on the slaves anyway.
On Mon, Feb 16, 2015 at 1:58 AM, Haopu Wang wrote:
> I have a streaming application which registered temp table on a
> HiveContext for each batch duration.
>
> The application runs well in S
We've been using commons configuration to pull our properties out of
properties files and system properties (prioritizing system properties over
others) and we add those properties to our spark conf explicitly and we use
ArgoPartser to get the command line argument for which property file to
load.
Is it possible to reference a column from a SchemaRDD using the column's
name instead of its number?
For example, let's say I've created a SchemaRDD from an avro file:
val sqlContext = new SQLContext(sc)
import sqlContext._
val caper=sqlContext.avroFile("hdfs://localhost:9000/sma/raw_avro/caper
Hi All,
Thanks in advance for your help. I have timestamp which I need
to convert to datetime using scala. A folder contains the three needed
jar files: "joda-convert-1.5.jar joda-time-2.4.jar
nscala-time_2.11-1.8.0.jar"
Using scala REPL and adding the jars: scala -classpath "*.jar"
I can
I haven't actually tried mixing non-Spark settings into the Spark
properties. Instead I package my properties into the jar and use the
Typesafe Config[1] - v1.2.1 - library (along with Ficus[2] - Scala
specific) to get at my properties:
Properties file: src/main/resources/integration.conf
(below
How about system properties? or something like Typesafe Config which
lets you at least override something in a built-in config file on the
command line, with props or other files.
On Mon, Feb 16, 2015 at 3:38 PM, Emre Sevinc wrote:
> Sean,
>
> I'm trying this as an alternative to what I currently
Sean,
I'm trying this as an alternative to what I currently do. Currently I have
my module.properties file for my module in the resources directory, and
that file is put inside the über JAR file when I build my application with
Maven, and then when I submit it using spark-submit, I can read that
m
Since SparkConf is only for Spark properties, I think it will in
general only pay attention to and preserve "spark.*" properties. You
could experiment with that. In general I wouldn't rely on Spark
mechanisms for your configuration, and you can use any config
mechanism you like to retain your own p
Hello,
I'm using Spark 1.2.1 and have a module.properties file, and in it I have
non-Spark properties, as well as Spark properties, e.g.:
job.output.dir=file:///home/emre/data/mymodule/out
I'm trying to pass it to spark-submit via:
spark-submit --class com.myModule --master local[4] --dep
Ok, so will it be only available for the next version (1.30)?
2015-02-16 15:24 GMT+01:00 Ted Yu :
> I searched for 'spark-hive-thriftserver_2.10' on this page:
> http://mvnrepository.com/artifact/org.apache.spark
>
> Looks like it is not published.
>
> On Mon, Feb 16, 2015 at 5:44 AM, Marco wrot
Hello!
I'm newbie to Spark and I have the following case study:
1. Client sending at 100ms the following data:
{uniqueId, timestamp, measure1, measure2 }
2. Each 30 seconds I would like to correlate the data collected in the
window, with some predefined double vector pattern for each given key.
Dear Cheng Hao,
You are right!
After using the HiveContext, the issue is solved.
Thanks,
Wush
2015-02-15 10:42 GMT+08:00 Cheng, Hao :
> Are you using the SQLContext? I think the HiveContext is recommended.
>
>
>
> Cheng Hao
>
>
>
> *From:* Wush Wu [mailto:w...@bridgewell.com]
> *Sent:* Thursd
I searched for 'spark-hive-thriftserver_2.10' on this page:
http://mvnrepository.com/artifact/org.apache.spark
Looks like it is not published.
On Mon, Feb 16, 2015 at 5:44 AM, Marco wrote:
> Hi,
>
> I am referring to https://issues.apache.org/jira/browse/SPARK-4925 (Hive
> Thriftserver Maven Ar
Sean,
In this case, I've been testing the code on my local machine and using
Spark locally, so I all the log output was available on my terminal. And
I've used the .print() method to have an output operation, just to force
Spark execute.
And I was not using foreachRDD, I was only using print() me
Hi,
I am referring to https://issues.apache.org/jira/browse/SPARK-4925 (Hive
Thriftserver Maven Artifact). Can somebody point me (URL) to the artifact
in a public repository ? I have not found it @Maven Central.
Thanks,
Marco
Instead of print you should do jsonIn.count().print(). Straight forward
approach is to use foreachRDD :)
Thanks
Best Regards
On Mon, Feb 16, 2015 at 6:48 PM, Emre Sevinc wrote:
> Hello Sean,
>
> I did not understand your question very well, but what I do is checking
> the output directory (and
Materialization shouldn't be relevant. The collect by itself doesn't let
you detect whether it happened. Print should print some results to the
console but on different machines, so may not be a reliable way to see what
happened.
Yes I understand your real process uses foreachRDD and that's what y
Hello Sean,
I did not understand your question very well, but what I do is checking the
output directory (and I have various logger outputs at various stages
showing the contents of an input file being processed, the response from
the web service, etc.).
By the way, I've already solved my problem
How are you deciding whether files are processed or not? It doesn't seem
possible from this code. Maybe it just seems so.
On Feb 16, 2015 12:51 PM, "Emre Sevinc" wrote:
> I've managed to solve this, but I still don't know exactly why my solution
> works:
>
> In my code I was trying to force the S
I've managed to solve this, but I still don't know exactly why my solution
works:
In my code I was trying to force the Spark to output via:
jsonIn.print();
jsonIn being a JavaDStream.
When removed the code above, and added the code below to force the output
operation, hence the execution:
Hello,
I have an application in Java that uses Spark Streaming 1.2.1 in the
following manner:
- Listen to the input directory.
- If a new file is copied to that input directory process it.
- Process: contact a RESTful web service (running also locally and
responsive), send the contents of the
PS this is the real fix to this issue:
https://issues.apache.org/jira/browse/SPARK-5795
I'd like to merge it as I don't think it breaks the API; it actually
fixes it to work as intended.
On Mon, Feb 16, 2015 at 3:25 AM, Bahubali Jain wrote:
> I used the latest assembly jar and the below as sugg
I have a streaming application which registered temp table on a
HiveContext for each batch duration.
The application runs well in Spark 1.1.0. But I get below error from
1.1.1.
Do you have any suggestions to resolve it? Thank you!
java.io.NotSerializableException: org.apache.hadoop.hive.conf.
HI all,
The default OutputCommitter used by RDD, which is FileOutputCommitter, seems to
require moving files at the commit step, which is not a constant operation in
S3, as discussed in
http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3c543e33fa.2000...@entropy.be%3E.
People se
Thanks Ted
After searching for a whole day, I still don't know how to let spark use
twitter chill serialization - there are very few documents about how to
integrate twitter chill into Spark for serialization. I tried the
following, but an exception of "java.lang.ClassCastException:
com.twitter.ch
Hi Jianshi,
When accessing a Hive table with Parquet SerDe, Spark SQL tries to convert
it into Spark SQL's native Parquet support for better performance. And yes,
predicate push-down, column pruning are applied here. In 1.3.0, we'll also
cover the write path except for writing partitioned table.
spark.cleaner.ttl is not the right way - seems to be really designed for
streaming. although it keeps the disk usage under control it also causes loss
of rdds and broadcasts that are required later leading to crash.
is there any other way?thanks,Antony.
On Sunday, 15 February 2015, 21:42,
62 matches
Mail list logo