current file structure where ever
it runs in the cluster. Therefore, I am just wondering if there are any best
practices to do it or just plain way of cloning the git repos all over the
nodes in cluster. Any suggestion is welcome.
Thanks,
RK
--
Collective[i] dramatically improves sales and
.com/apache/spark/blob/master/python/pyspark/rdd.py#L229>
> <https://mixmax.com/r/aMyLuMpcgLtL2LPwR>
>
>
>
>
> On Tue, Aug 23, 2016 4:17 PM, kant kodali kanth...@gmail.com
> <mailto:kanth...@gmail.com> wrote:
> @RK you may want to look more deeply if you
te:
>
> >>
>
> >> RDD contains data but not JVM byte code i.e. data which is read from
>
> >> source and transformations have been applied. This is ideal case to persist
>
> >> RDDs.. As Nirav mentioned this data will be serialized before persisting t
That’s because of this:
scala> val text =
Array((1,"hNjLJEgjxn"),(2,"lgryHkVlCN"),(3,"ukswqcanVC"),(4,"ZFULVxzAsv"),(5,"LNzOozHZPF"),(6,"KZPYXTqMkY"),(7,"DVjpOvVJTw"),(8,"LKRYrrLrLh"),(9,"acheneIPDM"),(10,"iGZTrKfXNr"))
text: Array[(Int, String)] = Array((1,hNjLJEgjxn), (2,lgryHkVlCN),
(3,ukswqc
On an other note, if you have a streaming app, you checkpoint the RDDs so that
they can be accessed in case of a failure. And yes, RDDs are persisted to DISK.
You can access spark’s UI and see it listed under Storage tab.
If RDDs are persisted in memory, you avoid any disk I/Os so that any look
).start()
It fails in py4j/protocol.py at this function get_return_value. Has anyone out
there played with Sparkling water faced any issues with H2O + Spark on Cloudera?
Thanks,
RK
--
Collective[i] dramatically improves sales and marketing performance using
technology, applications and a
results? If it were the design, I am flummoxed.
-RK
--
Collective[i] dramatically improves sales and marketing performance using
technology, applications and a revolutionary network designed to provide
next generation analytics and decision-support directly to business users.
Our goal is to maximize
You can have a temporary file to capture the data that you would like to
overwrite. And swap that with existing partition that you would want to wipe
the data away. Swapping can be done by simple rename of the partition and just
repair the table to pick up the new partition.
Am not sure if that
I can see large number of collections happening on driver and eventually,
driver is running out of memory. ( am not sure whether you have persisted any
rdd or data frame). May be you would want to avoid doing so many collections or
persist unwanted data in memory.
To begin with, you may want to
Thanks for the idea Maciej. The data is roughly 10 gigs.
I’m wondering if there any way to avoid the collect for each unit operation and
somehow capture all such resultant arrays and collect them at once.
> On Jul 20, 2016, at 2:52 PM, Maciej Bryński wrote:
>
> RK Aduri,
> Anothe
That -1 is coming from here:
PythonRDD.writeIteratorToStream(inputIterator, dataOut)
dataOut.writeInt(SpecialLengths.END_OF_DATA_SECTION) —> val
END_OF_DATA_SECTION = -1
dataOut.writeInt(SpecialLengths.END_OF_STREAM)
dataOut.flush()
> On Jul 21, 2016, at 12:24 PM, Jacek Laskowski wrote:
>
>
This has worked for me:
--conf "spark.driver.extraJavaOptions
-Dlog4j.configuration=file:/some/path/search-spark-service-log4j-Driver.properties"
\
you may want to try it.
If that doesn't work, then you may use --properties-file
--
View this message in context:
http://apache-spark-user-list.
Spark version: 1.6.0
So, here is the background:
I have a data frame (Large_Row_DataFrame) which I have created from an
array of row objects and also have another array of unique ids (U_ID) which
I’m going to use to look up into the Large_Row_DataFrame (which is cached)
to do a customized
:
> Hi,
>
> I have set driver memory 10 GB and job ran with intermediate failure which
> is recovered back by spark.
>
> But I still what to know if no of parts increases git driver ram need to
> be increased and what is ration of no of parts/RAM.
>
> @RK : I am using cach
Did you check this:
case class Example(name : String, age ; Int)
there is a semicolon. should have been (age : Int)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Task-not-serializable-java-io-NotSerializableException-org-json4s-Serialization-anon-1-tp823
Just want to see if this helps.
Are you doing heavy collects and persist that? If that is so, you might
want to parallelize that collection by converting to an RDD.
Thanks,
RK
On Tue, Jul 19, 2016 at 12:09 AM, Saurav Sinha
wrote:
> Hi Mich,
>
>1. In what mode are you running
val textFile = sc.textFile("README.md")val linesWithSpark =
textFile.filter(line => line.contains("Spark"))
linesWithSpark.saveAsTextFile("output1")
Same error:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
file:/home/user/spark-1.5.1-bin-hadoop2.4/bin/README.md
You can probably define sliding windows and set larger batch intervals.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Best-Practices-to-handle-multiple-datapoints-arriving-at-different-time-interval-tp27315p27348.html
Sent from the Apache
Did you try with different driver's memory? Increasing driver's memory can be
one option. Can you print the GC and post the GC times?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-OutOfMemoryError-related-to-Graphframe-bfs-tp27318p27347.html
Sent
DataFrames uses RDDs as internal implementation of its structure. It doesn't
convert to RDD but uses RDD partitions to produce logical plan.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Dataframes-tp27306p27346.html
Sent from the Apache Spark User
val count = inputfile.flatMap(line => line.split(" ")).map(word =>
(word,1)).reduceByKey(_ + _);
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
scala> val count = inputfile.flatMap(line => line.split((" ").map(word =>
(word,1)).reduceByKey(_ + _) | | You typed two blank lines. Starting a new
command.
I am getting like how to solve this
Regrads,
Ramkrishna KT
I am using Spark version is 1.5.1, I am getting errors in first program of
spark,ie.e., word count. Please help me to solve this
*scala> val inputfile = sc.textFile("input.txt")*
*inputfile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[13] at
textFile at :21*
*scala> val counts = inputFile
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
) =>
{ count > requestThreshold } }
70 usefulEvents.foreachRDD(filteredEvents => { ... })
Thanks,RK
From: RK
To: User
Sent: Tuesday, March 1, 2016 3:17 PM
Subject: Checkpoint RDD ReliableCheckpointRDD at foreachRDD has different
number of partitions from ori
(103) has different number of partitions
from original RDD MapPartitionsRDD[28050] at reduceByKeyAndWindow at
EventProcessor.scala:45(108)
Has anyone seen this issue and under what circumstances will it exception occur?
Thanks,RK
If you don't want a few slow tasks to slow down the entire job, you can turn on
speculation.
Here are the speculation settings from Spark Configuration - Spark 1.2.0
Documentation.
| |
| | | | | |
| Spark Configuration - Spark 1.2.0 DocumentationSpark Configuration Spark
Properties
The issue is happening when I try to concatenate column values in the query
like "col1+'$'+col3". For some reason, this issue is not manifesting itself
when I do a single IF query.
Is there a concat function in SparkSQL? I can't find anything in the
documentation.
Th
BTW, I am seeing this issue in Spark 1.1.1.
On Sunday, January 4, 2015 7:29 PM, RK wrote:
When I use a single IF statement like "select IF(col1 != "", col1+'$'+col3,
col2+'$'+col3) from my_table", it works fine.
However, when I use a ne
leType)) else if (NOT (col2#2 = )) (CAST((CAST(col2#2,
DoubleType) + CAST($, DoubleType)), DoubleType) + CAST(col3#3, DoubleType))
else $ AS c0#4] Subquery my_table SparkLogicalPlan (ExistingRdd
[DB#0,col1#1,col2#2,col3#3], MappedRDD[97] at getCallSite at DStream.scala:294)
Does Spark SQL not support nested IF queries or is my query incorrect?
Thanks,RK
If you look at your program output closely, you can see the following output.
Lines with a: 24, Lines with b: 15
The exception seems to be happening with Spark cleanup after executing your
code. Try adding sc.stop() at the end of your program to see if the exception
goes away.
On Wedne
I am running the job on 1.1.1.
I will let the job run overnight and send you more info on computation vs GC
time tomorrow.
BTW, do you know what the stage description named "getCallSite at
DStream.scala:294" might mean?
Thanks,RK
On Tuesday, December 30, 2014 6:02 PM, Tat
ob which seems to indicate that this code is fine.Also, the
waiting batches is either at 0 or 1 even after 8 to 10 hours.
Based on the information that map is as fast as during the start of job and
that there is no waiting batches, I am assuming that the getCallSite stages
correspond to getting data out of Kafka? Is this correct or not?If my
assumption is correct, Is there anything that I could do to optimize receiving
data from Kafka?If not, which part of my code needs to be optimized to reduce
the scheduling delay?
Thanks,RK
33 matches
Mail list logo