hi Ted,
thanks for your reply, are there any other way to do this with spark 1.3?
such as write the orcfile manually in foreachPartition method?
On Sat, Aug 22, 2015 at 12:19 PM, Ted Yu wrote:
> ORC support was added in Spark 1.4
> See SPARK-2883
>
> On Fri, Aug 21, 2015 at 7:36 PM, dong.yajun
ORC support was added in Spark 1.4
See SPARK-2883
On Fri, Aug 21, 2015 at 7:36 PM, dong.yajun wrote:
> Hi list,
>
> Is there a way to save the RDD result as Orcfile in spark1.3? due to some
> reasons we can't upgrade our spark version to 1.4 now.
>
> --
> *Ric Dong*
>
>
it comes at start of each tasks when there is new data inserted in kafka.(
data inserted is very few)
kafka topic has 300 partitions - data inserted is ~10 MB.
Tasks gets failed and it retries which succeed and after certain no of fail
tasks it kills the job.
On Sat, Aug 22, 2015 at 2:08 AM, A
You get the list of all the persistet rdd using spark context...
On Aug 21, 2015 12:06 AM, "Rishitesh Mishra"
wrote:
> I am not sure if you can view all RDDs in a session. Tables are maintained
> in a catalogue . Hence its easier. However you can see the DAG
> representation , which lists all th
Hi Joseph,
I have used the built in save as you suggested. The directory gets created
but the complete model doesn't gets written to the file. Only a part of the
model gets written to the file. Please find the attached part that was
written when tested with the above method.
On 18 August 2015 at
Hi list,
Is there a way to save the RDD result as Orcfile in spark1.3? due to some
reasons we can't upgrade our spark version to 1.4 now.
--
*Ric Dong*
I've been reading documentation on accessing offsetRanges and updating ZK
yourself when using DirectKafkaInputDStream (from createDirectStream),
along with the code changes in this PR:
https://github.com/apache/spark/pull/4805.
I'm planning on adding a listener to update ZK (for monitoring purpose
Is there a workaround without updating Hadoop? Would really appreciate if
someone can explain what spark is trying to do here and what is an easy way
to turn this off. Thanks all!
On Fri, Aug 21, 2015 at 11:09 AM, Raghavendra Pandey <
raghavendra.pan...@gmail.com> wrote:
> Did you try with hadoop
Following is a method that retrieves the list of executors registered to a
spark context. It worked perfectly with spark-submit in standalone mode for my
project.
/** * A simplified method that just returns the current active/registered
executors * excluding the driver. * @param sc *
Nathan,
I achieve this using rowNumber. Here is a Python DataFrame example:
from pyspark.sql.window import Window
from pyspark.sql.functions import desc, rowNumber
yourOutputDF = (
yourInputDF
.withColumn("first", rowNumber()
.over(Window.partitionBy("userID").orderBy("d
In the test job I am running in Spark 1.3.1 in our stage cluster, I can see
following information on the application stage information:
MetricMin25th percentileMedian75th percentileMaxDuration0 ms1.1 min1.5 min1.7
min3.4 minGC Time11 s16 s21 s25 s54 s
>From the GC output log, I can see it is abou
Hi Akhil,
I'm using spark 1.4.1.
Number of executors is not in the command line, not in the
getExecutorMemoryStatus
(I already mentioned that I tried that, works in spark-shell but not when
executed via spark-submit). I tried looking at "defaultParallelism" too,
it's 112 (7 executors * 16 cores) w
Could you periodically (say every 10 mins) run System.gc() on the driver.
The cleaning up shuffles is tied to the garbage collection.
On Fri, Aug 21, 2015 at 2:59 AM, gaurav sharma
wrote:
> Hi All,
>
>
> I have a 24x7 running Streaming Process, which runs on 2 hour windowed data
>
> The issue i
That looks like you are choking your kafka machine. Do a top on the kafka
machines and see the workload, it may happen that you are spending too much
time on disk io etc.
On Aug 21, 2015 7:32 AM, "Cody Koeninger" wrote:
> Sounds like that's happening consistently, not an occasional network
> prob
Which version spark are you using? There was a discussion happened over
here
http://apache-spark-user-list.1001560.n3.nabble.com/Determine-number-of-running-executors-td19453.html
http://mail-archives.us.apache.org/mod_mbox/spark-user/201411.mbox/%3ccacbyxk+ya1rbbnkwjheekpnbsbh10rykuzt-laqgpdanvhm
It may happen that the version of spark-ec2 script you are using is buggy
or sometime AWS have problem provisioning machines.
On Aug 21, 2015 7:56 AM, "Garry Chen" wrote:
> Hi All,
>
> I am trying to lunch a spark ec2 cluster by running
> spark-ec2 --key-pair=key --identity-file=
You can try adding a humanly readable entry in your /etc/hosts file of the
worker machine and then you can set the SPARK_LOCAL_IP pointing to this
hostname on that machines spark-env.sh file.
On Aug 21, 2015 11:57 AM, wrote:
> Hi,
>
> Is it possible in standalone to set up worker ID names? to avo
You can look at the spark.streaming.concurrentJobs by default it runs a
single job. If set it to 2 then it can run 2 jobs parallely. Its an
experimental flag, but go ahead and give it a try.
On Aug 21, 2015 3:36 AM, "Sateesh Kavuri" wrote:
> Hi,
>
> My scenario goes like this:
> I have an algorit
Did you try sorting it by datetime and doing a groupBy on the userID?
On Aug 21, 2015 12:47 PM, "Nathan Skone" wrote:
> Raghavendra,
>
> Thanks for the quick reply! I don’t think I included enough information in
> my question. I am hoping to get fields that are not directly part of the
> aggregat
You could also rename them with names
Unfortunately the API doesn't show the example of that
https://spark.apache.org/docs/latest/api/R/index.html
On Thu, Aug 20, 2015 at 7:43 PM -0700, "Sun, Rui" wrote:
Hi,
You can create a DataFrame using load.df() with a specified schema.
Something like
Raghavendra,
Thanks for the quick reply! I don’t think I included enough information in my
question. I am hoping to get fields that are not directly part of the
aggregation. Imagine a dataframe representing website views with a userID,
datetime, and a webpage address. How could I find the oldes
Hi Sateesh,
It is interesting to know , how did you determine that the Dstream runs on
a single core. Did you mean receivers?
Coming back to your question, could you not start disk io in a separate
thread, so that the sceduler can go ahead and assign other tasks ?
On 21 Aug 2015 16:06, "Sateesh Ka
You've probably hit this bug:
https://issues.apache.org/jira/browse/SPARK-7180
It's fixed in Spark 1.4.1+. Try setting spark.serializer.extraDebugInfo to
false and see if it goes away.
On Fri, Aug 21, 2015 at 3:37 AM, Eugene Morozov
wrote:
> Hi,
>
> I'm using spark 1.3.1 built against hadoop 1
Hi,
Is it possible in standalone to set up worker ID names? to avoid the
worker-19248891237482379-ip..-port ??
Thanks,
Saif
Thanks Reynold, that helps a lot. I'm glad you're involved with that Google
Doc community effort. I think it's because of that doc that the JEP's
wording and scope changed for the better since it originally got
introduced.
Marek
On Fri, Aug 21, 2015 at 11:18 AM, Reynold Xin wrote:
> I'm actuall
I'm actually somewhat involved with the Google Docs you linked to.
I don't think Oracle will remove Unsafe in JVM 9. As you said, JEP 260
already proposes making Unsafe available. Given the widespread use of
Unsafe for performance and advanced functionalities, I don't think Oracle
can just remove
Impact,
You can group by the data and then sort it by timestamp and take max to
select the oldest value.
On Aug 21, 2015 11:15 PM, "Impact" wrote:
> I am also looking for a way to achieve the reducebykey functionality on
> data
> frames. In my case I need to select one particular row (the oldest,
Did you try with hadoop version 2.7.1 .. It is known that s3a works really
well with parquet which is available in 2.7. They fixed lot of issues
related to metadata reading there...
On Aug 21, 2015 11:24 PM, "Jerrick Hoang" wrote:
> @Cheng, Hao : Physical plans show that it got stuck on scanning
@Cheng, Hao : Physical plans show that it got stuck on scanning S3!
(table is partitioned by date_prefix and hour)
explain select count(*) from test_table where date_prefix='20150819' and
hour='00';
TungstenAggregate(key=[], value=[(count(1),mode=Final,isDistinct=false)]
TungstenExchange SingleP
I am also looking for a way to achieve the reducebykey functionality on data
frames. In my case I need to select one particular row (the oldest, based on
a timestamp column value) by key.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Aggregate-to-array-or-
I can't find a way to specify the datediff function to give me the difference
in minutes or seconds. Can anyone please help?
Thanks,
Stefan Panayotov
Sent from my Windows Phone
I believe this was caused by some network configuration on my machines. After
installing VirtualBox, some new network interfaces were installed on the
machines and the Akka software was binding to one of the VirtualBox interfaces
and not the interface that belonged to my Ethernet card. Once I d
That was only true until Spark 1.3. Spark 1.4 can be built with JDK7
and pyspark will still work.
On Fri, Aug 21, 2015 at 8:29 AM, Chen Song wrote:
> Thanks Sean.
>
> So how PySpark is supported. I thought PySpark needs jdk 1.6.
>
> Chen
>
> On Fri, Aug 21, 2015 at 11:16 AM, Sean Owen wrote:
>>
Hi Naveen,
As I mentioned before, the code is private therefore not accessible. Just
copy and use the snippet that I sent. Copying it here again:
https://github.com/apache/spark/blob/43e0135421b2262cbb0e06aae53523f663b4f959/mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala#L270
Bes
Hi Sunil,
Have you seen this fix in Spark 1.5 that may fix the locality issue?:
https://issues.apache.org/jira/browse/SPARK-4352
On Thu, Aug 20, 2015 at 4:09 AM, Sunil wrote:
> Hello . I am seeing some unexpected issues with achieving HDFS
> data
> locality. I expect the tasks to be ex
Spark 1.4 requires Java 7.
On Fri, Aug 21, 2015, 3:12 PM Chen Song wrote:
> I tried to build Spark 1.4.1 on cdh 5.4.0. Because we need to support
> PySpark, I used JDK 1.6.
>
> I got the following error,
>
> [INFO] --- scala-maven-plugin:3.2.0:testCompile (scala-test-compile-first)
> @ spark-str
No, the message never end. I have to ctrl-c out of it.
Garry
From: shahid ashraf [mailto:sha...@trialx.com]
Sent: Friday, August 21, 2015 11:13 AM
To: Garry Chen
Cc: user@spark.apache.org
Subject: Re: Spark ec2 lunch problem
Does the cluster work at the end ?
On Fri, Aug 21, 2015 at 8:25 PM,
Thanks Sean.
So how PySpark is supported. I thought PySpark needs jdk 1.6.
Chen
On Fri, Aug 21, 2015 at 11:16 AM, Sean Owen wrote:
> Spark 1.4 requires Java 7.
>
> On Fri, Aug 21, 2015, 3:12 PM Chen Song wrote:
>
>> I tried to build Spark 1.4.1 on cdh 5.4.0. Because we need to support
>> PySp
Have you seen the Spark SQL paper?:
https://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
On Thu, Aug 20, 2015 at 11:35 PM, Dawid Wysakowicz <
wysakowicz.da...@gmail.com> wrote:
> Hi,
>
> thanks for answers. I have read answers you provided, but I rather look
> for some materials on
Does the cluster work at the end ?
On Fri, Aug 21, 2015 at 8:25 PM, Garry Chen wrote:
> Hi All,
>
> I am trying to lunch a spark ec2 cluster by running
> spark-ec2 --key-pair=key --identity-file=my.pem --vpc-id=myvpc
> --subnet-id=subnet-011 --spark-version=1.4.1 launch spark-cl
Hi All,
I am trying to lunch a spark ec2 cluster by running spark-ec2
--key-pair=key --identity-file=my.pem --vpc-id=myvpc --subnet-id=subnet-011
--spark-version=1.4.1 launch spark-cluster but getting following message
endless. Please help.
Warning: SSH connection error. (Thi
I've been able to almost halve my memory usage with no instability issues.
I lowered my storage.memoryFraction and increased my shuffle.memoryFraction
(essentially swapping them). I set spark.yarn.executor.memoryOverhead to
6GB. And I lowered executor-cores in case other jobs are using the
availab
Is there any reliable way to find out the number of executors
programatically - regardless of how the job is run? A method that
preferably works for spark-standalone, yarn, mesos, regardless whether the
code runs from the shell or not?
Things that I tried and don't work:
- sparkContext.getExecuto
Sounds like that's happening consistently, not an occasional network
problem?
Look at the Kafka broker logs
Make sure you've configured the correct kafka broker hosts / ports (note
that direct stream does not use zookeeper host / port).
Make sure that host / port is reachable from your driver an
Have you read this ?
http://stackoverflow.com/questions/22716346/how-to-use-lz4-compression-in-linux-3-11
> On Aug 21, 2015, at 6:57 AM,
> wrote:
>
> Hi all,
>
> I am using pre-compiled spark with hadoop 2.6. LZ4 Codec is not on hadoop’s
> native libraries, so I am not being able to use
I tried to build Spark 1.4.1 on cdh 5.4.0. Because we need to support
PySpark, I used JDK 1.6.
I got the following error,
[INFO] --- scala-maven-plugin:3.2.0:testCompile (scala-test-compile-first)
@ spark-streaming_2.10 ---
java.lang.UnsupportedClassVersionError: org/apache/hadoop/io/LongWritabl
What version of Spark you are using, or comes with DSE 4.7?
We just cannot reproduce it in Spark.
yzhang@localhost>$ more test.sparkval pairs =
sc.makeRDD(Seq((0,1),(0,2),(1,20),(1,30),(2,40)))pairs.reduceByKey((x,y) => x +
y).collectyzhang@localhost>$ ~/spark/bin/spark-shell --master local -i
t
Hi all,
I am using pre-compiled spark with hadoop 2.6. LZ4 Codec is not on hadoop's
native libraries, so I am not being able to use it.
Can anyone suggest on how to proceed? Hopefully I wont have to recompile
hadoop. I tried changing the --driver-library-path to point directly into lz4
stand a
HI Abhishek,
I have even tried that but rdd2 is empty
Regards,
Satish
On Fri, Aug 21, 2015 at 6:47 PM, Abhishek R. Singh <
abhis...@tetrationanalytics.com> wrote:
> You had:
>
> > RDD.reduceByKey((x,y) => x+y)
> > RDD.take(3)
>
> Maybe try:
>
> > rdd2 = RDD.reduceByKey((x,y) => x+y)
> > rdd2.ta
You had:
> RDD.reduceByKey((x,y) => x+y)
> RDD.take(3)
Maybe try:
> rdd2 = RDD.reduceByKey((x,y) => x+y)
> rdd2.take(3)
-Abhishek-
On Aug 20, 2015, at 3:05 AM, satish chandra j wrote:
> HI All,
> I have data in RDD as mentioned below:
>
> RDD : Array[(Int),(Int)] = Array((0,1), (0,2),(1,20)
HI All,
Any inputs for the actual problem statement
Regards,
Satish
On Fri, Aug 21, 2015 at 5:57 PM, Jeff Zhang wrote:
> Yong, Thanks for your reply.
>
> I tried spark-shell -i , it works fine for me. Not sure the
> different with
> dse spark --master local --jars postgresql-9.4-1201.jar -i
Hello,
I attended the Tungsten-related presentations at Spark Summit (by Josh
Rosen) and at Big Data Scala (by Matei Zaharia). Needless to say, this
project holds great promise for major performance improvements.
At Josh's talk, I heard about the use of sun.misc.Unsafe as a way of
achieving some
Hi,
Exception thrown when using Having Clause with variation or stddev. It
works perfectly when using other aggregate functions(Like
sum,count,min,max..)
SELECT SUM(1) AS `sum_number_of_records_ok` FROM
`some_db`.`some_table` `some_table`
GROUP BY 1 HAVING (STDDEV(1) > 0)
SELECT SUM(1) AS `sum_
Hi *,
We are trying to run Spark on top of mesos using fine grained mode. While
talking to few people i came to know that running Spark job using fine
grained mode on mesos is not a good idea.
I could not find anything regarding fine grained mode getting deprecated
and also if corse grained mode
2015-08-21 3:17 GMT-07:00 smagadi :
> teenagers .toJSON gives the json but it does not preserve the parent ids
>
> meaning if the input was {"name":"Yin",
> "address":{"city":"Columbus","state":"Ohio"},"age":20}
>
> val x= sqlContext.sql("SELECT name, address.city, address.state ,age FROM
> people
I believe "spark-shell -i scriptFile" is there. We also use it, at least in
Spark 1.3.1.
"dse spark" will just wrap "spark-shell" command, underline it is just invoking
"spark-shell".
I don't know too much about the original problem though.
Yong
Date: Fri, 21 Aug 2015 18:19:49 +0800
Subject: Re:
This is something of a wild guess, but I find that when executors start
disappearingfor no obvious reason, this is usually because the yarn
node-managers have decided that the containers are using too much memory and
then terminate the executors.
Unfortunately, to see evidence of this, one needs
Hi,
I'm using spark 1.3.1 built against hadoop 1.0.4 and java 1.7 and I'm
trying to save my data frame to parquet.
The issue I'm stuck looks like serialization tries to do pretty weird
thing: tries to write to an empty array.
The last (through stack trace) line of spark code that leads to excepti
Hi,
My scenario goes like this:
I have an algorithm running in Spark streaming mode on a 4 core virtual
machine. Majority of the time, the algorithm does disk I/O and database
I/O. Question is, during the I/O, where the CPU is not considerably loaded,
is it possible to run any other task/thread so
ML plans to make Machine Learning pipeline that users can make machine
learning more efficient.
It's more general to make StringIndexer chain with any kinds of Estimators.
I think we can make StringIndexer and reverse process automatic in the
future.
If you want to know your original labels, you ca
Hi Satish,
I don't see where spark support "-i", so suspect it is provided by DSE. In
that case, it might be bug of DSE.
On Fri, Aug 21, 2015 at 6:02 PM, satish chandra j
wrote:
> HI Robin,
> Yes, it is DSE but issue is related to Spark only
>
> Regards,
> Satish Chandra
>
> On Fri, Aug 21, 2
teenagers .toJSON gives the json but it does not preserve the parent ids
meaning if the input was {"name":"Yin",
"address":{"city":"Columbus","state":"Ohio"},"age":20}
val x= sqlContext.sql("SELECT name, address.city, address.state ,age FROM
people where age>19 and age <=30 ").toJSON
x.collect
HI Robin,
Yes, it is DSE but issue is related to Spark only
Regards,
Satish Chandra
On Fri, Aug 21, 2015 at 3:06 PM, Robin East wrote:
> Not sure, never used dse - it’s part of DataStax Enterprise right?
>
> On 21 Aug 2015, at 10:07, satish chandra j
> wrote:
>
> HI Robin,
> Yes, below mention
Hi All,
I have a 24x7 running Streaming Process, which runs on 2 hour windowed data
The issue i am facing is my worker machines are running OUT OF DISK space
I checked that the SHUFFLE FILES are not getting cleaned up.
/log/spark-2b875d98-1101-4e61-86b4-67c9e71954cc/executor-5bbb53c1-cee9-4438
Yes, DSE 4.7
Regards,
Satish Chandra
On Fri, Aug 21, 2015 at 3:06 PM, Robin East wrote:
> Not sure, never used dse - it’s part of DataStax Enterprise right?
>
> On 21 Aug 2015, at 10:07, satish chandra j
> wrote:
>
> HI Robin,
> Yes, below mentioned piece or code works fine in Spark Shell but
The easiest option I found to put jars in SPARK CLASSPATH
On 21 Aug 2015 06:20, "Burak Yavuz" wrote:
> If you would like to try using spark-csv, please use
> `pyspark --packages com.databricks:spark-csv_2.11:1.2.0`
>
> You're missing a dependency.
>
> Best,
> Burak
>
> On Thu, Aug 20, 2015 at 1:0
please try DataFrame.toJSON, it will give you an RDD of JSON string.
At 2015-08-21 15:59:43, "smagadi" wrote:
>val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND
>age <= 19")
>
>I need teenagers to be a JSON object rather a simple row .How can we get
>that done ?
>
HI Robin,
Yes, below mentioned piece or code works fine in Spark Shell but the same
when place in Script File and executed with -i it creating an
empty RDD
scala> val pairs = sc.makeRDD(Seq((0,1),(0,2),(1,20),(1,30),(2,40)))
pairs: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[77]
Hi
Getting below error in spark streaming 1.3 while consuming from kafka
using directkafka stream. Few of tasks are getting failed in each run.
What is the reason /solution of this error?
15/08/21 08:54:54 ERROR executor.Executor: Exception in task 262.0 in
stage 130.0 (TID 16332)
java.io.EOF
It seems like you want simultaneous processing of multiple jobs but at the
same time serialization of few tasks within those jobs. I don't know how to
achieve that in Spark.
But, why would you bother about the inter-weaved processing when the data
that is being aggregated in different jobs is per
Have you considered asking this question on
https://groups.google.com/a/lists.datastax.com/forum/#!forum/spark-connector-user
?
Cheers
On Thu, Aug 20, 2015 at 10:57 PM, Samya wrote:
> Hi All,
>
> I need to write an RDD to Cassandra & using the sparkCassandraConnector
> from
> DataStax. My appli
You may have seen this:
http://search-hadoop.com/m/q3RTt31EKP1BTHxY1
On Fri, Aug 21, 2015 at 1:01 AM, smagadi wrote:
> Does spark sql supports XML the same way as it supports json ?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-SQL-support
It's good
Thanks for your reply Michael.
On Thu, Aug 20, 2015 at 11:03 PM, Michael Armbrust
wrote:
> We will probably fix this in Spark 1.6
>
> https://issues.apache.org/jira/browse/SPARK-10040
>
> On Thu, Aug 20, 2015 at 5:18 AM, Aram Mkrtchyan <
> aram.mkrtchyan...@gmail.com> wrote:
>
>> We wa
The OP wants to understand what determines the size of the task code that is
shipped to each executor so it can run the task. I don't know the answer to but
would be interested to know too.
Sent from my iPhone
> On 21 Aug 2015, at 08:26, oubrik [via Apache Spark User List]
> wrote:
>
> Hi
>
Does spark sql supports XML the same way as it supports json ?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-SQL-support-for-XML-tp24382.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
--
val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND
age <= 19")
I need teenagers to be a JSON object rather a simple row .How can we get
that done ?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-sql-Need-JSON-back-isntead-of-
Hi,
I was trying to programmatically specify a schema and apply it to a RDD of
Rows and save the resulting DataFrame as a parquet file, but I got
"java.lang.ClassCastException:
java.lang.String cannot be cast to java.lang.Long" on the last step.
Here's what I did:
1. Created an RDD of Rows from
Hi,
I was trying to programmatically specify a schema and apply it to a RDD of
Rows and save the resulting DataFrame as a parquet file.
Here's what I did:
1. Created an RDD of Rows from RDD[Array[String]]:
val gameId= Long.valueOf(line(0))
val accountType = Long.valueOf(line(1))
val w
78 matches
Mail list logo