Thanks
From: Nick Pentreath [mailto:nick.pentre...@gmail.com]
Sent: Tuesday, April 07, 2015 5:52 PM
To: Puneet Kumar Ojha
Cc: user@spark.apache.org
Subject: Re: Difference between textFile Vs hadoopFile (textInoutFormat) on
HDFS data
There is no difference - textFile calls hadoopFile with a Text
Hi Cheng,
I tried both these patches, and seems still not resolve my issue. And I
found the most time is spend on this line in newParquet.scala:
ParquetFileReader.readAllFootersInParallel(
sparkContext.hadoopConfiguration, seqAsJavaList(leaves), taskSideMetaData)
Which need read all the files
I use EMR 3.3.1 which comes with Java 7. Do you think that this may cause
the issue? Did you test it with Java 8?
Hi All how can I subscribe myself in this group so that every mail sent to
this group comes to me as well.
I already sent request to user-subscr...@spark.apache.org ,still Iam not
getting mail sent to this group by other persons.
Regards
Jeetendra
Check your spam or any filter,
On Wed, Apr 8, 2015 at 2:17 PM, Jeetendra Gangele
wrote:
> Hi All how can I subscribe myself in this group so that every mail sent to
> this group comes to me as well.
> I already sent request to user-subscr...@spark.apache.org ,still Iam not
> getting mail sent t
Hi folks,
I am writing to ask how to filter and partition a set of files thru Spark.
The situation is that I have N big files (cannot fit into single machine). And
each line of files starts with a category (say Sport, Food, etc), while only
have less than 100 categories actually. I need a progr
I have a spark stage that has 8 tasks. 7/8 have completed. However 1 task
is failing with Cannot find address
Aggregated Metrics by ExecutorExecutor IDAddressTask TimeTotal TasksFailed
TasksSucceeded TasksShuffle Read Size / RecordsShuffle Write Size /
RecordsShuffle
Spill (Memory)Shuffle Spill
Spark Version 1.3
Command:
./bin/spark-submit -v --master yarn-cluster --driver-class-path
/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-company-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-company/share/hadoop/yarn/lib/guava-11.0.2.jar:/apache/hadoop-2.4.1
This means the spark workers exited with code "15"; probably nothing YARN
related itself (unless there are classpath-related problems).
Have a look at the logs of the app/container via the resource manager. You can
also increase the time that logs get kept on the nodes themselves to something
Hi Michael,
In fact, I find that all workers are hanging when SQL/DF join is running.
So I picked the master and one of the workers. jstack is the following:
Master
2015-04-08
If you are using Spark Standalone deployment, make sure you set the
WORKER_MEMROY over 20G, and you do have 20G physical memory.
Yong
> Date: Tue, 7 Apr 2015 20:58:42 -0700
> From: li...@adobe.com
> To: user@spark.apache.org
> Subject: EC2 spark-submit --executor-memory
>
> Dear Spark team,
>
>
To use the HiveThriftServer2.startWithContext, I thought one would use the
following artifact in the build:
"org.apache.spark"%% "spark-hive-thriftserver" % "1.3.0"
But I am unable to resolve the artifact. I do not see it in maven central
or any other repo. Do I need to build Spark and p
Hi guys
I’ve got:
180 days of log data in Parquet.
Each day is stored in a separate folder in S3.
Each day consists of 20-30 Parquet files of 256 MB each.
Spark 1.3 on Amazon EMR
This makes approximately 5000 Parquet files with a total size if 1.5 TB.
My code:
val in = sqlContext.parquetFile(“da
How do I build Spark SQL Avro Library for Spark 1.2 ?
I was following this https://github.com/databricks/spark-avro and was able
to build spark-avro_2.10-1.0.0.jar by simply running sbt/sbt package from
the project root.
but we are on Spark 1.2 and need compatible spark-avro jar.
Any idea how do
Please email user-subscr...@spark.apache.org
> On Apr 8, 2015, at 6:28 AM, Idris Ali wrote:
>
>
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
Hi,
We have cluster running on CDH 5.3.2 and Spark 1.2 (Which is current
version in CDH5.3.2), But We want to try Spark 1.3 without breaking existing
setup, so is it possible to have Spark 1.3 on existing setup ?
Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.
I am trying to run a Spark application using spark-submit on a cluster
using Cloudera manager. I get the error
"Exception in thread "main" java.io.IOException: Error in creating log
directory: file:/user/spark/applicationHistory//app-20150408094126-0008"
Adding the below lines in /etc/spark/conf/
You may have seen this thread: http://search-hadoop.com/m/JW1q5SlRpt1
Cheers
On Wed, Apr 8, 2015 at 6:15 AM, Eric Eijkelenboom <
eric.eijkelenb...@gmail.com> wrote:
> Hi guys
>
> *I’ve got:*
>
>- 180 days of log data in Parquet.
>- Each day is stored in a separate folder in S3.
>- Ea
Did anybody by any chance had a look at this bug? It keeps on happening to
me, and it's quite blocking, I would like to understand if there's something
wrong in what I'm doing, or whether there's a workaround or not.
Thank you all,
--
Dott. Stefano Parmesan
Backend Web Developer and Data Lover ~
Yes, should be fine since you are running on YARN. This is probably more
appropriate for the cdh-user list.
On Apr 8, 2015 9:35 AM, "roy" wrote:
> Hi,
>
> We have cluster running on CDH 5.3.2 and Spark 1.2 (Which is current
> version in CDH5.3.2), But We want to try Spark 1.3 without breaking
>
It should be noted I'm a newbie to Spark so please have patience ...
I'm trying to convert an existing application over to spark and am running
into some "high level " questions that I can't seem to resolve. Possibly
because what I'm trying to do is not supported.
In a nutshell as I process t
+1
Interestingly, I ran into the exactly the same issue yesterday. I couldn’t
find any documentation about which project to include as a dependency in
build.sbt to use HiveThriftServer2. Would appreciate help.
Mohammed
From: Todd Nist [mailto:tsind...@gmail.com]
Sent: Wednesday, April 8, 2015
Hi folks, I am noticing a pesky and persistent warning in my logs (this is
from Spark 1.2.1):
15/04/08 15:23:05 WARN ShellBasedUnixGroupsMapping: got exception
trying to get groups for user anonymous
org.apache.hadoop.util.Shell$ExitCodeException: id: anonymous: No such user
at org.apach
I will look into this today.
On Wed, Apr 8, 2015 at 7:35 AM, Stefano Parmesan wrote:
> Did anybody by any chance had a look at this bug? It keeps on happening to
> me, and it's quite blocking, I would like to understand if there's something
> wrong in what I'm doing, or whether there's a workarou
Hi All,
In some cases, I have below exception when I run spark in local mode (I
haven't see this in a cluster). This is weird but also affect my local unit
test case (it is not always happen, but usually one per 4-5 times run). From
the stack, looks like error happen when create the context, bu
I am trying to start the worker by:
sbin/start-slave.sh spark://ip-10-241-251-232:7077
In the logs it's complaining about:
Master must be a URL of the form spark://hostname:port
I also have this in spark-defaults.conf
spark.master spark://ip-10-241-251-232:7077
Did I miss
"spark.eventLog.dir" should contain the full HDFS URL. In general,
this should be sufficient:
spark.eventLog.dir=hdfs:/user/spark/applicationHistory
On Wed, Apr 8, 2015 at 6:45 AM, Vijayasarathy Kannan wrote:
> I am trying to run a Spark application using spark-submit on a cluster using
> Cloud
There are a couple of options. Increase timeout (see Spark configuration).
Also see past mails in the mailing list.
Another option you may try (I have gut feeling that may work, but I am not
sure) is calling GC on the driver periodically. The cleaning up of stuff is
tied to GCing of RDD objects a
Hi,
Does SparkContext's textFile() method handle files with Unicode characters?
How about files in UTF-8 format?
Going further, is it possible to specify encodings to the method? If not,
what should one do if the files to be read are in some encoding?
Thanks,
arun
some additional context:
Since, I am using features of spark 1.3.0, I have downloaded spark 1.3.0 and
used spark-submit from there.
The cluster is still on spark-1.2.0.
So, this looks to me that at runtime, the executors could not find some
libraries of spark-1.3.0, even though I ran spark-subm
Spark use the Hadoop TextInputFormat to read the file. Since Hadoop is almost
only supporting Linux, so UTF-8 is the only encoding supported, as it is the
the one on Linux.
If you have other encoding data, you may want to vote for this
Jira:https://issues.apache.org/jira/browse/MAPREDUCE-232
Yon
Hi all,
I am using Spark Streaming to monitor an S3 bucket for objects that contain
JSON. I want
to import that JSON into Spark SQL DataFrame.
Here's my current code:
*from pyspark import SparkContext, SparkConf*
*from pyspark.streaming import StreamingContext*
*import json*
*from pyspark.sql im
Thanks for the report. We improved the speed here in 1.3.1 so would be
interesting to know if this helps. You should also try disabling schema
merging if you do not need that feature (i.e. all of your files are the
same schema).
sqlContext.load("path", "parquet", Map("mergeSchema" -> "false"))
I think your thread dump for the master is actually just a thread dump for
SBT that is waiting on a forked driver program.
...
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x7fed624ff528> (a java.lang.UNIXProcess)
at java.lang.Obj
Hi Muhammad,
There are lots of ways to do it. My company actually develops a text
mining solution which embeds a very fast Approximate Neighbours solution
(a demo with real time queries on the wikipedia dataset can be seen at
wikinsights.org). For the record, we now prepare a dataset of 4.5
m
Sorry guys. I didn't realize that
https://issues.apache.org/jira/browse/SPARK-4925 was not fixed yet.
You can publish locally in the mean time (sbt/sbt publishLocal).
On Wed, Apr 8, 2015 at 8:29 AM, Mohammed Guller
wrote:
> +1
>
>
>
> Interestingly, I ran into the exactly the same issue yeste
Back to the user list so everyone can see the result of the discussion...
Ah. It all makes sense now. The issue is that when I created the parquet
> files, I included an unnecessary directory name (data.parquet) below the
> partition directories. It’s just a leftover from when I started with
> Mic
Michael,
Thank you!
Looks like the sbt build is broken for 1.3. I downloaded the source code for
1.3, but I get the following error a few minutes after I run “sbt/sbt
publishLocal”
[error] (network-shuffle/*:update) sbt.ResolveException: unresolved dependency:
org.apache.spark#spark-network-co
I am seeing the following, is this because of my maven version?
15/04/08 15:42:22 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
ip-10-241-251-232.us-west-2.compute.internal):
java.io.InvalidClassException: org.apache.spark.Aggregator; local class
incompatible: stream classdesc serialVers
Hi,
I have an RDD with objects containing Joda's LocalDate. When trying to save
the RDD as Parquet, I get an exception. Here is the code:
-
val sqlC = new org.apache.spark.sql.SQLContext(sc)
import sqlC._
myRDD.s
I am trying to unit test some code which takes an existing HiveContext and
uses it to execute a CREATE TABLE query (among other things). Unfortunately
I've run into some hurdles trying to unit test this, and I'm wondering if
anyone has a good approach.
The metastore DB is automatically created in
When I call *transform* or *foreachRDD *on* DStream*, I keep getting an
error that I have an empty RDD, which make sense since my batch interval
maybe smaller than the rate at which new data are coming in. How to guard
against it?
Thanks,
Vadim
ᐧ
What version of Java do you use to build ?
Cheers
On Wed, Apr 8, 2015 at 12:43 PM, Mohit Anchlia
wrote:
> I am seeing the following, is this because of my maven version?
>
> 15/04/08 15:42:22 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
> ip-10-241-251-232.us-west-2.compute.internal)
Hi TD,
Thanks for the response. Since you mentioned GC, this got me thinking.
Given that we are running in local mode (all in a single JVM) for now, does
the option "spark.executor.extraJavaOptions" set to
"-XX:+UseConcMarkSweepGC" inside SparkConf object take effect at all before
we use it to cr
Its does take effect on the executors, not on the driver. Which is okay
because executors have all the data and therefore have GC issues, not so
usually for the driver. If you want to double-sure, print the JVM flag
(e.g. http://stackoverflow.com/questions/10486375/print-all-jvm-flags)
However, th
Please take a look at
sql/hive/src/main/scala/org/apache/spark/sql/hive/test/TestHive.scala :
protected def configure(): Unit = {
warehousePath.delete()
metastorePath.delete()
setConf("javax.jdo.option.ConnectionURL",
s"jdbc:derby:;databaseName=$metastorePath;create=true")
Hi Mohammed,
I think you just need to add -DskipTests to you build. Here is how I built
it:
mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver
-DskipTests clean package install
build/sbt does however fail even if only doing package which should skip
tests.
I am able to b
Which version of Joda are you using ?
Here is snippet of dependency:tree out w.r.t. Joda :
[INFO] +- org.apache.flume:flume-ng-core:jar:1.4.0:compile
...
[INFO] | +- joda-time:joda-time:jar:2.1:compile
FYI
On Wed, Apr 8, 2015 at 12:53 PM, Patrick Grandjean
wrote:
> Hi,
>
> I have an RDD with
Since we are running in local mode, won't all the executors be in the same
JVM as the driver?
Thanks
NB
On Wed, Apr 8, 2015 at 1:29 PM, Tathagata Das wrote:
> Its does take effect on the executors, not on the driver. Which is okay
> because executors have all the data and therefore have GC issu
I am loading some avro data into spark using the following code:
sqlContext.sql("CREATE TEMPORARY TABLE foo USING com.databricks.spark.avro
OPTIONS (path 'hdfs://*.avro')")
The avro data contains some binary fields that get translated to the
BinaryType data type. I am struggling with how to use
More generic version of a question below:
Is it possible to append a column to existing DataFrame at all? I understand
that this is not an easy task in Spark environment, but is there any
workaround?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Add-row-I
Hi,
Thanks a lot for such a detailed response.
On Wed, Apr 8, 2015 at 8:55 PM, Guillaume Pitel
wrote:
> Hi Muhammad,
>
> There are lots of ways to do it. My company actually develops a text
> mining solution which embeds a very fast Approximate Neighbours solution (a
> demo with real time quer
Hi Xiangrui,
I tried running this on my local machine (laptop) and got the same error:
Here is what I did:
1. downloaded spark 1.30 release version (prebuilt for hadoop 2.4 and later)
"spark-1.3.0-bin-hadoop2.4.tgz".
2. Ran the following command:
spark-submit --class ALSNew --master local[8
Hi Bojan,
Could you please expand your idea on how to append to RDD? I can think of
how to append a constant value to each row on RDD:
//oldRDD - RDD[Array[String]]
val c = "const"
val newRDD = oldRDD.map(r=>c+:r)
But how to append a custom column to RDD? Something like:
val colToAppend = sc.ma
Hi all,
I figured it out! The DataFrames and SQL example in Spark Streaming docs
were useful.
Best,
Vadim
ᐧ
On Wed, Apr 8, 2015 at 2:38 PM, Vadim Bichutskiy wrote:
> Hi all,
>
> I am using Spark Streaming to monitor an S3 bucket for objects that
> contain JSON. I want
> to import that JSON int
Hi,
If I perform a sortByKey(true, 2).saveAsTextFile("filename") on a cluster,
will the data be sorted per partition, or in total. (And is this
guaranteed?)
Example:
Input 4,2,3,6,5,7
Sorted per partition:
part-0: 2,3,7
part-1: 4,5,6
Sorted in total:
part-0: 2,3,4
part-1: 5,6,7
You could convert DF to RDD, then in map phase or in join add new column,
and then again convert to DF. I know this is not elegant solution and maybe
it is not a solution at all. :) But this is the first thing that popped in
my mind.
I am new also to DF api.
Best
Bojan
On Apr 9, 2015 00:37, "olegsh
Thanks TD. I believe that might have been the issue. Will try for a few
days after passing in the GC option on the java command line when we start
the process.
Thanks for your timely help.
NB
On Wed, Apr 8, 2015 at 6:08 PM, Tathagata Das wrote:
> Yes, in local mode they the driver and executor
Yes, in local mode they the driver and executor will be same the process.
And in that case the Java options in SparkConf configuration will not
work.
On Wed, Apr 8, 2015 at 1:44 PM, N B wrote:
> Since we are running in local mode, won't all the executors be in the same
> JVM as the driver?
>
>
See the scaladoc from OrderedRDDFunctions.scala :
* Sort the RDD by key, so that each partition contains a sorted range of
the elements. Calling
* `collect` or `save` on the resulting RDD will return or output an
ordered list of records
* (in the `save` case, they will be written to multi
bq. one is Oracle and the other is OpenJDK
I don't have experience with mixed JDK's.
Can you try with using single JDK ?
Cheers
On Wed, Apr 8, 2015 at 3:26 PM, Mohit Anchlia
wrote:
> For the build I am using java version "1.7.0_65" which seems to be the
> same as the one on the spark host. How
Hi Eric - Would you mind to try either disabling schema merging as what
Michael suggested, or disabling the new Parquet data source by
sqlContext.setConf("spark.sql.parquet.useDataSourceApi", "false")
Cheng
On 4/9/15 2:43 AM, Michael Armbrust wrote:
Thanks for the report. We improved the spee
Hey Patrick, Michael and Todd,
Thank you for your help!
As you guys recommended, I did a local install and got my code to compile.
As an FYI, on my local machine the sbt build fails even if I add –DskipTests.
So I used mvn.
Mohammed
From: Patrick Wendell [mailto:patr...@databricks.com]
Sent:
On 4/9/15 3:09 AM, Michael Armbrust wrote:
Back to the user list so everyone can see the result of the discussion...
Ah. It all makes sense now. The issue is that when I created the
parquet files, I included an unnecessary directory name
(data.parquet) below the partition directori
Aah yes. The jsonRDD method needs to walk through the whole RDD to
understand the schema, and does not work if there is not data in it. Making
sure there is no data in it using take(1) should work.
TD
It's because your tests are running in parallel and you can only have one
context running at a time.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-run-unit-test-tp14459p22429.html
Sent from the Apache Spark User List mailing list archive at Nabble.
I wanted to run the groupBy(partition ) but this is not working.
here first part in pairvendorData will be repeated multiple second part.
Both are object do I need to overrite the equals and hash code?
Is groupBy fast enough?
JavaPairRDD pairvendorData
=matchRdd.flatMapToPair( new PairFlatMapFunc
Thanks!
arun
On Wed, Apr 8, 2015 at 10:51 AM, java8964 wrote:
> Spark use the Hadoop TextInputFormat to read the file. Since Hadoop is
> almost only supporting Linux, so UTF-8 is the only encoding supported, as
> it is the the one on Linux.
>
> If you have other encoding data, you may want to v
Thanks TD!
> On Apr 8, 2015, at 9:36 PM, Tathagata Das wrote:
>
> Aah yes. The jsonRDD method needs to walk through the whole RDD to understand
> the schema, and does not work if there is not data in it. Making sure there
> is no data in it using take(1) should work.
>
> TD
--
We noticed similar perf degradation using Parquet (outside of Spark) and it
happened due to merging of multiple schemas. Would be good to know if
disabling merge of schema (if the schema is same) as Michael suggested
helps in your case.
On Wed, Apr 8, 2015 at 11:43 AM, Michael Armbrust
wrote:
>
Please take a look at zipWithIndex() of RDD.
Cheers
On Wed, Apr 8, 2015 at 3:40 PM, Jeetendra Gangele
wrote:
> Hi All I have a RDD I want to convert it to
> RDD this sequence number can be 1 for first
> SomeObject 2 for second SomeOjejct
>
>
> Regards
> jeet
>
The Thrift server hasn't support authentication or Hadoop doAs yet, so
you can simply ignore this warning.
To avoid this, when connecting via JDBC you may specify the user to the
same user who starts the Thrift server process. For Beeline, use "-n
".
On 4/8/15 11:49 PM, Yana Kadiyska wrote:
What is the computation you are doing in the foreachRDD, that is throwing
the exception?
One way to guard against is to do a take(1) to see if you get back any
data. If there is none, then don't do anything with the RDD.
TD
On Wed, Apr 8, 2015 at 1:08 PM, Vadim Bichutskiy wrote:
> When I call *
Hi All I have a RDD I want to convert it to
RDD this sequence number can be 1 for first
SomeObject 2 for second SomeOjejct
Regards
jeet
For the build I am using java version "1.7.0_65" which seems to be the same
as the one on the spark host. However one is Oracle and the other is
OpenJDK. Does that make any difference?
On Wed, Apr 8, 2015 at 1:24 PM, Ted Yu wrote:
> What version of Java do you use to build ?
>
> Cheers
>
> On We
76 matches
Mail list logo