Re: Join stucks in the last stage step

2015-01-08 Thread paja
Just to demonstrate BIG difference between ordinary task (id 450) and last remaining task (id 0) Index ID Attempt Status ▾Locality LevelLaunch Time DurationGC Time Shuffle ReadShuffle Spill (Memory) Shuffle Spill (Disk)Errors 0 24130 RUNNING 2

RE: Spark History Server can't read event logs

2015-01-08 Thread michael.england
Hi Vanzin, I am using the MapR distribution of Hadoop. The history server logs are created by a job with the permissions: drwxrwx--- - 2 2015-01-08 09:14 /apps/spark/historyserver/logs/spark-1420708455212 However, the permissions of the higher directories are mapr:mapr and th

Re: spark-network-yarn 2.11 depends on spark-network-shuffle 2.10

2015-01-08 Thread Aniket Bhatnagar
Actually it does causes builds with SBT 0.13.7 to fail with the error "Conflicting cross-version suffixes". I have raised a defect SPARK-5143 for this. On Wed Jan 07 2015 at 23:44:21 Marcelo Vanzin wrote: > This particular case shouldn't cause problems since both of those > libraries are java-on

Re: Spark Standalone Cluster not correctly configured

2015-01-08 Thread frodo777
Hello everyone. With respect to the configuration problem that I explained before Do you have any idea what is wrong there? The problem in a nutshell: - When more than one master is started in the cluster, all of them are scheduling independently, thinking they are all leaders. - zookeeper

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Cheng Lian
Hey Xuelin, which data item in the Web UI did you check? On 1/7/15 5:37 PM, Xuelin Cao wrote: Hi, Curious and curious. I'm puzzled by the Spark SQL cached table. Theoretically, the cached table should be columnar table, and only scan the column that included in my SQL. However, in my test,

Spark 1.2.0 ec2 launch script hadoop native libraries not found warning

2015-01-08 Thread critikaled
Hi, Im facing this error on spark ec2 cluster when a job is submitted its says that native hadoop libraries are not found I have checked spark-env.sh and all the folders in the path but unable to find the problem even though the folder are containing. are there any performance drawbacks if we use i

Several applications share the same Spark executors (or their cache)

2015-01-08 Thread preeze
Hi all, We have a web application that connects to a Spark cluster to trigger some calculation there. It also caches big amount of data in the Spark executors' cache. To meet high availability requirements we need to run 2 instances of our web application on different hosts. Doing this straightfo

Trying to execute Spark in Yarn

2015-01-08 Thread Guillermo Ortiz
I'm trying to execute Spark from a Hadoop Cluster, I have created this script to try it: #!/bin/bash export HADOOP_CONF_DIR=/etc/hadoop/conf SPARK_CLASSPATH="" for lib in `ls /user/local/etc/lib/*.jar` do SPARK_CLASSPATH=$SPARK_CLASSPATH:$lib done /home/spark-1.1.1-bin-hadoop2.4/bin/spark

SPARKonYARN failing on CDH 5.3.0 : container cannot be fetched because of NumberFormatException

2015-01-08 Thread Mukesh Jha
Hi Experts, I am running spark inside YARN job. The spark-streaming job is running fine in CDH-5.0.0 but after the upgrade to 5.3.0 it cannot fetch containers with the below errors. Looks like the container id is incorrect and a string is present in a pace where it's expecting a number. java.l

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Xuelin Cao
Hi, Cheng I checked the Input data for each stage. For example, in my attached screen snapshot, the input data is 1212.5MB, which is the total amount of the whole table [image: Inline image 1] And, I also check the input data for each task (in the stage detail page). And the sum of the

Re: Trying to execute Spark in Yarn

2015-01-08 Thread Shixiong Zhu
`--jars` accepts a comma-separated list of jars. See the usage about `--jars` --jars JARS Comma-separated list of local jars to include on the driver and executor classpaths. Best Regards, Shixiong Zhu 2015-01-08 19:23 GMT+08:00 Guillermo Ortiz : > I'm trying to execute Spark from a Hadoop Cl

Re: Trying to execute Spark in Yarn

2015-01-08 Thread Guillermo Ortiz
thanks! 2015-01-08 12:59 GMT+01:00 Shixiong Zhu : > `--jars` accepts a comma-separated list of jars. See the usage about > `--jars` > > --jars JARS Comma-separated list of local jars to include on the driver and > executor classpaths. > > > > Best Regards, > > Shixiong Zhu > > 2015-01-08 19:23 GMT

RE: Several applications share the same Spark executors (or their cache)

2015-01-08 Thread Silvio Fiorito
Rather than having duplicate Spark apps and the web app having a direct reference to the SparkContext, why not use a queue or message bus to submit your requests. This way you're not wasting resources caching the same data in Spark and you can scale your web tier independently of the Spark tier

Executing Spark, Error creating path from empty String.

2015-01-08 Thread Guillermo Ortiz
When I try to execute my task with Spark it starts to copy the jars it needs to HDFS and it finally fails, I don't know exactly why. I have checked HDFS and it copies the files, so, it seems to work that part. I changed the log level to debug but there's nothing else to help. What else does Spark n

Eclipse flags error on KafkaUtils.createStream()

2015-01-08 Thread kc66
Hi, I am using Eclipse writing Java code. I am trying to create a Kafka receiver by: JavaPairReceiverInputDStream a = KafkaUtils.createStream(jssc, String.class, Message.class, StringDecoder.class, DefaultDecoder.class, kafkaParams, topics, StorageL

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Cheng Lian
Weird, which version did you use? Just tried a small snippet in Spark 1.2.0 shell as follows, the result showed in the web UI meets the expectation quite well: |import org.apache.spark.sql.SQLContext import sc._ val sqlContext = new SQLContext(sc) import sqlContext._ jsonFile("file:///

Re: example insert statement in Spark SQL

2015-01-08 Thread Cheng Lian
Spark SQL supports Hive insertion statement (Hive 0.14.0 style insertion is not supported though) https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingdataintoHiveTablesfromqueries The small SQL dialect provided in Spark SQL doesn't support insertion ye

Re: SparkSQL support for reading Avro files

2015-01-08 Thread Cheng Lian
This package is moved here: https://github.com/databricks/spark-avro On 1/6/15 5:12 AM, yanenli2 wrote: Hi All, I want to use the SparkSQL to manipulate the data with Avro format. I found a solution at https://github.com/marmbrus/sql-avro . However it doesn't compile successfully anymore with t

Re: Does SparkSQL not support nested IF(1=1, 1, IF(2=2, 2, 3)) statements?

2015-01-08 Thread Cheng Lian
The |+| operator only handles numeric data types, you may register you own concat function like this: |sqlContext.registerFunction("concat", (s: String, t: String) => s + t) sqlContext.sql("select concat('$', col1) from tbl") | Cheng On 1/5/15 1:13 PM, RK wrote: The issue is happening when I

Parquet compression codecs not applied

2015-01-08 Thread Ayoub Benali
Hello, I tried to save a table created via the hive context as a parquet file but whatever compression codec (uncompressed, snappy, gzip or lzo) I set via setConf like: setConf("spark.sql.parquet.compression.codec", "gzip") the size of the generated files is the always the same, so it seems like

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Xuelin Cao
Hi, Cheng In your code: cacheTable("tbl") sql("select * from tbl").collect() sql("select name from tbl").collect() Running the first sql, the whole table is not cached yet. So the *input data will be the original json file. * After it is cached, the json format data is removed, s

Re: Executing Spark, Error creating path from empty String.

2015-01-08 Thread Guillermo Ortiz
I was adding some bad jars I guess. I deleted all the jars and copied them again and it works. 2015-01-08 14:15 GMT+01:00 Guillermo Ortiz : > When I try to execute my task with Spark it starts to copy the jars it > needs to HDFS and it finally fails, I don't know exactly why. I have > checked HDFS

Build spark source code with Maven in Intellij Idea

2015-01-08 Thread Todd
Hi, I have imported the Spark source code in Intellij Idea as a SBT project. I try to do maven install in Intellij Idea by clicking Install in the Spark Project Parent POM(root),but failed. I would ask which profiles should be checked. What I want to achieve is staring Spark in IDE and Hadoop

Re: Build spark source code with Maven in Intellij Idea

2015-01-08 Thread Sean Owen
Popular topic in the last 48 hours! Just about 20 minutes ago I collected some recent information on just this topic into a pull request. https://github.com/apache/spark/pull/3952 On Thu, Jan 8, 2015 at 2:24 PM, Todd wrote: > Hi, > I have imported the Spark source code in Intellij Idea as a SBT

Parquet compression codecs not applied

2015-01-08 Thread Ayoub
Hello, I tried to save a table created via the hive context as a parquet file but whatever compression codec (uncompressed, snappy, gzip or lzo) I set via setConf like: setConf("spark.sql.parquet.compression.codec", "gzip") the size of the generated files is the always the same, so it seems like

Re: Spark SQL: The cached columnar table is not columnar?

2015-01-08 Thread Cheng Lian
Ah, my bad... You're absolute right! Just checked how this number is computed. It turned out that once an RDD block is retrieved from the block manager, the size of the block is added to the input bytes. Spark SQL's in-memory columnar format stores all columns within a single partition into a

Spark Project Fails to run multicore in local mode.

2015-01-08 Thread mixtou
I am new to Apache Spark, now i am trying my first project "Space Saving Counting Algorithm" and while it compiles in single core using .setMaster("local") it fails when using .setMaster("local[4]") or any number>1.My Code follows:=import

Re: ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId

2015-01-08 Thread Spidy
Hi, Can you please explain which settings did you changed? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ERROR-ConnectionManager-Corresponding-SendingConnection-to-ConnectionManagerId-tp17050p21035.html Sent from the Apache Spark User List mailing list ar

Re: Saving partial (top 10) DStream windows to hdfs

2015-01-08 Thread Yana Kadiyska
I'm glad you solved this issue but have a followup question for you. Wouldn't Akhil's solution be better for you after all? I run similar computation where a large set of data gets reduced to a much smaller aggregate in an interval. If you do saveAsText without coalescing, I believe you'd get the s

Re: Registering custom metrics

2015-01-08 Thread Enno Shioji
FYI I found this approach by Ooyala. /** Instrumentation for Spark based on accumulators. * * Usage: * val instrumentation = new SparkInstrumentation("example.metrics") * val numReqs = sc.accumulator(0L) * instrumentation.source.registerDailyAccumulator(numReqs, "numReqs") * instrument

Re: Registering custom metrics

2015-01-08 Thread Gerard Maas
Very interesting approach. Thanks for sharing it! On Thu, Jan 8, 2015 at 5:30 PM, Enno Shioji wrote: > FYI I found this approach by Ooyala. > > /** Instrumentation for Spark based on accumulators. > * > * Usage: > * val instrumentation = new SparkInstrumentation("example.metrics") > * va

Join RDDs with DStreams

2015-01-08 Thread Asim Jalis
Is there a way to join non-DStream RDDs with DStream RDDs? Here is the use case. I have a lookup table stored in HDFS that I want to read as an RDD. Then I want to join it with the RDDs that are coming in through the DStream. How can I do this? Thanks. Asim

Re: Join RDDs with DStreams

2015-01-08 Thread Gerard Maas
You are looking for dstream.transform(rdd => rdd.(otherRdd)) The docs contain an example on how to use transform. https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams -kr, Gerard. On Thu, Jan 8, 2015 at 5:50 PM, Asim Jalis wrote: > Is there a way t

Spark Streaming Checkpointing

2015-01-08 Thread Asim Jalis
Since checkpointing in streaming apps happens every checkpoint duration, in the event of failure, how is the system able to recover the state changes that happened after the last checkpoint?

Re: Spark History Server can't read event logs

2015-01-08 Thread Marcelo Vanzin
Hmm. Can you set the permissions of "/apps/spark/historyserver/logs" to 3777? I'm not sure HDFS respects the group id bit, but it's worth a try. (BTW that would only affect newly created log directories.) On Thu, Jan 8, 2015 at 1:22 AM, wrote: > Hi Vanzin, > > I am using the MapR distribution of

Re: Spark History Server can't read event logs

2015-01-08 Thread Marcelo Vanzin
Nevermind my last e-mail. HDFS complains about not understanding "3777"... On Thu, Jan 8, 2015 at 9:46 AM, Marcelo Vanzin wrote: > Hmm. Can you set the permissions of "/apps/spark/historyserver/logs" > to 3777? I'm not sure HDFS respects the group id bit, but it's worth a > try. (BTW that would o

Re: Spark on teradata?

2015-01-08 Thread Reynold Xin
Depending on your use cases. If the use case is to extract small amount of data out of teradata, then you can use the JdbcRDD and soon a jdbc input source based on the new Spark SQL external data source API. On Wed, Jan 7, 2015 at 7:14 AM, gen tang wrote: > Hi, > > I have a stupid question: >

Re: Spark on teradata?

2015-01-08 Thread gen tang
Thanks a lot for your reply. In fact, I need to work on almost all the data in teradata (~100T). So, I don't think that jdbcRDD is a good choice. Cheers Gen On Thu, Jan 8, 2015 at 7:39 PM, Reynold Xin wrote: > Depending on your use cases. If the use case is to extract small amount of > data ou

Re: SPARKonYARN failing on CDH 5.3.0 : container cannot be fetched because of NumberFormatException

2015-01-08 Thread Mukesh Jha
On Thu, Jan 8, 2015 at 5:08 PM, Mukesh Jha wrote: > Hi Experts, > > I am running spark inside YARN job. > > The spark-streaming job is running fine in CDH-5.0.0 but after the upgrade > to 5.3.0 it cannot fetch containers with the below errors. Looks like the > container id is incorrect and a stri

Re: SparkSQL support for reading Avro files

2015-01-08 Thread yanenli2
thanks for the reply! Now I know that this package is moved here: https://github.com/databricks/spark-avro -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-support-for-reading-Avro-files-tp20981p21040.html Sent from the Apache Spark User List mailing

Find S3 file attributes by Spark

2015-01-08 Thread rajnish
Hi, We have file in AWS S3 bucket, that is loaded frequently, When accessing that file from spark, can we get file properties by some method in spark? Regards Raj -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Find-S3-file-attributes-by-Spark-tp21039.ht

Re: Spark History Server can't read event logs

2015-01-08 Thread Marcelo Vanzin
Sorry for the noise; but I just remembered you're actually using MapR (and not HDFS), so maybe the "3777" trick could work... On Thu, Jan 8, 2015 at 10:32 AM, Marcelo Vanzin wrote: > Nevermind my last e-mail. HDFS complains about not understanding "3777"... > > On Thu, Jan 8, 2015 at 9:46 AM, Mar

Re: SPARKonYARN failing on CDH 5.3.0 : container cannot be fetched because of NumberFormatException

2015-01-08 Thread Sandy Ryza
Hi Mukesh, Those line numbers in ConverterUtils in the stack trace don't appear to line up with CDH 5.3: https://github.com/cloudera/hadoop-common/blob/cdh5-2.5.0_5.3.0/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/ConverterUtils.java Is it possible

Re: SPARKonYARN failing on CDH 5.3.0 : container cannot be fetched because of NumberFormatException

2015-01-08 Thread Marcelo Vanzin
Just to add to Sandy's comment, check your client configuration (generally in /etc/spark/conf). If you're using CM, you may need to run the "Deploy Client Configuration" command on the cluster to update the configs to match the new version of CDH. On Thu, Jan 8, 2015 at 11:38 AM, Sandy Ryza wrote

Data locality running Spark on Mesos

2015-01-08 Thread mvle
Hi, I've noticed running Spark apps on Mesos is significantly slower compared to stand-alone or Spark on YARN. I don't think it should be the case, so I am posting the problem here in case someone has some explanation or can point me to some configuration options i've missed. I'm running the Line

Re: Spark Project Fails to run multicore in local mode.

2015-01-08 Thread Dean Wampler
Use local[*] instead of local to grab all available cores. Using local just grabs one. Dean On Thursday, January 8, 2015, mixtou wrote: > I am new to Apache Spark, now i am trying my first project "Space Saving > Counting Algorithm" and while it compiles in single core using > .setMaster("local

Re: Spark on teradata?

2015-01-08 Thread Evan R. Sparks
Have you taken a look at the TeradataDBInputFormat? Spark is compatible with arbitrary hadoop input formats - so this might work for you: http://developer.teradata.com/extensibility/articles/hadoop-mapreduce-connector-to-teradata-edw On Thu, Jan 8, 2015 at 10:53 AM, gen tang wrote: > Thanks a lo

Re: Data locality running Spark on Mesos

2015-01-08 Thread Tim Chen
How did you run this benchmark, and is there a open version I can try it with? And what is your configurations, like spark.locality.wait, etc? Tim On Thu, Jan 8, 2015 at 11:44 AM, mvle wrote: > Hi, > > I've noticed running Spark apps on Mesos is significantly slower compared > to > stand-alone

Initial State of updateStateByKey

2015-01-08 Thread Asim Jalis
In Spark Streaming, is there a way to initialize the state of updateStateByKey before it starts processing RDDs? I noticed that there is an overload of updateStateByKey that takes an initialRDD in the latest sources (although not in the 1.2.0 release). Is there another way to do this until this fea

Re: ERROR ConnectionManager: Corresponding SendingConnection to ConnectionManagerId

2015-01-08 Thread Aaron Davidson
Do note that this problem may be fixed in Spark 1.2, as we changed the default transfer service to use a Netty-based one rather than the ConnectionManager. On Thu, Jan 8, 2015 at 7:05 AM, Spidy wrote: > Hi, > > Can you please explain which settings did you changed? > > > > -- > View this message

Re: Discrepancy in PCA values

2015-01-08 Thread Xiangrui Meng
The Julia code is computing the SVD of the Gram matrix. PCA should be applied to the covariance matrix. -Xiangrui On Thu, Jan 8, 2015 at 8:27 AM, Upul Bandara wrote: > Hi All, > > I tried to do PCA for the Iris dataset > [https://archive.ics.uci.edu/ml/datasets/Iris] using MLLib > [http://spark.a

Re: Spark Standalone Cluster not correctly configured

2015-01-08 Thread Josh Rosen
Can you please file a JIRA issue for this? This will make it easier to triage this issue. https://issues.apache.org/jira/browse/SPARK Thanks, Josh On Thu, Jan 8, 2015 at 2:34 AM, frodo777 wrote: > Hello everyone. > > With respect to the configuration problem that I explained before > > D

Is the Thrift server right for me?

2015-01-08 Thread sjbrunst
I'm building a system that collects data using Spark Streaming, does some processing with it, then saves the data. I want the data to be queried by multiple applications, and it sounds like the Thrift JDBC/ODBC server might be the right tool to handle the queries. However, the documentation for th

Spark SQL: Storing AVRO Schema in Parquet

2015-01-08 Thread Jerry Lam
Hi spark users, I'm using spark SQL to create parquet files on HDFS. I would like to store the avro schema into the parquet meta so that non spark sql applications can marshall the data without avro schema using the avro parquet reader. Currently, schemaRDD.saveAsParquetFile does not allow to do t

Getting Output From a Cluster

2015-01-08 Thread Su She
Hello Everyone, Thanks in advance for the help! I successfully got my Kafka/Spark WordCount app to print locally. However, I want to run it on a cluster, which means that I will have to save it to HDFS if I want to be able to read the output. I am running Spark 1.1.0, which means according to th

SparkSQL

2015-01-08 Thread Abhi Basu
I am working with CDH5.2 (Spark 1.0.0) and wondering which version of Spark comes with SparkSQL by default. Also, will SparkSQL come enabled to access the Hive Metastore? Is there an easier way to enable Hive support without have to build the code with various switches? Thanks, Abhi -- Abhi Bas

Re: SparkSQL

2015-01-08 Thread Marcelo Vanzin
Disclaimer: this seems more of a CDH question, I'd suggest sending these to the CDH mailing list in the future. CDH 5.2 actually has Spark 1.1. It comes with SparkSQL built-in, but it does not include the thrift server because of incompatibilities with the CDH version of Hive. To use Hive support,

Re: Implement customized Join for SparkSQL

2015-01-08 Thread Rishi Yadav
Hi Kevin, Say A has 10 ids, so you are pulling data from B's data source only for these 10 ids? What if you load A and B as separate schemaRDDs and then do join. Spark will optimize the path anyway when action is fired . On Mon, Jan 5, 2015 at 2:28 AM, Dai, Kevin wrote: > Hi, All > > > > Supp

correct/best way to install custom spark1.2 on cdh5.3.0?

2015-01-08 Thread freedafeng
Could anyone come up with your experience on how to do this? I have created a cluster and installed cdh5.3.0 on it with basically core + Hbase. but cloudera installed and configured the spark in its parcels anyway. I'd like to install our custom spark on this cluster to use the hadoop and hbase s

Re: correct/best way to install custom spark1.2 on cdh5.3.0?

2015-01-08 Thread Marcelo Vanzin
Disclaimer: CDH questions are better handled at cdh-us...@cloudera.org. But the question I'd like to ask is: why do you need your own Spark build? What's wrong with CDH's Spark that it doesn't work for you? On Thu, Jan 8, 2015 at 3:01 PM, freedafeng wrote: > Could anyone come up with your experi

Re: correct/best way to install custom spark1.2 on cdh5.3.0?

2015-01-08 Thread freedafeng
I installed the custom as a standalone mode as normal. The master and slaves started successfully. However, I got error when I ran a job. It seems to me from the error message the some library was compiled against hadoop1, but my spark was compiled against hadoop2. 15/01/08 23:27:36 INFO ClientC

Re: Problem with StreamingContext - getting SPARK-2243

2015-01-08 Thread Rishi Yadav
you can also access SparkConf using sc.getConf in Spark shell though for StreamingContext you can directly refer sc as Akhil suggested. On Sun, Dec 28, 2014 at 12:13 AM, Akhil Das wrote: > In the shell you could do: > > val ssc = StreamingContext(*sc*, Seconds(1)) > > as *sc* is the SparkContext

Re: Profiling a spark application.

2015-01-08 Thread Rishi Yadav
as per my understanding RDDs do not get replicated, underlying Data does if it's in HDFS. On Thu, Dec 25, 2014 at 9:04 PM, rapelly kartheek wrote: > Hi, > > I want to find the time taken for replicating an rdd in spark cluster > along with the computation time on the replicated rdd. > > Can some

Re: JavaRDD (Data Aggregation) based on key

2015-01-08 Thread Rishi Yadav
One approach is to first transform this RDD into a PairRDD by taking the field you are going to do aggregation on as key On Tue, Dec 23, 2014 at 1:47 AM, sachin Singh wrote: > Hi, > I have a csv file having fields as a,b,c . > I want to do aggregation(sum,average..) based on any field(a,b or c)

Re: correct/best way to install custom spark1.2 on cdh5.3.0?

2015-01-08 Thread Marcelo Vanzin
On Thu, Jan 8, 2015 at 3:33 PM, freedafeng wrote: > I installed the custom as a standalone mode as normal. The master and slaves > started successfully. > However, I got error when I ran a job. It seems to me from the error message > the some library was compiled against hadoop1, but my spark was

Running spark 1.2 on Hadoop + Kerberos

2015-01-08 Thread Manoj Samel
Hi, For running spark 1.2 on Hadoop cluster with Kerberos, what spark configurations are required? Using existing keytab, can any examples be submitted to the secured cluster ? How? Thanks,

Re: Running spark 1.2 on Hadoop + Kerberos

2015-01-08 Thread Marcelo Vanzin
Hi Manoj, As long as you're logged in (i.e. you've run kinit), everything should just work. You can run "klist" to make sure you're logged in. On Thu, Jan 8, 2015 at 3:49 PM, Manoj Samel wrote: > Hi, > > For running spark 1.2 on Hadoop cluster with Kerberos, what spark > configurations are requi

Re: Running spark 1.2 on Hadoop + Kerberos

2015-01-08 Thread Manoj Samel
Pl ignore the keytab question for now, the question wasn't fully described Some old communication (Oct 14) says Spark is not certified with Kerberos. Can someone comment on this aspect ? On Thu, Jan 8, 2015 at 3:53 PM, Marcelo Vanzin wrote: > Hi Manoj, > > As long as you're logged in (i.e. you'

Re: correct/best way to install custom spark1.2 on cdh5.3.0?

2015-01-08 Thread freedafeng
I ran the release spark in cdh5.3.0 but got the same error. Anyone tried to run spark in cdh5.3.0 using its newAPIHadoopRDD? command: spark-submit --master spark://master:7077 --jars /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/jars/spark-examples-1.2.0-cdh5.3.0-hadoop2.5.0-cdh5.3.0.jar ./sp

Re: Running spark 1.2 on Hadoop + Kerberos

2015-01-08 Thread Marcelo Vanzin
On Thu, Jan 8, 2015 at 4:09 PM, Manoj Samel wrote: > Some old communication (Oct 14) says Spark is not certified with Kerberos. > Can someone comment on this aspect ? Spark standalone doesn't support kerberos. Spark running on top of Yarn works fine with kerberos. -- Marcelo --

Re: correct/best way to install custom spark1.2 on cdh5.3.0?

2015-01-08 Thread Marcelo Vanzin
I ran this with CDH 5.2 without a problem (sorry don't have 5.3 readily available at the moment): $ HBASE='/opt/cloudera/parcels/CDH/lib/hbase/\*' $ spark-submit --driver-class-path $HBASE --conf "spark.executor.extraClassPath=$HBASE" --master yarn --class org.apache.spark.examples.HBaseTest /opt/

Re: Getting Output From a Cluster

2015-01-08 Thread Yana Kadiyska
are you calling the saveAsText files on the DStream --looks like it? Look at the section called "Design Patterns for using foreachRDD" in the link you sent -- you want to do dstream.foreachRDD(rdd => rdd.saveAs) On Thu, Jan 8, 2015 at 5:20 PM, Su She wrote: > Hello Everyone, > > Thanks in a

[ANNOUNCE] Apache Science and Healthcare Track @ApacheCon NA 2015

2015-01-08 Thread Lewis John Mcgibbney
Hi Folks, Apologies for cross posting :( As some of you may already know, @ApacheCon NA 2015 is happening in Austin, TX April 13th-16th. This email is specifically written to attract all folks interested in Science and Healthcare... this is an official call to arms! I am aware that there are man

Cannot save RDD as text file to local file system

2015-01-08 Thread Wang, Ningjun (LNG-NPV)
I try to save RDD as text file to local file system (Linux) but it does not work Launch spark-shell and run the following val r = sc.parallelize(Array("a", "b", "c")) r.saveAsTextFile("file:///home/cloudera/tmp/out1") IOException: Mkdirs failed to create file:/home/cloudera/tmp/out1/_temporary/

Failed to save RDD as text file to local file system

2015-01-08 Thread NingjunWang
I try to save RDD as text file to local file system (Linux) but it does not work Launch spark-shell and run the following val r = sc.parallelize(Array("a", "b", "c")) r.saveAsTextFile("file:///home/cloudera/tmp/out1") IOException: Mkdirs failed to create file:/home/cloudera/tmp/out1/_temporary/

Re: Getting Output From a Cluster

2015-01-08 Thread Su She
Yes, I am calling the saveAsHadoopFiles on the Dstream. However, when I call print on the Dstream it works? If I had to do foreachRDD to saveAsHadoopFile, then why is it working for print? Also, if I am doing foreachRDD, do I need connections, or can I simply put the saveAsHadoopFiles inside the f

Did anyone tried overcommit of CPU cores?

2015-01-08 Thread Xuelin Cao
Hi, I'm wondering whether it is a good idea to overcommit CPU cores on the spark cluster. For example, in our testing cluster, each worker machine has 24 physical CPU cores. However, we are allowed to set the CPU core number to 48 or more in the spark configuration file. As a result,

skipping header from each file

2015-01-08 Thread Hafiz Mujadid
Suppose I give three files paths to spark context to read and each file has schema in first row. how can we skip schema lines from headers val rdd=sc.textFile("file1,file2,file3"); -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/skipping-header-from-each-f

Re: Spark SQL: Storing AVRO Schema in Parquet

2015-01-08 Thread Raghavendra Pandey
I have the similar kind of requirement where I want to push avro data into parquet. But it seems you have to do it on your own. There is parquet-mr project that uses hadoop to do so. I am trying to write a spark job to do similar kind of thing. On Fri, Jan 9, 2015 at 3:20 AM, Jerry Lam wrote: >

Re: Failed to save RDD as text file to local file system

2015-01-08 Thread Raghavendra Pandey
Can you check permissions etc as I am able to run r.saveAsTextFile("file:///home/cloudera/tmp/out1") successfully on my machine.. On Fri, Jan 9, 2015 at 10:25 AM, NingjunWang wrote: > I try to save RDD as text file to local file system (Linux) but it does not > work > > Launch spark-shell and ru

Re: skipping header from each file

2015-01-08 Thread Akhil Das
Did you try something like: val file = sc.textFile("/home/akhld/sigmoid/input") val skipped = file.filter(row => !row.contains("header")) skipped.take(10).foreach(println) Thanks Best Regards On Fri, Jan 9, 2015 at 11:48 AM, Hafiz Mujadid wrote: > Suppose I give three files paths

Re: RDD Moving Average

2015-01-08 Thread Tobias Pfeiffer
Hi, On Wed, Jan 7, 2015 at 9:47 AM, Asim Jalis wrote: > One approach I was considering was to use mapPartitions. It is > straightforward to compute the moving average over a partition, except for > near the end point. Does anyone see how to fix that? > Well, I guess this is not a perfect use ca

Re: Failed to save RDD as text file to local file system

2015-01-08 Thread VISHNU SUBRAMANIAN
looks like it is trying to save the file in Hdfs. Check if you have set any hadoop path in your system. On Fri, Jan 9, 2015 at 12:14 PM, Raghavendra Pandey < raghavendra.pan...@gmail.com> wrote: > Can you check permissions etc as I am able to run > r.saveAsTextFile("file:///home/cloudera/tmp/out

Parallel execution on one node

2015-01-08 Thread mikens
Hello, I am new to Spark. I have adapted an example code to do binary classification using logistic regression. I tried it on rcv1_train.binary dataset using LBFGS.runLBFGS solver, and obtained correct loss. Now, I'd like to run code in parallel across 16 cores of my single CPU socket. If I unders

Re: Did anyone tried overcommit of CPU cores?

2015-01-08 Thread Jörn Franke
Hallo, Based on experiences with other software in virtualized environments I cannot really recommend this. However, I am not sure how Spark reacts. You may face unpredictable task failures depending on utilization, tasks connecting to external systems (databases etc.) may fail unexpectedly and th

Re: Spark SQL: Storing AVRO Schema in Parquet

2015-01-08 Thread Raghavendra Pandey
I cam across this http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/. You can take a look. On Fri Jan 09 2015 at 12:08:49 PM Raghavendra Pandey < raghavendra.pan...@gmail.com> wrote: > I have the similar kind of requirement where I want to push avro data into > parquet. But it seems you h

Re: Cannot save RDD as text file to local file system

2015-01-08 Thread Akhil Das
Are you running the program in local mode or in standalone cluster mode? Thanks Best Regards On Fri, Jan 9, 2015 at 10:12 AM, Wang, Ningjun (LNG-NPV) < ningjun.w...@lexisnexis.com> wrote: > I try to save RDD as text file to local file system (Linux) but it does > not work > > > > Launch spark-s

Re: Getting Output From a Cluster

2015-01-08 Thread Akhil Das
saveAsHadoopFiles requires you to specify the output format which i believe you are not specifying anywhere and hence the program crashes. You could try something like this: Class> outputFormatClass = (Class>) (Class) SequenceFileOutputFormat.class; 46 yourStream.saveAsNewAPIHadoopFiles(hdfsUrl,

Re: Join RDDs with DStreams

2015-01-08 Thread Akhil Das
Here's how you do it: val joined_stream = *myStream*.transform((x: RDD[(String, String)]) => { val prdd = new PairRDDFunctions[String, String](x) prdd.join(*myRDD*)}) Thanks Best Regards On Thu, Jan 8, 2015 at 10:20 PM, Asim Jalis wrote: > Is there a way to join non-DStream

Re: SparkSQL schemaRDD & MapPartitions calls - performance issues - columnar formats?

2015-01-08 Thread Nathan McCarthy
Any ideas? :) From: Nathan mailto:nathan.mccar...@quantium.com.au>> Date: Wednesday, 7 January 2015 2:53 pm To: "user@spark.apache.org" mailto:user@spark.apache.org>> Subject: SparkSQL schemaRDD & MapPartitions calls - performance issues - columnar formats? Hi, I

Re: Getting Output From a Cluster

2015-01-08 Thread Su She
1) Thank you everyone for the help once again...the support here is really amazing and I hope to contribute soon! 2) The solution I actually ended up using was from this thread: http://mail-archives.apache.org/mod_mbox/spark-user/201310.mbox/%3ccafnzj5ejxdgqju7nbdqy6xureq3d1pcxr+i2s99g5brcj5e...@m

Re: Set EXTRA_JAR environment variable for spark-jobserver

2015-01-08 Thread Sasi
Thank you Pankaj. We are able to create the Uber JAR (very good to bind all dependency JARs together) and run it on spark-jobserver. One step better than what we are. However, now facing *SparkException: Job aborted due to stage failure: All masters are unresponsive! Giving up*. We may need to rai