BlockNotFoundException when running spark word count on Tachyon

2015-08-25 Thread Todd
I am using tachyon in the spark program below,but I encounter a BlockNotFoundxception. Does someone know what's wrong and also is there guide on how to configure spark to work with Tackyon?Thanks! conf.set("spark.externalBlockStore.url", "tachyon://10.18.19.33:19998") conf.set("spark.ex

Re: SparkSQL saveAsParquetFile does not preserve AVRO schema

2015-08-25 Thread storm
Note: In the code (org.apache.spark.sql.parquet.DefaultSource) I've found this: val relation = if (doInsertion) { // This is a hack. We always set nullable/containsNull/valueContainsNull to true // for the schema of a parquet data. val df = sqlContext.createDataFrame(

reduceByKey not working on JavaPairDStream

2015-08-25 Thread Deepesh Maheshwari
Hi, I have applied mapToPair and then a reduceByKey on a DStream to obtain a JavaPairDStream>. I have to apply a flatMapToPair and reduceByKey on the DSTream Obtained above. But i do not see any logs from reduceByKey operation. Can anyone explain why is this happening..? find My Code Below - *

Re: CHAID Decision Trees

2015-08-25 Thread Feynman Liang
For a single decision tree, the closest I can think of is printDebugString, which gives you a text representation of the decision thresholds and paths down the tree. I don't think there's anything in MLlib for visualizing GBTs or random forests On Tue, Aug 25, 2015 at 9:20 PM, Jatinpreet Singh w

Re: How to access Spark UI through AWS

2015-08-25 Thread Justin Pihony
I figured it all out after this: http://apache-spark-user-list.1001560.n3.nabble.com/WebUI-on-yarn-through-ssh-tunnel-affected-by-AmIpfilter-td21540.html The short is that I needed to set SPARK_PUBLIC_DNS (not DNS_HOME) = ec2_publicdns then the YARN proxy gets in the way, so I needed to go to:

Re:Re: How to increase data scale in Spark SQL Perf

2015-08-25 Thread Todd
I think the answer is No. I only see such message on the console..and #2 is the thread stack trace。 I am thinking is that in Spark SQL Perf forks many dsdgen process to generate data when the scalafactor is increased which at last exhaust the JVM When thread exception is thrown on the console a

Re: How to increase data scale in Spark SQL Perf

2015-08-25 Thread Ted Yu
The error in #1 below was not informative. Are you able to get more detailed error message ? Thanks > On Aug 25, 2015, at 6:57 PM, Todd wrote: > > > Thanks Ted Yu. > > Following are the error message: > 1. The exception that is shown on the UI is : > Exception in thread "Thread-113" Excep

Re: use GraphX with Spark Streaming

2015-08-25 Thread ponkin
Hi, Sure you can. StreamingContext has property /def sparkContext: SparkContext/(see docs ). Think about DStream - main abstraction in Spark Streaming, as a sequence of RDD. Each DStream can be

Re: CHAID Decision Trees

2015-08-25 Thread Jatinpreet Singh
Hi Feynman, Thanks for the information. Is there a way to depict decision tree as a visualization for large amounts of data using any other technique/library? Thanks, Jatin On Tue, Aug 25, 2015 at 11:42 PM, Feynman Liang wrote: > Nothing is in JIRA >

Question on take function - Spark Java API

2015-08-25 Thread Pankaj Wahane
Hi community members, > Apache Spark is Fantastic and very easy to learn.. Awesome work!!! > > Question: > > I have multiple files in a folder and and the first line in each file is name > of the asset that the file belongs to. Second line is csv header row and data > starts from third row..

Re:Re: How to increase data scale in Spark SQL Perf

2015-08-25 Thread Todd
Thanks Ted Yu. Following are the error message: 1. The exception that is shown on the UI is : Exception in thread "Thread-113" Exception in thread "Thread-126" Exception in thread "Thread-64" Exception in thread "Thread-90" Exception in thread "Thread-117" Exception in thread "Thread-80" Excep

Re: Spark thrift server on yarn

2015-08-25 Thread Udit Mehta
I registered it in a new Spark SQL CLI. Yeah I thought so too about how the temp tables were accessible across different applications without using a job-server. I see that running* HiveThriftServer2.startWithContext(hiveContext) *within the spark app starts up a thrift server. On Tue, Aug 25, 201

RE: Spark thrift server on yarn

2015-08-25 Thread Cheng, Hao
Did you register temp table via the beeline or in a new Spark SQL CLI? As I know, the temp table cannot cross the HiveContext. Hao From: Udit Mehta [mailto:ume...@groupon.com] Sent: Wednesday, August 26, 2015 8:19 AM To: user Subject: Spark thrift server on yarn Hi, I am trying to start a spark

Spark thrift server on yarn

2015-08-25 Thread Udit Mehta
Hi, I am trying to start a spark thrift server using the following command on Spark 1.3.1 running on yarn: * ./sbin/start-thriftserver.sh --master yarn://resourcemanager.snc1:8032 --executor-memory 512m --hiveconf hive.server2.thrift.bind.host=test-host.sn1 --hiveconf hive.server2.thrift.port=1

Persisting sorted parquet tables for future sort merge joins

2015-08-25 Thread Jason
I want to persist a large _sorted_ table to Parquet on S3 and then read this in and join it using the Sorted Merge Join strategy against another large sorted table. The problem is: even though I sort these tables on the join key beforehand, once I persist them to Parquet, they lose the information

RE: Protobuf error when streaming from Kafka

2015-08-25 Thread java8964
I am not familiar with CDH distribution, we built spark ourselves. The error means running code generated with Protocol-Buffers 2.5.0 with a protocol-buffers-2.4.1 (or earlier) jar. So there is a protocol-buffer 2.4.1 version somewhere, either in the jar you built, or in the cluster runtime. T

Re: Exclude slf4j-log4j12 from the classpath via spark-submit

2015-08-25 Thread Utkarsh Sengar
Looks like I stuck then, I am using mesos. Adding these 2 jars to all executors might be a problem for me, I will probably try to remove the dependency on the otj-logging lib then and just use log4j. On Tue, Aug 25, 2015 at 2:15 PM, Marcelo Vanzin wrote: > On Tue, Aug 25, 2015 at 1:50 PM, Utkars

Re: How to access Spark UI through AWS

2015-08-25 Thread Justin Pihony
OK, I figured the horrid look alsothe href of all of the styles is prefixed with the proxy dataso, ultimately if I can fix the proxy issues with the links, then I can fix the look also On Tue, Aug 25, 2015 at 5:17 PM, Justin Pihony wrote: > SUCCESS! I set SPARK_DNS_HOME=ec2_publicdns

Re: How to access Spark UI through AWS

2015-08-25 Thread Justin Pihony
SUCCESS! I set SPARK_DNS_HOME=ec2_publicdns, which makes it available to access the spark ui directly. The application proxy was still getting in the way by the way it creates the URL, so I manually filled in the /stage?id=#&attempt=# and that workedI'm still having trouble with the css as the

Re: Exclude slf4j-log4j12 from the classpath via spark-submit

2015-08-25 Thread Marcelo Vanzin
On Tue, Aug 25, 2015 at 1:50 PM, Utkarsh Sengar wrote: > So do I need to manually copy these 2 jars on my spark executors? Yes. I can think of a way to work around that if you're using YARN, but not with other cluster managers. > On Tue, Aug 25, 2015 at 10:51 AM, Marcelo Vanzin > wrote: >> >> O

Re: build spark 1.4.1 with JDK 1.6

2015-08-25 Thread Rick Moritz
My local build using rc-4 and java 7 does actually also produce different binaries (for one file only) than the 1.4.0 releqse artifact available on Central. These binaries also decompile to identical instructions, but this may be due to different versions of javac (within the 7 family) producing di

Re: build spark 1.4.1 with JDK 1.6

2015-08-25 Thread Sean Owen
Hm... off the cuff I wonder if this is because somehow the build process ran Maven with Java 6 but forked the Java/Scala compilers and those used JDK 7. Or some later repackaging process ran on the artifacts and used Java 6. I do see "Build-Jdk: 1.6.0_45" in the manifest, but I don't think 1.4.x ca

Re: build spark 1.4.1 with JDK 1.6

2015-08-25 Thread Rick Moritz
A quick question regarding this: how come the artifacts (spark-core in particular) on Maven Central are built with JDK 1.6 (according to the manifest), if Java 7 is required? On Aug 21, 2015 5:32 PM, "Sean Owen" wrote: > Spark 1.4 requires Java 7. > > On Fri, Aug 21, 2015, 3:12 PM Chen Song wrot

Re: Exclude slf4j-log4j12 from the classpath via spark-submit

2015-08-25 Thread Utkarsh Sengar
So do I need to manually copy these 2 jars on my spark executors? On Tue, Aug 25, 2015 at 10:51 AM, Marcelo Vanzin wrote: > On Tue, Aug 25, 2015 at 10:48 AM, Utkarsh Sengar > wrote: > > Now I am going to try it out on our mesos cluster. > > I assumed "spark.executor.extraClassPath" takes csv

Re: Protobuf error when streaming from Kafka

2015-08-25 Thread Cassa L
Do you think this binary would have issue? Do I need to build spark from source code? On Tue, Aug 25, 2015 at 1:06 PM, Cassa L wrote: > I downloaded below binary version of spark. > spark-1.4.1-bin-cdh4 > > On Tue, Aug 25, 2015 at 1:03 PM, java8964 wrote: > >> Did your spark build with Hive? >>

SparkSQL problem with IBM BigInsight V3

2015-08-25 Thread java8964
Hi, On our production environment, we have a unique problems related to Spark SQL, and I wonder if anyone can give me some idea what is the best way to handle this. Our production Hadoop cluster is IBM BigInsight Version 3, which comes with Hadoop 2.2.0 and Hive 0.12. Right now, we build spark 1

Re: How to access Spark UI through AWS

2015-08-25 Thread Justin Pihony
Thanks. I just tried and still am having trouble. It seems to still be using the private address even if I try going through the resource manager. On Tue, Aug 25, 2015 at 12:34 PM, Kelly, Jonathan wrote: > I'm not sure why the UI appears broken like that either and haven't > investigated it myse

Re: Too many files/dirs in hdfs

2015-08-25 Thread Mohit Anchlia
Based on what I've read it appears that when using spark streaming there is no good way of optimizing the files on HDFS. Spark streaming writes many small files which is not scalable in apache hadoop. Only other way seem to be to read files after it has been written and merge them to a bigger file,

Re: How to unit test HiveContext without OutOfMemoryError (using sbt)

2015-08-25 Thread Yana Kadiyska
The PermGen space error is controlled with MaxPermSize parameter. I run with this in my pom, I think copied pretty literally from Spark's own tests... I don't know what the sbt equivalent is but you should be able to pass it...possibly via SBT_OPTS? org.scalatest sca

Re: Protobuf error when streaming from Kafka

2015-08-25 Thread Cassa L
I downloaded below binary version of spark. spark-1.4.1-bin-cdh4 On Tue, Aug 25, 2015 at 1:03 PM, java8964 wrote: > Did your spark build with Hive? > > I met the same problem before because the hive-exec jar in the maven > itself include "protobuf" class, which will be included in the Spark jar.

RE: Protobuf error when streaming from Kafka

2015-08-25 Thread java8964
Did your spark build with Hive? I met the same problem before because the hive-exec jar in the maven itself include "protobuf" class, which will be included in the Spark jar. Yong Date: Tue, 25 Aug 2015 12:39:46 -0700 Subject: Re: Protobuf error when streaming from Kafka From: lcas...@gmail.com T

Re: Join with multiple conditions (In reference to SPARK-7197)

2015-08-25 Thread Davies Liu
It's good to support this, could you create a JIRA for it and target for 1.6? On Tue, Aug 25, 2015 at 11:21 AM, Michal Monselise wrote: > > Hello All, > > PySpark currently has two ways of performing a join: specifying a join > condition or column names. > > I would like to perform a join using

Re: Protobuf error when streaming from Kafka

2015-08-25 Thread Cassa L
Hi, I am using Spark-1.4 and Kafka-0.8.2.1 As per google suggestions, I rebuilt all the classes with protobuff-2.5 dependencies. My new protobuf is compiled using 2.5. However now, my spark job does not start. Its throwing different error. Does Spark or any other its dependencies uses old protobuf

Re: Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-25 Thread Susan Zhang
Sure thing! The main looks like: -- val kafkaBrokers = conf.getString(s"$varPrefix.metadata.broker.list") val kafkaConf = Map( "zookeeper.connect" -> zookeeper, "group.id" -> options.gro

Re: Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-25 Thread Cody Koeninger
Sounds like something's not set up right... can you post a minimal code example that reproduces the issue? On Tue, Aug 25, 2015 at 1:40 PM, Susan Zhang wrote: > Yeah. All messages are lost while the streaming job was down. > > On Tue, Aug 25, 2015 at 11:37 AM, Cody Koeninger > wrote: > >> Are y

Re: Local Spark talking to remote HDFS?

2015-08-25 Thread Roberto Congiu
That's what I'd suggest too. Furthermore, if you use vagrant to spin up VMs, there's a module that can do that automatically for you. R. 2015-08-25 10:11 GMT-07:00 Steve Loughran : > I wouldn't try to play with forwarding & tunnelling; always hard to work > out what ports get used everywhere, an

Re: Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-25 Thread Susan Zhang
Yeah. All messages are lost while the streaming job was down. On Tue, Aug 25, 2015 at 11:37 AM, Cody Koeninger wrote: > Are you actually losing messages then? > > On Tue, Aug 25, 2015 at 1:15 PM, Susan Zhang wrote: > >> No; first batch only contains messages received after the second job >> sta

Re: Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-25 Thread Cody Koeninger
Are you actually losing messages then? On Tue, Aug 25, 2015 at 1:15 PM, Susan Zhang wrote: > No; first batch only contains messages received after the second job > starts (messages come in at a steady rate of about 400/second). > > On Tue, Aug 25, 2015 at 11:07 AM, Cody Koeninger > wrote: > >>

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Feynman Liang
Kristina, Thanks for the discussion. I followed up on your problem and learned that Scala doesn't support multiple implicit conversions in a single expression for complexity reasons. I'm af

Fwd: Join with multiple conditions (In reference to SPARK-7197)

2015-08-25 Thread Michal Monselise
Hello All, PySpark currently has two ways of performing a join: specifying a join condition or column names. I would like to perform a join using a list of columns that appear in both the left and right DataFrames. I have created an example in this question on Stack Overflow

Re: Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-25 Thread Susan Zhang
No; first batch only contains messages received after the second job starts (messages come in at a steady rate of about 400/second). On Tue, Aug 25, 2015 at 11:07 AM, Cody Koeninger wrote: > Does the first batch after restart contain all the messages received while > the job was down? > > On Tue

Re: CHAID Decision Trees

2015-08-25 Thread Feynman Liang
Nothing is in JIRA so AFAIK no, only random forests and GBTs using entropy or GINI for information gain is supported. On Tue, Aug 25, 2015 at 9:39 AM, jatinpreet wrote: > Hi, > > I wish to know if M

How to unit test HiveContext without OutOfMemoryError (using sbt)

2015-08-25 Thread Mike Trienis
Hello, I am using sbt and created a unit test where I create a `HiveContext` and execute some query and then return. Each time I run the unit test the JVM will increase it's memory usage until I get the error: Internal error when running tests: java.lang.OutOfMemoryError: PermGen space Exception

Re: Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-25 Thread Cody Koeninger
Does the first batch after restart contain all the messages received while the job was down? On Tue, Aug 25, 2015 at 12:53 PM, suchenzang wrote: > Hello, > > I'm using direct spark streaming (from kafka) with checkpointing, and > everything works well until a restart. When I shut down (^C) the f

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Kristina Rogale Plazonic
YES PLEASE! :))) On Tue, Aug 25, 2015 at 1:57 PM, Burak Yavuz wrote: > Hmm. I have a lot of code on the local linear algebra operations using > Spark's Matrix and Vector representations > done for https://issues.apache.org/jira/browse/SPARK-6442. > > I can make a Spark package with that cod

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Burak Yavuz
Hmm. I have a lot of code on the local linear algebra operations using Spark's Matrix and Vector representations done for https://issues.apache.org/jira/browse/SPARK-6442. I can make a Spark package with that code if people are interested. Best, Burak On Tue, Aug 25, 2015 at 10:54 AM, Kristina R

SparkR: exported functions

2015-08-25 Thread Colin Gillespie
Hi, I've just started playing about with SparkR (Spark 1.4.1), and noticed that a number of the functions haven't been exported. For example, the textFile function https://github.com/apache/spark/blob/master/R/pkg/R/context.R isn't exported, i.e. the function isn't in the NAMESPACE file. This is

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Kristina Rogale Plazonic
> > However I do think it's easier than it seems to write the implicits; > it doesn't involve new classes or anything. Yes it's pretty much just > what you wrote. There is a class "Vector" in Spark. This declaration > can be in an object; you don't implement your own class. (Also you can > use "toB

Spark Streaming Checkpointing Restarts with 0 Event Batches

2015-08-25 Thread suchenzang
Hello, I'm using direct spark streaming (from kafka) with checkpointing, and everything works well until a restart. When I shut down (^C) the first streaming job, wait 1 minute, then re-submit, there is somehow a series of 0 event batches that get queued (corresponding to the 1 minute when the job

Re: Exclude slf4j-log4j12 from the classpath via spark-submit

2015-08-25 Thread Marcelo Vanzin
On Tue, Aug 25, 2015 at 10:48 AM, Utkarsh Sengar wrote: > Now I am going to try it out on our mesos cluster. > I assumed "spark.executor.extraClassPath" takes csv as jars the way "--jars" > takes it but it should be ":" separated like a regular classpath jar. Ah, yes, those options are just raw c

Re: Exclude slf4j-log4j12 from the classpath via spark-submit

2015-08-25 Thread Utkarsh Sengar
This worked for me locally: spark-1.4.1-bin-hadoop2.4/bin/spark-submit --conf spark.executor.extraClassPath=/.m2/repository/ch/qos/logback/logback-core/1.1.2/logback-core-1.1.2.jar:/.m2/repository/ch/qos/logback/logback-classic/1.1.2/logback-classic-1.1.2.jar --conf spark.driver.extraClassPath=/.m2

Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-25 Thread Nicholas Chammas
Final chance to fill out the survey! http://goo.gl/forms/erct2s6KRR I'm gonna close it to new responses tonight and send out a summary of the results. Nick On Thu, Aug 20, 2015 at 2:08 PM Nicholas Chammas wrote: > I'm planning to close the survey to further responses early next week. > > If y

Re: build spark 1.4.1 with JDK 1.6

2015-08-25 Thread Eric Friedman
Well, this is very strange. My only change is to add -X to make-distribution and it succeeds: % git diff (spark/spark) *diff --git a/make-distribution.sh b/make-distribution.sh* *index a2b0c43..351fac2 100755* *--- a/make-distribution.sh* *+++ b/make-dist

Re: Spark-Ec2 launch failed on starting httpd spark 141

2015-08-25 Thread Ted Yu
Looks like it is this PR: https://github.com/mesos/spark-ec2/pull/133 On Tue, Aug 25, 2015 at 9:52 AM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > Yeah thats a know issue and we have a PR out to fix it. > > Shivaram > > On Tue, Aug 25, 2015 at 7:39 AM, Garry Chen wrote: > > Hi A

Re: How to effieciently write sorted neighborhood in pyspark

2015-08-25 Thread shahid qadri
Any resources on this > On Aug 25, 2015, at 3:15 PM, shahid qadri wrote: > > I would like to implement sorted neighborhood approach in spark, what is the > best way to write that in pyspark. - To unsubscribe, e-mail: user-uns

Re: Local Spark talking to remote HDFS?

2015-08-25 Thread Steve Loughran
I wouldn't try to play with forwarding & tunnelling; always hard to work out what ports get used everywhere, and the services like hostname==URL in paths. Can't you just set up an entry in the windows /etc/hosts file? It's what I do (on Unix) to talk to VMs > On 25 Aug 2015, at 04:49, Dino Fan

Re: Spark (1.2.0) submit fails with exception saying log directory already exists

2015-08-25 Thread Marcelo Vanzin
This probably means your app is failing and the second attempt is hitting that issue. You may fix the "directory already exists" error by setting spark.eventLog.overwrite=true in your conf, but most probably that will just expose the actual error in your app. On Tue, Aug 25, 2015 at 9:37 AM, Varad

Re: Spark-Ec2 lunch failed on starting httpd spark 141

2015-08-25 Thread Shivaram Venkataraman
Yeah thats a know issue and we have a PR out to fix it. Shivaram On Tue, Aug 25, 2015 at 7:39 AM, Garry Chen wrote: > Hi All, > > I am trying to lunch a spark cluster on ec2 with spark 1.4.1 > version. The script finished but getting error at the end as following. > What should

[SQL/Hive] Trouble with refreshTable

2015-08-25 Thread Yana Kadiyska
I'm having trouble with refreshTable, I suspect because I'm using it incorrectly. I am doing the following: 1. Create DF from parquet path with wildcards, e.g. /foo/bar/*.parquet 2. use registerTempTable to register my dataframe 3. A new file is dropped under /foo/bar/ 4. Call hiveContext.refres

Error:(46, 66) not found: type SparkFlumeProtocol

2015-08-25 Thread Muler
I'm trying to build Spark using Intellij on Windows. But I'm repeatedly getting this error spark-master\external\flume-sink\src\main\scala\org\apache\spark\streaming\flume\sink\SparkAvroCallbackHandler.scala Error:(46, 66) not found: type SparkFlumeProtocol val transactionTimeout: Int, val backO

Spark (1.2.0) submit fails with exception saying log directory already exists

2015-08-25 Thread Varadhan, Jawahar
Here is the error yarn.ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: Log directory hdfs://Sandbox/user/spark/applicationHistory/application_1438113296105_0302 already exists!) I am using cloudera 5.3.2 with Spark 1.2.0 Any help is appreciated. Th

CHAID Decision Trees

2015-08-25 Thread jatinpreet
Hi, I wish to know if MLlib supports CHAID regression and classifcation trees. If yes, how can I build them in spark? Thanks, Jatin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/CHAID-Decision-Trees-tp24449.html Sent from the Apache Spark User List mail

Re: How to access Spark UI through AWS

2015-08-25 Thread Kelly, Jonathan
I'm not sure why the UI appears broken like that either and haven't investigated it myself yet, but if you instead go to the YARN ResourceManager UI (port 8088 if you are using emr-4.x; port 9026 for 3.x, I believe), then you should be able to click on the ApplicationMaster link (or the History lin

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Sean Owen
Yes I get all that too and I think there's a legit question about whether moving a little further down the slippery slope is worth it and if so how far. The other catch here is: either you completely mimic another API (in which case why not just use it directly, which has its own problems) or you d

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Kristina Rogale Plazonic
> What about declaring a few simple implicit conversions between the > MLlib and Breeze Vector classes? if you import them then you should be > able to write a lot of the source code just as you imagine it, as if > the Breeze methods were available on the Vector object in MLlib. The problem is tha

Pyspark ImportError: No module named definitions

2015-08-25 Thread YaoPau
I have three modules: *join_driver.py* - driver, imports 'joined_paths_all', then calls some of joined_paths_all's functions for wrangling RDDs *joined_paths_all.py* - all wrangling functions for this project are defined here. Imports 'definitions' *definitions.py* - contains all my regex defi

Re: Spark RDD join with CassandraRDD

2015-08-25 Thread Matt Narrell
I would suggest converting your RDDs to Dataframes (or SchemaRDDs depending on your version) and performing a native join. mn > On Aug 25, 2015, at 9:22 AM, Priya Ch wrote: > > Hi All, > > I have the following scenario: > > There exists a booking table in cassandra, which holds the field

Spark RDD join with CassandraRDD

2015-08-25 Thread Priya Ch
Hi All, I have the following scenario: There exists a booking table in cassandra, which holds the fields like, bookingid, passengeName, contact etc etc. Now in my spark streaming application, there is one class Booking which acts as a container and holds all the field details - class Booking

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Sean Owen
Yes, you're right that it's quite on purpose to leave this API to Breeze, in the main. As you can see the Spark objects have already sprouted a few basic operations anyway; there's a slippery slope problem here. Why not addition, why not dot products, why not determinants, etc. What about declarin

DataFrame Parquet Writer doesn't keep schema

2015-08-25 Thread Petr Novak
Hi all, when I read parquet files with "required" fields aka nullable=false they are read correctly. Then I save them (df.write.parquet) and read again all my fields are saved and read as optional, aka nullable=true. Which means I suddenly have files with incompatible schemas. This happens on 1.3.0

Re: Spark-Ec2 launch failed on starting httpd spark 141

2015-08-25 Thread Ted Yu
Corrected a typo in the subject of your email. What you cited seems to be from worker node startup. Was there other error you saw ? Please list the command you used. Cheers On Tue, Aug 25, 2015 at 7:39 AM, Garry Chen wrote: > Hi All, > > I am trying to lunch a spark cluster on

Spark-Ec2 lunch failed on starting httpd spark 141

2015-08-25 Thread Garry Chen
Hi All, I am trying to lunch a spark cluster on ec2 with spark 1.4.1 version. The script finished but getting error at the end as following. What should I do to correct this issue. Thank you very much for your input. Starting httpd: httpd: Syntax error on line 199 of /etc/http

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Kristina Rogale Plazonic
Well, yes, the hack below works (that's all I have time for), but is not satisfactory - it is not safe, and is verbose and very cumbersome to use, does not separately deal with SparseVector case and is not complete either. My question is, out of hundreds of users on this list, someone must have co

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Sonal Goyal
>From what I have understood, you probably need to convert your vector to breeze and do your operations there. Check stackoverflow.com/questions/28232829/addition-of-two-rddmllib-linalg-vectors On Aug 25, 2015 7:06 PM, "Kristina Rogale Plazonic" wrote: > Hi all, > > I'm still not clear what is th

Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Kristina Rogale Plazonic
Hi all, I'm still not clear what is the best (or, ANY) way to add/subtract two org.apache.spark.mllib.Vector objects in Scala. Ok, I understand there was a conscious Spark decision not to support linear algebra operations in Scala and leave it to the user to choose a linear algebra library. But,

Scala: Overload method by its class type

2015-08-25 Thread Saif.A.Ellafi
Hi all, I have SomeClass[TYPE] { def some_method(args: fixed_type_args): TYPE } And on runtime, I create instances of this class with different AnyVal + String types, but the return type of some_method varies. I know I could do this with an implicit object, IF some_method received a type, but

SparkSQL saveAsParquetFile does not preserve AVRO schema

2015-08-25 Thread storm
Hi, I have serious problems with saving DataFrame as parquet file. I read the data from the parquet file like this: val df = sparkSqlCtx.parquetFile(inputFile.toString) and print the schema (you can see both fields are required) root |-- time: long (nullable = false) |-- time_ymdhms: long (n

Re: Exception when S3 path contains colons

2015-08-25 Thread Romi Kuntsman
Hello, We had the same problem. I've written a blog post with the detailed explanation and workaround: http://labs.totango.com/spark-read-file-with-colon/ Greetings, Romi K. On Tue, Aug 25, 2015 at 2:47 PM Gourav Sengupta wrote: > I am not quite sure about this but should the notation not be

using Convert function of sql in spark sql

2015-08-25 Thread Rajeshkumar J
Hi All, I want to use Convert() function in sql in one of my spark sql query. Can any one tell me whether it is supported or not?

Re: Performance - Python streaming v/s Scala streaming

2015-08-25 Thread Utkarsh Patkar
Thanks for the quick response. I have tried the direct word count python example and it also seems to be slow. Lot of times it is not fetching the words that are sent by the producer. I am using SPARK version 1.4.1 and KAFKA 2.10-0.8.2.0. On Tue, Aug 25, 2015 at 2:05 AM, Tathagata Das wrote: >

Re: Local Spark talking to remote HDFS?

2015-08-25 Thread Dino Fancellu
Tried adding 50010, 50020 and 50090. Still no difference. I can't imagine I'm the only person on the planet wanting to do this. Anyway, thanks for trying to help. Dino. On 25 August 2015 at 08:22, Roberto Congiu wrote: > Port 8020 is not the only port you need tunnelled for HDFS to work. If yo

Re: Exception when S3 path contains colons

2015-08-25 Thread Gourav Sengupta
I am not quite sure about this but should the notation not be s3n://redactedbucketname/* instead of s3a://redactedbucketname/* The best way is to use s3://<>/<>/* Regards, Gourav On Tue, Aug 25, 2015 at 10:35 AM, Akhil Das wrote: > You can change the names, whatever program that is pushing th

Re: Spark Streaming failing on YARN Cluster

2015-08-25 Thread Ramkumar V
yes , when i see my yarn logs for that particular failed app_id, i got the following error. ERROR yarn.ApplicationMaster: SparkContext did not initialize after waiting for 10 ms. Please check earlier log output for errors. Failing the application For this error, I need to change the 'SparkCon

Re: How to increase data scale in Spark SQL Perf

2015-08-25 Thread Ted Yu
Looks like you were attaching images to your email which didn't go through. Consider using third party site for images - or paste error in text. Cheers On Tue, Aug 25, 2015 at 4:22 AM, Todd wrote: > Hi, > The spark sql perf itself contains benchmark data generation. I am using > spark shell to

Select some data from Hive (SparkSQL) directly using NodeJS

2015-08-25 Thread Phakin Cheangkrachange
Hi, I just wonder if there's any way that I can get some sample data (10-20 rows) out of Spark's Hive using NodeJs? Submitting a spark job to show 20 rows of data in web page is not good for me. I've set up Spark Thrift Server as shown in Spark Doc. The server works because I can use *beeline* t

How to increase data scale in Spark SQL Perf

2015-08-25 Thread Todd
Hi, The spark sql perf itself contains benchmark data generation. I am using spark shell to run the spark sql perf to generate the data with 10G memory for both driver and executor. When I increase the scalefactor to be 30,and run the job, Then I got the following error: When I jstack it to s

Checkpointing in Iterative Graph Computation

2015-08-25 Thread sachintyagi22
Hi, I have stumbled upon an issue with iterative Graphx computation (using v 1.4.1). It goes thusly -- Setup 1. Construct a graph. 2. Validate that the graph satisfies certain conditions. Here I do some assert(*conditions*) within graph.triplets.foreach(). [Notice that this materializes the grap

Re: spark not launching in yarn-cluster mode

2015-08-25 Thread Jeetendra Gangele
when I am launching with yarn-client also its giving me below error bin/spark-sql --master yarn-client 15/08/25 13:53:20 ERROR YarnClientSchedulerBackend: Yarn application has already exited with state FINISHED! Exception in thread "Yarn application state monitor" org.apache.spark.SparkException: E

Re: Spark works with the data in another cluster(Elasticsearch)

2015-08-25 Thread gen tang
Great advice. Thanks a lot Nick. In fact, if we use rdd.persist(DISK) command at the beginning of the program to avoid hitting the network again and again. The speed is not influenced a lot. In my case, it is just 1 min more compared to the situation that we put the data in local HDFS. Cheers Gen

Re: Spark works with the data in another cluster(Elasticsearch)

2015-08-25 Thread Nick Pentreath
While it's true locality might speed things up, I'd say it's a very bad idea to mix your Spark and ES clusters - if your ES cluster is serving production queries (and in particular using aggregations), you'll run into performance issues on your production ES cluster. ES-hadoop uses ES scan &

Re: Spark Streaming: Some issues (Could not compute split, block —— not found) and questions

2015-08-25 Thread Akhil Das
You hit block not found issues when you processing time exceeds the batch duration (this happens with receiver oriented streaming). If you are consuming messages from Kafka then try to use the directStream or you can also set StorageLevel to MEMORY_AND_DISK with receiver oriented consumer. (This mi

Re: Spark works with the data in another cluster(Elasticsearch)

2015-08-25 Thread Akhil Das
If the data is local to the machine then obviously it will be faster compared to pulling it through the network and storing it locally (either memory or disk etc). Have a look at the data locality

How to effieciently write sorted neighborhood in pyspark

2015-08-25 Thread shahid qadri
I would like to implement sorted neighborhood approach in spark, what is the best way to write that in pyspark. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.or

Re: Exception when S3 path contains colons

2015-08-25 Thread Akhil Das
You can change the names, whatever program that is pushing the record must follow the naming conventions. Try to replace : with _ or something. Thanks Best Regards On Tue, Aug 18, 2015 at 10:20 AM, Brian Stempin wrote: > Hi, > I'm running Spark on Amazon EMR (Spark 1.4.1, Hadoop 2.6.0). I'm se

Re: Using unserializable classes in tasks

2015-08-25 Thread Akhil Das
Instead of foreach try to use forEachPartitions, that will initialize the connector per partition rather than per record. Thanks Best Regards On Fri, Aug 14, 2015 at 1:13 PM, Dawid Wysakowicz < wysakowicz.da...@gmail.com> wrote: > No the connector does not need to be serializable cause it is con

Re:Re: What does Attribute and AttributeReference mean in Spark SQL

2015-08-25 Thread Todd
Thank you Michael for the detail explanation, it makes clear to me. Thanks! At 2015-08-25 15:37:54, "Michael Armbrust" wrote: Attribute is the Catalyst name for an input column from a child operator. An AttributeReference has been resolved, meaning we know which input column in particula

Re: spark not launching in yarn-cluster mode

2015-08-25 Thread Yanbo Liang
spark-shell and spark-sql can not be deployed with "yarn-cluster" mode, because you need to make spark-shell or spark-sql scripts run on your local machine rather than container of YARN cluster. 2015-08-25 16:19 GMT+08:00 Jeetendra Gangele : > Hi All i am trying to launch the spark shell with --m

Re: Loading already existing tables in spark shell

2015-08-25 Thread Ishwardeep Singh
Hi Jeetendra, Please try the following in spark shell. it is like executing an sql command. sqlContext.sql("use ") Regards, Ishwardeep From: Jeetendra Gangele Sent: Tuesday, August 25, 2015 12:57 PM To: Ishwardeep Singh Cc: user Subject: Re: Loading alre

Re:Re: Exception throws when running spark pi in Intellij Idea that scala.collection.Seq is not found

2015-08-25 Thread Todd
Thanks you guys. Yes, I have fixed the guava and spark core and scala and jetty. And I can run Pi now. At 2015-08-25 15:28:51, "Jeff Zhang" wrote: As I remember, you also need to change guava and jetty related dependency to compile if you run to run SparkPi in intellij. On Tue, Aug

spark not launching in yarn-cluster mode

2015-08-25 Thread Jeetendra Gangele
Hi All i am trying to launch the spark shell with --master yarn-cluster its giving below error. why this is not supported? bin/spark-sql --master yarn-cluster Error: Cluster deploy mode is not applicable to Spark SQL shell. Run with --help for usage help or --verbose for debug output Regards Je

  1   2   >