Re: thought experiment: use spark ML to real time prediction

2015-11-12 Thread Felix Cheung
+1 on that. It would be useful to use the model outside of Spark. _ From: DB Tsai Sent: Wednesday, November 11, 2015 11:57 PM Subject: Re: thought experiment: use spark ML to real time prediction To: Nirmal Fernando Cc: Andy Davidson , Adrian Tanase , user @spa

Re: How can you sort wordcounts by counts in stateful_network_wordcount.py example

2015-11-12 Thread Amir Rahnama
Hi Ayan, Thanks for help, Your example is not the streaming example. There we don't have sortByKey. On Wed, Nov 11, 2015 at 11:35 PM, ayan guha wrote: > how about this? > > sorted = running_counts.map(lambda t: t[1],t[0]).sortByKey() > > Basically swap key and value of the RDD and then sort?

Re: Spark task hangs infinitely when accessing S3 from AWS

2015-11-12 Thread aecc
Any hints? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-task-hangs-infinitely-when-accessing-S3-from-AWS-tp25289p25365.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Re: NullPointerException with joda time

2015-11-12 Thread Romain Sagean
I Still can't make the logger work inside a map function. I can use "logInfo("")" in the main but not in the function. Anyway I rewrite my program to use java.util.Date instead joda time and I don't have NPE anymore. I will stick with this solution for the moment even if I find java Date ugly. Th

Re: Spark task hangs infinitely when accessing S3 from AWS

2015-11-12 Thread Michael Cutler
Reading files directly from Amazon S3 can be frustrating especially if you're dealing with a large number of input files, could you please elaborate more on your use-case? Does the S3 bucket in question already contain a large number of files? The implementation of the * wildcard operator in S3 i

Re: Spark task hangs infinitely when accessing S3 from AWS

2015-11-12 Thread Alessandro Chacón
Hi Michael, Thanks for your answer. My path is exactly as you mention: s3://my-bucket/// /*.avro For sure i'm not using wild cards in any other part besides the date. So i don't think the issue could be that. The weird thing is that on top of the same data set, randomly in 1 of every 20 jobs one

Re: Spark task hangs infinitely when accessing S3 from AWS

2015-11-12 Thread aecc
Some other stats: The number of files I have in the folder is 48. The number of partitions used when reading data is 7315. The maximum size of a file to read is 14G The size of the folder is around: 270G -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spar

Partitioned Parquet based external table

2015-11-12 Thread Chandra Mohan, Ananda Vel Murugan
Hi, I am using Spark 1.5.1. https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQL.java. I have slightly modified this example to create partitioned parquet file Instead of this line schemaPeople.write().parquet("people.parquet"); I use t

Re: thought experiment: use spark ML to real time prediction

2015-11-12 Thread Sean Owen
This is all starting to sound a lot like what's already implemented in Java-based PMML parsing/scoring libraries like JPMML and OpenScoring. I'm not clear it helps a lot to reimplement this in Spark. On Thu, Nov 12, 2015 at 8:05 AM, Felix Cheung wrote: > +1 on that. It would be useful to use the

SparkR cannot handle double-byte chacaters

2015-11-12 Thread Shige Song
Dear All, I am running Spark 1.5.2 via the SparkR front-end. SpartR retuned error messages when I tried to process a very simple toy data set with some Chinese characters in it. The error message looks like this: "> head(df) Error in rawToChar(string) : embedded nul in string: '\xd5\003\024\006

Re: thought experiment: use spark ML to real time prediction

2015-11-12 Thread Nick Pentreath
Yup, currently PMML export, or Java serialization, are the options realistically available. Though PMML may deter some, there are not many viable cross-platform alternatives (with nearly as much coverage). On Thu, Nov 12, 2015 at 1:42 PM, Sean Owen wrote: > This is all starting to sound a lot l

Re: Partitioned Parquet based external table

2015-11-12 Thread Michal Klos
You must add the partitions to the Hive table with something like "alter table your_table add if not exists partition (country='us');". If you have dynamic partitioning turned on, you can do 'msck repair table your_table' to recover the partitions. I would recommend reviewing the Hive document

Issue with Spark-redshift

2015-11-12 Thread Hafiz Mujadid
Hi all! I am trying to read data from redshift table using spark-redshift project. Here is my code val conf = new SparkConf().setAppName("TestApp").setMaster("local[*]") val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) val df1:

large, dense matrix multiplication

2015-11-12 Thread Eilidh Troup
Hi, I’m trying to multiply a large squarish matrix with its transpose. Eventually I’d like to work with matrices of size 200,000 by 500,000, but I’ve started off first with 100 by 100 which was fine, and then with 10,000 by 10,000 which failed with an out of memory exception. I used MLlib and

Conf Settings in Mesos

2015-11-12 Thread John Omernik
Hey all, I noticed today that if I take a tgz as my URI for Mesos, that I have to repackaged it with my conf settings from where I execute say pyspark for the executors to have the right configuration settings. That is... If I take a "stock" tgz from makedistribution.sh, unpack it, and then set

PMML export for Decision Trees

2015-11-12 Thread Niki Pavlopoulou
Hi all, I hope you are well! I would like to know if the PMML export for Decision Trees is planned for Spark 1.6. Regards, Niki.

NPE is Spark Running on Mesos in Finegrained Mode

2015-11-12 Thread John Omernik
I have stumbled across and interesting (potential) bug. I have an environment that is MapR FS and Mesos. I've posted a bit in the past around getting this setup to work with Spark Mesos, and MapR and the Spark community has been helpful. In 1.4.1, I was able to get Spark working in this setup vi

RE: Partitioned Parquet based external table

2015-11-12 Thread Chandra Mohan, Ananda Vel Murugan
Thank you. It works perfectly fine. I enabled dynamic partition in my table and then fired “msck repair table your_table” and it works now Regards, Anand.C From: Michal Klos [mailto:michal.klo...@gmail.com] Sent: Thursday, November 12, 2015 6:32 PM To: Chandra Mohan, Ananda Vel Murugan Cc: user

Re: NullPointerException with joda time

2015-11-12 Thread Ted Yu
Even if log4j didn't work, you can still get some clue by wrapping the following call with try block: currentDate = currentDate.plusDays(1) catching NPE and rethrowing with an exception that shows the value of currentDate Cheers On Thu, Nov 12, 2015 at 1:56 AM, Romain Sagean wrote: >

metastore_db

2015-11-12 Thread Younes Naguib
Hi all, Is there any documentation on how to setup metastore_db on MySQL in Spark? I did find a load of information, but they all seems to be some "hack" for spark. Thanks Younes

RE: How can you sort wordcounts by counts in stateful_network_wordcount.py example

2015-11-12 Thread Young, Matthew T
You can use foreachRDD to get access to the batch API in streaming jobs. From: Amir Rahnama [mailto:amirrahn...@gmail.com] Sent: Thursday, November 12, 2015 12:11 AM To: ayan guha Cc: user Subj

Re: NullPointerException with joda time

2015-11-12 Thread Koert Kuipers
i remember us having issues with joda classes not serializing property and coming out null "on the other side" in tasks On Thu, Nov 12, 2015 at 10:12 AM, Ted Yu wrote: > Even if log4j didn't work, you can still get some clue by wrapping the > following call with try block: > > currentDate

In Spark application, how to get the passed in configuration?

2015-11-12 Thread java8964
In my Spark application, I want to access the pass in configuration, but it doesn't work. How should I do that? object myCode extends Logging { // starting point of the application def main(args: Array[String]): Unit = { val sparkContext = new SparkContext() val runtimeEnvironment = sp

graphx - trianglecount of 2B edges

2015-11-12 Thread vinodma
I was attempting to use the graphx triangle count method on a 2B edge graph (Friendster dataset on SNAP) and running into out of memory issue. I have access to a 60 node cluster with 90GB memory and 30v cores per node . I am using 1000 partitions and using the RandomVertexCut. Here’s my submit

Use existing Hive- and SparkContext with SparkR

2015-11-12 Thread Tobias Bockrath
Hello, I developed a Spark application for executing adhoc queries on In-Memory Hive tables. Therefore I am using a Spark and a Hive Context. During the startup process of the application some Hive tables are being loaded In-Memory within the Hive Context automatically. By using the sql()- method

Re: Kafka Direct does not recover automatically when the Kafka Stream gets messed up?

2015-11-12 Thread Cody Koeninger
To be blunt, if you care about being able to recover from weird situations, you should be tracking offsets yourself and specifying offsets on job start, not relying on checkpoints. On Tue, Nov 10, 2015 at 3:54 AM, Adrian Tanase wrote: > I’ve seen this before during an extreme outage on the clust

RE: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Mohammed Guller
Did you mean Hive or Spark SQL JDBC/ODBC server? Mohammed From: Bryan Jeffrey [mailto:bryan.jeff...@gmail.com] Sent: Thursday, November 12, 2015 9:12 AM To: Mohammed Guller Cc: user Subject: Re: Cassandra via SparkSQL/Hive JDBC Mohammed, That is great. It looks like a perfect scenario. Would I

Re: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Bryan Jeffrey
Mohammed, That is great. It looks like a perfect scenario. Would I be able to make the created DF queryable over the Hive JDBC/ODBC server? Regards, Bryan Jeffrey On Wed, Nov 11, 2015 at 9:34 PM, Mohammed Guller wrote: > Short answer: yes. > > > > The Spark Cassandra Connector supports the d

Issue with spark on hive

2015-11-12 Thread rugalcrimson
I referenced Hive on Spark: Getting Started to compile and configure my spark(1.5.1) and hive(1.2.1) and executed a query in hive CLI, then I got the following error in hive.log: 15/11/13 09:01:37 [stderr-redir

Re: In Spark application, how to get the passed in configuration?

2015-11-12 Thread varun sharma
You must be getting a warning at the start of application like : Warning: Ignoring non-spark config property: runtime.environment=passInValue . Configs in spark should start with *spark* as prefix. So try something like --conf spark.runtime.environment=passInValue . Regards Varun On Thu, Nov 12,

Re: Upgrading Spark in EC2 clusters

2015-11-12 Thread Nicholas Chammas
spark-ec2 does not offer a way to upgrade an existing cluster, and from what I gather, it wasn't intended to be used to manage long-lasting infrastructure. The recommended approach really is to just destroy your existing cluster and launch a new one with the desired configuration. If you want to u

Re: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Bryan Jeffrey
Yes, I do - I found your example of doing that later in your slides. Thank you for your help! On Thu, Nov 12, 2015 at 12:20 PM, Mohammed Guller wrote: > Did you mean Hive or Spark SQL JDBC/ODBC server? > > > > Mohammed > > > > *From:* Bryan Jeffrey [mailto:bryan.jeff...@gmail.com] > *Sent:* Thu

Re: Upgrading Spark in EC2 clusters

2015-11-12 Thread Jason Rubenstein
Hi, With some minor changes to spark-ec2/spark/init.sh and writing your own "upgrade-spark.sh" script, you can upgrade spark in place. (Make sure to call not only spark/init.sh but also spark/setup.sh, because the latter uses copy-dir to get your ner version of spark to the slaves) I wrote one

RE: In Spark application, how to get the passed in configuration?

2015-11-12 Thread java8964
Thanks, it looks like the config has to start with "spark", a very interesting requirement. I am using Spark 1.3.1, I didn't see this Warning log in the console. Thanks for your help. Yong Date: Thu, 12 Nov 2015 23:03:12 +0530 Subject: Re: In Spark application, how to get the passed in configurat

Re: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Bryan Jeffrey
Mohammed, While you're willing to answer questions, is there a trick to getting the Hive Thrift server to connect to remote Cassandra instances? 0: jdbc:hive2://localhost:1> SET spark.cassandra.connection.host="cassandrahost"; SET spark.cassandra.connection.host="cassandrahost"; +

Re: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Bryan Jeffrey
Answer: In beeline run the following: SET spark.cassandra.connection.host="10.0.0.10" On Thu, Nov 12, 2015 at 1:13 PM, Bryan Jeffrey wrote: > Mohammed, > > While you're willing to answer questions, is there a trick to getting the > Hive Thrift server to connect to remote Cassandra instances? > >

Checkpointing with Kinesis hangs with socket timeouts when driver is relaunched while transforming on a 0 event batch

2015-11-12 Thread Hster Geguri
Hello everyone, We are testing checkpointing against YARN 2.7.1 with Spark 1.5. We are trying to make sure checkpointing works with orderly shutdowns(i.e. yarn application --kill) and unexpected shutdowns which we simulate with a kill -9. If there is anyone who has successfully tested failover re

RE: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Mohammed Guller
Hi Bryan, Yes, you can query a real Cassandra cluster. You just need to provide the address of the Cassandra seed node. Looks like you figured out the answer. You can also put the C* seed node address in the spark-defaults.conf file under the SPARK_HOME/conf directory. Then you don’t need to m

Re: thought experiment: use spark ML to real time prediction

2015-11-12 Thread DB Tsai
I think the use-case can be quick different from PMML. By having a Spark platform independent ML jar, this can empower users to do the following, 1) PMML doesn't contain all the models we have in mllib. Also, for a ML pipeline trained by Spark, most of time, PMML is not expressive enough to do al

Re: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Bryan Jeffrey
I hesitate to ask further questions, but your assistance is advancing my work much faster than extensive fiddling might. I am seeing the following error when querying: 0: jdbc:hive2://localhost:1> create temporary table cassandraeventcounts using org.apache.spark.sql.cassandra OPTIONS ( keysp

Re: Upgrading Spark in EC2 clusters

2015-11-12 Thread Augustus Hong
Thanks for the info and the tip! I'll look into writing our own script based on the spark-ec2 scripts. Best, Augustus On Thu, Nov 12, 2015 at 10:01 AM, Jason Rubenstein < jasondrubenst...@gmail.com> wrote: > Hi, > > With some minor changes to spark-ec2/spark/init.sh and writing your own > "upg

RE: Cassandra via SparkSQL/Hive JDBC

2015-11-12 Thread Mohammed Guller
No worries. Happy to help. I don’t think the 1.5 version of the Spark Cassandra connector has been officially released yet. In any case 1.5.0-M1 has been replaced by 1.5.0-M2. Moreover, this version is meant for Spark 1.5.x. Since you are using Spark 1.4, why not use the v1.4 of the SCC? Did yo

Powered by Spark page

2015-11-12 Thread Nate Kupp
Hello, We are using Spark at Thumbtack for all of our big data work. Would love to get added to the powered by spark page: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark Requested info: *Organization name:* Thumbtack *URL:* thumbtack.com *Spark components:* Spark Core, Spark

RE: thought experiment: use spark ML to real time prediction

2015-11-12 Thread Kothuvatiparambil, Viju
I am glad to see DB’s comments, make me feel I am not the only one facing these issues. If we are able to use MLLib to load the model in web applications (outside the spark cluster), that would have solved the issue. I understand Spark is manly for processing big data in a distributed mode. But

RE: thought experiment: use spark ML to real time prediction

2015-11-12 Thread darren
I agree 100%. Making the model requires large data and many cpus. Using it does not. This is a very useful side effect of ML models. If mlib can't use models outside spark that's a real shame. Sent from my Verizon Wireless 4G LTE smartphone Original message From: "Kothuvat

Re: Partitioned Parquet based external table

2015-11-12 Thread Michael Armbrust
Note that if you read in the table using sqlContext.read.parquet(...) or if you use saveAsTable(...) the partitions will be auto-discovered. However, this is not compatible with Hive if you also want to be able to read the data there. On Thu, Nov 12, 2015 at 6:23 AM, Chandra Mohan, Ananda Vel Mur

Re: NoSuchElementException: key not found

2015-11-12 Thread Ankush Khanna
Hi, Well, I finally was able to figure it out. I was using VectorIndexer with max category 2 (min value is also 2) for my features, with a increase dimension of the features vector I landed into problem of no such element found in Vector Indexer.  It sounds a bit straight forward now, but i wa

Re: How can you sort wordcounts by counts in stateful_network_wordcount.py example

2015-11-12 Thread Amir Rahnama
Thanks Matthew. According to documentation, foreachRDD is used for writing right? I did with like this and it worked: address_by_destination = address.updateStateByKey(updateFunction).transform(lambda rdd: rdd.sortBy(lambda x: x[1], False)) Thanks anyways :) On Thu, Nov 12, 2015 at 4:53 PM

HiveServer2 Thrift OOM

2015-11-12 Thread Yana Kadiyska
Hi folks, I'm starting a HiveServer2 from a HiveContext (HiveThriftServer2.startWithContext(hiveContext)) and then connecting to it via beenline. On the server side, I see the below error which I think is related to https://issues.apache.org/jira/browse/HIVE-6468 But I'd like to know: 1. why I

RE: HiveServer2 Thrift OOM

2015-11-12 Thread Cheng, Hao
OOM can occurs in any place, if most of memory is used by some of the `defect`, the exception stack probably doesn’t show the real problem. Most of the this occurs in ThriftServer as I know, people are trying to collect a huge result set, can you confirm that? If it fall into this category, pro

Re: spark-1.5.1 application detail ui url

2015-11-12 Thread Rastan Boroujerdi
I'm having the same issue after upgrading to 1.5.1. This is just an issue with the link on the master UI and I can still access the running application fine via the hostname of the driver. Thanks, Rastan On Thu, Oct 29, 2015 at 7:16 AM, Jean-Baptiste Onofré wrote: > Hi, > > The running applicat

Re: thought experiment: use spark ML to real time prediction

2015-11-12 Thread Andy Davidson
+1 Andy From: darren Date: Thursday, November 12, 2015 at 12:34 PM To: "Kothuvatiparambil, Viju" , DB Tsai , Sean Owen Cc: Felix Cheung , Nirmal Fernando , Andrew Davidson , Adrian Tanase , "user @spark" , Xiangrui Meng , "hol...@pigscanfly.ca" Subject: RE: thought experiment: use spark M

problem with spark.unsafe.offHeap & spark.sql.tungsten.enabled

2015-11-12 Thread tyronecai
Hi, all: I test spark-1.5.*-bin-hadoop2.6, and find this problem, it’s easy to reproduce. Environment: OS: CentOS release 6.5 (Final) 2.6.32-431.el6.x86_64 JVM: java version "1.7.0_60" Java(TM) SE Runtime Environment (build 1.7.0_60-b19) Java HotSpot(TM) 64-Bit Server VM (build 24.60-b09, mixed

Re: Why there's no api for SparkContext#textFiles to support multiple inputs ?

2015-11-12 Thread Jeff Zhang
Didn't notice that I can pass comma separated path in the existing API (SparkContext#textFile). So no necessary for new api. Thanks all. On Thu, Nov 12, 2015 at 10:24 AM, Jeff Zhang wrote: > Hi Pradeep > > ≥≥≥ Looks like what I was suggesting doesn't work. :/ > I guess you mean put comma separ

Re: HiveServer2 Thrift OOM

2015-11-12 Thread Yana Kadiyska
Nope, in my case I see it pretty much as soon as the beeline client issues the connect statement. Queries I've run so far are of the "show table" variety and also "select count(*) from mytable" -- i.e. nothing that serializes large amounts of data On Thu, Nov 12, 2015 at 7:44 PM, Cheng, Hao wrot

Re: problem with spark.unsafe.offHeap & spark.sql.tungsten.enabled

2015-11-12 Thread Ted Yu
I tried with master branch. scala> sc.getConf.getAll.foreach(println) (spark.executor.id,driver) (spark.driver.memory,16g) (spark.unsafe.offHeap,true) (spark.driver.host,172.18.128.12) (spark.repl.class.uri,http://172.18.128.12:59780) (spark.sql.tungsten.enabled,true) (spark.fileserver.uri,http://

Re: thought experiment: use spark ML to real time prediction

2015-11-12 Thread Nirmal Fernando
On Fri, Nov 13, 2015 at 2:04 AM, darren wrote: > I agree 100%. Making the model requires large data and many cpus. > > Using it does not. > > This is a very useful side effect of ML models. > > If mlib can't use models outside spark that's a real shame. > Well you can as mentioned earlier. You d

RE: Partitioned Parquet based external table

2015-11-12 Thread Chandra Mohan, Ananda Vel Murugan
My primary interface to access the data is going to be Hive. I am planning to use spark to ingest data (in future I will use spark streaming, but for now it is just spark sql). Another group will analyze this data using Hive queries. For this scenario, earlier suggestion seems to work. Regards

Re: thought experiment: use spark ML to real time prediction

2015-11-12 Thread DB Tsai
This will bring the whole dependencies of spark will may break the web app. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Thu, Nov 12, 2015 at 8:15 PM, Nirmal Fernando wrote: > > > On Fri, Nov 13, 2015 at 2: