RE: Get the previous state string in Spark streaming

2015-10-19 Thread Chandra Mohan, Ananda Vel Murugan
Thank you, we got it fixed. Regards, Anand.C From: Tathagata Das [mailto:t...@databricks.com] Sent: Friday, October 16, 2015 3:38 PM To: Chandra Mohan, Ananda Vel Murugan Cc: user Subject: Re: Get the previous state string in Spark streaming A simple Javadoc look up in Java List

master die and worker registration failed with duplicated worker id

2015-10-19 Thread ZhuGe
Hi all:We met a serial of weir problem in our standalone cluster with 2 masters(zk election agent). Q1 :Firstly, we find the active master would lose leadership at some point and shutdown itself. [INFO 2015-10-17 13:00:15 (ClientCnxn.java:1083)] Client session timed out, have not heard from serv

best way to generate per key auto increment numerals after sorting

2015-10-19 Thread fahad shah
Hi I wanted to ask whats the best way to achieve per key auto increment numerals after sorting, for eg. : raw file: 1,a,b,c,1,1 1,a,b,d,0,0 1,a,b,e,1,0 2,a,e,c,0,0 2,a,f,d,1,0 post-output (the last column is the position number after grouping on first three fields and reverse sorting on last tw

RE: Should I convert json into parquet?

2015-10-19 Thread Ewan Leith
As Jörn says, Parquet and ORC will get you really good compression and can be much faster. There also some nice additions around predicate pushdown which can be great if you've got wide tables. Parquet is obviously easier to use, since it's bundled into Spark. Using ORC is described here http:

RE: Spark Streaming - use the data in different jobs

2015-10-19 Thread Ewan Leith
Storing the data in HBase, Cassandra, or similar is possibly the right answer, the other option that can work well is re-publishing the data back into second queue on RabbitMQ, to be read again by the next job. Thanks, Ewan From: Oded Maimon [mailto:o...@scene53.com] Sent: 18 October 2015 12:49

Re: Incrementally add/remove vertices in GraphX

2015-10-19 Thread mas
Dear All, Any update regarding Graph Streaming, I want to update, i.e., add vertices and edges after creation of graph. Any suggestions or recommendations to do that. Thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Incrementally-add-remove-vertic

Is one batch created by Streaming Context always equal to one RDD?

2015-10-19 Thread vaibhavrtk
-- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-one-batch-created-by-Streaming-Context-always-equal-to-one-RDD-tp25117.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

How to take user jars precedence over Spark jars

2015-10-19 Thread YiZhi Liu
I'm trying to read a Thrift object from SequenceFile, using elephant-bird's ThriftWritable. My code looks like val rawData = sc.sequenceFile[BooleanWritable, ThriftWritable[TrainingSample]](input) val samples = rawData.map { case (key, value) => { value.setConverter(classOf[TrainingSample]) va

Spark executor on Mesos - how to set effective user id?

2015-10-19 Thread Eugene Chepurniy
Hi everyone! While we are trying to utilize Spark On Mesos cluster, we are facing an issue related to effective linux user id being used to start executors on Mesos slaves: all executors are trying to use driver's linux user id to start on Mesos slaves. Let me explain in detail: spark driver progr

spark streaming failing to replicate blocks

2015-10-19 Thread Eugen Cepoi
Hi, I am running spark streaming 1.4.1 on EMR (AMI 3.9) over YARN. The job is reading data from Kinesis and the batch size is of 30s (I used the same value for the kinesis checkpointing). In the executor logs I can see every 5 seconds a sequence of stacktraces indicating that the block replication

Re: Spark executor on Mesos - how to set effective user id?

2015-10-19 Thread Jerry Lam
Can you try setting SPARK_USER at the driver? It is used to impersonate users at the executor. So if you have user setup for launching spark jobs on the executor machines, simply set it to that user name for SPARK_USER. There is another configuration that will prevents jobs being launched with a

Re: Spark Streaming - use the data in different jobs

2015-10-19 Thread Adrian Tanase
+1 for re-publishing to pubsub if there is only transient value in the data. If you need to query the intermediate representation then you will need to use a database. Sharing RDDs in memory should be possible with projects like spark job server but I think that’s overkill in this scenario. La

Re: Spark executor on Mesos - how to set effective user id?

2015-10-19 Thread SLiZn Liu
Hi Jerry, I think you are referring to --no-switch_user. =) chiling...@gmail.com>于2015年10月19日 周一21:05写道: > Can you try setting SPARK_USER at the driver? It is used to impersonate > users at the executor. So if you have user setup for launching spark jobs > on the executor machines, simply se

Re: Should I convert json into parquet?

2015-10-19 Thread Adrian Tanase
For general data access of the pre-computed aggregates (group by) you’re better off with Parquet. I’d only choose JSON if I needed interop with another app stack / language that has difficulty accessing parquet (E.g. Bulk load into document db…). On a strategic level, both JSON and parquet are

Re: How does shuffle work in spark ?

2015-10-19 Thread shahid
@all i did partitionby using default hash partitioner on data [(1,data)(2,(data),(n,data)] the total data was approx 3.5 it showed shuffle write 50G and on next action e.g count it is showing shuffle read of 50 G. i don't understand this behaviour and i think the performance is getting slow with so

Re: How to take user jars precedence over Spark jars

2015-10-19 Thread Ted Yu
Have you tried the following options ? --conf spark.driver.userClassPathFirst=true --conf spark.executor. userClassPathFirst=true Cheers On Mon, Oct 19, 2015 at 5:07 AM, YiZhi Liu wrote: > I'm trying to read a Thrift object from SequenceFile, using > elephant-bird's ThriftWritable. My code loo

Re: How to take user jars precedence over Spark jars

2015-10-19 Thread YiZhi Liu
Hi Ted, Unfortunately these two options cause following failure in my environment: (java.lang.RuntimeException: class org.apache.hadoop.security.JniBasedUnixGroupsMappingWithFallback not org.apache.hadoop.security.GroupMappingServiceProvider,java.lang.RuntimeException: java.lang.RuntimeException:

[Spark MLlib] How to apply spark ml given models for questions with general background

2015-10-19 Thread Zhiliang Zhu
Dear All, I am new for spark ml. There is some project for me, for some given math model and I would like to get its optimized solution.It is very similar with spark mllib application. However, the key problem for me is that the given math model is not obviously belonging to the models ( as clas

Re: How to take user jars precedence over Spark jars

2015-10-19 Thread Ted Yu
Can you use: https://maven.apache.org/plugins/maven-shade-plugin/ to shade the dependencies unique to your project ? On Mon, Oct 19, 2015 at 7:47 AM, YiZhi Liu wrote: > Hi Ted, > > Unfortunately these two options cause following failure in my environment: > > (java.lang.RuntimeException: class

flattening a JSON data structure

2015-10-19 Thread nunomrc
Hi I am fairly new to Spark and I am trying to flatten the following structure: |-- provider: struct (nullable = true) ||-- accountId: string (nullable = true) ||-- contract: array (nullable = true) And then provider is: root

Re: Spark SQL Thriftserver and Hive UDF in Production

2015-10-19 Thread Deenar Toraskar
Reece You can do the following. Start the spark-shell. Register the UDFs in the shell using sqlContext, then start the Thrift Server using startWithContext from the spark shell: https://github.com/apache/spark/blob/master/sql/hive- thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver

k-prototypes in MLLib?

2015-10-19 Thread Fernando Velasco
Hi everyone! I am a data scientist new to Spark and I am interested on clustering of mixed variables. I am more used to R, where there are implementations like Daysy, PAM, etc. It is true that dummy variables along with K-Means can perform a nice job on clustering mixed variables, but I find this

Re: Spark handling parallel requests

2015-10-19 Thread Adrian Tanase
To answer your specific question, you can’t push data to Kafka through a socket – you need a smart client library as the cluster setup is pretty advanced (also requires zookeeper). I bet there are php libraries for Kafka although after a quick search it seems they’re still pretty young. Also –

Re: Spark SQL Thriftserver and Hive UDF in Production

2015-10-19 Thread Todd Nist
>From tableau, you should be able to use the Initial SQL option to support this: So in Tableau add the following to the “Initial SQL” create function myfunc AS 'myclass' using jar 'hdfs:///path/to/jar'; HTH, Todd On Mon, Oct 19, 2015 at 11:22 AM, Deenar Toraskar wrote: > Reece > > You can

pyspark: results differ based on whether persist() has been called

2015-10-19 Thread peay2
Hi, I am getting some very strange results, where I get different results based on whether or not I call persist() on a data frame or not before materialising it. There's probably something obvious I am missing, as only very simple operations are involved here. Any help with this would be greatly

Re: In-memory computing and cache() in Spark

2015-10-19 Thread Jia Zhan
Hi Sonal, I tried changing the size spark.executor.memory but noting changes. It seems when I run locally in one machine, the RDD is cached in driver memory instead of executor memory. Here is a related post online: http://apache-spark-user-list.1001560.n3.nabble.com/Running-Spark-in-Local-Mode-td

Re: In-memory computing and cache() in Spark

2015-10-19 Thread Jia Zhan
Hi Igor, It iterative conducts reduce((a,b)*=>*a+b) which is the action there. I can see clearly 4 stages (one saveAsTextFile() and three Reduce()) in the web UI. Don't know what's going there that causes the non-intuitive caching behavior. Thanks for help! On Sun, Oct 18, 2015 at 11:32 PM, Igor

How to calculate row by now and output retults in Spark

2015-10-19 Thread Shepherd
Hi all, I am new in Spark and Scala. I have a question in doing calculation.I am using "groupBy" to generate key value pair, and the value points to a subset of original RDD. The RDD has four columns, and each subset RDD may have different number of rows.For example, the original code like this:"va

Re: pyspark: results differ based on whether persist() has been called

2015-10-19 Thread Davies Liu
This should be fixed by https://github.com/apache/spark/commit/a367840834b97cd6a9ecda568bb21ee6dc35fcde Will be released as 1.5.2 soon. On Mon, Oct 19, 2015 at 9:04 AM, peay2 wrote: > Hi, > > I am getting some very strange results, where I get different results based > on whether or not I call p

Re: best way to generate per key auto increment numerals after sorting

2015-10-19 Thread Davies Liu
What's the issue with groupByKey()? On Mon, Oct 19, 2015 at 1:11 AM, fahad shah wrote: > Hi > > I wanted to ask whats the best way to achieve per key auto increment > numerals after sorting, for eg. : > > raw file: > > 1,a,b,c,1,1 > 1,a,b,d,0,0 > 1,a,b,e,1,0 > 2,a,e,c,0,0 > 2,a,f,d,1,0 > > post-o

Re: best way to generate per key auto increment numerals after sorting

2015-10-19 Thread fahad shah
Thanks Davies, groupbykey was throwing up the error: unpack requires a string argument of length 4 interestingly, I replace that with the sortbykey (which i read also shuffles so that data for same key are on same partition) and it ran fine - wondering if this a bug on groupbykey for Spark 1.3?

Re: pyspark groupbykey throwing error: unpack requires a string argument of length 4

2015-10-19 Thread Davies Liu
Could you simplify the code a little bit so we can reproduce the failure? (may also have some sample dataset if it depends on them) On Sun, Oct 18, 2015 at 10:42 PM, fahad shah wrote: > Hi > > I am trying to do pair rdd's, group by the key assign id based on key. > I am using Pyspark with spark

Re: pyspark groupbykey throwing error: unpack requires a string argument of length 4

2015-10-19 Thread fahad shah
Thanks Davies, sure, I can share the code/data in pm - best fahad On Mon, Oct 19, 2015 at 10:52 AM, Davies Liu wrote: > Could you simplify the code a little bit so we can reproduce the failure? > (may also have some sample dataset if it depends on them) > > On Sun, Oct 18, 2015 at 10:42 PM, fahad

Re: Spark handling parallel requests

2015-10-19 Thread tarek.abouzeid91
Thanks guys for your advice , i will have a look on the custom receivers , thanks again guys for your efforts   --  Best Regards, -- Tarek Abouzeid On Monday, October 19, 2015 6:50 PM, Adrian Tanase wrote: To answer your specific question, you can’t push data to Kafka through a so

Storing Compressed data in HDFS into Spark

2015-10-19 Thread ahaider3
Hi, A lot of the data I have in HDFS is compressed. I noticed when I load this data into spark and cache it, Spark unrolls the data like normal but stores the data uncompressed in memory. For example, suppose /data/ is an RDD with compressed partitions on HDFS. I then cache the data. When I call /d

new 1.5.1 behavior - exception on executor throws ClassNotFound on driver

2015-10-19 Thread gbop
I've been struggling with a particularly puzzling issue after upgrading to Spark 1.5.1 from Spark 1.4.1. When I use the MySQL JDBC connector and an exception (e.g. com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException) is thrown on the executor, I get a ClassNotFoundException on the driver, wh

writing avro parquet

2015-10-19 Thread Alex Nastetsky
Using Spark 1.5.1, Parquet 1.7.0. I'm trying to write Avro/Parquet files. I have this code: sc.hadoopConfiguration.set(ParquetOutputFormat.WRITE_SUPPORT_CLASS, classOf[AvroWriteSupport].getName) AvroWriteSupport.setSchema(sc.hadoopConfiguration, MyClass.SCHEMA$) myDF.write.parquet(outputPath) Th

Re: How to calculate row by now and output retults in Spark

2015-10-19 Thread Ted Yu
Under core/src/test/scala/org/apache/spark , you will find a lot of examples for map function. FYI On Mon, Oct 19, 2015 at 10:35 AM, Shepherd wrote: > Hi all, I am new in Spark and Scala. I have a question in doing > calculation. I am using "groupBy" to generate key value pair, and the value >

Re: new 1.5.1 behavior - exception on executor throws ClassNotFound on driver

2015-10-19 Thread Ted Yu
The attachments didn't go through. Consider pastbebin'ning. Thanks On Mon, Oct 19, 2015 at 11:15 AM, gbop wrote: > I've been struggling with a particularly puzzling issue after upgrading to > Spark 1.5.1 from Spark 1.4.1. > > When I use the MySQL JDBC connector and an exception (e.g. > com.mys

Re: new 1.5.1 behavior - exception on executor throws ClassNotFound on driver

2015-10-19 Thread Lij Tapel
Sorry, here's the logs and source: The error I see in spark 1.5.1: http://pastebin.com/86K9WQ5f * full logs here: http://pastebin.com/dfysSh9E What I used to see in spark 1.4.1: http://pastebin.com/eK3AZQFx * full logs here: http://pastebin.com/iffSFFWW The source and build.sbt: http://pastebin.

RE: Dynamic partition pruning

2015-10-19 Thread Younes Naguib
Done: SPARK-11150 Thanks From: Xiao Li [mailto:gatorsm...@gmail.com] Sent: October-16-15 8:21 PM To: Younes Naguib Cc: Michael Armbrust; user@spark.apache.org Subject: Re: Dynamic partition pruning Hi, Younes, Maybe you can open a JIRA? Thanks, Xiao Li 2015-10-16 12:43 GMT-07:00 Younes Nagui

Differentiate Spark streaming in event logs

2015-10-19 Thread franklyn
Hi I'm running a job to collect some analytics on spark jobs by analyzing their event logs. We write the event logs to a single HDFS folder and then pick them up in another job. I'd like to differentiate between regular spark jobs and spark streaming jobs in the event logs, i was wondering if there

Re: new 1.5.1 behavior - exception on executor throws ClassNotFound on driver

2015-10-19 Thread Ted Yu
Lij: jar tvf /Users/tyu/.m2/repository//mysql/mysql-connector-java/5.1.31/mysql-connector-java-5.1.31.jar | grep MySQLSyntaxErrorExceptio 914 Wed May 21 01:42:16 PDT 2014 com/mysql/jdbc/exceptions/MySQLSyntaxErrorException.class 842 Wed May 21 01:42:18 PDT 2014 com/mysql/jdbc/exceptions/jdbc

Re: How does shuffle work in spark ?

2015-10-19 Thread Adrian Tanase
I don’t know why it expands to 50 GB but it’s correct to see it both on the first operation (shuffled write) and on the next one (shuffled read). It’s the barrier between the 2 stages. -adrian From: shahid ashraf Date: Monday, October 19, 2015 at 9:53 PM To: Kartik Mathur, Adrian Tanase Cc: use

Re: Differentiate Spark streaming in event logs

2015-10-19 Thread Adrian Tanase
You could try to start the 2/N jobs with a slightly different log4j template, by prepending some job type to all the messages... On 10/19/15, 9:47 PM, "franklyn" wrote: >Hi I'm running a job to collect some analytics on spark jobs by analyzing >their event logs. We write the event logs to a

Re: new 1.5.1 behavior - exception on executor throws ClassNotFound on driver

2015-10-19 Thread Lij Tapel
I have verified that there is only 5.1.34 on the classpath. Funnily enough, I have a repro that doesn't even use mysql so this seems to be purely a classloader issue: source: http://pastebin.com/WMCMwM6T 1.4.1: http://pastebin.com/x38DQY2p 1.5.1: http://pastebin.com/DQd6k818 On Mon, Oct 19, 20

Re: How to calculate row by now and output retults in Spark

2015-10-19 Thread Adrian Tanase
Are you by any chance looking for reduceByKey? IF you’re trying to collapse all the values in V into an aggregate, that’s what you should be looking at. -adrian From: Ted Yu Date: Monday, October 19, 2015 at 9:16 PM To: Shepherd Cc: user Subject: Re: How to calculate row by now and output retult

Re: Spark Streaming scheduler delay VS driver.cores

2015-10-19 Thread Adrian Tanase
Bump on this question – does anyone know what is the effect of spark.driver.cores on the driver's ability to manage larger clusters? Any tips on setting a correct value? I’m running Spark streaming on Yarn / Hadoop 2.6 / Spark 1.5.1. Thanks, -adrian From: Adrian Tanase Date: Saturday, October

Re: Streaming of COAP Resources

2015-10-19 Thread Adrian Tanase
I’m not familiar with you COAP library but onStart is called only once. You’re only reading the value once when the custom receiver is initialized. You need to set-up a callback, poll a buffer — again, depends on your COAP client — In short configure your client to “start listening for changes”

Re: new 1.5.1 behavior - exception on executor throws ClassNotFound on driver

2015-10-19 Thread Ted Yu
Interesting. I wonder why existing tests didn't catch that: class UnserializableException extends Exception { ./core/src/test/scala/org/apache/spark/rpc/RpcEnvSuite.scala class DAGSchedulerSuiteDummyException extends Exception ./core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scal

Spark SQL Exception: Conf non-local session path expected to be non-null

2015-10-19 Thread YaoPau
I've connected Spark SQL to the Hive Metastore and currently I'm running SQL code via pyspark. Typically everything works fine, but sometimes after a long-running Spark SQL job I get the error below, and from then on I can no longer run Spark SQL commands. I still do have both my sc and my sqlCtx

Dynamic Allocation & Spark Streaming

2015-10-19 Thread robert towne
I have watched a few videos from Databricks/Andrew Or around the Spark 1.2 release and it seemed that dynamic allocation was not yet available for Spark Streaming. I now see SPARK-10955 which is tied to 1.5.2 and allows disabling of Spark Streami

Re: new 1.5.1 behavior - exception on executor throws ClassNotFound on driver

2015-10-19 Thread Lij Tapel
Wow thanks, that's a great place to start digging deeper. Would it be appropriate to file this on JIRA? It makes spark 1.5.1 a bit of a deal breaker for me but I wouldn't mind taking a shot at fixing it given some guidance On Mon, Oct 19, 2015 at 1:03 PM, Ted Yu wrote: > Interesting. > I wonder

Dynamic Allocation & Spark Streaming

2015-10-19 Thread robert towne
I have watched a few videos from Databricks/Andrew Or around the Spark 1.2 release and it seemed that dynamic allocation was not yet available for Spark Streaming. I now see SPARK-10955 which is tied to 1.5.2 and allows disabling of Spark Streami

Re: Dynamic Allocation & Spark Streaming

2015-10-19 Thread Tathagata Das
Unfortunately the title on the JIRA is extremely confusing. I have fixed it. The reason why dynamic allocation does not work well with streaming is that the heuristic that is used to automatically scale up or down the number of executors works for the pattern of task schedules in batch jobs, not f

Re: Is one batch created by Streaming Context always equal to one RDD?

2015-10-19 Thread Tathagata Das
Each DStream creates one RDD per batch. On Mon, Oct 19, 2015 at 4:39 AM, vaibhavrtk wrote: > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Is-one-batch-created-by-Streaming-Context-always-equal-to-one-RDD-tp25117.html > Sent from the Apache Spar

Re: flattening a JSON data structure

2015-10-19 Thread Michael Armbrust
Quickfix is probably to use Seq[Row] instead of Array (the types that are returned are documented here: http://spark.apache.org/docs/latest/sql-programming-guide.html#data-types) Really though you probably want to be using explode. Perhaps something like this would help? import org.apache.spark.

Re: writing avro parquet

2015-10-19 Thread Alex Nastetsky
Figured it out ... needed to use saveAsNewAPIHadoopFile, but was trying to use it on myDF.rdd instead of converting it to a PairRDD first. On Mon, Oct 19, 2015 at 2:14 PM, Alex Nastetsky < alex.nastet...@vervemobile.com> wrote: > Using Spark 1.5.1, Parquet 1.7.0. > > I'm trying to write Avro/Parq

Spark ML/MLib newbie question

2015-10-19 Thread George Paulson
I have a dataset that's relatively big, but easily fits in memory. I want to generate many different features for this dataset and then run L1 regularized Logistic Regression on the feature enhanced dataset. The combined features will easily exhaust memory. I was hoping there was a way that I coul

Re: Spark 1.5 Streaming and Kinesis

2015-10-19 Thread Phil Kallos
I am currently trying a few code changes to see if I can squash this error. I have created https://issues.apache.org/jira/browse/SPARK-11193 to track progress, hope that is okay! In the meantime, can anyone confirm their ability to run the Kinesis-ASL example using Spark > 1.5.x ? Would be helpful

Concurrency/Multiple Users

2015-10-19 Thread GooniesNeverSayDie
I am trying to connect to an Apache Spark 1.4 server with multiple users. Here is the issue in short form: Connection 1 specifies database test1 at connection time. Show tables shows test1 database tables. Connection 2 specifies database test2 at connection time. Show tables shows test2 database

Re: How to put an object in cache for ever in Streaming

2015-10-19 Thread Tathagata Das
That should also get cleaned through the GC, though you may have to explicitly run GC periodically for faster clean up. RDDs are by definition distributed across executors in parts. When caches the RDD partitions are cached in memory across the executors. On Fri, Oct 16, 2015 at 6:15 PM, swetha k

Re: Issue in spark batches

2015-10-19 Thread Tathagata Das
If cassandra is down, does saveToCassandra throw an exception? If it does, you can catch that exception and write your own logic to retry and/or no update. Once the foreachRDD function completes, that batch will be internally marked as completed. TD On Mon, Oct 19, 2015 at 5:48 AM, varun sharma

Re: PySpark + Streaming + DataFrames

2015-10-19 Thread Tathagata Das
RDD and DF are not compatible data types. So you cannot return a DF when you have to return an RDD. What rather you can do is return the underlying RDD of the dataframe by dataframe.rdd(). On Fri, Oct 16, 2015 at 12:07 PM, Jason White wrote: > Hi Ken, thanks for replying. > > Unless I'm misunde

Re: Concurrency/Multiple Users

2015-10-19 Thread Michael Armbrust
Unfortunately the implementation of SPARK-2087 didn't have enough tests and got broken in 1.4. In Spark 1.6 we will have a much more solid fix: https://github.com/apache/spark/commit/3390b400d04e40f767d8a51f1078fcccb4e64abd On Mon, Oct 19, 2015 at 2:13 PM, GooniesNeverSayDie wrote: > I am tryin

serialization error

2015-10-19 Thread daze5112
Hi having some problems with the piece of code I inherited: the error messages i get are: the code runs if i exclude the following line: any help appreciated. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/serialization-error-tp25131.html Sent fro

Re: PySpark + Streaming + DataFrames

2015-10-19 Thread Jason White
Ah, that makes sense then, thanks TD. The conversion from RDD -> DF involves a `.take(10)` in PySpark, even if you provide the schema, so I was avoiding back-and-forth conversions. I’ll see if I can create a ‘trusted’ conversion that doesn’t involve the `take`. --  Jason On October 19, 2015 at

Re: PySpark + Streaming + DataFrames

2015-10-19 Thread Tathagata Das
Yes, precisely! Also, for other folks who may read this, could reply back with the trusted conversion that worked for you (for a clear solution)? TD On Mon, Oct 19, 2015 at 3:08 PM, Jason White wrote: > Ah, that makes sense then, thanks TD. > > The conversion from RDD -> DF involves a `.take(1

java TwitterUtils.createStream() how create "user stream" ???

2015-10-19 Thread Andy Davidson
Hi I wrote a little prototype that created a ³public stream² now I want to convert it to read tweets for a large number of explicit users. I to create a ³user stream² or a ³site stream". According to the twitter developer doc I should be able to set the ³follows² parameter to a list of users I am

Re: serialization error

2015-10-19 Thread Ted Yu
Attachments didn't go through. Mind using pastebin to show the code / error ? Thanks On Mon, Oct 19, 2015 at 3:01 PM, daze5112 wrote: > Hi having some problems with the piece of code I inherited: > > > > > the error messages i get are: > > > the code runs if i exclude the following line: > > >

Re: Spark SQL Exception: Conf non-local session path expected to be non-null

2015-10-19 Thread Ted Yu
A brief search led me to ql/src/java/org/apache/hadoop/hive/ql/session/SessionState.java : private static final String HDFS_SESSION_PATH_KEY = "_hive.hdfs.session.path"; ... public static Path getHDFSSessionPath(Configuration conf) { SessionState ss = SessionState.get(); if (ss == null

Filter RDD

2015-10-19 Thread Shepherd
Hi all, I have a very simple question. I have a RDD, saying r1, which contains 5 columns, with both string and Int. How can I get a sub RDD, based on a rule, that the second column equals to a string (s)? Thanks a lot. -- View this message in context: http://apache-spark-user-list.1001560.n3.

Multiple Spark Streaming Jobs on Single Master

2015-10-19 Thread Augustus Hong
Hi All, Would it be possible to run multiple spark streaming jobs on a single master at the same time? I currently have one master node and several worker nodes in the standalone mode, and I used spark-submit to submit multiple spark streaming jobs. >From what I observed, it seems like only the

Succinct experience

2015-10-19 Thread Younes Naguib
Hi all, Anyone has any experience with SuccinctRDD? Thanks, Younes

Re: Filter RDD

2015-10-19 Thread Ted Yu
See the filter() method: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L334 Cheers On Mon, Oct 19, 2015 at 4:27 PM, Shepherd wrote: > Hi all, > I have a very simple question. > I have a RDD, saying r1, which contains 5 columns, with both string

Re: serialization error

2015-10-19 Thread Andy Huang
That particular line used an object which did not implement "Serializable" On Tue, Oct 20, 2015 at 9:01 AM, daze5112 wrote: > Hi having some problems with the piece of code I inherited: > > > > > the error messages i get are: > > > the code runs if i exclude the following line: > > > any help ap

why the Rating(user: Int, product: Int, rating: Double)(in MLlib's ALS), the 'user' and 'product' must be Int?

2015-10-19 Thread futureage
hi, i am learning the MLlib's ALS, when i saw the case class Rating(user: Int, product: Int, rating: Double), i want to know why the 'user' and 'product' are Int, they are just id, Long or String is not ok? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabb

Re: Multiple Spark Streaming Jobs on Single Master

2015-10-19 Thread Tathagata Das
You can set the max cores for the first submitted job such that it does not take all the resources from the master. See http://spark.apache.org/docs/latest/submitting-applications.html # Run on a Spark standalone cluster in client deploy mode ./bin/spark-submit \ --class org.apache.spark.example

Re: Spark 1.5 Streaming and Kinesis

2015-10-19 Thread Jean-Baptiste Onofré
Hi Phil, thanks for the Jira, I will try to take a look asap. Regards JB On 10/19/2015 11:07 PM, Phil Kallos wrote: I am currently trying a few code changes to see if I can squash this error. I have created https://issues.apache.org/jira/browse/SPARK-11193 to track progress, hope that is okay!

Re: Spark SQL Exception: Conf non-local session path expected to be non-null

2015-10-19 Thread Deenar Toraskar
This seems to be set using hive.exec.scratchdir, is that set? hdfsSessionPath = new Path(hdfsScratchDirURIString, sessionId); createPath(conf, hdfsSessionPath, scratchDirPermission, false, true); conf.set(HDFS_SESSION_PATH_KEY, hdfsSessionPath.toUri().toString()); On 20 October 2015

Re: Spark SQL Exception: Conf non-local session path expected to be non-null

2015-10-19 Thread Davies Liu
The thread-local things does not work well with PySpark, because the thread used by PySpark in JVM could change over time, SessionState could be lost. This should be fixed in master by https://github.com/apache/spark/pull/8909 On Mon, Oct 19, 2015 at 1:08 PM, YaoPau wrote: > I've connected Spar