Re: What's the advantage features of Spark SQL(JDBC)

2015-05-14 Thread Yi Zhang
@Hao,As you said, there is no advantage feature for JDBC, it just provides unified api to support different data sources. Is it right? On Friday, May 15, 2015 2:46 PM, "Cheng, Hao" wrote: #yiv2822675239 #yiv2822675239 -- _filtered #yiv2822675239 {font-family:Helvetica;panose-1:2 11

RE: What's the advantage features of Spark SQL(JDBC)

2015-05-14 Thread Cheng, Hao
Spark SQL just take the JDBC as a new data source, the same as we need to support loading data from a .csv or .json. From: Yi Zhang [mailto:zhangy...@yahoo.com.INVALID] Sent: Friday, May 15, 2015 2:30 PM To: User Subject: What's the advantage features of Spark SQL(JDBC) Hi All, Comparing direct

Spark on Mesos vs Yarn

2015-05-14 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, This is both a survey type as well as a roadmap query question. It seems like of the cluster options to run spark (i.e. via YARN and Mesos), YARN seems to be getting a lot more attention and patches when compared to Mesos. Would it be correct to

What's the advantage features of Spark SQL(JDBC)

2015-05-14 Thread Yi Zhang
Hi All, Comparing direct access via JDBC, what's the advantage features of Spark SQL(JDBC) to access external data source? Any tips are welcome! Thanks. Regards,Yi

Re: Spark performance in cluster mode using yarn

2015-05-14 Thread Sachin Singh
Hi Ayan, I am asking general scenarios as per given info/configuration, from experts, not specific, java code is nothing get hive context and select query, there is no serialization or any other complex things I kept,straight forward, 10 lines of code, Group Please suggest if any Idea, Regards Sac

RE: question about sparksql caching

2015-05-14 Thread Cheng, Hao
You probably can try something like: val df = sqlContext.sql("select c1, sum(c2) from T1, T2 where T1.key=T2.key group by c1") df.cache() // Cache the result, but it's a lazy execution. df.registerAsTempTable("my_result") sqlContext.sql("select * from my_result where c1=1").collect // the cache

Spark 1.3.0 -> 1.3.1 produces java.lang.NoSuchFieldError: NO_FILTER

2015-05-14 Thread Exie
Hello Bright Sparks, I was using Spark 1.3.0 to push data out to Parquet files. They have been working great, super fast, easy way to persist data frames etc. However I just swapped out Spark 1.3.0 and picked up the tarball for 1.3.1. I unzipped it, copied my config over and then went to read one

Re: Multiple Kinesis Streams in a single Streaming job

2015-05-14 Thread Chris Fregly
another option (not really recommended, but worth mentioning) would be to change the region of dynamodb to be separate from the other stream - and even separate from the stream itself. this isn't available right now, but will be in Spark 1.4. > On May 14, 2015, at 6:47 PM, Erich Ess wrote: >

question about sparksql caching

2015-05-14 Thread sequoiadb
Hi all, We are planing to use SparkSQL in a DW system. There’s a question about the caching mechanism of SparkSQL. For example, if I have a SQL like sqlContext.sql(“select c1, sum(c2) from T1, T2 where T1.key=T2.key group by c1”).cache() Is it going to cache the final result or the raw data of

question about sparksql caching

2015-05-14 Thread sequoiadb
Hi all, We are planing to use SparkSQL in a DW system. There’s a question about the caching mechanism of SparkSQL. For example, if I have a SQL like sqlContext.sql(“select c1, sum(c2) from T1, T2 where T1.key=T2.key group by c1”).cache() Is it going to cache the final result or the raw data of

RE: Spark's Guava pieces cause exceptions in non-trivial deployments

2015-05-14 Thread Anton Brazhnyk
The problem is with 1.3.1 It has Function class (mentioned in exception) in spark-network-common_2.10-1.3.1.jar. Our current resolution is actually backport to 1.2.2, which is working fine. From: Marcelo Vanzin [mailto:van...@cloudera.com] Sent: Thursday, May 14, 2015 6:27 PM To: Anton Brazhnyk

回复:textFileStream Question

2015-05-14 Thread 董帅阳
file timestamp -- 原始邮件 -- 发件人: "Vadim Bichutskiy";; 发送时间: 2015年5月15日(星期五) 凌晨4:55 收件人: "user@spark.apache.org"; 主题: textFileStream Question How does textFileStream work behind the scenes? How does Spark Streaming know what files are new and need to be proces

Word2Vec with billion-word corpora

2015-05-14 Thread shilad
Hi all, I'm experimenting with Spark's Word2Vec implementation for a relatively large (5B words, vocabulary size 4M, 400-dimensional vectors) corpora. Has anybody had success running it at this scale? Thanks in advance for your guidance! -Shilad -- View this message in context: http://apache

Re: Multiple Kinesis Streams in a single Streaming job

2015-05-14 Thread Erich Ess
Hi Tathagata, I think that's exactly what's happening. The error message is: "com.amazonaws.services.kinesis.model.InvalidArgumentException: StartingSequenceNumber 49550673839151225431779125105915140284622031848663416866 used in GetShardIterator on shard shardId-0002 in stream erich-test

Re: Multiple Kinesis Streams in a single Streaming job

2015-05-14 Thread Tathagata Das
A possible problem may be that the kinesis stream in 1.3 uses the SparkContext app name, as the Kinesis Application Name, that is used by the Kinesis Client Library to save checkpoints in DynamoDB. Since both kinesis DStreams are using the Kinesis application name (as they are in the same Streaming

Re: Spark's Guava pieces cause exceptions in non-trivial deployments

2015-05-14 Thread Marcelo Vanzin
What version of Spark are you using? The bug you mention is only about the Optional class (and a handful of others, but none of the classes you're having problems with). All other Guava classes should be shaded since Spark 1.2, so you should be able to use your own version of Guava with no problem

Re: Spark performance in cluster mode using yarn

2015-05-14 Thread ayan guha
With this information it is hard to predict. What's the performance you are getting? What's your desired performance? Maybe you can post your code and experts can suggests improvement? On 14 May 2015 15:02, "sachin Singh" wrote: > Hi Friends, > please someone can give the idea, Ideally what shoul

Spark Summit 2015 - June 15-17 - Dev list invite

2015-05-14 Thread Scott walent
*Join the Apache Spark community at the fourth Spark Summit in San Francisco on June 15, 2015. At Spark Summit 2015 you will hear keynotes from NASA, the CIA, Toyota, Databricks, AWS, Intel, MapR, IBM, Cloudera, Hortonworks, Timeful, O'Reilly, and Andreessen Horowitz. 260 talks proposal were submit

Spark's Guava pieces cause exceptions in non-trivial deployments

2015-05-14 Thread Anton Brazhnyk
Greetings, I have a relatively complex application with Spark, Jetty and Guava (16) not fitting together. Exception happens when some components try to use "mix" of Guava classes (including Spark's pieces) that are loaded by different classloaders: java.lang.LinkageError: loader constraint viola

RE: Is it feasible to keep millions of keys in state of Spark Streaming job for two months?

2015-05-14 Thread Haopu Wang
Hi TD, regarding to the performance of updateStateByKey, do you have a JIRA for that so we can watch it? Thank you! From: Tathagata Das [mailto:t...@databricks.com] Sent: Wednesday, April 15, 2015 8:09 AM To: Krzysztof Zarzycki Cc: user Subject: Re: Is it feas

RE: [SparkSQL 1.4.0] groupBy columns are always nullable?

2015-05-14 Thread Haopu Wang
Thank you, should I open a JIRA for this issue? From: Olivier Girardot [mailto:ssab...@gmail.com] Sent: Tuesday, May 12, 2015 5:12 AM To: Reynold Xin Cc: Haopu Wang; user Subject: Re: [SparkSQL 1.4.0] groupBy columns are always nullable? I'll look into it

[SparkStreaming] Is it possible to delay the start of some DStream in the application?

2015-05-14 Thread Haopu Wang
In my application, I want to start a DStream computation only after an special event has happened (for example, I want to start the receiver only after the reference data has been properly initialized). My question is: it looks like the DStream will be started right after the StreaminContext has b

Re: Multiple Kinesis Streams in a single Streaming job

2015-05-14 Thread Chris Fregly
have you tried to union the 2 streams per the KinesisWordCountASL example where 2 streams (against the same Kinesis stream in this case) are created

Re: Multiple Kinesis Streams in a single Streaming job

2015-05-14 Thread Tathagata Das
What is the error you are seeing? TD On Thu, May 14, 2015 at 9:00 AM, Erich Ess wrote: > Hi, > > Is it possible to setup streams from multiple Kinesis streams and process > them in a single job? From what I have read, this should be possible, > however, the Kinesis layer errors out whenever I

LogisticRegressionWithLBFGS with large feature set

2015-05-14 Thread Pala M Muthaia
Hi, I am trying to validate our modeling data pipeline by running LogisticRegressionWithLBFGS on a dataset with ~3.7 million features, basically to compute AUC. This is on Spark 1.3.0. I am using 128 executors with 4 GB each + driver with 8 GB. The number of data partitions is 3072 The execution

Data Load - Newbie

2015-05-14 Thread Ricardo Goncalves da Silva
Hi, I have a text dataset which I want to apply a cluster algorithm from MLIB. This data is (n,m) matrix with readers. I would like to know if the team can help me on how load this data in Spark Scala and separate the variable I want to cluster. Thanks Rick. [Descrição: Descrição: Descrição:

Re: store hive metastore on persistent store

2015-05-14 Thread Tamas Jambor
I have tried to put the hive-site.xml file in the conf/ directory with, seems it is not picking up from there. On Thu, May 14, 2015 at 6:50 PM, Michael Armbrust wrote: > You can configure Spark SQLs hive interaction by placing a hive-site.xml > file in the conf/ directory. > > On Thu, May 14, 2

textFileStream Question

2015-05-14 Thread Vadim Bichutskiy
How does textFileStream work behind the scenes? How does Spark Streaming know what files are new and need to be processed? Is it based on time stamp, file name? Thanks, Vadim ᐧ

Hive partition table + read using hiveContext + spark 1.3.1

2015-05-14 Thread SamyaMaiti
Hi Team, I have a hive partition table with partition column having spaces. When I try to run any query, say a simple "Select * from table_name", it fails. *Please note the same was working in spark 1.2.0, now I have upgraded to 1.3.1. Also there is no change in my application code base.* If I

Custom Aggregate Function for DataFrame

2015-05-14 Thread Justin Yip
Hello, May I know if these is way to implement aggregate function for grouped data in DataFrame? I dug into the doc but didn't find any apart from the UDF functions which applies on a Row. Maybe I have missed something. Thanks. Justin -- View this message in context: http://apache-spark-user

Per-machine configuration?

2015-05-14 Thread mj1200
Is it possible to configure each machine that Spark is using as a worker individually? For instance, setting the maximum number of cores to use for each machine individually, or the maximum memory, or other settings related to workers? Or is there any other way to specify a per-machine "capacity

spark log field clarification

2015-05-14 Thread yanwei
I am trying to extract the *output data size* information for *each task*. What *field(s)* should I look for, given the json-format log? Also, what does "Result Size" stand for? Thanks a lot in advance! -Yanwei -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble

Re: DStream Union vs. StreamingContext Union

2015-05-14 Thread Vadim Bichutskiy
@TD How do I file a JIRA? ᐧ On Tue, May 12, 2015 at 2:06 PM, Tathagata Das wrote: > I wonder that may be a bug in the Python API. Please file it as a JIRA > along with sample code to reproduce it and sample output you get. > > On Tue, May 12, 2015 at 10:00 AM, Vadim Bichutskiy < > vadim.bichuts.

Restricting the number of iterations in Mllib Kmeans

2015-05-14 Thread Suman Somasundar
Hi,, I want to run a definite number of iterations in Kmeans. There is a command line argument to set maxIterations, but even if I set it to a number, Kmeans runs until the centroids converge. Is there a specific way to specify it in command line? Also, I wanted to know if we can supply the

Re: [Spark SQL 1.3.1] data frame saveAsTable returns exception

2015-05-14 Thread Michael Armbrust
End of the month is the target: https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage On Thu, May 14, 2015 at 3:45 AM, Ishwardeep Singh < ishwardeep.si...@impetus.co.in> wrote: > Hi Michael & Ayan, > > > > Thank you for your response to my problem. > > > > Michael do we have a tentativ

Re: store hive metastore on persistent store

2015-05-14 Thread Michael Armbrust
You can configure Spark SQLs hive interaction by placing a hive-site.xml file in the conf/ directory. On Thu, May 14, 2015 at 10:24 AM, jamborta wrote: > Hi all, > > is it possible to set hive.metastore.warehouse.dir, that is internally > create by spark, to be stored externally (e.g. s3 on aws

Re: swap tuple

2015-05-14 Thread Yasemin Kaya
I solved my problem right this way. JavaPairRDD swappedPair = pair.mapToPair( new PairFunction, String, String>() { @Override public Tuple2 call( Tuple2 item) throws Exception { return item.swap(); } }); 2015-05-14 20:42 GMT+03:00 Stephen Carman : > Yea, I wouldn't try and modify the current

Re: how to delete data from table in sparksql

2015-05-14 Thread Michael Armbrust
The list of unsupported hive features should mention that it implicitly includes features added after Hive 13. You cannot yet compile with Hive > 13, though we are investigating this for 1.5 On Thu, May 14, 2015 at 6:40 AM, Denny Lee wrote: > Delete from table is available as part of Hive 0.14

RE: swap tuple

2015-05-14 Thread Stephen Carman
Yea, I wouldn't try and modify the current since RDDs are suppose to be immutable, just create a new one... val newRdd = oldRdd.map(r => (r._2(), r._1())) or something of that nature... Steve From: Evo Eftimov [evo.efti...@isecc.com] Sent: Thursday, May 14, 2015

RE: swap tuple

2015-05-14 Thread Evo Eftimov
Where is the “Tuple” supposed to be in - you can refer to a “Tuple” if it was e.g. > From: holden.ka...@gmail.com [mailto:holden.ka...@gmail.com] On Behalf Of Holden Karau Sent: Thursday, May 14, 2015 5:56 PM To: Yasemin Kaya Cc: user@spark.apache.org Subject: Re: swap tuple Can you pas

store hive metastore on persistent store

2015-05-14 Thread jamborta
Hi all, is it possible to set hive.metastore.warehouse.dir, that is internally create by spark, to be stored externally (e.g. s3 on aws or wasb on azure)? thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/store-hive-metastore-on-persistent-store-tp22

Re: swap tuple

2015-05-14 Thread Holden Karau
Can you paste your code? transformations return a new RDD rather than modifying an existing one, so if you were to swap the values of the tuple using a map you would get back a new RDD and then you would want to try and print this new RDD instead of the original one. On Thursday, May 14, 2015, Yas

Storing a lot of state with updateStateByKey

2015-05-14 Thread krot.vyacheslav
Hi all, I'm a complete newbie to spark and spark streaming, so the question may seem obvious, sorry for that. It is okay to store Seq[Data] in state when using 'updateStateByKey'? I have a function with signature def saveState(values: Seq[Msg], value: Option[Iterable[Msg]]): Option[Iterable[Msg]]

Multiple Kinesis Streams in a single Streaming job

2015-05-14 Thread Erich Ess
Hi, Is it possible to setup streams from multiple Kinesis streams and process them in a single job? From what I have read, this should be possible, however, the Kinesis layer errors out whenever I try to receive from more than a single Kinesis Stream. Here is the code. Currently, I am focused o

Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread David Morales
We put a lot of work in sparkta and it is awesome to hear from both the community and relevant people. Just as easy as that. I hope you have time to consider the project, which is our main concern at this moment, and hear from you too. 2015-05-14 17:46 GMT+02:00 Evo Eftimov : > I do not intend

RE: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread Evo Eftimov
I do not intend to provide comments on the actual “product” since my time is engaged elsewhere My comments were on the “process” for commenting which looked as self-indulgent, self patting on the back communication (between members of the party and its party leader) – that bs used to be inh

Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread David Morales
Thank you Paolo. Don't hesitate to contact us. Evo, we will be glad to hear from you and we are happy to see some kind of fast feedback from the main thought leader of spark, for sure. 2015-05-14 17:24 GMT+02:00 Paolo Platter : > Nice Job! > > we are developing something very similar… I will

Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread Paolo Platter
Nice Job! we are developing something very similar... I will contact you to understand if we can contribute to you with some piece ! Best Paolo Da: Evo Eftimov Data invio: ?gioved?? ?14? ?maggio? ?2015 ?17?:?21 A: 'David Morales', Mate

RE: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread Evo Eftimov
That has been a really rapid “evaluation” of the “work” and its “direction” From: David Morales [mailto:dmora...@stratio.com] Sent: Thursday, May 14, 2015 4:12 PM To: Matei Zaharia Cc: user@spark.apache.org Subject: Re: SPARKTA: a real-time aggregation engine based on Spark Streaming Than

Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread Matei Zaharia
(Sorry, for non-English people: that means it's a good thing.) Matei > On May 14, 2015, at 10:53 AM, Matei Zaharia wrote: > > ...This is madness! > >> On May 14, 2015, at 9:31 AM, dmoralesdf wrote: >> >> Hi there, >> >> We have released our real-time aggregation engine based on Spark Stream

Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread David Morales
Thanks for your kind words Matei, happy to see that our work is in the right way. 2015-05-14 17:10 GMT+02:00 Matei Zaharia : > (Sorry, for non-English people: that means it's a good thing.) > > Matei > > > On May 14, 2015, at 10:53 AM, Matei Zaharia > wrote: > > > > ...This is madness! > > >

Re: Using sc.HadoopConfiguration in Python

2015-05-14 Thread ayan guha
Super, it worked. Thanks On Fri, May 15, 2015 at 12:26 AM, Ram Sriharsha wrote: > Here is an example of how I would pass in the S3 parameters to hadoop > configuration in pyspark. > You can do something similar for other parameters you want to pass to the > hadoop configuration > > hadoopConf=sc

Re: SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread Matei Zaharia
...This is madness! > On May 14, 2015, at 9:31 AM, dmoralesdf wrote: > > Hi there, > > We have released our real-time aggregation engine based on Spark Streaming. > > SPARKTA is fully open source (Apache2) > > > You can checkout the slides showed up at the Strata past week: > > http://www.s

Re: reduceByKey

2015-05-14 Thread ayan guha
Here is a python code, I am sure you'd get the drift. Basically you need to implement 2 functions: seq and comb in order to partial and final operations. def addtup(t1,t2): j=() for k,v in enumerate(t1): j=j+(t1[k]+t2[k],) return j def seq(tIntrm,tNext): return addtup(tInt

Re: reduceByKey

2015-05-14 Thread Gaspar Muñoz
What have you tried so far? Maybe, the easiest way is using a collection and reduce them adding its values. JavaPairRDD pairRDD = sc.parallelizePairs(data); JavaPairRDD> result = pairRDD.mapToPair(new Functions.createList()) .mapToPair(new Functions.ListStringToInt()) .reduceByKe

Re: --jars works in "yarn-client" but not "yarn-cluster" mode, why?

2015-05-14 Thread Fengyun RAO
thanks, Wilfred. In our program, the "htrace-core-3.1.0-incubating.jar" dependency is only required in the executor, not in the driver. while in both "yarn-client" and "yarn-cluster", the executor runs in cluster. and it's clearly in "yarn-cluster" mode, the jar IS in "spark.yarn.secondary.jars",

Re: Using sc.HadoopConfiguration in Python

2015-05-14 Thread Ram Sriharsha
Here is an example of how I would pass in the S3 parameters to hadoop configuration in pyspark. You can do something similar for other parameters you want to pass to the hadoop configuration hadoopConf=sc._jsc.hadoopConfiguration() hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.Native

Re: [Unit Test Failure] Test org.apache.spark.streaming.JavaAPISuite.testCount failed

2015-05-14 Thread Wangfei (X)
Yes it is repeatedly on my locally Jenkins. 发自我的 iPhone 在 2015年5月14日,18:30,"Tathagata Das" mailto:t...@databricks.com>> 写道: Do you get this failure repeatedly? On Thu, May 14, 2015 at 12:55 AM, kf mailto:wangf...@huawei.com>> wrote: Hi, all, i got following error when i run unit test of spa

reduceByKey

2015-05-14 Thread Yasemin Kaya
Hi, I have JavaPairRDD and I want to implement reduceByKey method. My pairRDD : *2553: 0,0,0,1,0,0,0,0* 46551: 0,1,0,0,0,0,0,0 266: 0,1,0,0,0,0,0,0 *2553: 0,0,0,0,0,1,0,0* *225546: 0,0,0,0,0,1,0,0* *225546: 0,0,0,0,0,1,0,0* I want to get : *2553: 0,0,0,1,0,1,0,0* 46551: 0,1,0,0,0,0,0,0 266: 0,1

Re: how to delete data from table in sparksql

2015-05-14 Thread Denny Lee
Delete from table is available as part of Hive 0.14 (reference: Apache Hive > Language Manual DML - Delete ) while Spark 1.3 defaults to Hive 0.13.Perhaps rebuild Spark with Hive 0.14 or generate a new

SPARKTA: a real-time aggregation engine based on Spark Streaming

2015-05-14 Thread dmoralesdf
Hi there, We have released our real-time aggregation engine based on Spark Streaming. SPARKTA is fully open source (Apache2) You can checkout the slides showed up at the Strata past week: http://www.slideshare.net/Stratio/strata-sparkta Source code: https://github.com/Stratio/sparkta And do

Re: how to set random seed

2015-05-14 Thread Charles Hayden
Thanks for the reply. I have not tried it out (I will today and report on my results) but I think what I need to do is to call mapPartitions and pass it a function that sets the seed. I was planning to pass the seed value in the closure. Something like: my_seed = 42 def f(iterator): ran

Re: spark sql hive-shims

2015-05-14 Thread Lior Chaga
I see that the pre-built distributions includes hive-shims-0.23 shaded in spark-assembly jar (unlike when I make the distribution myself). Does anyone knows what I should do to include the shims in my distribution? On Thu, May 14, 2015 at 9:52 AM, Lior Chaga wrote: > Ultimately it was PermGen o

Re: how to read lz4 compressed data using fileStream of spark streaming?

2015-05-14 Thread lisendong
it still does’t work… the streamingcontext could detect the new file, but it shows: ERROR dstream.FileInputDStream: File hdfs://nameservice1/sandbox/hdfs/list_join_action/2015_05_14_20_stream_1431605640.lz4 has no data in it. Spark Streaming can only ingest files that have been "moved" to the di

Re: Using sc.HadoopConfiguration in Python

2015-05-14 Thread ayan guha
Jo Thanks for the reply, but _jsc does not have anything to pass hadoop configs. can you illustrate your answer a bit more? TIA... On Wed, May 13, 2015 at 12:08 AM, Ram Sriharsha wrote: > yes, the SparkContext in the Python API has a reference to the > JavaSparkContext (jsc) > > https://spark.a

RE: [Spark SQL 1.3.1] data frame saveAsTable returns exception

2015-05-14 Thread Ishwardeep Singh
Hi Michael & Ayan, Thank you for your response to my problem. Michael do we have a tentative release date for Spark version 1.4? Regards, Ishwardeep From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Wednesday, May 13, 2015 10:54 PM To: ayan guha Cc: Ishwardeep Singh; user Subject: R

Re: [Unit Test Failure] Test org.apache.spark.streaming.JavaAPISuite.testCount failed

2015-05-14 Thread Tathagata Das
Do you get this failure repeatedly? On Thu, May 14, 2015 at 12:55 AM, kf wrote: > Hi, all, i got following error when i run unit test of spark by > dev/run-tests > on the latest "branch-1.4" branch. > > the latest commit id: > commit d518c0369fa412567855980c3f0f426cde5c190d > Author: zsxwing

how to delete data from table in sparksql

2015-05-14 Thread luohui20001
Hi guys i got to delete some data from a table by "delete from table where name = xxx", however "delete" is not functioning like the DML operation in hive. I got a info like below:Usage: delete [FILE|JAR|ARCHIVE] []* 15/05/14 18:18:24 ERROR processors.DeleteResourceProcessor: Usage: dele

Re: how to set random seed

2015-05-14 Thread ayan guha
Sorry for late reply. Here is what I was thinking import random as r def main(): get SparkContext #Just for fun, lets assume seed is an id filename="bin.dat" seed = id(filename) #broadcast it br = sc.broadcast(seed) #set up dummy list lst = [] for i in range(4): x

Build change PSA: Hadoop 2.2 default; -Phadoop-x.y profile recommended for builds

2015-05-14 Thread Sean Owen
This change will be merged shortly for Spark 1.4, and has a minor implication for those creating their own Spark builds: https://issues.apache.org/jira/browse/SPARK-7249 https://github.com/apache/spark/pull/5786 The default Hadoop dependency has actually been Hadoop 2.2 for some time, but the def

Re: kafka + Spark Streaming with checkPointing fails to restart

2015-05-14 Thread Ankur Chauhan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Thanks everyone, that was the problem. the "create new streaming context" function was supposed to setup the stream processing as well as the checkpoint directory. I had missed the whole process of checkpoint setup. With that done, everything works as

Re: Kafka stream fails: java.lang.NoClassDefFound com/yammer/metrics/core/Gauge

2015-05-14 Thread Tathagata Das
It would be good if you can tell what I should add to the documentation to make it easier to understand. I can update the docs for 1.4.0 release. On Tue, May 12, 2015 at 9:52 AM, Lee McFadden wrote: > Thanks for explaining Sean and Cody, this makes sense now. I'd like to > help improve this doc

Re: The explanation of input text format using LDA in Spark

2015-05-14 Thread Cui xp
hi keegan, Thanks a lot. Now I know the column represents all the words without repetition in all documents. I don't know what determine the order of the words, is there any difference when the column words with the different order? Thanks.

swap tuple

2015-05-14 Thread Yasemin Kaya
Hi, I have *JavaPairRDD *and I want to *swap tuple._1() to tuple._2()*. I use *tuple.swap() *but it can't be changed JavaPairRDD in real. When I print JavaPairRDD, the values are same. Anyone can help me for that? Thank you. Have nice day. yasemin -- hiç ender hiç

Re: how to read lz4 compressed data using fileStream of spark streaming?

2015-05-14 Thread Akhil Das
Here's the class. You can read more here https://github.com/twitter/hadoop-lzo#maven-repository Thanks Best Regards On Thu, May 14, 2015 at 1:22 PM, lisendong wrote: > LzoTextInputForm

[Unit Test Failure] Test org.apache.spark.streaming.JavaAPISuite.testCount failed

2015-05-14 Thread kf
Hi, all, i got following error when i run unit test of spark by dev/run-tests on the latest "branch-1.4" branch. the latest commit id: commit d518c0369fa412567855980c3f0f426cde5c190d Author: zsxwing Date: Wed May 13 17:58:29 2015 -0700 error [info] Test org.apache.spark.streaming.JavaAPISui

Re: Unsubscribe

2015-05-14 Thread Ted Yu
Please see http://spark.apache.org/community.html Cheers > On May 14, 2015, at 12:38 AM, Saurabh Agrawal > wrote: > > > How do I unsubscribe from this mailing list please? > > Thanks!! > > Regards, > Saurabh Agrawal > Vice President > > Markit > > Green Boulevard > B-9A, Tower C >

Re: how to read lz4 compressed data using fileStream of spark streaming?

2015-05-14 Thread Akhil Das
That's because you are using TextInputFormat i think, try with LzoTextInputFormat like: val list_join_action_stream = ssc.fileStream[LongWritable, Text, com.hadoop.mapreduce.LzoTextInputFormat](gc.input_dir, (t: Path) => true, false).map(_._2.toString) Thanks Best Regards On Thu, May 14, 2015 at

Re: Unsubscribe

2015-05-14 Thread Akhil Das
Have a look https://spark.apache.org/community.html Send an email to user-unsubscr...@spark.apache.org Thanks Best Regards On Thu, May 14, 2015 at 1:08 PM, Saurabh Agrawal wrote: > > > How do I unsubscribe from this mailing list please? > > > > Thanks!! > > > > Regards, > > Saurabh Agrawal > >

Unsubscribe

2015-05-14 Thread Saurabh Agrawal
How do I unsubscribe from this mailing list please? Thanks!! Regards, Saurabh Agrawal Vice President Markit Green Boulevard B-9A, Tower C 3rd Floor, Sector - 62, Noida 201301, India +91 120 611 8274 Office This e-mail, including accompanying communications an

Re: how to read lz4 compressed data using fileStream of spark streaming?

2015-05-14 Thread Akhil Das
Have a look https://spark.apache.org/community.html Send an email to user-unsubscr...@spark.apache.org Thanks Best Regards On Thu, May 14, 2015 at 1:06 PM, Saurabh Agrawal wrote: > How do I unsubscribe from this mailing list please? > > > > Thanks!! > > > > Regards, > > Saurabh Agrawal > > Vi

Re: Worker & Core in Spark

2015-05-14 Thread Akhil Das
That totally depends on the use-case. The more data and the more complex pipeline you have, the more cores you will require. Thanks Best Regards On Wed, May 13, 2015 at 8:13 AM, guoqing0...@yahoo.com.hk < guoqing0...@yahoo.com.hk> wrote: > Assume that i had several mathines with 8cores , 1 core

RE: how to read lz4 compressed data using fileStream of spark streaming?

2015-05-14 Thread Saurabh Agrawal
How do I unsubscribe from this mailing list please? Thanks!! Regards, Saurabh Agrawal Vice President Markit Green Boulevard B-9A, Tower C 3rd Floor, Sector - 62, Noida 201301, India +91 120 611 8274 Office This e-mail, including accompanying communications and

Re: how to read lz4 compressed data using fileStream of spark streaming?

2015-05-14 Thread Akhil Das
What do you mean by not detected? may be you forgot to trigger some action on the stream to get it executed. Like: val list_join_action_stream = ssc.fileStream[LongWritable, Text, TextInputFormat](gc.input_dir, (t: Path) => true, false).map(_._2.toString) *list_join_action_stream.count().print()*

Re: spark-streaming whit flume error

2015-05-14 Thread Akhil Das
Can you share the client code that you used to send the data? May be this discussion would give you some insights http://apache-avro.679487.n3.nabble.com/Avro-RPC-Python-to-Java-isn-t-working-for-me-td4027454.html Thanks Best Regards On Thu, May 14, 2015 at 8:44 AM, 鹰 <980548...@qq.com> wrote: >

Re: Spark SQL: preferred syntax for column reference?

2015-05-14 Thread Michael Armbrust
Depends on which compile time you are talking about. *scala compile time*: No, the information about which columns are available is usually coming from a file or an external database which may or may not be available to scalac. *query compile time*: While your program is running, but before any s

Re: How to run multiple jobs in one sparkcontext from separate threads in pyspark?

2015-05-14 Thread Akhil Das
Did you happened to have a look at the spark job server? Someone wrote a python wrapper around it, give it a try. Thanks Best Regards On Thu, May 14, 2015 at 11:10 AM, MEETHU MATHEW wrote: > Hi all,