Re: Spark SVD benchmark for dense matrices

2017-08-09 Thread Anastasios Zouzias
Hi Jose, Just to note that in the databricks blog they state that they compute the top-5 singular vectors, not all singular values/vectors. Computing all is much more computational intense. Cheers, Anastasios Am 09.08.2017 15:19 schrieb "Jose Francisco Saray Villamizar" < jsa...@gmail.com>:

Re: --jars from spark-submit on master on YARN don't get added properly to the executors - ClassNotFoundException

2017-08-09 Thread Mikhailau, Alex
Thanks, Marcelo. Will give it a shot tomorrow. -Alex On 8/9/17, 5:59 PM, "Marcelo Vanzin" wrote: Jars distributed using --jars are not added to the system classpath, so log4j cannot see them. To work around that, you need to manually add the *name* jar to the driver executo

Re: --jars from spark-submit on master on YARN don't get added properly to the executors - ClassNotFoundException

2017-08-09 Thread Marcelo Vanzin
Jars distributed using --jars are not added to the system classpath, so log4j cannot see them. To work around that, you need to manually add the *name* jar to the driver executor classpaths: spark.driver.extraClassPath=some.jar spark.executor.extraClassPath=some.jar In client mode you should use

--jars from spark-submit on master on YARN don't get added properly to the executors - ClassNotFoundException

2017-08-09 Thread Mikhailau, Alex
I have log4j json layout jars added via spark-submit on EMR /usr/lib/spark/bin/spark-submit --deploy-mode cluster --master yarn --jars /home/hadoop/lib/jsonevent-layout-1.7.jar,/home/hadoop/lib/json-smart-1.1.1.jar --driver-java-options "-XX:+AlwaysPreTouch -XX:MaxPermSize=6G" --class com.mlbam

Anyone has come across incorta that relies on Spark,Parquet and open source ibraries.

2017-08-09 Thread Mich Talebzadeh
Hi, There is a tool called incorta that uses Spark, Parquet and open source big data analytics libraries. Its aim is to accelerate Analytics. It claims that it incorporates Direct Data Mapping to deliver near real-time analytics on top of original, intricate, transactional data such as ERP systems

Re: KafkaUtils.createRDD , How do I read all the data from kafka in a batch program for a given topic?

2017-08-09 Thread Cody Koeninger
org.apache.spark.streaming.kafka.KafkaCluster has methods getLatestLeaderOffsets and getEarliestLeaderOffsets On Mon, Aug 7, 2017 at 11:37 PM, shyla deshpande wrote: > Thanks TD. > > On Mon, Aug 7, 2017 at 8:59 PM, Tathagata Das > wrote: >> >> I dont think there is any easier way. >> >> On Mon,

Spark SVD benchmark for dense matrices

2017-08-09 Thread Jose Francisco Saray Villamizar
Hi everyone, I am trying to invert a 5000 x 5000 Dense Matrix (99% non-zeros), by using SVD with an approach simmilar to : https://stackoverflow.com/questions/29969521/how-to-compute-the-inverse-of-a-rowmatrix-in-apache-spark The time Im getting with SVD is close to 10 minutes what is very long

StreamingQueryListner spark structered Streaming

2017-08-09 Thread purna pradeep
Im working on structered streaming application wherein im reading from Kafka as stream and for each batch of streams i need to perform S3 lookup file (which is nearly 200gb) to fetch some attributes .So im using df.persist() (basically caching the lookup) but i need to refresh the dataframe as the

Re[4]: Trying to connect Spark 1.6 to Hive

2017-08-09 Thread toletum
Yes... I know... but The cluster is not administered by me On Mié., Ago. 9, 2017 at 13:46, Gourav Sengupta wrote: Hi, Just out of sheer curiosity - why are you using SPARK 1.6? Since then SPARK has made significant advancement and improvement, why not take advantage of that? Regards, Goura

Re: Re[2]: Trying to connect Spark 1.6 to Hive

2017-08-09 Thread Gourav Sengupta
Hi, Just out of sheer curiosity - why are you using SPARK 1.6? Since then SPARK has made significant advancement and improvement, why not take advantage of that? Regards, Gourav On Wed, Aug 9, 2017 at 10:41 AM, toletum wrote: > Thanks Matteo > > I fixed it > > Regards, > JCS > > On Mié., Ago

Re[2]: Trying to connect Spark 1.6 to Hive

2017-08-09 Thread toletum
Thanks Matteo I fixed it Regards, JCS On Mié., Ago. 9, 2017 at 11:22, Matteo Cossu wrote: Hello, try to use these options when starting Spark: --conf "spark.driver.userClassPathFirst=true" --conf "spark.executor.userClassPathFirst=true" In this way you will be sure that the executor and the dr

回复:Re: Re: Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread luohui20001
Thank you guys, I got my code worked like below:val record75df = sc.parallelize(listForRule75, numPartitions).map(x=> x.replace("|", ",")).map(_.split(",")).map(x => Mycaseclass4(x(0).toInt,x(1).toInt,x(2).toFloat,x(3).toInt)).toDF() val userids = 1 to 1 val uiddf = sc.paralleliz

Re: Reusing dataframes for streaming (spark 1.6)

2017-08-09 Thread Tathagata Das
There is a DStream.transform() that does exactly this. On Tue, Aug 8, 2017 at 7:55 PM, Ashwin Raju wrote: > Hi, > > We've built a batch application on Spark 1.6.1. I'm looking into how to > run the same code as a streaming (DStream based) application. This is using > pyspark. > > In the batch ap

Re: Multiple queries on same stream

2017-08-09 Thread Tathagata Das
Its important to note that running multiple streaming queries, as of today, would read the input data that many number of time. So there is a trade off between the two approaches. So even though scenario 1 wont get great catalyst optimization, it may be more efficient overall in terms of resource u

Re: Re: Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread Ryan
rdd has a cartesian method On Wed, Aug 9, 2017 at 5:12 PM, ayan guha wrote: > If you use join without any condition in becomes cross join. In sql, it > looks like > > Select a.*,b.* from a join b > > On Wed, 9 Aug 2017 at 7:08 pm, wrote: > >> Riccardo and Ryan >>Thank you for your ideas.It

Re: Trying to connect Spark 1.6 to Hive

2017-08-09 Thread Matteo Cossu
Hello, try to use these options when starting Spark: *--conf "spark.driver.userClassPathFirst=true" --conf "spark.executor.userClassPathFirst=true" * In this way you will be sure that the executor and the driver of Spark will use the classpath you define. Best Regards, Matteo Cossu On 5 August

Re: [Structured Streaming] Recommended way of joining streams

2017-08-09 Thread Tathagata Das
Writing streams into some sink (preferably fault-tolerant, exactly once sink, see docs) and then joining is definitely a possible way. But you will likely incur higher latency. If you want lower latency, then stream-stream joins is the best approach, which we are working on right now. Spark 2.3 is

Re: Re: Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread ayan guha
If you use join without any condition in becomes cross join. In sql, it looks like Select a.*,b.* from a join b On Wed, 9 Aug 2017 at 7:08 pm, wrote: > Riccardo and Ryan >Thank you for your ideas.It seems that crossjoin is a new dataset api > after spark2.x. > my spark version is 1.6.3.

回复:Re: Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread luohui20001
Riccardo and Ryan Thank you for your ideas.It seems that crossjoin is a new dataset api after spark2.x. my spark version is 1.6.3. Is there a relative api to do crossjoin?thank you. Thanks&Best regards! San.Luo - 原始邮件 - 发件人:Riccardo Ferrar

Re: Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread Riccardo Ferrari
Depends on your Spark version, have you considered the Dataset api? You can do something like: val df1 = rdd1.toDF("userid") val listRDD = sc.parallelize(listForRule77) val listDF = listRDD.toDF("data") df1.crossJoin(listDF).orderBy("userid").show(60, truncate=false) +--+-

Re: Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread Ryan
It's just sort of inner join operation... If the second dataset isn't very large it's ok(btw, you can use flatMap directly instead of map followed by flatmap/flattern), otherwise you can register the second one as a rdd/dataset, and join them on user id. On Wed, Aug 9, 2017 at 4:29 PM, wrote: >

Is there an operation to create multi record for every element in a RDD?

2017-08-09 Thread luohui20001
hello guys: I have a simple rdd like :val userIDs = 1 to 1val rdd1 = sc.parallelize(userIDs , 16) //this rdd has 1 user id And I have a List[String] like below:scala> listForRule77 res76: List[String] = List(1,1,100.00|1483286400, 1,1,100.00|1483372800, 1,1,100.00|1483459200,

[Structured Streaming] Recommended way of joining streams

2017-08-09 Thread Priyank Shrivastava
I have streams of data coming in from various applications through Kafka. These streams are converted into dataframes in Spark. I would like to join these dataframes on a common ID they all contain. Since joining streaming dataframes is currently not supported, what is the current recommended wa