handle data skew problem when calculating word count and word dependency

2016-11-13 Thread ruan.answer
I am planning to calculating word count and two word dependency via spark, but the data is skew, how can i solve this problem. And do you have some suggest about double level data slice? I have some topics, and each topic corresponding to lots of text. so I have a RDD structure like this: JavaPair

Re: Convert SparseVector column to Densevector column

2016-11-13 Thread Takeshi Yamamuro
Hi, How about this? import org.apache.spark.ml.linalg._ val toSV = udf((v: Vector) => v.toDense) val df = Seq((0.1, Vectors.sparse(16, Array(0, 3), Array(0.1, 0.3))), (0.2, Vectors.sparse(16, Array(0, 3), Array(0.1, 0.3.toDF("a", "b") df.select(toSV($"b")) // maropu On Mon, Nov 14, 2016 at

Convert SparseVector column to Densevector column

2016-11-13 Thread janardhan shetty
Hi, Is there any easy way of converting a dataframe column from SparseVector to DenseVector using import org.apache.spark.ml.linalg.DenseVector API ? Spark ML 2.0

Re: Spark SQL shell hangs

2016-11-13 Thread Hyukjin Kwon
Hi Rakesh, Could you please open an issue in https://github.com/databricks/spark-xml with some codes so that reviewers can reproduce the issue you met? Thanks! 2016-11-14 0:20 GMT+09:00 rakesh sharma : > Hi > > I'm trying to convert an XML file to data frame using data bricks spark > XML. But

Re: Spark ML : One hot Encoding for multiple columns

2016-11-13 Thread Nicholas Sharkey
Amen > On Nov 13, 2016, at 7:55 PM, janardhan shetty wrote: > > These Jiras' are still unresolved: > https://issues.apache.org/jira/browse/SPARK-11215 > > Also there is https://issues.apache.org/jira/browse/SPARK-8418 > >> On Wed, Aug 17, 2016 at 11:15 AM, Nisha Muktewar wrote: >> >> The O

Re: Spark ML : One hot Encoding for multiple columns

2016-11-13 Thread janardhan shetty
These Jiras' are still unresolved: https://issues.apache.org/jira/browse/SPARK-11215 Also there is https://issues.apache.org/jira/browse/SPARK-8418 On Wed, Aug 17, 2016 at 11:15 AM, Nisha Muktewar wrote: > > The OneHotEncoder does *not* accept multiple columns. > > You can use Michal's suggest

Re: sbt shenanigans for a Spark-based project

2016-11-13 Thread Don Drake
I would upgrade your Scala version to 2.11.8 as Spark 2.0 uses Scala 2.11 by default. On Sun, Nov 13, 2016 at 3:01 PM, Marco Mistroni wrote: > HI all > i have a small Spark-based project which at the moment depends on jar > from Spark 1.6.0 > The project has few Spark examples plus one which de

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-13 Thread Cody Koeninger
Preferred locations are only advisory, you can still get tasks scheduled on other executors. You can try bumping up the size of the cache to see if that is causing the issue you're seeing. On Nov 13, 2016 12:47, "Ivan von Nagy" wrote: > As the code iterates through the parallel list, it is proc

sbt shenanigans for a Spark-based project

2016-11-13 Thread Marco Mistroni
HI all i have a small Spark-based project which at the moment depends on jar from Spark 1.6.0 The project has few Spark examples plus one which depends on Flume libraries I am attempting to move to Spark 2.0, but i am having issues with my dependencies The stetup below works fine when compiled a

Re: Nearest neighbour search

2016-11-13 Thread Meeraj Kunnumpurath
That was a bit of a brute force search, so I changed the code to use a UDF to create the dot product between the two IDF vectors, and do a sort on the new column. package com.ss.ml.clustering import org.apache.spark.sql.{DataFrame, SparkSession} import org.apache.spark.sql.functions._ import org.

receiver based spark streaming doubts

2016-11-13 Thread Shushant Arora
Hi In spark streaming based on receivers - when receiver gets data and store in blocks for workers to process, How many blocks does receiver gives to worker. Say I have a streaming app with 30 sec of batch interval what will happen 1.for first batch(first 30 sec) there will not be any data for wo

Re: Nearest neighbour search

2016-11-13 Thread Meeraj Kunnumpurath
This is what I have done, is there a better way of doing this? val df = spark.read.option("header", "false").csv("data") val tk = new Tokenizer().setInputCol("_c2").setOutputCol("words") val tf = new HashingTF().setInputCol("words").setOutputCol("tf") val idf = new IDF().setInputCol("t

Re: Instability issues with Spark 2.0.1 and Kafka 0.10

2016-11-13 Thread Ivan von Nagy
As the code iterates through the parallel list, it is processing up to 8 KafkaRDD at a time. Each has it's own unique topic and consumer group now. Every topic has 4 partitions, so technically there should never be more then 32 CachedKafkaConsumers. However, this seems to not be the case as we are

[ANNOUNCE] Apache SystemML 0.11.0-incubating released

2016-11-13 Thread Luciano Resende
The Apache SystemML team is pleased to announce the release of Apache SystemML version 0.11.0-incubating. Apache SystemML provides declarative large-scale machine learning (ML) that aims at a flexible specification of ML algorithms and automatic generation of hybrid runtime plans ranging from sing

Re: Strongly Connected Components

2016-11-13 Thread Nicholas Chammas
FYI: There is a new connected components implementation coming in GraphFrames 0.3. See: https://github.com/graphframes/graphframes/pull/119 Implementation is based on: https://mmds-data.org/presentations/2014/vassilvitskii_mmds14.pdf Nick On Sat, Nov 12, 2016 at 3:01 PM Koert Kuipers wrote: >

Re: Joining to a large, pre-sorted file

2016-11-13 Thread Silvio Fiorito
Hi Stuart, Yes that's the query plan but if you take a look at my screenshot it skips the first stage since the datasets are co-partitioned. Thanks, Silvio From: Stuart White Sent: Saturday, November 12, 2016 11:20:28 AM To: Silvio Fiorito Cc: user@spark.apache

Spark SQL shell hangs

2016-11-13 Thread rakesh sharma
Hi I'm trying to convert an XML file to data frame using data bricks spark XML. But the shell hanhs when I do a select operation on the table. I believe it's memory issue. How can I debug this. The cm file sizes 86 MB. Thanks in advance Rakesh Get Outlook for Android

Nearest neighbour search

2016-11-13 Thread Meeraj Kunnumpurath
Hello, I have a dataset containing TF-IDF vectors for a corpus of documents. How do I perform a nearest neighbour search on the dataset, using cosine similarity? val df = spark.read.option("header", "false").csv("data") val tk = new Tokenizer().setInputCol("_c2").setOutputCol("words") val

Re: toDebugString is clipped

2016-11-13 Thread Sean Owen
I believe it's the shell (Scala shell) that's cropping the output. See http://blog.ssanj.net/posts/2016-10-16-output-in-scala-repl-is-truncated.html On Sun, Nov 13, 2016 at 1:56 AM Anirudh Perugu < anirudh.per...@stonybrook.edu> wrote: > Hello all, > > I am trying to understanding how graphx work

Re: spark-shell not starting ( in a Kali linux 2 OS)

2016-11-13 Thread Kelum Perera
Thanks Marco, Sea, & Oshadha, Changed the permission to the files in spark directory using "chmod" & now it works. Thank you very much for the help. Kelum On Sun, Nov 13, 2016 at 5:31 PM, Marco Mistroni wrote: > Hi > not a Linux expert but how did you installed Spark ? as a root user?

Re: spark-shell not starting ( in a Kali linux 2 OS)

2016-11-13 Thread Marco Mistroni
Hi not a Linux expert but how did you installed Spark ? as a root user? The error above seems to indicate you dont have permissions to access that directory. If you have full control of the host you can try to do a chmod 777 to the directory where you installed Spark and its subdirs Anwyay,

Re: Spark joins using row id

2016-11-13 Thread Yan Facai
pairRDD can use (hash) partition information to do some optimizations when joined, while I am not sure if dataset could. On Sat, Nov 12, 2016 at 7:11 PM, Rohit Verma wrote: > For datasets structured as > > ds1 > rowN col1 > 1 A > 2 B > 3 C > 4 C > … > > and > > ds2 > rowN

Re: spark-shell not starting ( in a Kali linux 2 OS)

2016-11-13 Thread Kelum Perera
Thanks Oshadha & Sean, Now, When i enter "spark-shell", this error pops as; bash: /root/spark/bin/pyspark: Permission denied Same error comes for "pyspark" too. Any help on this. Thanks for your help. Kelum On Sun, Nov 13, 2016 at 2:14 PM, Oshadha Gunawardena < oshadha.ro...@gmail.com> wrot

Re: Spark stalling during shuffle (maybe a memory issue)

2016-11-13 Thread bogdanbaraila
The issue was fixed for me by allocating just one core per executor. If I have executors with more then 1 core the issue appears again. I didn't yet understood why is this happening but for the ones having similar issue they can try this. -- View this message in context: http://apache-spark-use

Re: spark-shell not starting ( in a Kali linux 2 OS)

2016-11-13 Thread Sean Owen
You set SCALA_HOME twice and didn't set SPARK_HOME. On Sun, Nov 13, 2016, 04:50 Kelum Perera wrote: > Dear Users, > > I'm a newbie, trying to get spark-shell using kali linux OS, but getting > error - "spark-shell: command not found" > > I'm running on Kali Linux 2 (64bit) > > I followed several