sqlstream for real time analytics

2017-06-28 Thread Mich Talebzadeh
Hi, has anyone had experience of using sqlstream for real time analytics the whole blaze package by any chance? thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

SparkSQL to read XML Blob data to create multiple rows

2017-06-28 Thread Talap, Amol
Hi: We are trying to parse XML data to get below output from given input sample. Can someone suggest a way to pass one DFrames output into load() function or any other alternative to get this output. Input Data from Oracle Table XMLBlob: SequenceID Name City XMLComment 1 Amol Kolhapur Tit

Re: Spark Project build Issues.(Intellij)

2017-06-28 Thread satyajit vegesna
Hi , I was able to successfully build the project(source code), from intellij. But when i try to run any of the examples present in $SPARK_HOME/examples folder , i am getting different errors for different example jobs. example: for structuredkafkawordcount example, Exception in thread "main" ja

about broadcast join of base table in spark sql

2017-06-28 Thread paleyl
Hi All, Recently I meet a problem in broadcast join: I want to left join table A and B, A is the smaller one and the left table, so I wrote A = A.join(B,A("key1") === B("key2"),"left") but I found that A is not broadcast out, as the shuffle size is still very large. I guess this is a designed mech

Re: Spark Project build Issues.(Intellij)

2017-06-28 Thread Dongjoon Hyun
Did you follow the guide in `IDE Setup` -> `IntelliJ` section of http://spark.apache.org/developer-tools.html ? Bests, Dongjoon. On Wed, Jun 28, 2017 at 5:13 PM, satyajit vegesna < satyajit.apas...@gmail.com> wrote: > Hi All, > > When i try to build source code of apache spark code from > https:

Spark Project build Issues.(Intellij)

2017-06-28 Thread satyajit vegesna
Hi All, When i try to build source code of apache spark code from https://github.com/apache/spark.git, i am getting below errors, Error:(9, 14) EventBatch is already defined as object EventBatch public class EventBatch extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro

Re: Structured Streaming Questions

2017-06-28 Thread Tathagata Das
Answers inline. On Wed, Jun 28, 2017 at 10:27 AM, Revin Chalil wrote: > I am using Structured Streaming with Spark 2.1 and have some basic > questions. > > > > · Is there a way to automatically refresh the Hive Partitions > when using Parquet Sink with Partition? My query looks like be

Re: Building Kafka 0.10 Source for Structured Streaming Error.

2017-06-28 Thread ayan guha
--jars does not do wildcard expansion. List out the jars as comma separated. On Thu, 29 Jun 2017 at 5:17 am, satyajit vegesna wrote: > Have updated the pom.xml in external/kafka-0-10-sql folder, in yellow , as > below, and have run the command > build/mvn package -DskipTests -pl external/kafka-0

Re: Building Kafka 0.10 Source for Structured Streaming Error.

2017-06-28 Thread satyajit vegesna
Have updated the pom.xml in external/kafka-0-10-sql folder, in yellow , as below, and have run the command build/mvn package -DskipTests -pl external/kafka-0-10-sql which generated spark-sql-kafka-0-10_2.11-2.3.0-SNAPSHOT-jar-with-dependencies.jar http://maven.apache.org/POM/4.0.0"; xmlns:xsi

Re: Building Kafka 0.10 Source for Structured Streaming Error.

2017-06-28 Thread Shixiong(Ryan) Zhu
"--package" will add transitive dependencies that are not "$SPARK_HOME/external/kafka-0-10-sql/target/*.jar". > i have tried building the jar with dependencies, but still face the same error. What's the command you used? On Wed, Jun 28, 2017 at 12:00 PM, satyajit vegesna < satyajit.apas...@gmail

Building Kafka 0.10 Source for Structured Streaming Error.

2017-06-28 Thread satyajit vegesna
Hi All, I am trying too build Kafka-0-10-sql module under external folder in apache spark source code. Once i generate jar file using, build/mvn package -DskipTests -pl external/kafka-0-10-sql i get jar file created under external/kafka-0-10-sql/target. And try to run spark-shell with jars create

Re: Spark job profiler results showing high TCP cpu time

2017-06-28 Thread Reth RM
I am using visual vm: https://github.com/krasa/VisualVMLauncher @Marcelo, thank you for the reply, that was helpful. On Fri, Jun 23, 2017 at 12:48 PM, Eduardo Mello wrote: > what program do u use to profile Spark? > > On Fri, Jun 23, 2017 at 3:07 PM, Marcelo Vanzin > wrote: > >> That thread

Structured Streaming Questions

2017-06-28 Thread Revin Chalil
I am using Structured Streaming with Spark 2.1 and have some basic questions. * Is there a way to automatically refresh the Hive Partitions when using Parquet Sink with Partition? My query looks like below val queryCount = windowedCount .withColumn("hive_partition_per

Re: IDE for python

2017-06-28 Thread Xiaomeng Wan
Thanks for all of you. I will give Pycharm a try. Regards, Shawn On 28 June 2017 at 06:07, Sotola, Radim wrote: > I know. But I pay around 20Euro per month for all products from JetBrains > and I think this is not so much – I Czech it is one evening in pub. > > > > *From:* Md. Rezaul Karim [mai

Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-28 Thread jeff saremi
I have to read up on the writer. But would the writer get records back from somewhere? I want to do a bulk operation and continue with the results in the form of a dataframe. Currently the UDF does this: 1 scalar -> 1 scalar the UDAF does this: M records -> 1 scalar I want this: M records -> M

using Apache Spark standalone on a server for a class/multiple users, db.lck does not get removed

2017-06-28 Thread Robert Kudyba
We have a Big Data class planned and we’d like students to be able to start spark-shell or pyspark as their own user. However the Derby database locks the process from starting as another user: -rw-r--r-- 1 myuser staff 38 Jun 28 10:40 db.lck And these errors appear: ERROR PoolWatchThread: E

Re: PySpark 2.1.1 Can't Save Model - Permission Denied

2017-06-28 Thread Yanbo Liang
It looks like your Spark job was running under user root, but you file system operation was running under user jomernik. Since Spark will call corresponding file system(such as HDFS, S3) to commit job(rename temporary file to persistent one), it should have correct authorization for both Spark and

How to propagate Non-Empty Value in SPARQL Dataset

2017-06-28 Thread carloallocca
Dear All, I am trying to propagate the last valid observation (e.g. not null) to the null values in a dataset. Below I reported the partial solution: Dataset tmp800=tmp700.select("uuid", "eventTime", "Washer_rinseCycles"); WindowSpec wspec= Window.partitionBy(tmp800.col("uuid")).or

How to propagate Non-Empty Value in SPARQL Dataset

2017-06-28 Thread carloallocca
Dear All, I am trying to propagate the last valid observation (e.g. not null) to the null values in a dataset. Below I reported the partial solution: Dataset tmp800=tmp700.select("uuid", "eventTime", "Washer_rinseCycles"); WindowSpec wspec= Window.partitionBy(tmp800.col("uuid")).or

How to Fill Sparse Data With the Previous Non-Empty Value in SPARQL Dataset

2017-06-28 Thread Carlo Allocca
Dear All, I am trying to propagate the last valid observation (e.g. not null) to the null values in a dataset. Below I reported the partial solution: Dataset tmp800=tmp700.select("uuid", "eventTime", "Washer_rinseCycles"); WindowSpec wspec= Window.partitionBy(tmp800.col("uuid")).o

Re: [PySpark]: How to store NumPy array into single DataFrame cell efficiently

2017-06-28 Thread Judit Planas
Dear Nick, Thanks for your quick reply. I quickly implemented your proposal, but I do not see any improvement. In fact, the test data set of around 3 GB occupies a total of 10 GB in worker memory, and the execution time of queries is like 4 times slower than

RE: IDE for python

2017-06-28 Thread Sotola, Radim
I know. But I pay around 20Euro per month for all products from JetBrains and I think this is not so much – I Czech it is one evening in pub. From: Md. Rezaul Karim [mailto:rezaul.ka...@insight-centre.org] Sent: Wednesday, June 28, 2017 12:55 PM To: Sotola, Radim Cc: spark users ; ayan guha ; A

Re: [ML] Stop conditions for RandomForest

2017-06-28 Thread OBones
To me, they are. Y is used to control if a split is a valid candidate when deciding which one to follow. X is used to make a node a leaf if it has too few elements to even consider candidate splits. 颜发才(Yan Facai) wrote: It seems that split will always stop when count of nodes is less than ma

(Spark-ml) java.util.NosuchElementException: key not found exception on doing prediction and computing test error.

2017-06-28 Thread neha nihal
Thanks. Its working now. My test data had some labels which were not there in training set. On Wednesday, June 28, 2017, Pralabh Kumar > wrote: > Hi Neha > > This generally occurred when , you training data set have some value of > categorical variable ,which in not there in your testing data. Fo

RE: IDE for python

2017-06-28 Thread Md. Rezaul Karim
By the way, Pycharm from JetBrians also have a community edition which is free and open source. Moreover, if you are a student, you can use the professional edition for students as well. For more, see here https://www.jetbrains.com/student/ On Jun 28, 2017 11:18 AM, "Sotola, Radim" wrote: > Py

Re: [PySpark]: How to store NumPy array into single DataFrame cell efficiently

2017-06-28 Thread Nick Pentreath
You will need to use PySpark vectors to store in a DataFrame. They can be created from Numpy arrays as follows: from pyspark.ml.linalg import Vectors df = spark.createDataFrame([("src1", "pkey1", 1, Vectors.dense(np.array([0, 1, 2])))]) On Wed, 28 Jun 2017 at 12:23 Judit Planas wrote: > Dear a

[PySpark]: How to store NumPy array into single DataFrame cell efficiently

2017-06-28 Thread Judit Planas
Dear all, I am trying to store a NumPy array (loaded from an HDF5 dataset) into one cell of a DataFrame, but I am having problems. In short, my data layout is similar to a database, where I have a few columns with metadata (source of information, primary key, et

RE: IDE for python

2017-06-28 Thread Sotola, Radim
Pycharm is good choice. I buy monthly subscription and can see that the PyCharm development continue (I mean that this is not tool which somebody develop and leave it without any upgrades). From: Abhinay Mehta [mailto:abhinay.me...@gmail.com] Sent: Wednesday, June 28, 2017 11:06 AM To: ayan guh

Re: HDP 2.5 - Python - Spark-On-Hbase

2017-06-28 Thread ayan guha
Hi Thanks for all of you, I could get HBase connector working. there are still some details around namespace is pending, but overall it is working well. Now, as usual, I would like to use the same concept into Structured Streaming. Is there any similar way I can use writeStream.format and use HBa

Re: [ML] Stop conditions for RandomForest

2017-06-28 Thread Yan Facai
It seems that split will always stop when count of nodes is less than max(X, Y). Hence, are they different? On Tue, Jun 27, 2017 at 11:07 PM, OBones wrote: > Hello, > > Reading around on the theory behind tree based regression, I concluded > that there are various reasons to stop exploring the

Re: IDE for python

2017-06-28 Thread Abhinay Mehta
I use Pycharm and it works a treat. The big advantage I find is that I can use the same command shortcuts that I do when developing with IntelliJ IDEA when doing Scala or Java. On 27 June 2017 at 23:29, ayan guha wrote: > Depends on the need. For data exploration, i use notebooks whenever I can

Re: How do I find the time taken by each step in a stage in a Spark Job

2017-06-28 Thread ??????????
You can find the information from the spark UI ---Original--- From: "SRK" Date: 2017/6/28 02:36:37 To: "user"; Subject: How do I find the time taken by each step in a stage in a Spark Job Hi, How do I find the time taken by each step in a stage in spark job? Also, how do I find the bottlene

Re: (Spark-ml) java.util.NosuchElementException: key not found exception on doing prediction and computing test error.

2017-06-28 Thread Pralabh Kumar
Hi Neha This generally occurred when , you training data set have some value of categorical variable ,which in not there in your testing data. For e.g you have column DAYS ,with value M,T,W in training data . But when your test data contains F ,then it say no key found exception . Please look int