Spark JOIN Not working

2016-05-23 Thread Aakash Basu
Hi experts, I'm extremely new to the Spark Ecosystem, hence need a help from you guys. While trying to fetch data from CSV files and join querying them in accordance to the need, when I'm caching the data by using registeredTempTables and then using select query to select what I need as per the gi

Pros and Cons

2016-05-25 Thread Aakash Basu
Hi, I’m new to the Spark Ecosystem, need to understand the *Pros and Cons *of fetching data using *SparkSQL vs Hive in Spark vs Spark API.* *PLEASE HELP!* Thanks, Aakash Basu.

Best Practices for Spark Join

2016-06-01 Thread Aakash Basu
Hi, Can you please write in order of importance, one by one, the Best Practices (necessary/better to follow) for doing a Spark Join. Thanks, Aakash.

Python to Scala

2016-06-17 Thread Aakash Basu
Hi all, I've a python code, which I want to convert to Scala for using it in a Spark program. I'm not so well acquainted with python and learning scala now. Any Python+Scala expert here? Can someone help me out in this please? Thanks & Regards, Aakash.

Re: Python to Scala

2016-06-17 Thread Aakash Basu
8-Jun-2016 10:07 AM, "Yash Sharma" wrote: > You could use pyspark to run the python code on spark directly. That will > cut the effort of learning scala. > > https://spark.apache.org/docs/0.9.0/python-programming-guide.html > > - Thanks, via mobile, excuse brevity. > O

Re: Python to Scala

2016-06-17 Thread Aakash Basu
k - or find someone else who feels > comfortable to do it. That kind of inquiry would likelybe appropriate on a > job board. > > > > 2016-06-17 21:47 GMT-07:00 Aakash Basu : > >> Hey, >> >> Our complete project is in Spark on Scala, I code in Scala for Spark, >> th

Re: spark job automatically killed without rhyme or reason

2016-06-23 Thread Aakash Basu
Hey, I've come across this. There's a command called "yarn application -kill ", which kills the application with a one liner 'Killed'. If it is memory issue, the error shows up in form of 'GC Overhead' or forming up tree or something of the sort. So, I think someone killed your job by that comma

Little idea needed

2016-07-19 Thread Aakash Basu
Hi all, I'm trying to pull a full table from oracle, which is huge with some 10 million records which will be the initial load to HDFS. Then I will do delta loads everyday in the same folder in HDFS. Now, my query here is, DAY 0 - I did the initial load (full dump). DAY 1 - I'll load only that

Re: Little idea needed

2016-07-20 Thread Aakash Basu
l the deltas. On 19 Jul 2016, at 21:27, Aakash Basu wrote: Hi all, I'm trying to pull a full table from oracle, which is huge with some 10 million records which will be the initial load to HDFS. Then I will do delta loads everyday in the same folder in HDFS. Now, my query here is, DAY 0

Re: Little idea needed

2016-07-20 Thread Aakash Basu
; > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. &

Unsubscribe

2016-08-09 Thread Aakash Basu

Fwd: Need some help

2016-09-01 Thread Aakash Basu
-- Forwarded message -- From: Aakash Basu Date: Thu, Aug 25, 2016 at 10:06 PM Subject: Need some help To: user@spark.apache.org Hi all, Aakash here, need a little help in KMeans clustering. This is needed to be done: "Implement Kmeans Clustering Algorithm without usin

Re: Fwd: Need some help

2016-09-01 Thread Aakash Basu
u want to do it with a UI based software, > try Weka or Orange. > > Regards, > > Sivakumaran S > > On 1 Sep 2016 8:42 p.m., Aakash Basu wrote: > > > -- Forwarded message -- > From: *Aakash Basu* > Date: Thu, Aug 25, 2016 at 10:06 PM > Subjec

Re: Fwd: Need some help

2016-09-02 Thread Aakash Basu
stering algorithm is and then looking into how you can use the DataFrame > API to implement the KMeansClustering. > > Thanks, > Shashank > > On Thu, Sep 1, 2016 at 1:05 PM, Aakash Basu > wrote: > >> Hey Siva, >> >> It needs to be done with Spark, without the

Join Query

2016-11-17 Thread Aakash Basu
Hi, Conceptually I can understand below spark joins, when it comes to implementation I don’t find much information in Google. Please help me with code/pseudo code for below joins using java-spark or scala-spark. *Replication Join:* Given two datasets, where one is small enou

HDPCD SPARK Certification Queries

2016-11-17 Thread Aakash Basu
Hi all, I want to know more about this examination - http://hortonworks.com/training/certification/exam-objectives/#hdpcdspark If anyone's there who appeared for the examination, can you kindly help? 1) What are the kind of questions that come, 2) Samples, 3) All the other details. Thanks,

Hortonworks Spark Certification Query

2016-12-14 Thread Aakash Basu
Hi all, Is there anyone who wrote the HDPCD examination as in the below link? http://hortonworks.com/training/certification/exam-objectives/#hdpcdspark I'm going to sit for this with a very little time to prepare, can I please be helped with the questions to expect and their probable solutions?

How to convert RDD to DF for this case -

2017-02-17 Thread Aakash Basu
Hi all, Without using case class I tried making a DF to work on the join and other filtration later. But I'm getting an ArrayIndexOutOfBoundException error while doing a show of the DF. 1) Importing SQLContext= import org.apache.spark.sql.SQLContext._ import org.apache.spark.sql.SQLConte

Re: How to convert RDD to DF for this case -

2017-02-17 Thread Aakash Basu
-+--+++ > | col1| col2|col3|col4| > +--+--+++ > | uihgf| Paris| 56| 5| > |asfsds| ***| 43| 1| > |fkwsdf|London| 45| 6| > | gddg| ABCD| 32| 2| > | grgzg| *CSD| 35| 3| > | gsrsn| ADR*| 22| 4| > +--+--+++ >

Re: Get S3 Parquet File

2017-02-23 Thread Aakash Basu
Hey, Please recheck your access key and secret key being used to fetch the parquet file. It seems to be a credential error. Either mismatch/load. If load, then first use it directly in code and see if the issue resolves, then it can be hidden and read from Input Params. Thanks, Aakash. On 23-Fe

Re: community feedback on RedShift with Spark

2017-04-24 Thread Aakash Basu
Hey afshin, Your point 1 is innumerably faster than the latter. It further shoots up the speed if you know how to properly use distKey and sortKey on the tables being loaded. Thanks, Aakash. https://www.linkedin.com/in/aakash-basu-5278b363 On 24-Apr-2017 10:37 PM, "Afshin, Bardia"

Any solution for this?

2017-05-15 Thread Aakash Basu
Hi all, Any solution for this issue - http://stackoverfl ow.com/q/43921392/7998705 Thanks, Aakash.

Re: Documentation on "Automatic file coalescing for native data sources"?

2017-05-19 Thread Aakash Basu
Hey all, A reply on this would be great! Thanks, A.B. On 17-May-2017 1:43 AM, "Daniel Siegmann" wrote: > When using spark.read on a large number of small files, these are > automatically coalesced into fewer partitions. The only documentation I can > find on this is in the Spark 2.0.0 release

Re: Spark-SQL collect function

2017-05-19 Thread Aakash Basu
Well described​, thanks! On 04-May-2017 4:07 AM, "JayeshLalwani" wrote: > In any distributed application, you scale up by splitting execution up on > multiple machines. The way Spark does this is by slicing the data into > partitions and spreading them on multiple machines. Logically, an RDD is

Re: Use SQL Script to Write Spark SQL Jobs

2017-06-12 Thread Aakash Basu
Hey, I work on Spark SQL and would pretty much be able to help you in this. Let me know your requirement. Thanks, Aakash. On 12-Jun-2017 11:00 AM, "bo yang" wrote: > Hi Guys, > > I am writing a small open source project > to use SQL Script to write > S

Repartition vs PartitionBy Help/Understanding needed

2017-06-15 Thread Aakash Basu
Hi all, Everybody is giving a difference between coalesce and repartition, but nowhere I found a difference between partitionBy and repartition. My question is, is it better to write a data set in parquet partitioning by a column and then reading the respective directories to work on that column i

Fwd: Repartition vs PartitionBy Help/Understanding needed

2017-06-16 Thread Aakash Basu
Hi all, Can somebody put some light on this pls? Thanks, Aakash. -- Forwarded message -- From: "Aakash Basu" Date: 15-Jun-2017 2:57 PM Subject: Repartition vs PartitionBy Help/Understanding needed To: "user" Cc: Hi all, > > Everybody is giving a diffe

Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

2017-08-16 Thread Aakash Basu
Hi all, I am working on PySpark (*Python 3.6 and Spark 2.1.1*) and trying to fetch data from an excel file using *spark.read.format("com.crealytics.spark.excel")*, but it is inferring double for a date type column. The detailed description is given here (the question I posted) - https://stackove

Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

2017-08-16 Thread Aakash Basu
Hey all, Forgot to attach the link to the overriding Schema through external package's discussion. https://github.com/crealytics/spark-excel/pull/13 You can see my comment there too. Thanks, Aakash. On Wed, Aug 16, 2017 at 11:11 PM, Aakash Basu wrote: > Hi all, > > I am work

Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

2017-08-16 Thread Aakash Basu
the actual value in the cell and what Excel formats that cell. You probably want to import that field as a string or not have it as a date format in Excel. Just a thought Thank You, Irving Duran On Wed, Aug 16, 2017 at 12:47 PM, Aakash Basu wrote: > Hey all, > > Forgot to atta

Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

2017-08-17 Thread Aakash Basu
01 AM, Jörn Franke wrote: > You can use Apache POI DateUtil to convert double to Date ( > https://poi.apache.org/apidocs/org/apache/poi/ss/usermodel/DateUtil.html). > Alternatively you can try HadoopOffice (https://github.com/ZuInnoTe/ > hadoopoffice/wiki), it supports Spark 1.x or Spark 2

Re: Reading Excel (.xlsm) file through PySpark 2.1.1 with external JAR is causing fatal conversion of data type

2017-08-17 Thread Aakash Basu
ithColumn('shipdate',f.to_date(lineitem_ > df.shipdate)) > > —— > > You should have first ingested the column as a string; and then leveraged > the DF api to make the conversion to dateType. > > That should work. > > Kind Regards >

Problem with CSV line break data in PySpark 2.1.0

2017-09-03 Thread Aakash Basu
Hi, I've a dataset where a few rows of the column F as shown below have line breaks in CSV file. [image: Inline image 1] When Spark is reading it, it is coming as below, which is a complete new line. [image: Inline image 2] I want my PySpark 2.1.0 to read it by forcefully avoiding the line bre

Efficient Spark-Submit planning

2017-09-11 Thread Aakash Basu
Hi, Can someone please clarify a little on how should we effectively calculate the parameters to be passed over using spark-submit. Parameters as in - Cores, NumExecutors, DriverMemory, etc. Is there any generic calculation which can be done over most kind of clusters with different sizes from

Help needed in Dividing open close dates column into multiple columns in dataframe

2017-09-19 Thread Aakash Basu
Hi, I've a csv dataset which has a column with all the details of store open and close timings as per dates, but the data is highly variant, as follows - Mon-Fri 10am-9pm, Sat 10am-8pm, Sun 12pm-6pm Mon-Sat 10am-8pm, Sun Closed Mon-Sat 10am-8pm, Sun 10am-6pm Mon-Friday 9-8 / Saturday 10-7 / Sund

Fwd: Help needed in Dividing open close dates column into multiple columns in dataframe

2017-09-20 Thread Aakash Basu
Hi folks, Any solution to it? PFB. Thanks, Aakash. -- Forwarded message -- From: Aakash Basu Date: Tue, Sep 19, 2017 at 2:32 PM Subject: Help needed in Dividing open close dates column into multiple columns in dataframe To: user Hi, I've a csv dataset which has a column

PySpark 2.1 Not instantiating properly

2017-10-20 Thread Aakash Basu
Hi all, I have Spark 2.1 installed in my laptop where I used to run all my programs. PySpark wasn't used for around 1 month, and after starting it now, I'm getting this exception (I've tried the solutions I could find on Google, but to no avail). Specs: Spark 2.1.1, Python 3.6, HADOOP 2.7, Window

Fwd: PySpark 2.1 Not instantiating properly

2017-10-20 Thread Aakash Basu
Hi, Any help please? What can be the issue? Thanks, Aakash. -- Forwarded message -- From: Aakash Basu Date: Fri, Oct 20, 2017 at 1:00 PM Subject: PySpark 2.1 Not instantiating properly To: user Hi all, I have Spark 2.1 installed in my laptop where I used to run all my

Re: PySpark 2.1 Not instantiating properly

2017-10-20 Thread Aakash Basu
ither you have to give write perm to >> your /tmp directory or there's a spark config you need to override >> This error is not 2.1 specific...let me get home and check my configs >> I think I amended my /tmp permissions via xterm instead of control panel >> >> Ht

Split column with dynamic data

2017-10-30 Thread Aakash Basu
Hi all, I've a requirement to split a column and fetch only the description where I have numbers appended before that for some rows whereas other rows have only the description - Eg - (Description is the column header) *Description* Inventory Tree Products 1. AT&T Services 2. Accessories 4. Misc

RE: Split column with dynamic data

2017-10-30 Thread Aakash Basu
stop, and catering for the > possibility of a capital letter). > > > > This is untested, but it shoud do the trick based on your examples so far: > > > > df.withColumn(“new_column”, regexp_replace($”Description”, “^\d+A-Z?\.”, > “”)) > > > > > > *From:

Regarding column partitioning IDs and names as per hierarchical level SparkSQL

2017-10-31 Thread Aakash Basu
Hi all, I have to generate a table with Spark-SQL with the following columns - Level One Id: VARCHAR(20) NULL Level One Name: VARCHAR( 50) NOT NULL Level Two Id: VARCHAR( 20) NULL Level Two Name: VARCHAR(50) NULL Level Thr ee Id: VARCHAR(20) NULL Level Thr ee Name: VARCHAR(50) NULL Level Four Id

Fwd: Regarding column partitioning IDs and names as per hierarchical level SparkSQL

2017-10-31 Thread Aakash Basu
Hey all, Any help in the below please? Thanks, Aakash. -- Forwarded message -- From: Aakash Basu Date: Tue, Oct 31, 2017 at 9:17 PM Subject: Regarding column partitioning IDs and names as per hierarchical level SparkSQL To: user Hi all, I have to generate a table with

Dynamic data ingestion into SparkSQL - Interesting question

2017-11-20 Thread Aakash Basu
Hi all, I have a table which will have 4 columns - | Expression|filter_condition| from_clause| group_by_columns| This file may have variable number of rows depending on the no. of KPIs I need to calculate. I need to write a SparkSQL program which will have to read this fil

Re: Dynamic data ingestion into SparkSQL - Interesting question

2017-11-20 Thread Aakash Basu
Hi all, Any help? PFB. Thanks, Aakash. On 20-Nov-2017 6:58 PM, "Aakash Basu" wrote: > Hi all, > > I have a table which will have 4 columns - > > | Expression|filter_condition| from_clause| > group_by_columns| > > > This file may have

Re: Dynamic data ingestion into SparkSQL - Interesting question

2017-11-21 Thread Aakash Basu
programming-guide.html#hive-tables > > Cheers > > On 21 November 2017 at 03:27, Aakash Basu > wrote: > >> Hi all, >> >> Any help? PFB. >> >> Thanks, >> Aakash. >> >> On 20-Nov-2017 6:58 PM, "Aakash Basu"

Passing an array of more than 22 elements in a UDF

2017-12-22 Thread Aakash Basu
Hi, I am using Spark 2.2 using Java, can anyone please suggest me how to take more than 22 parameters in an UDF? I mean, if I want to pass all the parameters as an array of integers? Thanks, Aakash.

Re: Passing an array of more than 22 elements in a UDF

2017-12-25 Thread Aakash Basu
r udf. > > On Fri, 22 Dec 2017 at 2:55 pm, Aakash Basu > wrote: > >> Hi, >> >> I am using Spark 2.2 using Java, can anyone please suggest me how to take >> more than 22 parameters in an UDF? I mean, if I want to pass all the >> parameters as an array of integers? >> >> Thanks, >> Aakash. >> > -- > Best Regards, > Ayan Guha >

Spark MLLib vs. SciKitLearn

2018-01-19 Thread Aakash Basu
Hi all, I am totally new to ML APIs. Trying to get the *ROC_Curve* for Model Evaluation on both *ScikitLearn* and *PySpark MLLib*. I do not find any API for ROC_Curve calculation for BinaryClassification in SparkMLLib. The codes below have a wrapper function which is creating the respective dataf

Re: Spark MLLib vs. SciKitLearn

2018-01-20 Thread Aakash Basu
Any help on the below? On 19-Jan-2018 7:12 PM, "Aakash Basu" wrote: > Hi all, > > I am totally new to ML APIs. Trying to get the *ROC_Curve* for Model > Evaluation on both *ScikitLearn* and *PySpark MLLib*. I do not find any > API for ROC_Curve calculation for BinaryCla

Is there any Spark ML or MLLib API for GINI for Model Evaluation? Please help! [EOM]

2018-01-21 Thread Aakash Basu

[Help] Converting a Python Numpy code into Spark using RDD

2018-01-21 Thread Aakash Basu
Hi, How can I convert this Python Numpy code into Spark RDD so that the operations leverage the Spark distributed architecture for Big Data. Code is as follows - def gini(array): """Calculate the Gini coefficient of a numpy array.""" array = array.flatten() #all values are treated equall

[Doubt] GridSearch for Hyperparameter Tuning in Spark

2018-01-30 Thread Aakash Basu
Hi, Is there any available pyspark ML or MLLib API for Grid Search similar to GridSearchCV from model_selection of sklearn? I found this - https://spark.apache.org/docs/2.2.0/ml-tuning.html, but it has cross-validation and train-validation for hp-tuning and not pure grid search. Any help? Thank

EDI (Electronic Data Interchange) parser on Spark

2018-03-13 Thread Aakash Basu
Hi, Did anyone built parallel and large scale X12 EDI parser to XML or JSON using Spark? Thanks, Aakash.

Re: EDI (Electronic Data Interchange) parser on Spark

2018-03-13 Thread Aakash Basu
lot of tricks to make > it run in parallel. This also depends on type of edit message etc. > sophisticated unit testing and performance testing is key. > > Nevertheless it is also not as difficult as I made it sound now. > > > On 13. Mar 2018, at 10:36, Aakash Basu > wrote: >

Re: EDI (Electronic Data Interchange) parser on Spark

2018-03-13 Thread Aakash Basu
ntent-type} >{$srctitle} >{$document-type} >{$document-subtype} >{$publication-date} >{$article-title} >{$issn} >{$isbn} >{$lang} > {$tables} > > > return xml-to-json($retval) > > > Darin. >

How to start practicing Python Spark Streaming in Linux?

2018-03-14 Thread Aakash Basu
Hi all, Any guide on how to kich-start learning PySpark Streaming in ubuntu standalone system? Step wise, practical hands-on, would be great. Also, connecting Kafka with Spark and getting real time data and processing it in micro-batches... Any help? Thanks, Aakash.

Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Aakash Basu
Hi, *Info (Using):Spark Streaming Kafka 0.8 package* *Spark 2.2.1* *Kafka 1.0.1* As of now, I am feeding paragraphs in Kafka console producer and my Spark, which is acting as a receiver is printing the flattened words, which is a complete RDD operation. *My motive is to read two tables contin

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Aakash Basu
you try spark 2.3 with structured streaming? There watermarking and > plain sql might be really interesting for you. > Aakash Basu schrieb am Mi. 14. März 2018 um > 14:57: > >> Hi, >> >> >> >> *Info (Using):Spark Streaming Kafka 0.8 package* >&g

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Aakash Basu
Hey Dylan, Great! Can you revert back to my initial and also the latest mail? Thanks, Aakash. On 15-Mar-2018 12:27 AM, "Dylan Guedes" wrote: > Hi, > > I've been using the Kafka with pyspark since 2.1. > > On Wed, Mar 14, 2018 at 3:49 PM, Aakash Basu >

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Aakash Basu
layed > data and appropriately join stuff with SQL join semantics. Please check it > out :) > > TD > > > > On Wed, Mar 14, 2018 at 12:07 PM, Dylan Guedes > wrote: > >> I misread it, and thought that you question was if pyspark supports kafka >> lol. Sor

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-15 Thread Aakash Basu
scribe[test1]]", "startOffset" : { "test1" : { "0" : 5226 } }, "endOffset" : { "test1" : { "0" : 5232 } }, "numInputRows" : 6, "inputRowsPerSeco

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-15 Thread Aakash Basu
Any help on the above? On Thu, Mar 15, 2018 at 3:53 PM, Aakash Basu wrote: > Hi, > > I progressed a bit in the above mentioned topic - > > 1) I am feeding a CSV file into the Kafka topic. > 2) Feeding the Kafka topic as readStream as TD's article suggests. > 3) Then,

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-15 Thread Aakash Basu
n-hadoop2.7/python/lib/pyspark.zip/pyspark/sql/utils.py", line 69, in deco *pyspark.sql.utils.AnalysisException: u'USING column `Id` cannot be resolved on the left side of the join. The left-side columns: [key, value, topic, partition, offset, timestamp, timestampType];'* Seems, as

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-15 Thread Aakash Basu
k provides from_json out of the box. For a very simple POC you can happily cast the value to a string, etc. if you are prototyping and pushing messages by hand with a console producer on the kafka side. ________ From: Aakash Basu Sent: Thursday, March 15, 2018 7:52:

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-15 Thread Aakash Basu
t; experience vs. explicitly using split and array indices, etc. In this > simple example, casting the binary to a string just works because there is > a common understanding of string's encoded as bytes between Spark and Kafka > by default. > > > -Chris > --

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-16 Thread Aakash Basu
| id|first_name|last_name| | 1| Kellyann|Moyne| | 2| Morty| Blacker| | 3| Tobit|Robardley| | 4|Wilona|Kells| | 5| Reggy|Comizzoli| | id|first_name|last_name| +---+--+-+ only showing top 20 rows Any help? Thanks, Aakash. On Fri, Mar 16, 2018 at 12

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-16 Thread Aakash Basu
ccounting| 8/3/2006| | 3| Tobit|Robardley| Accounting| 8/3/2006| | 5| Reggy|Comizzoli|Human Resources| 8/15/2012| | 5| Reggy|Comizzoli|Human Resources| 8/15/2012| +---+--+-+---+---+ only showing top 20 rows *Queries:* *1) Why even a

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-16 Thread Aakash Basu
es still persists, would anyone like to reply? :) Thanks, Aakash. On 16-Mar-2018 3:57 PM, "Aakash Basu" wrote: > Hi all, > > The code was perfectly alright, just the package I was submitting had to > be the updated one (marked green below). The join happened but the output >

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-21 Thread Aakash Basu
doubt you need the sleep > > 3b. Kafka consumer tuning is simply a matter of passing appropriate config > keys to the source's options if desired > > 3c. I would argue the most obvious improvement would be a more structured > and compact data format if CSV isn't requir

Wait for 30 seconds before terminating Spark Streaming

2018-03-21 Thread Aakash Basu
Hi, Using: *Spark 2.3 + Kafka 0.10* How to wait for 30 seconds after the latest stream and if there's no more streaming data, gracefully exit. Is it running - query.awaitTermination(30) Or is it something else? I tried with this, keeping - option("startingOffsets", "latest") for both my i

Re: Is there a mutable dataframe spark structured streaming 2.3.0?

2018-03-22 Thread Aakash Basu
Hey, I faced the same issue a couple of days back, kindly go through the mail chain with "*Multiple Kafka Spark Streaming Dataframe Join query*" as subject, TD and Chris has cleared my doubts, it would help you too. Thanks, Aakash. On Thu, Mar 22, 2018 at 7:50 AM, kant kodali wrote: > Hi All,

Structured Streaming Spark 2.3 Query

2018-03-22 Thread Aakash Basu
Hi, What is the way to stop a Spark Streaming job if there is no data inflow for an arbitrary amount of time (eg: 2 mins)? Thanks, Aakash.

[Query] Columnar transformation without Structured Streaming

2018-03-29 Thread Aakash Basu
Hi, I started my Spark Streaming journey from Structured Streaming using Spark 2.3, where I can easily do Spark SQL transformations on streaming data. But, I want to know, how can I do columnar transformation (like, running aggregation or casting, et al) using the prior utility of DStreams? Is th

[Structured Streaming Query] Calculate Running Avg from Kafka feed using SQL query

2018-04-02 Thread Aakash Basu
Hi, This is a very interesting requirement, where I am getting stuck at a few places. *Requirement* - Col1Col2 1 10 2 11 3 12 4 13 5 14 *I have to calculate avg of col1 and then divide each row of col2 by that avg. And,

Re: [Structured Streaming Query] Calculate Running Avg from Kafka feed using SQL query

2018-04-02 Thread Aakash Basu
Any help, guys? On Mon, Apr 2, 2018 at 1:01 PM, Aakash Basu wrote: > Hi, > > This is a very interesting requirement, where I am getting stuck at a few > places. > > *Requirement* - > > Col1Col2 > 1 10 > 2 11 > 3

Re: [Structured Streaming Query] Calculate Running Avg from Kafka feed using SQL query

2018-04-02 Thread Aakash Basu
Spark.py If I uncomment the collect from the above code and use it, I get the following error - *pyspark.sql.utils.AnalysisException: u'Queries with streaming sources must be executed with writeStream.start();;\nkafka'* Any alternative (better) solution to get this job done,

[Structured Streaming] How to save entire column aggregation to a file

2018-04-05 Thread Aakash Basu
Hi, I want to save an aggregate to a file without using any window, watermark or groupBy. So, my aggregation is at entire column level. df = spark.sql("select avg(col1) as aver from ds") Now, the challenge is as follows - 1) If I use outputMode = Append, but "*Append output mode not supported

Spark Structured Streaming Inner Queries fails

2018-04-05 Thread Aakash Basu
Hi, Why are inner queries not allowed in Spark Streaming? Spark assumes the inner query to be a separate stream altogether and expects it to be triggered with a separate writeStream.start(). Why so? Error: pyspark.sql.utils.StreamingQueryException: 'Queries with streaming sources must be execute

[Structured Streaming] More than 1 streaming in a code

2018-04-05 Thread Aakash Basu
Hi, If I have more than one writeStream in a code, which operates on the same readStream data, why does it produce only the first writeStream? I want the second one to be also printed on the console. How to do that? from pyspark.sql import SparkSession from pyspark.sql.functions import split, co

Fwd: Spark Structured Streaming Inner Queries fails

2018-04-06 Thread Aakash Basu
Any help? Need urgent help. Someone please clarify the doubt? -- Forwarded message -- From: Aakash Basu Date: Thu, Apr 5, 2018 at 2:50 PM Subject: Spark Structured Streaming Inner Queries fails To: user Hi, Why are inner queries not allowed in Spark Streaming? Spark assumes

Fwd: [Structured Streaming] How to save entire column aggregation to a file

2018-04-06 Thread Aakash Basu
Any help? Need urgent help. Someone please clarify the doubt? -- Forwarded message -- From: Aakash Basu Date: Thu, Apr 5, 2018 at 2:28 PM Subject: [Structured Streaming] How to save entire column aggregation to a file To: user Hi, I want to save an aggregate to a file

Fwd: [Structured Streaming Query] Calculate Running Avg from Kafka feed using SQL query

2018-04-06 Thread Aakash Basu
Any help? Need urgent help. Someone please clarify the doubt? -- Forwarded message -- From: Aakash Basu Date: Mon, Apr 2, 2018 at 1:01 PM Subject: [Structured Streaming Query] Calculate Running Avg from Kafka feed using SQL query To: user , "Bowden, Chris" &

Fwd: [Structured Streaming] More than 1 streaming in a code

2018-04-06 Thread Aakash Basu
Any help? Need urgent help. Someone please clarify the doubt? -- Forwarded message -- From: Aakash Basu Date: Thu, Apr 5, 2018 at 3:18 PM Subject: [Structured Streaming] More than 1 streaming in a code To: user Hi, If I have more than one writeStream in a code, which

Re: [Structured Streaming] More than 1 streaming in a code

2018-04-06 Thread Aakash Basu
lo Aakash, > > When you use query.awaitTermination you are pretty much blocking there > waiting for the current query to stop or throw an exception. In your case > the second query will not even start. > What you could do instead is remove all the blocking calls and use > spark.s

Re: [Structured Streaming Query] Calculate Running Avg from Kafka feed using SQL query

2018-04-09 Thread Aakash Basu
hat do you thing the problem maybe? Thanks in adv, Aakash. On Fri, Apr 6, 2018 at 9:55 PM, Felix Cheung wrote: > Instead of write to console you need to write to memory for it to be > queryable > > > .format("memory") >.queryName("tableName") > h

Is DLib available for Spark?

2018-04-10 Thread Aakash Basu
Hi team, Is DLib package available for use through Spark? Thanks, Aakash.

Re: [Structured Streaming] More than 1 streaming in a code

2018-04-16 Thread Aakash Basu
n aggregated streaming data frame. As per the documentation, joining > an aggregated streaming data frame with another streaming data frame is not > supported > > > > > > *From: *spark receiver > *Date: *Friday, April 13, 2018 at 11:49 PM > *To: *Aakash Basu >

PySpark ML: Get best set of parameters from TrainValidationSplit

2018-04-16 Thread Aakash Basu
Hi, I am running a Random Forest model on a dataset using hyper parameter tuning with Spark's paramGrid and Train Validation Split. Can anyone tell me how to get the best set for all the four parameters? I used: model.bestModel() model.metrics() But none of them seem to work. Below is the c

Re: [Structured Streaming] More than 1 streaming in a code

2018-04-16 Thread Aakash Basu
tured streaming, you can only window by timestamp columns. You cannot > do windows aggregations on integers. > > > > *From: *Aakash Basu > *Date: *Monday, April 16, 2018 at 4:52 AM > *To: *"Lalwani, Jayesh" > *Cc: *spark receiver , Panagiotis Garefalakis

Re: [Structured Streaming] More than 1 streaming in a code

2018-04-16 Thread Aakash Basu
the first query getting the data. > If you would talk to a TPC server supporting multiple connected clients, > you would see data in both queries. > > If your actual source is Kafka, the original solution of using > `spark.streams.awaitAnyTermination` should solve the problem. > &g

[How To] Using Spark Session in internal called classes

2018-04-23 Thread Aakash Basu
Hi, I have created my own Model Tuner class which I want to use to tune models and return a Model object if the user expects. This Model Tuner is in a file which I would ideally import into another file and call the class and use it. Outer file {from where I'd be calling the Model Tuner): I am us

XGBoost on PySpark

2018-05-18 Thread Aakash Basu
Hi guys, I need help in implementing XG-Boost in PySpark. As per the conversation in a popular thread regarding XGB goes, it is available in Scala and Java versions but not Python. But, we've to implement a pythonic distributed solution (on Spark) maybe using DMLC or similar, to go ahead with XGB

Fwd: XGBoost on PySpark

2018-05-22 Thread Aakash Basu
Guys any insight on the below? -- Forwarded message -- From: Aakash Basu Date: Sat, May 19, 2018 at 12:21 PM Subject: XGBoost on PySpark To: user Hi guys, I need help in implementing XG-Boost in PySpark. As per the conversation in a popular thread regarding XGB goes, it is

[Query] Weight of evidence on Spark

2018-05-25 Thread Aakash Basu
Hi guys, What's the best way to create feature column with Weight of Evidence calculated for categorical columns on target column (both Binary and Multi-Class)? Any insight? Thanks, Aakash.

Spark 2.3 Memory Leak on Executor

2018-05-26 Thread Aakash Basu
Hi, I am getting memory leak warning which ideally was a Spark bug back till 1.6 version and was resolved. Mode: Standalone IDE: PyCharm Spark version: 2.3 Python version: 3.6 Below is the stack trace - 2018-05-25 15:00:05 WARN Executor:66 - Managed memory leak detected; size = 262144 bytes, T

Spark 2.3 Tree Error

2018-05-26 Thread Aakash Basu
Hi, This query is based on one step further from the query in this link . In this scenario, I add 1 or 2 more columns to be processed, Spark throws an ERROR by printing the physical plan of queries. It says,

Silly question on Dropping Temp Table

2018-05-26 Thread Aakash Basu
Hi all, I'm trying to use dropTempTable() after the respective Temporary Table's use is over (to free up the memory for next calculations). Newer Spark Session doesn't need sqlContext, so, it is confusing me on how to use the function. 1) Tried, same DF which I used to register a temp table to d

Re: Silly question on Dropping Temp Table

2018-05-26 Thread Aakash Basu
TempView > > On Sat, May 26, 2018, 8:56 PM Aakash Basu > wrote: > >> Hi all, >> >> I'm trying to use dropTempTable() after the respective Temporary Table's >> use is over (to free up the memory for next calculations). >> >> Newer Spark Sessio

Re: Silly question on Dropping Temp Table

2018-05-26 Thread Aakash Basu
Well, it did, meaning, internally a TempTable and a TempView are the same. Thanks buddy! On Sat, May 26, 2018 at 9:23 PM, Aakash Basu wrote: > Question is, while registering, using registerTempTable() and while > dropping, using a dropTempView(), would it go and hit the same Tem

  1   2   >