Re: spark session jdbc performance

2017-10-24 Thread Gourav Sengupta
Hi Naveen, I do not think that it is prudent to use the PK as the partitionColumn. That is too many partitions for any system to handle. The numPartitions will be valid in case of JDBC very differently. Please keep me updated on how things go. Regards, Gourav Sengupta On Tue, Oct 24, 2017 at 1

Using Spark 2.2.0 SparkSession extensions to optimize file filtering

2017-10-24 Thread Chris Luby
I have an external catalog that has additional information on my Parquet files that I want to match up with the parsed filters from the plan to prune the list of files included in the scan. I’m looking at doing this using the Spark 2.2.0 SparkSession extensions similar to the built in partition

Re: Is Spark suited for this use case?

2017-10-24 Thread Gourav Sengupta
Hi Saravanan, SPARK may be free, but to make it run with the same level of performance, consistency, and reliability will show you that SPARK or HADOOP or anything else is essentially not free. With Informatica you pay for the licensing and have almost no headaches as far as stability, upgrades, a

Re: spark session jdbc performance

2017-10-24 Thread Srinivasa Reddy Tatiredidgari
Hi, is the subquery is user defined sqls or table name in db.If it is user Defined sql.Make sure ur partition column is in main select clause. Sent from Yahoo Mail on Android On Wed, Oct 25, 2017 at 3:25, Naveen Madhire wrote: Hi,   I am trying to fetch data from Oracle DB using a subq

Re: Orc predicate pushdown with Spark Sql

2017-10-24 Thread Jörn Franke
Well the meta information is in the file so I am not surprised that it reads the file, but it should not read all the content, which is probably also not happening. > On 24. Oct 2017, at 18:16, Siva Gudavalli > wrote: > > > Hello, > > I have an update here. > > spark SQL is pushing pre

Re: spark session jdbc performance

2017-10-24 Thread lucas.g...@gmail.com
Sorry, I meant to say: "That code looks SANE to me" Assuming that you're seeing the query running partitioned as expected then you're likely configured with one executor. Very easy to check in the UI. Gary Lucas On 24 October 2017 at 16:09, lucas.g...@gmail.com wrote: > Did you check the quer

Re: Spark streaming for CEP

2017-10-24 Thread lucas.g...@gmail.com
This looks really interesting, thanks for linking! Gary Lucas On 24 October 2017 at 15:06, Mich Talebzadeh wrote: > Great thanks Steve > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >

Null array of cols

2017-10-24 Thread Mohit Anchlia
I am trying to understand the best way to handle the scenario where null array "[]" is passed. Can somebody suggest if there is a way to filter out such records. I've tried numerous things including using dataframe.head().isEmpty but pyspark doesn't recognize isEmpty even though I see it in the API

Re: spark session jdbc performance

2017-10-24 Thread lucas.g...@gmail.com
Did you check the query plan / check the UI? That code looks same to me. Maybe you've only configured for one executor? Gary On Oct 24, 2017 2:55 PM, "Naveen Madhire" wrote: > > Hi, > > > > I am trying to fetch data from Oracle DB using a subquery and experiencing > lot of performance issues.

Re: Spark streaming for CEP

2017-10-24 Thread Mich Talebzadeh
Great thanks Steve Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com *Disclaimer:* Use it at your own risk.

Re: Spark streaming for CEP

2017-10-24 Thread Stephen Boesch
Hi Mich, the github link has a brief intro - including a link to the formal docs http://logisland.readthedocs.io/en/latest/index.html . They have an architectural overview, developer guide, tutorial, and pretty comprehensive api docs. 2017-10-24 13:31 GMT-07:00 Mich Talebzadeh : > thanks Thomas

spark session jdbc performance

2017-10-24 Thread Naveen Madhire
Hi, I am trying to fetch data from Oracle DB using a subquery and experiencing lot of performance issues. Below is the query I am using, *Using Spark 2.0.2* *val *df = spark_session.read.format(*"jdbc"*) .option(*"driver"*,*"*oracle.jdbc.OracleDriver*"*) .option(*"url"*, jdbc_url) .o

spark session jdbc performance

2017-10-24 Thread Madhire, Naveen
Hi, I am trying to fetch data from Oracle DB using a subquery and experiencing lot of performance issues. Below is the query I am using, Using Spark 2.0.2 val df = spark_session.read.format("jdbc") .option("driver","oracle.jdbc.OracleDriver") .option("url", jdbc_url) .option("user", user)

spark session jdbc performance

2017-10-24 Thread Madhire, Naveen
Hi, I am trying to fetch data from Oracle DB using a subquery and experiencing lot of performance issues. Below is the query I am using, Using Spark 2.0.2 val df = spark_session.read.format("jdbc") .option("driver","oracle.jdbc.OracleDriver") .option("url", jdbc_url) .option("user", user)

Re: Spark streaming for CEP

2017-10-24 Thread Mich Talebzadeh
thanks Thomas. do you have a summary write-up for this tool please? regards, Thomas Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http:

Re: Orc predicate pushdown with Spark Sql

2017-10-24 Thread Siva Gudavalli
Hello, I have an update here.  spark SQL is pushing predicates down, if I load the orc files in spark Context and Is not the same when I try to read hive Table directly.please let me know if i am missing something here. Is this supported in spark  ?  when I load the files in spark Context  scal

Re: Spark streaming for CEP

2017-10-24 Thread Thomas Bailet
Hi we (@ hurence) have released on open source middleware based on SparkStreaming over Kafka to do CEP and log mining, called *logisland* (https://github.com/Hurence/logisland/) it has been deployed into production for 2 years now and does a great job. You should have a look. bye Thomas Ba

Databricks Certification Registration

2017-10-24 Thread sanat kumar Patnaik
Hello All, Can anybody here please provide me a link to register for Databricks Spark developer certification(US based). I have been googling but always end up with this page at end: http://www.oreilly.com/data/sparkcert.html?cmp=ex-data-confreg-lp-na_databricks&__hssc=249029528.5.1508846982378&_

Re: Zero Coefficient in logistic regression

2017-10-24 Thread Alexis Peña
Thanks,  8/10 coeff are zero estimate in CRUZADAS, the parameters for alpha and lambda are set in default(i think  zero, the model in R and SAS was fitted using glm binary logistic. Cheers De: Simon Dirmeier Fecha: martes, 24 de octubre de 2017, 08:30 Para: Alexis Peña , Asunto: Re: Zer

Re: Zero Coefficient in logistic regression

2017-10-24 Thread Simon Dirmeier
So, all the coefficients are the same but  for CRUZADAS? How are you fitting the model in R (glm)?  Can you try setting zero penalty for alpha and lambda: .setRegParam(0) .setElasticNetParam(0) Cheers, S Am 24.10.17 um 13:19 schrieb Alexis Peña: Thanks for your Answer, the features “Cr

Re: Zero Coefficient in logistic regression

2017-10-24 Thread Alexis Peña
Thanks for your Answer, the features “Cruzadas” are Binaries (0/1). The chisq statistic must be work whit 2x2 tables. i fit the model in SAS and R and both the coeff have estimates (not significant). Two of this kind of features has estimations CRUZADAS49070,247624087 CRUZADAS5304-0,16142

Re: Zero Coefficient in logistic regression

2017-10-24 Thread Weichen Xu
Yes chi-squared statistic only used in categorical features. It looks not proper here. Thanks! On Tue, Oct 24, 2017 at 5:13 PM, Simon Dirmeier wrote: > Hey, > as far as I know feature selection using the a chi-squared statistic, can > only be done on categorical features and not on possibly cont

Re: Zero Coefficient in logistic regression

2017-10-24 Thread Simon Dirmeier
Hey, as far as I know feature selection using the a chi-squared statistic, can only be done on categorical features and not on possibly continuous ones? Furthermore, since your logistic model doesn't use any regularization, you should be fine here. So I'd check the ChiSqSeletor and possibly r

Fwd: Spark 1.x - End of life

2017-10-24 Thread Ismaël Mejía
Thanks for your answer Matei. I agree that a more explicit maintenance policy is needed (even for the 2.x releases). I did not immediately find anything about this in the website, so I ended up assuming the information of the wikipedia article that says that the 1.6.x line is still maintained. I s

Accessing UI for spark running as kubernetics container on standby name node

2017-10-24 Thread Mohit Gupta
Hi, We are launching all spark jobs as kubernetics(k8es) containers inside a k8es cluster. We also create a service on each job and do port forwarding for the spark UI (container's 4040 is mapped to SvcPort 31123). The same set of nodes is also hosting a Yarn cluster. Inside container, we do spark