Connecting to Oracle Autonomous Data warehouse (ADW) from Spark via JDBC

2020-08-26 Thread Mich Talebzadeh
Hi, The connection from Spark to Oracle 12c etc are well established using ojdb6.jar. I am attempting to connect to Oracle Autonomous Data warehouse (ADW) version *Oracle Database 19c Enterprise Edition Release 19.0.0.0.0* Oracle document suggest using ojdbc8.jar to connect to the database with

Re: Connecting to Oracle Autonomous Data warehouse (ADW) from Spark via JDBC

2020-08-26 Thread Mich Talebzadeh
/keystore.jks* > *#javax.net.ssl.keyStorePassword=* > > Alternatively, if you want to use JKS< then you need to comment out the > firts line and un-comment the other lines and set the values. > > Kuassi > On 8/26/20 11:58 AM, Mich Talebzadeh wrote: > > Hi, > > T

Re: Connecting to Oracle Autonomous Data warehouse (ADW) from Spark via JDBC

2020-08-26 Thread Mich Talebzadeh
chars in username or password? > > it is recommended not to use such characters like '@', '.' in your > password. > > Best, Kuassi > > On 8/26/20 12:52 PM, Mich Talebzadeh wrote: > > Thanks Kuassi. > > This is the version of jar file that work OK

Re: Connecting to Oracle Autonomous Data warehouse (ADW) from Spark via JDBC

2020-08-26 Thread Mich Talebzadeh
or any monetary damages arising from such loss, damage or destruction. On Wed, 26 Aug 2020 at 21:09, wrote: > Mich, > > All looks fine. > Perhaps some special chars in username or password? > > it is recommended not to use such characters like '@', '.' in y

Re: Connecting to Oracle Autonomous Data warehouse (ADW) from Spark via JDBC

2020-08-26 Thread Mich Talebzadeh
's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Wed, 26 Aug 2020 at 21:58, Mich Talebzadeh wrote: > Hi Kuassi, > > This is the error. Only test running on local mode

Re: Connecting to Oracle Autonomous Data warehouse (ADW) from Spark via JDBC

2020-08-27 Thread Mich Talebzadeh
ad of > 18.3 jar. > You can ask them to use either full URL or tns alias format URL with > tns_admin path set as either connection property or system property. > > Regards, Kuassi > > On 8/26/20 2:11 PM, Mich Talebzadeh wrote: > > And this is a test using Oracle supplie

Re: Connecting to Oracle Autonomous Data warehouse (ADW) from Spark via JDBC

2020-08-28 Thread Mich Talebzadeh
ch may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Thu, 27 Aug 2020 at 17:34, wrote: > Mich, > > That's right, referring to you

Exception handling in Spark throws recursive value for DF needs type error

2020-10-01 Thread Mich Talebzadeh
Hi, Spark version 2.3.3 on Google Dataproc I am trying to use databricks to other databases https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html to read from Hive table on Prem using Spark in Cloud This works OK without a Try enclosure. import spark.implicits._ import scala

Re: Exception handling in Spark throws recursive value for DF needs type error

2020-10-01 Thread Mich Talebzadeh
two vars and it ends up ambiguous. Just rename > one. > > On Thu, Oct 1, 2020, 5:02 PM Mich Talebzadeh > wrote: > >> Hi, >> >> >> Spark version 2.3.3 on Google Dataproc >> >> >> I am trying to use databricks to other databases >> >&

Re: Exception handling in Spark throws recursive value for DF needs type error

2020-10-01 Thread Mich Talebzadeh
me). > > option("password", HybridServerPassword). > > load()) match { > > * case Success(validDf) => validDf* > >case Failure(e) => throw new Exception("Error > Encountered reading Hive table") >

Re: Exception handling in Spark throws recursive value for DF needs type error

2020-10-02 Thread Mich Talebzadeh
s technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Fri, 2 Oct 2020 at 05:33, Mich Talebzadeh wrote: > Many thanks Russell. That worked > > val *HiveDF* = Try(spark.read

Re: Exception handling in Spark throws recursive value for DF needs type error

2020-10-02 Thread Mich Talebzadeh
ark execution. > It doesn't seem like it helps though - you are just swallowing the cause. > Just let it fly? > > On Fri, Oct 2, 2020 at 9:34 AM Mich Talebzadeh > wrote: > >> As a side question consider the following read JDBC read >> >> >> val lowerBo

Reading BigQuery data from Spark in Google Dataproc

2020-10-05 Thread Mich Talebzadeh
Hi, I have testest few JDBC BigQuery providers like Progress Direct and Simba but none of them seem to work properly through Spark. The only way I can read and write to BigQuery is through Spark BigQuery API using the following scenario spark-shell --jars=gs://spark-lib/bigquery/spark-bigquery-l

Scala vs Python for ETL with Spark

2020-10-09 Thread Mich Talebzadeh
I have come across occasions when the teams use Python with Spark for ETL, for example processing data from S3 buckets into Snowflake with Spark. The only reason I think they are choosing Python as opposed to Scala is because they are more familiar with Python. Since Spark is written in Scala, its

Re: Scala vs Python for ETL with Spark

2020-10-09 Thread Mich Talebzadeh
d java code > so there won't be a big difference between python and scala. > > On Fri, Oct 9, 2020 at 3:57 PM Mich Talebzadeh > wrote: > >> I have come across occasions when the teams use Python with Spark for >> ETL, for example processing data from S3 buckets into

Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Mich Talebzadeh
Python just for the sake of it. Disclaimer: These are opinions and not facts so to speak :) Cheers, Mich On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh wrote: > I have come across occasions when the teams use Python with Spark for ETL, > for example processing data from S3 bucket

Re: Scala vs Python for ETL with Spark

2020-10-11 Thread Mich Talebzadeh
20, 21:24 Stephen Boesch, wrote: >> >>> I agree with Wim's assessment of data engineering / ETL vs Data >>> Science.I wrote pipelines/frameworks for large companies and scala was >>> a much better choice. But for ad-hoc work interfacing directly with data

Re: Scala vs Python for ETL with Spark

2020-10-11 Thread Mich Talebzadeh
de. Later he became head of machine learning >>>>>> somewhere else and he loved C and Python. So Python was a gift in >>>>>> disguise. >>>>>> I think Python appeals to those who are very familiar with CLI and shell >>>>>> prog

Re: Scala vs Python for ETL with Spark

2020-10-11 Thread Mich Talebzadeh
cases according to you? This is > interesting, really interesting. Perhaps I stand corrected. > > Regards, > Gourav > > On Sun, Oct 11, 2020 at 5:00 PM Mich Talebzadeh > wrote: > >> if we take Spark and its massive parallel processing and in-memory >> cache away, then one

Re: Spark as computing engine vs spark cluster

2020-10-12 Thread Mich Talebzadeh
Hi Santosh, Generally speaking, there are two ways of making a process faster: 1. Do more intelligent work by creating indexes, cubes etc thus reducing the processing time 2. Throw hardware and memory at it using something like Spark multi-cluster with fully managed cloud service lik

The simplest Syntax for saprk/Scala collect.foreach(println) in Pyspark

2020-10-12 Thread Mich Talebzadeh
Hi In Spark/Scala one can do scala> println ("\nStarted at"); spark.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') ").collect.foreach(println) Started at [12/10/2020 22:29:19.19] I believe foreach(println) is a special syntax in this case. I can also do a verbose one sca

The equivalent of Scala mapping in Pyspark

2020-10-13 Thread Mich Talebzadeh
Hi, I generate an array of random data and create a DF in Spark scala as follows val end = start + numRows - 1 println (" starting at ID = " + start + " , ending on = " + end ) val usedFunctions = new UsedFunctions *val text = ( start to end ).map(i =>* * (* * i.toSt

Re: The equivalent of Scala mapping in Pyspark

2020-10-15 Thread Mich Talebzadeh
quet tables and store them in Parquet format. If table exists, new rows are appended. Any feedback will be much appreciated (negative or positive so to speak). Thanks, Mich *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other prope

Re: How to Scale Streaming Application to Multiple Workers

2020-10-15 Thread Mich Talebzadeh
Hi, This in general depends on how many topics you want to process at the same time and whether this is done on-premise running Spark in cluster mode. Have you looked at Spark GUI to see if one worker (one JVM) is adequate for the task? Also how these small files are read and processed. Is it th

Re: Scala vs Python for ETL with Spark

2020-10-15 Thread Mich Talebzadeh
loss, damage or destruction. On Sun, 11 Oct 2020 at 20:46, Mich Talebzadeh wrote: > Hi, > > With regard to your statement below > > ".technology choices are agnostic to use cases according to you" > > If I may say, I do not think that was the message implied

Re: Count distinct and driver memory

2020-10-19 Thread Mich Talebzadeh
Best to check this in Spark GUI under storage and see what is causing the issue. HTH LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * *Disclaimer:* Use it at your

Why spark-submit works with package not with jar

2020-10-20 Thread Mich Talebzadeh
Hi, I have a scenario that I use in Spark submit as follows: spark-submit --driver-class-path /home/hduser/jars/ddhybrid.jar --jars /home/hduser/jars/spark-bigquery-latest.jar,/home/hduser/jars/ddhybrid.jar, */home/hduser/jars/spark-bigquery_2.11-0.2.6.jar* As you can see the jar files needed ar

Re: Why spark-submit works with package not with jar

2020-10-20 Thread Mich Talebzadeh
wrote: > --jar Adds only that jar > --package adds the Jar and a it's dependencies listed in maven > > On Tue, Oct 20, 2020 at 10:50 AM Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> Hi, >> >> I have a scenario that I use in Spark submit a

Re: Why spark-submit works with package not with jar

2020-10-20 Thread Mich Talebzadeh
t; > One way to think of this is --packages is better when you have third > party > > dependency and --jars is better when you have custom in-house built jars. > > > > On Wed, 21 Oct 2020 at 3:44 am, Mich Talebzadeh < > mich.talebza...@gmail.com> > > wrote: >

Re: Why spark-submit works with package not with jar

2020-10-20 Thread Mich Talebzadeh
et Maven > / Ivy resolution figure it out. It is not true that everything in .ivy2 is > on the classpath. > > On Tue, Oct 20, 2020 at 3:48 PM Mich Talebzadeh > wrote: > >> Hi Nicolas, >> >> I removed ~/.iv2 and reran the spark job with the package included (

Re: Why spark-submit works with package not with jar

2020-10-20 Thread Mich Talebzadeh
2020 at 22:43, Mich Talebzadeh wrote: > Thanks again all. > > Hi Sean, > > As I understood from your statement, you are suggesting just use > --packages without worrying about individual jar dependencies? > > > > LinkedIn * > https://www.linkedin.com/profile/vie

Re: Why spark-submit works with package not with jar

2020-10-20 Thread Mich Talebzadeh
;s no 100% guarantee that conflicting dependencies are resolved in a > way that works in every single case, which you run into sometimes when > using incompatible libraries, but yes this is the point of --packages and > Ivy. > > On Tue, Oct 20, 2020 at 4:43 PM Mich Talebzadeh >

Re: Why spark-submit works with package not with jar

2020-10-21 Thread Mich Talebzadeh
he internet or even the internal > proxying artefect repository. > > Also, wasn't uberjars an antipattern? For some reason I don't like them... > > Kind regards > -wim > > > > On Wed, 21 Oct 2020 at 01:06, Mich Talebzadeh > wrote: > >> Thanks again a

Re: Why spark-submit works with package not with jar

2020-10-21 Thread Mich Talebzadeh
How about PySpark? What process can that go through to not depend on external repo access in production LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * *Disclaimer

Re: Scala vs Python for ETL with Spark

2020-10-22 Thread Mich Talebzadeh
ction. On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh wrote: > I have come across occasions when the teams use Python with Spark for ETL, > for example processing data from S3 buckets into Snowflake with Spark. > > The only reason I think they are choosing Python as opposed to Scala

Re: Spark hive build and connectivity

2020-10-22 Thread Mich Talebzadeh
Hi Ravi, What exactly are you trying to do? You want to enhance Spark SQl or you want to run Hive on Spark engine? HTH LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: Spark hive build and connectivity

2020-10-22 Thread Mich Talebzadeh
; - I see that whenever i build spark with hive support (-Phive > -Phive-thriftserver) , it gets built with hive 2.3.7 jars. So , will it be > ok if i access tables created using my hive 3.2.1 cluster ? > - Do i have to add hive 3.2.1 jars to spark's (SPARK_DIST_CLASSPATH) ? >

Re: Scala vs Python for ETL with Spark

2020-10-22 Thread Mich Talebzadeh
Thanks for the feedback Sean. Kind regards, Mich LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * *Disclaimer:* Use it at your own risk. Any and all responsibili

Re: Scala vs Python for ETL with Spark

2020-10-23 Thread Mich Talebzadeh
you but agree with Sean. That is mostly not true. > > In your previous posts you also mentioned this . The only reason we > sometimes have to bail out to Scala is for performance with certain udfs > > On Thu, 22 Oct 2020 at 23:11, Mich Talebzadeh > wrote: > >>

Re: Custom JdbcConnectionProvider

2020-10-28 Thread Mich Talebzadeh
I think you can pickup your custom build driver from the command line itself Here I am using a custom build third-party driver to access Oracle Table on-premisses from cloud val jdbUrl = "jdbc:datadirect:ddhybrid://"+HybridServer+":"+HybridPort+";hybridDataPipelineDataSource="+ hybridDataPipeline

Re: repartition in Spark

2020-11-09 Thread Mich Talebzadeh
As a generic answer in a distributed environment like spark, making sure that data is distributed evenly among all nodes (assuming every node is the same or similar) can help performance repartition thus controls the data distribution among all nodes. However, it is not that straight forward. Your

Creating hive table through df.write.mode("overwrite").saveAsTable("DB.TABLE")

2020-11-10 Thread Mich Talebzadeh
Hi, In Spark I specifically specify the format of the table to be created sqltext = """ CREATE TABLE test.randomDataPy( ID INT , CLUSTERED INT , SCATTERED INT , RANDOMISED INT , RANDOM_STRING VARCHAR(50) , SMALL_VC VARCHAR(50) , PADDING VARCHAR(4000)

Re: Path of jars added to a Spark Job - spark-submit // // Override jars in spark submit

2020-11-12 Thread Mich Talebzadeh
As I understand Spark expects the jar files to be available on all nodes or if applicable on HDFS directory Putting Spark Jar files on HDFS In Yarn mode, *it is important that Spark jar files are available throughout the Spark cluster*. I have spent a fair bit of time on this and I recommend that

PyCharm IDE throws spark error

2020-11-13 Thread Mich Talebzadeh
Hi, This is basically a simple module from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql import HiveContext from pyspark.sql import SparkSession from pyspark.sql import Row from pyspark.sql.types import StringType, ArrayType from pyspark.sql.functions import udf,

Re: PyCharm IDE throws spark error

2020-11-15 Thread Mich Talebzadeh
On Fri, 13 Nov 2020 at 23:25, Wim Van Leuven wrote: > No Java installed? Or process can but find it? Java-home not set? > > On Fri, 13 Nov 2020 at 23:24, Mich Talebzadeh > wrote: > >> Hi, >> >> This is basically a simple module >> >> from pyspark i

spark-sql on windows throws Exception in thread "main" java.lang.UnsatisfiedLinkError:

2020-11-16 Thread Mich Talebzadeh
Need to create some hive test tables for pyCharm SPARK_HOME is set up as D:\temp\spark-3.0.1-bin-hadoop2.7 HADOOP_HOME is c:\hadoop\ spark-shell works. Trying to run spark-sql, I get the following errors PS C:\tmp\hive> spark-sql log4j:WARN No appenders could be found for logger (org.apache.h

Error in PyCharm with PySpark

2020-11-26 Thread Mich Talebzadeh
Hi, I do not know why I am getting this error in Pycharm! if __name__ == "__main__" : contract_json_path = os.path. \ join("../", "../", "conf/contractterms_app.json") default_json_path = os.path.join( "../", "../", "tests/data/contractterms_data/input_data/default_app

Separating storage from compute layer with Spark and data warehouses offering ML capabilities

2020-11-29 Thread Mich Talebzadeh
This is a generic question with regard to an optimum design. Many Cloud Data Warehouses like Google BigQuery (BQ) or Oracle Autonomous Data Warehouse (ADW), nowadays off

In windows 10, accessing Hive from PySpark with PyCharm throws error

2020-12-02 Thread Mich Talebzadeh
Hi, I have a simple code that tries to create Hive derby database as follows: from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql import HiveContext from pyspark.sql import SparkSession from pyspark.sql import Row from pyspark.sql.types import StringType, ArrayTyp

Re: In windows 10, accessing Hive from PySpark with PyCharm throws error

2020-12-03 Thread Mich Talebzadeh
age or destruction. On Wed, 2 Dec 2020 at 23:11, Artemis User wrote: > Apparently this is a OS dynamic lib link error. Make sure you have the > LD_LIBRARY_PATH (in Linux) or PATH (windows) set up properly for the right > .so or .dll file... > On 12/2/20 5:31 PM, Mich Talebzadeh wrote

Re: In windows 10, accessing Hive from PySpark with PyCharm throws error

2020-12-04 Thread Mich Talebzadeh
dows, this lib path may be different). So add this path to > your PATH environmental variable in your command shell before running > spark-submit again. > > -- ND > On 12/3/20 6:28 PM, Mich Talebzadeh wrote: > > This is becoming serious pain. > > using powershell I

substitution invocator for a variable in PyCharm sql

2020-12-07 Thread Mich Talebzadeh
In Spark/Scala you can use 's' substitution invocator for a variable in sql call, for example var sqltext = s""" INSERT INTO TABLE ${broadcastStagingConfig.broadcastTable} PARTITION (broadcastId = ${broadcastStagingConfig.broadcastValue},brand) SELECT ocis_mrg_p

Re: substitution invocator for a variable in PyCharm sql

2020-12-07 Thread Mich Talebzadeh
ilable in python 3.6. It uses a different syntax than scala's > https://www.programiz.com/python-programming/string-interpolation > > On Mon, Dec 7, 2020 at 7:05 AM Mich Talebzadeh > wrote: > >> In Spark/Scala you can use 's' substitution invocator

Using Lambda function to generate random data in PySpark throws not defined error

2020-12-11 Thread Mich Talebzadeh
Hi, This used to work but not anymore. I have UsedFunctions.py file that has these functions import random import string import math def randomString(length): letters = string.ascii_letters result_str = ''.join(random.choice(letters) for i in range(length)) return result_str def cl

Re: Using Lambda function to generate random data in PySpark throws not defined error

2020-12-11 Thread Mich Talebzadeh
uf.clustered(x, numRows), \ NameError: name 'numRows' is not defined Regards, Mich *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is

Re: Using Lambda function to generate random data in PySpark throws not defined error

2020-12-11 Thread Mich Talebzadeh
is email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Fri, 11 Dec 2020 at 16:56, Mich Talebzadeh wrote: > Thanks Sean, > > This is the code > > numRows = 10

Re: Using Lambda function to generate random data in PySpark throws not defined error

2020-12-11 Thread Mich Talebzadeh
("""SELECT * FROM pycharm.randomDataPy ORDER BY id""").show(n=20,truncate=False,vertical=False) lst = (spark.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/ HH:mm:ss.ss') ")).collect() print("\nFinish

Re: Using Lambda function to generate random data in PySpark throws not defined error

2020-12-12 Thread Mich Talebzadeh
which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Fri, 11 Dec 2020 at 18:52, Mich Talebzadeh wrote: > many thanks KR. > > If i call

Re: Using Lambda function to generate random data in PySpark throws not defined error

2020-12-13 Thread Mich Talebzadeh
3 Dec 2020 at 15:10, Sean Owen wrote: > I don't believe you'll be able to use globals in a Spark task, as they > won't exist on the remote executor machines. > > On Sun, Dec 13, 2020 at 3:46 AM Mich Talebzadeh > wrote: > >> thanks Marco. >> >>

Re: Is Spark SQL able to auto update partition stats like hive by setting hive.stats.autogather=true

2020-12-18 Thread Mich Talebzadeh
I am afraid not supported for spark sql see Automatic Statistics Collection For Better Query Performance | Qubole I tried it as below spark = SparkSession.builder \ .appName("app1") \ .enableH

Re: Is Spark SQL able to auto update partition stats like hive by setting hive.stats.autogather=true

2020-12-18 Thread Mich Talebzadeh
*Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monet

Re: Re: Is Spark SQL able to auto update partition stats like hive by setting hive.stats.autogather=true

2020-12-19 Thread Mich Talebzadeh
roperty which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction. On Sat, 19 Dec 2020 at 07:51, 疯狂的哈丘 wrote: > thx,but `hive.stats.autogather` is n

Spark 3.0.1 fails to insert into Hive Parquet table but Spark 2.11.12 used to work

2020-12-19 Thread Mich Talebzadeh
Hi, Upgraded Spark from 2.11.12 to Spark 3.0.1 Hive version 3.1.1 and Hadoop version 3.1.1 The following used to work with Spark 2.11.12 scala> sqltext = s""" | INSERT INTO TABLE ${fullyQualifiedTableName} | SELECT | ID | , CLUSTERED | ,

Re: Spark 3.0.1 fails to insert into Hive Parquet table but Spark 2.11.12 used to work

2020-12-21 Thread Mich Talebzadeh
, damage or destruction. On Sat, 19 Dec 2020 at 20:27, Mich Talebzadeh wrote: > Hi, > > Upgraded Spark from 2.11.12 to Spark 3.0.1 > > Hive version 3.1.1 and Hadoop version 3.1.1 > > The following used to work with Spark 2.11.12 > > scala> sqltext = s"""

Using UDF based on Numpy functions in Spark SQL

2020-12-23 Thread Mich Talebzadeh
Hi, This is a shot in the dark so to speak. I would like to use the standard deviation std offered by numpy in PySpark. I am using SQL for now The code as below sqltext = f""" SELECT rs.Customer_ID , rs.Number_of_orders , rs.Total_customer_amount ,

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-23 Thread Mich Talebzadeh
//stackoverflow.com/questions/43484269/how-to-register-udf-to-use-in-sql-and-dataframe > > On Wed, Dec 23, 2020 at 12:52 PM Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> Hi, >> >> >> This is a shot in the dark so to speak. >> >> >>

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-24 Thread Mich Talebzadeh
om such loss, damage or destruction. On Wed, 23 Dec 2020 at 23:50, Sean Owen wrote: > Why do you want to use this function instead of the built-in stddev > function? > > On Wed, Dec 23, 2020 at 2:52 PM Mich Talebzadeh > wrote: > >> Hi, >> >> >> This is

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-24 Thread Mich Talebzadeh
te: > I don't know which one is 'correct' (it's not standard SQL?) or whether > it's the sample stdev for a good reason or just historical now. But you can > always call STDDEV_SAMP (in any DB) if needed. It's equivalent to numpy.std > with ddof=1, the Besse

Re: Jdbc Hook in Spark Batch Application

2020-12-25 Thread Mich Talebzadeh
If I understand correctly you can store JDBC connection properties in a configuration file and refer to them in the code in your Scala/python module. Example: # oracle variables driverName = "oracle.jdbc.OracleDriver" _username = "user" _password = ".." _dbschema = "schema" _dbtable = "table

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-26 Thread Mich Talebzadeh
n write an aggregate UDF that calls numpy and register it for SQL, > but, it is already a built-in. > > On Thu, Dec 24, 2020 at 8:12 AM Mich Talebzadeh > wrote: > >> Thanks for the feedback. >> >> I have a question here. I want to use numpy STD as well but jus

Spark DF does not rename the column

2021-01-04 Thread Mich Talebzadeh
Hi, version 2.4.3 I don't know the cause of this. This renaming of DF columns used to work fine. I did couple of changes to spark/Scala code not relevant to this table and it refuses to rename the columns for a table!. val summaryACC = HiveContext.table("summaryACC") summaryACC.printSchema()

Re: Spark DF does not rename the column

2021-01-04 Thread Mich Talebzadeh
2021 at 18:09, Lalwani, Jayesh wrote: > You don’t have a column named “created”. The column name is “ceated”, > without the “r” > > > > *From: *Mich Talebzadeh > *Date: *Monday, January 4, 2021 at 1:06 PM > *To: *"user @spark" > *Subject: *[EXTERNAL] Spar

Re: Spark DF does not rename the column

2021-01-05 Thread Mich Talebzadeh
> and then: > > withColumnRenamed("created","Date Calculated"). > > > On Mon, 4 Jan 2021 at 19:12, Lalwani, Jayesh > wrote: > >> You don’t have a column named “created”. The column name is “ceated”, >> without the “r” >> &

A question on extrapolation of a nonlinear curve fit beyond x value

2021-01-05 Thread Mich Talebzadeh
Hi, I am not sure Spark forum is the correct avenue for this question. I am using PySpark with matplotlib to get the best fit for data using the Lorentzian Model. This curve uses 2010-2020 data points (11 on x-axis). I need to predict predict the prices for years 2021-2025 based on this fit. So

Re: A question on extrapolation of a nonlinear curve fit beyond x value

2021-01-05 Thread Mich Talebzadeh
any way I can use some plt functions to provide extrapolated values for 2021 and beyond? Thanks On Tue, 5 Jan 2021 at 14:43, Sean Owen wrote: > If your data set is 11 points, surely this is not a distributed problem? > or are you asking how to build tens of thousands of those projecti

Re: A question on extrapolation of a nonlinear curve fit beyond x value

2021-01-05 Thread Mich Talebzadeh
is a single extrapolation, over 11 data points, you can just use Spark > to do the aggregation, call .toPandas, and do whatever you want in the > Python ecosystem to fit and plot that result. > > On Tue, Jan 5, 2021 at 9:18 AM Mich Talebzadeh > wrote: > >> thanks Sean.

Re: A question on extrapolation of a nonlinear curve fit beyond x value

2021-01-05 Thread Mich Talebzadeh
to do with Spark here. Spark gives > you the data as pandas so you can use all these tools as you like. > > On Tue, Jan 5, 2021 at 9:38 AM Mich Talebzadeh > wrote: > >> Thanks again >> >> Just to clarify, I want to see the average price for year 2021, 2022 etc >&

PyCharm, Running spark-submit calling jars and a package at run time

2021-01-08 Thread Mich Talebzadeh
Hi, I have a module in Pycharm which reads data stored in a Bigquery table and does plotting. At the command line on the terminal I need to add the jar file and the packet to make it work. (venv) C:\Users\admin\PycharmProjects\pythonProject2\DS\src>spark-submit --jars ..\lib\spark-bigquery-with-

Re: PyCharm, Running spark-submit calling jars and a package at run time

2021-01-08 Thread Mich Talebzadeh
t; http://spark.apache.org/docs/latest/submitting-applications.html, look > for --py-files > > HTH > > On Fri, Jan 8, 2021 at 3:13 PM Mich Talebzadeh > wrote: > >> Hi, >> >> I have a module in Pycharm which reads data stored in a Bigquery table >> and does pl

Re: PyCharm, Running spark-submit calling jars and a package at run time

2021-01-08 Thread Mich Talebzadeh
7;sparkstuff'? how would the Spark > app have this code otherwise? > > On Fri, Jan 8, 2021 at 10:20 AM Mich Talebzadeh > wrote: > >> Thanks Riccardo. >> >> I am well aware of the submission form >> >> However, my question relates to doing submission

Re: PyCharm, Running spark-submit calling jars and a package at run time

2021-01-08 Thread Mich Talebzadeh
I don't see anywhere that you provide 'sparkstuff'? how would the Spark >> app have this code otherwise? >> >> On Fri, Jan 8, 2021 at 10:20 AM Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> Thanks Riccardo. >>&g

Re: PyCharm, Running spark-submit calling jars and a package at run time

2021-01-08 Thread Mich Talebzadeh
is isn't going to help submitting to a remote cluster though. You need > to explicitly include dependencies in your submit. > > On Fri, Jan 8, 2021 at 11:15 AM Mich Talebzadeh > wrote: > >> Hi Riccardo >> >> This is the env variables at runtime >

Adding third party specific jars to Spark

2021-01-14 Thread Mich Talebzadeh
The primer for this was the process of developing code for accessing BigQuery data from PyCharm on premises so that advanced analytics and graphics can be done on local. Writes are an issue as BiqQuery buffers data in a temporary storage on GS bucket before pushing it into BigQuery database One o

PySpark, setting spark conf values in a function and catching for errors

2021-01-15 Thread Mich Talebzadeh
Hi, I have multiple routines that are using Spark for Google BigQuery that set these configuration values. I have decided to put them in a PySpark function as below with spark as an input. def setSparkConfSet(spark): try: spark.conf.set("GcpJsonKeyFile", config['GCPVariables']['js

Re: Running pyspark job from virtual environment

2021-01-17 Thread Mich Talebzadeh
Hi Rajat, Are you running this through an IDE like PyCharm or on CLI? If you already have a Python Virtual environment, then just activate it The only env variable you need to set is export PYTHONPATH that you can do it in your startup shell script .bashrc etc. Once you are in virtual environme

Re: Running pyspark job from virtual environment

2021-01-17 Thread Mich Talebzadeh
n spark-env.sh and bashrc > > Thanks > Rajat > > > > On Sun, Jan 17, 2021 at 10:32 PM Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> Hi Rajat, >> >> Are you running this through an IDE like PyCharm or on CLI? >> >> If you alre

Re: Correctness bug on Shuffle+Repartition scenario

2021-01-17 Thread Mich Talebzadeh
Hi Shiao-An, With regard to your set-up below and I quote: "The input/output files are parquet on GCS. The Spark version is 2.4.4 with standalone deployment. Workers running on GCP preemptible instances and they being preempted very frequently." Am I correct that you have foregone deploying Data

Re: Issue with executer

2021-01-20 Thread Mich Talebzadeh
Hi Vikas, Are you running this on your local laptop etc or using some IDE etc? What is your available memory for Spark? Start with minimum set like below def spark_session_local(appName): return SparkSession.builder \ .master('local[1]') \ .appName(appName) \ .enable

Re: Column-level encryption in Spark SQL

2021-01-21 Thread Mich Talebzadeh
Most enterprise databases provide Data Encryption of some form. For example Introduction to Transparent Data Encryption (oracle.com) As far as I know Hive supports text and sequence file column

Re: Pyspark How to groupBy -> fit

2021-01-21 Thread Mich Talebzadeh
I guess one drawback would be that the data cannot be processed and stored in Pandas DataFrames as these DataFrames store data in RAM. If you are going to run multiple parallel jobs then a single machine may not be viable? On Thu, 21 Jan 2021 at 16:29, Sean Owen wrote: > If you mean you want

Connecting to Hive on -premise from Spark in Cloud using JDBC driver for Hive

2021-01-27 Thread Mich Talebzadeh
Hi, This is as a matter of information. I have seen some threads in stackoverflow about issues accessing Hive remotely without using the locality (Spark and Hive on the same Haddop cluster) or using hive-site.xml under $SPARK/conf. That process works fine. However, challenges come when accessing

Re: Connecting to Hive on -premise from Spark in Cloud using JDBC driver for Hive

2021-01-28 Thread Mich Talebzadeh
Hi Badrinath, This is a very valid question. The option of getting a ticket before being authorised is clearly not going to work here as any authentication of that nature applies to the environment where both Hive and Spark co-exist. So the question has to move to how we can authenticate connect

Re: Spark SQL query

2021-01-31 Thread Mich Talebzadeh
Hi Arpan, I presume you are interested in what client was doing. If you have access to the edge node (where spark code is submitted), look for the following file ${HOME/.spark_history example -rw-r--r--. 1 hduser hadoop 111997 Jun 2 2018 .spark_history just use shell tools (cat, grep etc) t

Re: Spark SQL query

2021-02-01 Thread Mich Talebzadeh
Hi Arpan, log in as any user that has execution right for spark. type spark-shell, do some simple commands then exit. go to home directory of that user and look for that hidden file ${HOME/.spark_history it will be there. HTH, LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2g

Re: Spark SQL query

2021-02-02 Thread Mich Talebzadeh
Hi Arpan. I believe all applications including spark and scala create a hidden history file You can go to home directory cd # see list of all hidden files ls -a | egrep '^\.' If you are using scala do you see .scala_history file? .scala_history HTH LinkedIn * https://www.linkedin.com/pr

Re: Spark SQL query

2021-02-02 Thread Mich Talebzadeh
create a directory in hdfs hdfs dfs -mkdir /spark_event_logs modify file $SPARK_HOME/conf/spark-defaults.conf and add these two lines spark.eventLog.enabled=true # do not use quotes below spark.eventLog.dir=hdfs://rhes75:9000/spark_event_logs Then run a job and check it hdfs dfs -ls /spark_eve

Re: Spark SQL query

2021-02-03 Thread Mich Talebzadeh
I gather what you are after is a code sniffer for Spark that provides a form of GUI to get the code that applications run against spark. I don't think Spark has this type of plug-in although it would be potentially useful. Some RDBMS provide this. Usually stored on some form of persistent storage

Re: Spark SQL query

2021-02-03 Thread Mich Talebzadeh
I suggest one thing you can do is to open another thread for this feature request "Having functionality in Spark to allow queries to be gathered and analyzed" and see what forum responds to it. HTH LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8P

Assertion of return value of dataframe in pytest

2021-02-03 Thread Mich Talebzadeh
Hi, In Pytest you want to ensure that the composed DF has the correct return. Example df2 = house_df. \ select( \ F.date_format('datetaken', '').cast("Integer").alias('YEAR') \ , 'REGIONNAME' \ , round(F.avg('averageprice').over(wSpecY)).alias('AVGPRICEPER

Re: Assertion of return value of dataframe in pytest

2021-02-03 Thread Mich Talebzadeh
ve() except Exception as e: print(f"""{e}, quitting""") sys.exit(1) and call it in the program from sparkutils import sparkstuff as s s.writeTableToOracle(df2,"overwrite",config['OracleVariables']['dbschema']

  1   2   3   4   5   6   7   8   9   10   >