external table with parquet files: problem querying in sparksql since data is stored as integer while hive schema expects a timestamp

2022-07-20 Thread Joris Billen
Hi, below sounds like something that someone will have experienced... I have external tables of parquet files with a hive table defined on top of the data. I dont manage/know the details of how the data lands. For some tables no issues when querying through spark. But for others there is an issue:

Fwd: Pyspark and multiprocessing

2022-07-20 Thread Bjørn Jørgensen
So now I have tried to run this function in a ThreadPool. But it doesn't seem to work. [image: image.png] -- Forwarded message - Fra: Sean Owen Date: ons. 20. jul. 2022 kl. 22:43 Subject: Re: Pyspark and multiprocessing To: Bjørn Jørgensen I don't think you ever say what doesn

Pyspark and multiprocessing

2022-07-20 Thread Bjørn Jørgensen
I have 400k of JSON files. Which is between 10 kb and 500 kb in size. They don`t have the same schema, so I have to loop over them one at a time. This works, but is`s very slow. This process takes 5 days! So now I have tried to run this functions in a ThreadPool. But it don`t seems to work. *St

Re: Dependencies issue in spark

2022-07-20 Thread rajat kumar
Hi Users, Pasting full stack trace, Could anyone pls suggest Main error toward end is : NoClassDefFoundError Attaching full trace. Lost task 31.0 in stage 6.0 (TID 235, 10.139.64.16, executor 9): java.lang.ExceptionInInitializerError at sun.reflect.NativeMethodAccessorImpl.invoke0(Native

Re: [MLlib] Differences after version upgrade

2022-07-20 Thread Sean Owen
How different? I think quite small variations are to be expected. On Wed, Jul 20, 2022 at 9:13 AM Roger Wechsler wrote: > Hi! > > We've been using Spark 3.0.1 to train Logistic regression models > with MLLIb. > We've recently upgraded to Spark 3.3.0 without making any other code > changes and no

[MLlib] Differences after version upgrade

2022-07-20 Thread Roger Wechsler
Hi! We've been using Spark 3.0.1 to train Logistic regression models with MLLIb. We've recently upgraded to Spark 3.3.0 without making any other code changes and noticed that the trained models are different as compared to the ones trained with 3.0.1 and therefore behave differently when used for

Re: Building a ML pipeline with no training

2022-07-20 Thread Sean Owen
The data transformation is all the same. Sure, linear regression is easy: https://spark.apache.org/docs/latest/ml-classification-regression.html#linear-regression These are components that operate on DataFrames. You'll want to look at VectorAssembler to prepare data into an array column. There are

Dependencies issue in spark

2022-07-20 Thread rajat kumar
Hello , I am using maven with Spark. Post upgrading scala form 2.11 to 2.12 I am getting below error and have observed this coming while reading avro Appreciate help. ShuffleMapStage 6 (save at Calling.scala:81) failed in 0.633 s due to Job aborted due to stage failure: Task 83 in stage 6.0 fail

Building a ML pipeline with no training

2022-07-20 Thread Edgar H
Morning everyone, The question may seem to broad but will try to synth as much as possible: I'm used to work with Spark SQL, DFs and such on a daily basis, easily grouping, getting extra counters and using functions or UDFs. However, I've come to an scenario where I need to make some predictions