[SparkSQL] Full Join Return Null Value For Funtion-Based Column

2021-01-17 Thread 刘 欢
Hi All: Here I got two tables: Table A name num tom 2 jerry 3 jerry 4 null null Table B name score tom 12 jerry 10 jerry 8 null null When i use spark.sql() to get result from A and B with sql : select a.name as aName, a.date, b.name as bName from ( selec

Re: Correctness bug on Shuffle+Repartition scenario

2021-01-17 Thread Shiao-An Yuan
Hi, I am using Spark 2.4.4 standalone mode. On Mon, Jan 18, 2021 at 4:26 AM Sean Owen wrote: > Hm, FWIW I can't reproduce that on Spark 3.0.1. What version are you using? > > On Sun, Jan 17, 2021 at 6:22 AM Shiao-An Yuan > wrote: > >> Hi folks, >> >> I finally found the root cause of this issue

Re: Correctness bug on Shuffle+Repartition scenario

2021-01-17 Thread Mich Talebzadeh
Hi Shiao-An, With regard to your set-up below and I quote: "The input/output files are parquet on GCS. The Spark version is 2.4.4 with standalone deployment. Workers running on GCP preemptible instances and they being preempted very frequently." Am I correct that you have foregone deploying Data

Re: Correctness bug on Shuffle+Repartition scenario

2021-01-17 Thread Gourav Sengupta
Hi, I may be wrong, but this looks like a massively complicated solution for what could have been a simple SQL. It always seems okay to be to first reduce the complexity and then solve it, rather than solve a problem which should not even exist in the first instance. Regards, Gourav On Sun, Jan

Re: Correctness bug on Shuffle+Repartition scenario

2021-01-17 Thread Sean Owen
Hm, FWIW I can't reproduce that on Spark 3.0.1. What version are you using? On Sun, Jan 17, 2021 at 6:22 AM Shiao-An Yuan wrote: > Hi folks, > > I finally found the root cause of this issue. > It can be easily reproduced by the following code. > We ran it on a standalone mode 4 cores * 4 instanc

Re: Running pyspark job from virtual environment

2021-01-17 Thread Mich Talebzadeh
Well. When you or application log in to Linux host (whether a physical tin box or a virtual node), they execute a script called .bashrc at home directory. If it is a scheduled job then it will also execute the same as well. In my Google Data proc cluster of three (one master and two workers), in

Re: Running pyspark job from virtual environment

2021-01-17 Thread rajat kumar
Hi Mich, Thanks for response. I am running it through CLI (on the cluster). Since this will be scheduled job. I do not want to activate the environment manually. It should automatically take the path of virtual environment to run the job. For that I saw 3 properties which I mentioned. I think se

Re: Running pyspark job from virtual environment

2021-01-17 Thread Mich Talebzadeh
Hi Rajat, Are you running this through an IDE like PyCharm or on CLI? If you already have a Python Virtual environment, then just activate it The only env variable you need to set is export PYTHONPATH that you can do it in your startup shell script .bashrc etc. Once you are in virtual environme

Re: Dynamic Spark metrics creation

2021-01-17 Thread Ivan Petrov
Would custom accumulator work for you? It should be do-able for Map[String,Long] too https://stackoverflow.com/questions/42293798/how-to-create-custom-set-accumulator-i-e-setstring ‪вс, 17 янв. 2021 г. в 15:16, ‫"Yuri Oleynikov (‫יורי אולייניקוב‬‎)"‬‎ < yur...@gmail.com>:‬ > Hey Jacek, I’ll clar

Re: Dynamic Spark metrics creation

2021-01-17 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
Hey Jacek, I’ll clarify myself a bit: As bottom line I need following metrics being reported by structured streaming: Country-USA:7 Country-Poland: 23 Country-Brazil: 56 The country names are included in incoming events and unknown at very beginning/application startup. Thus registering accumula

Re: Spark Event Log Forwarding and Offset Tracking

2021-01-17 Thread Jacek Laskowski
Hi, > Forwarding Spark Event Logs to identify critical events like job start, executor failures, job failures etc to ElasticSearch via log4j. However I could not find any way to foward event log via log4j configurations. Is there any other recommended approach to track these application events? I

Re: understanding spark shuffle file re-use better

2021-01-17 Thread Jacek Laskowski
Hi, An interesting question that I must admit I'm not sure how to answer myself actually :) Off the top of my head, I'd **guess** unless you cache the first query these two queries would share nothing. With caching, there's a phase in query execution when a canonicalized version of a query is use

Re: Dynamic Spark metrics creation

2021-01-17 Thread Jacek Laskowski
Hey Yurii, > which is unavailable from executors. Register it on the driver and use accumulators on executors to update the values (on the driver)? Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski "The Internals Of" Online Books Follow me on https://twi

Re: Running pyspark job from virtual environment

2021-01-17 Thread rajat kumar
Hello, Can anyone confirm here please? Regards Rajat On Sat, Jan 16, 2021 at 11:46 PM rajat kumar wrote: > Hey Users, > > I want to run spark job from virtual environment using Python. > > Please note I am creating virtual env (using python3 -m venv env) > > I see that there are 3 variables fo

Re: Correctness bug on Shuffle+Repartition scenario

2021-01-17 Thread Shiao-An Yuan
Hi folks, I finally found the root cause of this issue. It can be easily reproduced by the following code. We ran it on a standalone mode 4 cores * 4 instances (total 16 cores) environment. ``` import org.apache.spark.TaskContext import scala.sys.process._ import org.apache.spark.sql.functions._