Hi there !
Let me explain my problem to see if you have a good solution to help me :)
Let's imagine that I have all my data in a DB or a file, that I load in a
dataframe DF with the following columns :
*id | timestamp(ms) | value*
A | 100 | 100
A | 110 | 50
B | 100 | 100
B | 11
It seems to be because of this issues:
https://issues.apache.org/jira/browse/SPARK-10925
I added a checkpoint, as suggested, to break the lineage and it worked.
Best regards,
2017-07-04 17:26 GMT+02:00 Bernard Jesop :
> Thank Didac,
>
> My bad, actually this code is incomplete, it should have b
Thank Didac,
My bad, actually this code is incomplete, it should have been : - dfAgg =
df.groupBy("S_ID").agg(...).
I want to access the aggregated values (of dfAgg) for each row of 'df',
that is why I do a left outer join.
Also, regarding the second parameter, I am using this signature of join
Hi,
I'm working on integrating some pyspark code with Kafka. We'd like to use
SSL/TLS, and so want to use Kafka 0.10. Because structured streaming is
still marked alpha, we'd like to use Spark streaming. On this page,
however, it indicates that the Kafka 0.10 integration in Spark does not
support
To initialize it per executor, I used a class with only class attibutes and
class methods (like an `object` in Scala), but because I was using PySpark, it
was actually being initiated per process ☹
What I went for was the broadcast variable but there still is something
suspicious with my applic
This is the code: created a java class by extending
org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
,and creates a sparkSession as-
SparkSession spark =
SparkSession.builder().enableHiveSupport().master("yarn-client").appName("SampleSparkUDTF_yarnV1").getOrCreate();
,and tries to rea