Hello,
I have 2 text file in the following form and my goal is to calculate the
Pearson correlation between them using sliding window in pyspark:
123.00
-12.00
334.00
.
.
.
First I read these 2 text file and store them in RDD format and then I apply
the window operation on each RDD but I keep getting this error:
*
AttributeError: 'PipelinedRDD' object has no attribute window*
Here is my code:
if __name__ == "__main__":
spark = SparkSession.builder.appName("CrossCorrelation").getOrCreate()
# DEFINE your input path
input_path1 = sys.argv[1]
input_path2 = sys.argv[2]
num_of_partitions = 4
rdd1 = spark.sparkContext.textFile(input_path1,
num_of_partitions).flatMap(lambda line1:
line1.split("\n").strip()).map(lambda strelem1: float(strelem1))
rdd2 = spark.sparkContext.textFile(input_path2,
num_of_partitions).flatMap(lambda line2:
line2.split("\n").strip()).map(lambda strelem2: float(strelem2))
#Windowing
windowedrdd1= rdd1.window(3,2)
windowedrdd2= rdd2.window(3,2)
#Correlation between sliding windows
CrossCorr = Statistics.corr(windowedrdd1, windowedrdd2,
method="pearson")
if CrossCorr >= 0.7:
print("rdd1 & rdd2 are correlated")
I know from the error that window operation is only for DStream but since I
have RDD here how I can do window operation on RDDs?
Thank you,
Zeinab
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]