How to do sliding window operation on RDDs in Pyspark?

zakhavan Tue, 02 Oct 2018 09:30:41 -0700

Hello,

I have 2 text file in the following form and my goal is to calculate the
Pearson correlation between them using sliding window in pyspark:


123.00
-12.00
334.00
.
.
.

First I read these 2 text file and store them in RDD format and then I apply
the window operation on each RDD but I keep getting this error:
*
AttributeError: 'PipelinedRDD' object has no attribute window*

Here is my code:

if __name__ == "__main__":
    spark = SparkSession.builder.appName("CrossCorrelation").getOrCreate()
    #   DEFINE your input path
    input_path1 = sys.argv[1]
    input_path2 = sys.argv[2]



    num_of_partitions = 4
    rdd1 = spark.sparkContext.textFile(input_path1,
num_of_partitions).flatMap(lambda line1:
line1.split("\n").strip()).map(lambda strelem1: float(strelem1))
    rdd2 = spark.sparkContext.textFile(input_path2,
num_of_partitions).flatMap(lambda line2:
line2.split("\n").strip()).map(lambda strelem2: float(strelem2))

    #Windowing
    windowedrdd1= rdd1.window(3,2)
    windowedrdd2= rdd2.window(3,2)

    #Correlation between sliding windows

    CrossCorr = Statistics.corr(windowedrdd1, windowedrdd2,
method="pearson")


    if CrossCorr >= 0.7:
        print("rdd1 & rdd2 are correlated")

I know from the error that window operation is only for DStream but since I
have RDD here how I can do window operation on RDDs?

Thank you,

Zeinab





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

How to do sliding window operation on RDDs in Pyspark?

Reply via email to