Hi All , I have a requirement to Process huge file ( 75 GB ) ..

          Here is the sample data :
          <InodeSection>
           <inode>
             <id>100</id>
             <name>spark.conf</name>
              .
              .
              .
           </inode>
          </InodeSection>

           <INodeDirectorySection>

<directory><parent>99</parent><inode>98</inode><inode>97</inode><inode>96</inode></directory>
          </INodeDirectorySection>


          Steps :
            1)    Load complete <InodeSection>
            2)    Load INodeDirectorySection
            3)    Iterate each INode and Query InodeSection as well as
InodeDirectory Section to know the Parents ( till ROOT directory )


          Currently i have done this , as below
          1) Load Inodes to Redis
          2) Load InodeDirectorySection to Redis
          3) For each Inode Query Redis and compute the Parents

           The number of Inodes close to 200 Million so the Job is not
completing within SLA.. I have max SLA as 2-2.5 Hours for this Operation.

           How do i use Spark here and Expose RDD as Service for my
requirement ??  Can this be done with Other methodologies ? ..

--Senthil

Reply via email to