Hi All , I have a requirement to Process huge file ( 75 GB ) .. Here is the sample data : <InodeSection> <inode> <id>100</id> <name>spark.conf</name> . . . </inode> </InodeSection>
<INodeDirectorySection> <directory><parent>99</parent><inode>98</inode><inode>97</inode><inode>96</inode></directory> </INodeDirectorySection> Steps : 1) Load complete <InodeSection> 2) Load INodeDirectorySection 3) Iterate each INode and Query InodeSection as well as InodeDirectory Section to know the Parents ( till ROOT directory ) Currently i have done this , as below 1) Load Inodes to Redis 2) Load InodeDirectorySection to Redis 3) For each Inode Query Redis and compute the Parents The number of Inodes close to 200 Million so the Job is not completing within SLA.. I have max SLA as 2-2.5 Hours for this Operation. How do i use Spark here and Expose RDD as Service for my requirement ?? Can this be done with Other methodologies ? .. --Senthil