Hi, We are evaluating the use of standalone hdfs for one of our projects. The file system would be used to store audio,video,images and text files for various types of batch processing applications hosted across multiple machines and multiple platforms.
I wanted some feedback on what are the best hdfs based options (fuse-dfs,hbase or others) that are available given the requirements below : 1. Data type that is required to be stored is video, audio, images, xml and text files. 2. These files needs to be created/accessed/deleted from linux and windows machines 3. Nature of data that is to be stored is transient , we store all this data for a configurable amount of time (say 2 days) for processing across multiple machines and then delete it after processing is complete. 4. The data needs to be available as close as possible to the processing machines (linux or windows) to reduce network i/o. 5. The no. of files that need to be stored per day is of the order of millions. The number of folders that need to be created for storing images for a single videos will be in the order of millions 6. The no. of files that need to be deleted per day will be of the order of millions as we would be cleaning up the files for whom processing has been completed. 7. The file size for audio/video files can range from few KB to few GB. 8. The file permissions that are needed would be at max restricting some hosts to access files in a read only v/s read write mode. - good to have not a must have requirement 9. The set up can have 200 -600 machines (mix of windows (30%) and linux (70%)) each having 250-500 GB hard disk drives 10. File system should be mountable from linux and windows machines (via mapping network drive) Please let me know if you need more details. Thanks in advance, Amit
