An Approach for Effectively Handling Small-Size Image Files in Hadoop

Small size file handling problem is a big challenge in Hadoop framework. Many approaches have been proposed and evaluated to deal with the small size file handling problem in Hadoop. File merging strategy is one of the most popular approaches used in literature. To deal with the small size file handling problem this paper evaluates a merging strategy in the domain of “Content-based Image Retrieval Systems (CBIRS)”. CBIRS form a perfect application domain for evaluating solutions for small size file handling problem in Hadoop by incorporating a huge number of image files (small files).The approach used in this paper is shown to be efficient than the solution provided by HIPI (Hadoop Image processing Interface).


INTRODUCTION
Hadoop is among the most favorable highperformance java based open source distributed computing platform that was designed to store & process big data.Hadoop gives the best performance for handling large sized files & consists of two components i.e HDFS & Map Reduce.HDFS is the primary component of the Hadoop with a default data block size of 128 MB meant for managing & storing large sized files.When the "size of file" is much smaller than the default HDFS block size, the efficiency is degraded.HDFS supports master-slave architecture & follows writing once but reading many times pattern.One of the most important advantages of HDFS is data replication.Map Reduce is regarded as the heart of the Hadoop.Map reduce is a software framework, a programming model & a processing part of the Hadoop that makes use of computing resources & is used to process & generate large datasets in a reliable & fault tolerant manner.In map reduce, Hadoop program performs two separate & distinct tasks in which a sequence file is used for the purpose of input/output formats.Both job trackers & task trackers are incorporated in map reduce.It is not necessary to write map reduce jobs in java but Hadoop is implemented in Java.

PROBLEM IDENTIFICATION
Small size file handling problem is a major challenge in Hadoop framework.This paper explains two major objectives for handling small files.
Explore a merging strategy for small size file handling problem.
Use merging strategy for small files to show Performance evaluation for Content-based Image Retrieval Systems.
There are different techniques and solutions that have been proposed to deal with small size file handling problem in Hadoop.This paper shows the application of Hadoop in image processing by evaluating CBIRS to explore a solution to the small size file handling problem in Hadoop framework.
This paper has been divided into 6 sections.In Section 3 we discuss the related works.Section 4 presents the proposed approach, Section 5 shows the experimental results & Section 6 provides the conclusion of the work.20] proposed that there are "three opponent color axes" which are used for color axes of the histograms and are given below:
This paper makes use of merging strategy for evaluating "Content-Based Image Retrieval Systems" by using a technique known as "Histogram Intersection".

Proposed Approach
Since the above-stated approaches require a lot of overhead with respect to pre-processing so there is a need for an approach where there is less processing overhead & less communication cost.In this paper to achieve the same, we evaluate a Content-based Image retrieval system by implementing a merging strategy.Our approach also overcomes the overhead incorporated in Hadoop Image Processing Interface (HIPI).
Our proposed approach is divided into two stages:-

Algorithm for Stage 1:
1) Take input an Image dataset where each image is a small file.
2) Instead of processing each image file separately we extract paths to these small files & store those files into a single file.
3) This path file is rendered on Hadoop platform.
4) Images are read through this path file & content based features are extracted from the images which make up histograms for different files.
5) These histograms are stored as objects using Sequencefileoutputformat.

Algorithm for stage 2:
1) Extract the features from the input query image in a similar manner which also makes up a histogram.
2) This input query histogram is compared with the stored histograms through histogram Intersection process.
3) Those histograms whose comparison results are specific to a given threshold are the output results.The dataset used to perform this experiment is BSDS300 which stands for "Berkeley Segmentation dataset" consisting of 300 images.The images are divided into a "test set of 100 images" and a "training set of 200 images".The half of the segmentation was obtained from presenting grayscale image and other half were obtained from presenting the color image.This experiment was done on Hadoop 2.0 with Cent Operating system 7, 8GB RAM, Intel i7 core, 1TB hard disk, Java JDK 1.8.0 and 2.00 GB processor.For performing image processing tasks in the distributed environment a library for Hadoop framework called HIPI is used which provides application programming Interfaces(API's) in which multiple files are read by a single mapper where after each culling stage during processing one mapper reads one image file so the problem is still there.Instead of processing each image file separately our approach extracts paths to these small files, stores those files into a single file and renders them on Hadoop platform.Images are read through this path file by a single mapper and content based features are extracted from the images which make up histograms for different image files and these histograms are stored as objects using "SequenceFileoutputformat" which increases the efficiency and also solves the problem of classical HIPI.Since multiple image files are read by a single mapper, the map-reduce paradigm can be used efficiently.Audio and video files also come under the category of small files, as a future work, we can explore these types of files too since these files will also suffer performance issues which are faced by small files stored in HDFS.

Conclusion and Future Scope
Generally, in Hadoop deployment, we have three major types of machine roles.They are Name node, Data node & client machine.

Figure 1
Figure 1 Hadoop Distributed File System architecture 1.1 Image Files:-Image files are standardized means of storing & organizing digital images that can be rasterized for use on a printer or computer display.There are hundreds of image file types & each file is created for different purposes & has its own pros & cons.The PNG, GIF & JPEG formats are most often used to display images on the internet.Image files also come under the category of small files in which block size is less than HDFS default block size & therefore we use Hadoop in order to tackle such files.

Figure 2
Figure 2 Structure of CBIR

Chethan. R et al. [ 1 ]
proposed a separate algorithm for map & reducer in map reduce, gave the concept of merging strategy & also described a mathematical model.Sushmitha et al.[2] proposed an approach similar to 'merging' solution & avoids all those files to merge whose size is greater than the threshold but it was very time-consuming.So this paper made use of the map-reduce model in order to reduce time consumption, providing a minimum response time for batch analysis, handles sequence files & text files efficiently & reduces the time of executing & merging of files.Kashmira P.Jayakar & Y.B.Gurav[3] has proposed a solution called as EHDFS which consists of four operations file mapping, file merging, file extraction & prefetching.In this approach, for a combined file a file & block metadata is maintained by name node, and for accessing individual files, an indexing mechanism is proposed.Furthermore "Index prefetching" is incorporated to improve the I/O performance.Gupta Bharti et al. [4] this paper consists of the following five phases: merging strategy, local file strategy, fragmentation, caching & uploading of files to HDFS.The first phase is file merging strategy which is similar to the solution proposed by all the above authors.In the second phase, an index file is created for each original file which contains 4 parameters.In the third phase, merged files were partitioned in such a way so that no internal fragmentation occurs.In the fourth phase, Name node stores the information of the index file & merged file to avoid the overhead.The last phase is used for correlated & index files.Dong, Bo, et al. [5] International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN: 2456-6470 @ IJTSRD | Available Online @ www.ijtsrd.com| Volume -2 | Issue -3 | Mar-Apr 2018 Page: 1754

International
Journal of Trend in Scientific Research and Development (IJTSRD) ISSN: 2456-6470 @ IJTSRD | Available Online @ www.ijtsrd.com| Volume -2 | Issue -3 | Mar-Apr 2018 Page: 1755 CBIRS is a challenge.Hadoop Platform is best suited for handling large sized files.Image files are small files and in order to analyze the contents of various International Journal of Trend in Scientific Research and Development (IJTSRD) ISSN: 2456-6470 @ IJTSRD | Available Online @ www.ijtsrd.com| Volume -2 | Issue -3 | Mar-Apr 2018 Page: 1756 images, histograms are made, which store the color information of the images and outputs them as sequence file output format i.e. the histograms are stored as objects.The input query image is processed in the same manner making up a histogram.The input query histogram is compared with the stored histogram through histogram Intersection process.In Small size file handling problem & HIPI one mapper reads one image file which is a limitation.So this paper makes use of Sequencefileoutputformat in which multiple files are read by a single mapper due to which the time for reading from the hard disks will be less and the processing time will get reduced to a greater extent.

Shashank et al. [11] In
[13]hubham Bhandari et al.[8]this paper explained the architecture of NHAR to reduce the stress on Namenode & discussed the Configuring Hashtable of Name node for the NHAR in which each Name node is confined to three Hashtables.Chatuporn Vorapongkitipun & Natawut Nupairoj[9]proposed an approach based on HAR known as NHAR.NHAR enhances HAR to add additional files.This paper provides the ability to access smaller files in HDFS & also increases the memory usage of metadata.Guru Prasad M SI et al.[10]proposed two techniques i.e Map Combine Reduce & File Manager for handling small files in Hadoop.File Manager solves the memory stress on Namenode, manages the metadata, provides mutable property to HDFS files, distributes files to computing nodes & performs four functions in which a separate algorithm was proposed for each function.J. & can be retrieved using sophisticated, query-based semantics & precise measures of similarity.This paper also describes the information related to "lowlevel properties" of the image and produced the outputs.S. Murali et al.[13]proposed a method which gives the relationship between two pixels i.e a