Data Intensive Computing (DIC)

Abstract
Data Intensive Computing (DIC) is a kind of parallel computing which is specific to massive, distributed, heterogeneous and changing dataset processing. The architecture of DIC platform is a set of multiple abstract models, these models describes the function compositions, characteristics, coupling relationships, interaction ways and application scope of each layer in DIC platform.

This article studies the architecture of DIC platform. Firstly, the architecture of related parallel computing platform is reviewed. Secondly, the design requirements of DIC platform are analyzed, the integrated research method of DIC is discussed and then an architecture of DIC platform with seven layers is provided.

Finally, in order to verify its feasibility and effectiveness, a simple prototype system is implemented to support mass image data parallel processing. Compared with serial processing mode, the prototype system can obtain higher speed-up.

Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Wikipedia

What is Data Intensive Computing?

Data Intensive Computing is a class of parallel computing which uses data parallelism in order to process large volumes of data. The size of this data is typically in terabytes or petabytes. This large amount of data is generated each day and it is referred to Big Data.

In 2007, exactly a decade ago, An IDC white paper sponsored by EMC Corporation estimated the amount of information currently stored in a digital form in 2007 at 281 exabytes. One can only imagine how massive it would be today.

The figure revealed by IDC proves that the amount of data generated is beyond the capacity to analyze it. The same methods cannot be used to in this case, which are generally used to solve the usual traditional problems in computational science.

Different architectures

Different architectures have been developed to allow the processing of such amounts of data. These architectures can be used to perform data intensive computing in the cloud, and they include:

Stream Processing: this architecture processes data using the concept of single program multiple data techniques. It has multiple computational resources which allow each member of input data to be processed independently. Sphere is an example of where stream processing is applied.

MapReduce: this is a common programming model used to process large data sets. The datasets usually have a parallel or distributed algorithm in cluster.

Hybrid DBMS: this architecture is designed to compress the benefits of traditional DBMS. The traditional DBMS utilizes shared parallel DBMS with MapReduce architecture. This architecture type offers superior data computing and high fault tolerance level at the same time.

Data flow: it processes data by copying a 2-D graphical form. It then presents the data dependencies using directed edges and arcs.

Reference: