Recent advancements in data intensive computing for science discovery are fueling a dramatic growth in use of data-intensive iterative computations. The utility computing model introduced by cloud computing combined with the rich set of cloud infrastructure services offers a very attractive environment for scientists to perform such data intensive computations. The challenges to large scale distributed computations on clouds demand new computation frameworks that are specifically tailored for cloud characteristics in order to easily and effectively harness the power of clouds. The use of portable Parallel Programming in cloud computing i.e High Performance Computing(HPC) can increase the efficiency of Cloud Computing.
The current scientific computing landscape is vastly populated by the growing set of data-intensive Computations that require enormous amounts of computational as well as storage resources and novel distributed computing frameworks. The pay-as-you-go Cloud computing model provides an option for the computational and storage needs of such computations. The new generation of distributed computing frameworks such as MapReduce focuses on catering to the needs of such data-intensive computations.
Iterative computations are at the core of the vast majority of scientific computations. Many important data intensive iterative scientific computations can be implemented as iterative computation and communication steps, in which computations inside an iteration are independent and are synchronized at the end of each iteration through reduce and communication steps, making it possible for individual iterations to be parallelized using technologies such as MapReduce. Examples of such applications include dimensional scaling, many clustering algorithms, many machine learning algorithms, and expectation maximization applications, among others. The growth of such data intensive iterative computations in number as well as importance is driven partly by the need to process massive amounts of data and partly by the emergence of data intensive computational fields, such as bioinformatics, chemical informatics and web mining.
Twister4Azure is a distributed decentralized iterative MapReduce runtime for Windows Azure Cloud that was developed utilizing Azure cloud infrastructure services. Twister4Azure extends the familiar, easy-to-use MapReduce programming model with iterative extensions, enabling a wide array of large-scale iterative data analysis and scientific applications to utilize Azure platform easily and efficiently in a fault-tolerant manner. Twister4Azure effectively utilizes the eventually-consistent, high-latency Azure cloud services to deliver performance that is comparable to traditional MapReduce runtimes for non-iterative MapReduce. It outperforms traditional MapReduce runtimes for iterative MapReduce computation. Twister4Azure has minimal management & maintenance overheads and provides users with the capability to dynamically scale up or down the amount of computing resources. Twister4Azure takes care of almost all the Azure infrastructure (service failures, load balancing, etc) and coordination challenges, and frees users from having to deal with cloud services. Window Azure claims to allow the users to “Focus on your applications, not the infrastructure.” Twister4Azure take it one step further and lets users focus only on the application logic without worrying about the application architecture.
Applications of Twister4Azure can be categorized as three classes of application patterns. First are the Map only applications, which are also called pleasingly (or embarrassingly) parallel applications. Example of this type of applications include Monte Carlo simulations, BLAST+ sequence searches, parametric studies and most of the data cleansing and pre-processing applications. Section VI analyzes the BLAST+ Twister4Azure application.
The second type of applications includes the traditional MapReduce type applications, which utilize the reduction phase and other features of MapReduce. Twister4Azure contains sample implementations of SmithWatermann- GOTOH (SWG) pairwise sequence alignment and WordCount as traditional MapReduce type applications.
The third and most important type of applications Twister4Azure supports is the iterative MapReduce type applications. There exist many data-intensive scientific computation algorithms that rely on iterative computations, wherein each iterative step can be easily specified as a MapReduce computation. Examples of such applications include Dimension Reduction, Clustering, most of the Machine Learning algorithms, Classification and regression analysis.
Developing Twister4Azure was an incremental process, which began with the development of pleasingly parallel cloud programming frameworks for bioinformatics applications utilizing cloud infrastructure services. MRRoles4Azure MapReduce framework for Azure cloud was developed based on the success of pleasingly parallel cloud frameworks and was released in December 2010. First public beta release of Twister4Azure occurred in May 2011. We started working on Twister4Azure to fill the void of distributed programming frameworks in the Azure environment (as of June 2010).
Cloud computing is a term used to describe both a platform and type of application. A cloud computing platform dynamically provisions, configures, reconfigures, and deprovisions servers as needed. Servers in the cloud can be physical machines or virtual machines. Advanced clouds typically include other computing resources such as storage area networks (SANs), network equipment, firewall and other security devices.
Cloud computing also describes applications that are extended to be accessible through the Internet. These cloud applications use large data centers and powerful servers that host Web applications and Web services. Anyone with a suitable Internet connection and a standard browser can access a cloud application.
Cloud computing infrastructures can allow enterprises to achieve more efficient use of their IT hardware and software investments. They do this by breaking down the physical barriers inherent in isolated systems, and automating the management of the group of systems as a single entity. Cloud computing is an example of an ultimately virtualized system, and a natural evolution for data centers that employ automated systems management, workload balancing, and virtualization technologies. Cloud computing infrastructures can allow enterprises to achieve more efficient use of their IT hardware and software investments. They do this by breaking down the physical barriers inherent in isolated systems, and automating the management of the group of systems as a single entity. Cloud computing is an example of an ultimately virtualized system, and a natural evolution for data centers that employ automated systems management, workload balancing, and virtualization.
Figure 1 illustrates the high level architecture of the cloud computing platform. It’s comprised of a data center, IBM® Tivoli® Provisioning Manager, IBM® Tivoli® Monitoring, IBM® Websphere® Application Server, IBM® DB2®, and virtualization components. This architecture diagram focuses on the core back end of the cloud computing platform; it does not address the user interface.
Tivoli Provisioning Manager automates imaging, deployment, installation, and configuration of the Microsoft Windows and Linux operating systems, along with the installation / configuration of any software stack that the user requests. Tivoli Provisioning Manager uses Websphere Application Server to communicate the provisioning status and availability of resources in the data center, to schedule the provisioning and deprovisioning of resources, and to reserve resources for future use. As a result of the provisioning, virtual machines are created using the XEN hypervisor or physical machines are created using Network Installation Manager, Remote Deployment Manager, or Cluster Systems Manager, depending upon the operating system and platform.
IBM Tivoli Monitoring Server monitors the health (CPU, disk, and memory) of the servers provisioned by Tivoli Provisioning Manager. DB2 is the database server that Tivoli Provisioning Manager uses to store the resource data. IBM Tivoli Monitoring agents that are installed on the virtual and physical machines communicate with the Tivoli Monitoring server to get the health of the virtual machines and provide the same to the user.