We have the Data – What’s Next?

UM Research
Jun 13, 2023
8 min read

Updated: Nov 1, 2023

Accelerating data-driven scientific discoveries through the exploitation of advanced computing technologies.

Photo by Manuel Geissinger: https://www.pexels.com/photo/black-server-racks-on-a-room-325229/

A paradigm shift in scientific discovery

A few weeks before Jim Gray was lost in his solo sail trip to Farallon Island near San Francisco on January 2007, the great American computer scientist and winner of the 1998 Turing Award had envisioned a paradigm shift in the practise of science as a result of the increasing volume of data generated by scientific experiments and instruments, in which he coined it as “the fourth paradigm”1.

More than a decade later, we witnessed how data-intensive methods have been adopted to harness the vast amount of information and knowledge hiding behind the deluge of data, with various tools and technologies having been developed and adopted by scientific communities to collect and process such data. Particle physicists, for example, have utilised the petabyte of readings collected by the detector in the Large Hadron Collider deployed in CERN to answer frontier questions relating to the force of the universe. Meanwhile, social scientists studied the millions of texts created by social media per second to perform sentiment analysis for a behavioural study.

1 The Fourth Paradigm: Data-Intensive Scientific Discovery: https://www.microsoft.com/en-us/research/publication/fourth-paradigm-data-intensive-scientific-discovery/

With the increasingly widespread usage of sensor and Internet of Things (IoT)-derived technologies, the fourth Industrial Revolution (IR4.0) has brought in another data tsunami. Traditional social economic activities are now becoming more knowledge-intensive, and innovation-driven, as highlighted in the 10-10 Malaysian Science, Technology, Innovation and Economic (MySTIE) Framework.

For example, the agricultural sector had adopted the use of drone technology together with advanced intelligence systems to improve modern rice production. Images taken by the drone are streamed back to the Cloud via 4G network, to be processed with the machine-learning model for a real-time analysis on the crops’ health.

This information will then be used to find the necessary intervention needed to be taken in order to increase crop yields. We have seen the importance of data being used across all social-economic sectors and we know the key to their success is heavily reliant on whether we have the right tools and the know-how to facilitate the data exploration process.

Datascope – a new instrument for exploring data

Four hundred years ago, mankind invented the telescope to explore the celestial objects floating the infinite universe. With the current deluge of data, it is clear that we need a “datascope” to accelerate the data analysis process in exploring and exploiting the wealth of data.

The datascope should be able to provide the necessary functionalities in each of the phases of the data lifecycle: acquisition, processing, analysis, curation, sharing and reuse. The data acquired from various sources will to be organised in the data storage platform before it can be further processed. The processing of large volumes of data also requires a high-performance computing platform, especially when the machine-learning (ML) methods are used. A typical ML model training process requires the use of graphical processing unit (GPU) and the computational power that is beyond what a normal laptop can provide.

Once the data has been analysed and results have been obtained, it is now time to enter the post-research phases. To extend the longevity and ensure the reusability of the data, we need to curate the data – adding useful metadata to describe it.

The curated data is then deposited into organisational or community-driven repositories, for long-term data preservation and storing. These data repositories are then integrated with the national or domain data discovery services, such as the Malaysia Open Science Platform2, to increase the visibilities and sharing of the data, which in turn can then be used to fuel future research, completing the entire data lifecycle.

2 Malaysia Open Science Platform (MOSP): https://www.akademisains.gov.my/mosp/

Fig. 1: Research data lifecycle

Seeing the importance of supporting data-driven research in the Universiti Malaya (UM), Prof Noorsaadah Abd Rahman, the former Deputy Vice-Chancellor (Research & Innovations), had formed the Data-Intensive Computing Centre (DICC), also known as the UM HPC unit, on December 2015.

DICC3 has been given two important tasks: to conduct data-intensive research and to provide High Performance Computing (HPC) services to the campus communities.

Currently, DICC is operating a HPC cluster consisting of 11 AMD Opteron nodes (a total of 352 cores and 2.8TB RAM) and 4 AMD EPYC nodes (a total of 192 cores and 1TB RAM), and a small GPU farm comprises Nvidia Tesla K10, K20, K40, V100S and Titan GPU. These computing resources are connected via a 10Gbps network to two distributed storage systems: a persistent storage running on NFS (60TB) and a high-performance parallel file system running on Lustre (230TB). A 400TB storage system has been set up to perform the daily backup of the production system. The entire system is, in turn, connected to the rest of the world via a 1Gbps MyREN network.

3 Data-Intensive Computing Centre: https://www.dicc.um.edu.my/

Fig. 2: HPC Cluster

A data-intensive research odyssey

In order to perform advanced science, we will need advanced tools. A simple analogy will be Formula One racing. In order to win the race, you will need a competent team, the best car technologies and above all else, a good driver.

Ever since UM embarked on its High Impact Research program a decade ago, we have recruited many world-class researchers as our “F1 drivers”. The question is, are we supporting them well enough to allow them to soar?

Technical Expertise – a competent team

Many research projects are facing a common problem – lack of expertise in IT strategy, planning and implementation. Data-intensive research usually involves an advanced computing aspect that is beyond what a domain researcher can comprehend. Thus, an ideal project team should always include an IT expert from day one.

As a real-world example, there was an astrophysics project that approached the DICC a year ago. The project aimed to set up an international observatory platform via streaming live data from a radio satellite dish established in several locations and sending it into a central station abroad. The success of this project depended heavily on the network bandwidth connecting the central station to the data storage service. Unfortunately, the researchers had only approached the DICC when the project was entering its implementation stage, and thus too late to make any corrective measures.

Funding the Technology – the best car

“People are only interested to fund the rocket science, but not the rocket launcher”, said an attendee at the 2019 Beijing CODATA high-level workshop on Implementing Open Research Data Policy and Practice.

This is a non-trivial problem faced by many world-wide advanced-computing service providers, and unfortunately, it is happening in Malaysia as well.

Our ministries currently only fund high-impact research, or those that are commercialisation-ready. We do not have a national advanced-computing facility, or the grant to establish one, and thus have the means to provide essential support to data-intensive research. It is like giving our elite driver a commodity car and hoping that they can still win the match.

“We must carefully formulate our problem to ensure the computation can be completed using the modest DICC resources, otherwise our students cannot graduate”, as told by a computational physicist during a user engagement workshop we conducted two years ago.

Fig. 3: Close up of an HPC 10Gbps switchboard

Researchers’ Competency – a good driver

The last challenge is related to the researchers themselves. You need to be well-trained and skilful in order to manoeuvre the F1 racing car in the circuit; the same is true when it comes to computer scientists when conducting research activities in advanced-computing facilities.

In order to adopt data-intensive methods in our research, we must spend time to learn the necessary IT skillset for formulating the solution and utilising the machines effectively.

Many researchers have made the mistake of assuming that using the HPC cluster is no different from using their personal computers. It is for this reason why most of the tickets raised in the DICC service desk are mainly related to the users’ own capabilities and understanding.

Building ramps for the researchers

“Intellectual ramps proved invaluable in accelerating the passage across the ‘chasm’ and facilitate the broad adoption of data that enables researchers to move incrementally from their current practice into the adoption of new methods”, said Malcolm Atkinson, the former UK e-Science Envoy, in the 2010 e-Science All Hands Meeting report4. “Ramps grow in different parts of the ecosystem. Resource providers may provide documentation, software and training to facilitate use.”

4 Shaping Ramps for Data-Intensive Research: http://eprints.soton.ac.uk/id/eprint/271235

DICC’s mission is to enable scientific discoveries through the exploitation of advanced computing technologies, by providing research communities with excellent IT service and support in addressing research computing challenges.

Our principle is to make advanced computing easier for the researchers via setting up and operating these necessary ramps. By utilising these ramps, our researchers will be able to follow a detailed online documentation that can help run their experiments in the HPC cluster, without needing to worry about the complexity of the system itself.

We are also conducting training sessions for new users, and all of the materials, including recordings of previous sessions, are made available on our website5. Furthermore, users can reach us easily through a service desk6, to report for any problems with the service or request for new services.

Recently, DICC added a new tool to our HPC services, the Open OnDemand (OOD)7 portal, developed by the Ohio Supercomputer Centre, University of Buffalo Centre for Computational Research and Virginia Tech. This portal allows our users to submit and manage their computational experiments in a user-friendly way. Through a common web browser, users not only can browse, upload and download their files stored in the HPC storage system but can also send a new calculation into the HPC cluster and monitor its progress. For beginners, the OOD portal provides a seamless and painless way to access advanced computing infrastructure, without going through a steep learning curve. And for experienced users, they can monitor their computation, and perform necessary tuning to get the most out of our computing resources.

5 DICC HPC Document: https://confluence.dicc.um.edu.my

6 DICC Service Desk: https://jira.dicc.um.edu.my/servicedesk/

7 Open OnDemand (OOD): https://openondemand.org/

Fig. 4: Staff member setting up a cluster

UM Open Science (UMOS)

UM had also launched the UMOS initiative as another means of promoting and improving the research data management process, with a focus on the 4Ps under UM’s three pillars: Policy & Process, People, and Platform.

UMOS is currently led by Prof. Shaliza Ibrahim, the Deputy Vice-Chancellor of Research & Innovations, as well as a steering committee involving the library, the research management centre, and the UM Centre of Information Technology.

The first pillar is to establish the research data management (RDM) policy that sets up the scope of UMOS and the relevant processes in managing research data. UM’s RDM policy adopts the globally recognised FAIR (Findability, Accessibility, Interoperability and Reusability) principle which ensures all the research data produced by UM communities is findable by public, accessible via our UM research data repository, interoperable with national and international platforms, and reusable under well-controlled conditions.

One of the important processes is the research data management plan (DMP), which describes the data that will be acquired or generated, and how it will be managed during the entire project duration, as well as the long-term data preservation plan, as mentioned earlier in the research data lifecycle.

The second pillar focuses on capacity building activities to prep UM communities in moving the UMOS agenda. It encompasses both stakeholder engagements to increase awareness amongst data originators (which in this case are our researchers), and a series of training programs created by the Malaysia Open Science Platform (MOSP), aimed towards promoting our librarians into data stewards capable of managing the research data and assisting researchers in RDM processes. Data stewards are the main driving force for open science, ensuring that the research data complies to FAIR principles.

The third pillar aims to establish a platform for facilitating open science activities. It consists of a computer platform for researchers to manage and deposit their data, which is operated by DICC and UM Centre of Information Technology; and a human platform operated by the data stewards to provide a helpdesk and data curation services to the researchers.

Surfing the data wave

We are now in an era of data-driven progress. The massive scale of data being used and generated in IR4.0 research is inevitable. Thus, we must rethink the way we utilise data, making it a first-class citizen in our research. We need to craft a data-intensive research strategy and subsequent implementation plans, placing UM in a prime position to lead the nation, and impact the world.

The wave is coming. Are you going to grab a surfboard or a buoy?

Author and researcher featured: