Efficient Strategies for Handling Large Data Sets

In 2022, the government began to integrate its services into the government hybrid cloud, also known as MyGovCloud. This initiative is in line with the fifth Initiative under MyDigital, Malaysia’s Digital Economy Blueprint, to drive digital transformation of the public sector in tandem with digitalisation in other verticals such as agriculture and finance.

This year, Malaysia saw greater growth in cloud adoption as large players invested in the infrastructure needed to support digital transformation efforts. Over the last 6 months, Amazon Web Services (AWS) and Google committed to expand their data centres in the country. This is just the beginning – As Malaysia ramps up their cloud adoption efforts, we will continue to see an influx of big data technologies. To ensure reliable handling of large data volumes, it is imperative to have a robust enterprise data management system in place.

Managing large volumes of high velocity data brings a unique set of challenges. From the ability of the database to ingest the data to the total cost of ownership (TCO) of the solution given expanding data volumes. A complete approach to enterprise data management necessitates a comprehensive strategy, accounting for all aspects of data management, including data quality, data integration, data governance, and data security. In collaboration with Couchbase, Infosys has been supporting a large client in the tourism industry. The client uses Couchbase as the core technology to support industry leading use cases that provide a better experience to its guests. This includes IoT-based devices that stream information from multiple sites run by the client. This data is varied with over 20 different kinds of data points and is high velocity with data being generated at the rate of thousands per second. This data is streamed into local Couchbase clusters and processed before being sent to a cloud-based analytics cluster. 

In collaboration with Couchbase, Infosys has been supporting a large client in the tourism industry. The client uses Couchbase as the core technology to support industry leading use cases that provide a better experience to its guests. This includes IoT-based devices that stream information from multiple sites run by the client. This data is varied with over 20 different kinds of data points and is high velocity with data being generated at the rate of thousands per second. This data is streamed into local Couchbase clusters and processed before being sent to a cloud-based analytics cluster. 

This solution has stood the test of time and provided valuable insights to the business on guest preferences and behaviors as well as serving important regulatory needs. Due to the success of the program, the number of locations and guests increased by 4x, resulting in a huge increase in the volume of data. The Couchbase clusters absorbed this flood of data with ease, and the Analytics cluster efficiently served the increased needs of business for insights into the data.

As the data volumes grow, the size of the cluster needs to increase, along with the cost of running the cluster. Given the future state of increasing deployments, clients are increasingly wanting to optimize the Total Cost of Ownership (TCO) of running their analytics cluster. The Couchbase archived cluster takes data from the original deep storage engine, and implements a six-week time-to-live (TTL ) on the main cluster. TTL refers to the amount of time that data is set to live in the cloud before it is archived in the cloud. The main cluster had 14 data nodes, but the client was able to reduce the size of the archive cluster to just four data nodes. This reduced the required hardware and storage capacity  by three and half times, resulting in an annual savings of around $800,000. More importantly, there was enough capacity to accommodate growth for the next 12 – 14 months, 

There were two key features of the Magma solution that allowed the dramatic reduction:

  1. The size of disks was increased to 10 TeraByte (TB) per node) compared to the original 1.5 TB limit per node.
  2. Reducing the residency ratio to 5% for the buckets (lowest recommended being 1%) compared to 40% for for the new solution.

Another interesting benefit that we were able to observe is the savings in disk usage due to block compression. There was an almost 50% improvement in compression on the Magma engine with the disk usage per document going from 2.28 KiloByte (KB) per document to 1.47 KB per document on disk for data. 

Ultimately, when large data sets are managed efficiently, the analytics cluster in the cloud has the potential to boost business intelligence through insights into their operations. In this case, the customer data enabled the sales and marketing team to track their performance, identify trends and develop targeted campaigns for amazing results. Furthermore, that same data set would identify customer behaviour that would be used for data driven decisions for better customer satisfaction and a superior experience.

By Genie Yuan, Head of Solutions Engineering APAC

Previous articleChina Manufacturing PMI Shrinks For 4th Straight Month In July
Next articleNestlé (Malaysia) – A Respite From High Input Cost

LEAVE A REPLY

Please enter your comment!
Please enter your name here