This is how Hadoop has cut times compared to mainframe.
Have you ever wondered how you can efficiently perform complex processing with large volumes of data? In providing service to a large financial group, CaixaBank Tech has come across this type of problem and we’d like to explain how we managed to solve Big Data processing.
Why use Big Data in a MIS Reporting system?
Let’s take one of the most used apps at our branch network and central services as an example: the MIS Reporting app provides analytical accounting information to all employees. The system generates consolidated financial data on a daily and monthly basis. It is used to monitor many different selected financial indicators, some of which could be average balance, margin or ratios.
The app requires a large dataset to be processed since it offers several thousand indicators seen from different analysis axes or scales (organisational hierarchy, product hierarchy, time periods, etc.).
The importance of pre-calculating Big Data
Complexity occurs when all these indicators have to be pre-calculated for the different analysis scales mentioned earlier. The aim of all this pre-calculation is to provide a necessary and speedy response when employees check and analyse the information through an app that offers many filter options. And if this weren’t already complicated enough, activity analysis often requires new indicators or new analysis scales that need to be included quickly.
In terms of volume, we process an input of 8 billion records which are used to calculate 4,600 complex indicators (calculated from others in a more or less complex way, such as aggregate fI-m afraid your translation isinancial margin by different axes/scales of around 15 axes at present) and the result generates an output of around 12 billion records.
The beginnings in the mainframe
This process was initially developed based on the most common technology in finance – the mainframe. For it to be efficient, information was sliced and run in parallel batch processes. This reduced run times but generated consumption spikes of over 20% in the mainframe, leading to high costs. Due to this Big data process consumption, it could not be run in office hours.
Hadoop, an alternative to mainframe
Seeing that this solution was not scalable and was unable to support new developments required by business, a technological discovery process was started to find the best alternative to migrate these mainframe processes. Finally, a parallel processing environment was selected based on Hadoop, specifically HIVE.
HIVE is one of the elements that forms part of the bank’s Big Data solution. It is known as the information and analysis environment and comprises a technology ecosystem. Here, when using a Hadoop environment, the solution focus is based on the concept of MAP and REDUCE, like the classic “divide and conquer” approach.
How do we process Big Data with HIVE?
The processes were developed with the ODI tool, a market solution to perform ETL/ELT that offers a graphic development environment that eventually generates the HIVE code. The graphic environment helped developers in their work, since the learning curve was shorter than having to directly code with MapReduce, HIVE and this kind of technology. The new process is orchestrated with YARN, which juggles all these processes between different CPUs. This process is known as MAP and the result of all these processes is brought together in a single result, known as REDUCE.
Hadoop improvement over the mainframe
With Hadoop, we have managed to migrate most bundling performed in the mainframe, significantly improving Big Data processing times. For example, business bundling went from taking 30 hours in the mainframe to 10 in the new environment. Moreover, since the HIVE information environment does not provide direct service to branches, Hadoop can also run in office hours, enabling us to really improve data generation times and service new business requests much more efficiently.