So you’ve built a world-class data-driven marketing system for a client that enabled them to improve the effectiveness of their campaigns and sales. Data is coming in thick and fast and so are the insights. There’s just one problem. Data obesity is causing even your world class infrastructure to huff and puff…what do you do next? The solution to ensuring the health of your system is no different from what we advocate for human health - shed some flab and get leaner.
Mindtree built a best in class data-driven marketing engine, from the ground up, for a global Consumer Packaged Goods (CPG) client that consolidated data from nearly 30 different sources onto the platform, capturing over 200 attributes for each consumer. The platform mastered the data and created a unified view of individual consumers, enabling real time segmentation and analytics of the customer base to run targeted marketing campaigns. The system integrated leading-edge technologies for the first time, including an email marketing engine for sending campaigns. The result: improved campaign outcomes for half of the client’s North American brands that used the platform on a daily basis.
As is the case with any complex data-driven systems, the for the initiative went through a rigorous process -- from initial concept development to partner selection, design and development, and the first brand going live on the platform. The platform was unique and best-in-class leveraging technologies such as Cassandra (an operational data store), Hadoop stack of Cloudera for reporting, and analytics on AWS cloud. However, some of the technologies underpinning the platform could not keep up with the exponential data growth and enhanced scope. In addition, the wealth of information on the platform generated massive interest in the client’s business teams. As a result, more analysts and data scientists started mining the platform, leading to latency issues. Marketing analysts running ad-hoc queries to segment the data hit resource availability ceilings. Data aggregation jobs delayed the generation of reports and users hit infra utilization ceilings more often than before.
Cloudera Administrators were also constantly fine-tuning the system to keep it running. Usage patterns showed that system was used in peaks and troughs (at an average of about 40%), and yet, the system could not serve the analysts at peak times. The graph below shows the load distribution for a 24-hour period.
Identifying Solution Options
The conventional approach to solving this challenge would have been to throw more hardware at the problem. This would have increased hosting, licensing and maintenance/administration costs, and only partially solved the problem. It would have also caused average system consumption across the day to go down further, making the system more inefficient.
The Mindtree team identified two key principles that could solve this problem: separation of ‘Storage and Compute’ and ‘Auto Scaling’. The team then came up with eight goals that the potential solution needed to address:
- Increasing average system usage in a day with minimal peaks and troughs
- Reducing the hosting cost
- Enabling dynamic scaling
- Simplifying administration, in turn, reducing maintenance costs
- Providing resources and tools for analysts and data scientists to mine the data
- Ability to customize each Extract, Transform, Load (ETL) and reporting job for maximum efficiency
- Ability to support Create, Read, Update and Delete (CRUD) operations on the data to simplify data clean ups
- Leveraging open source systems that can be easily ported to any cloud
In essence, the platform had to be moved to a Lean and Nimble setup that would scale seamlessly while ensuring transparency and cost-efficiency, and future-proofing the system.
The Mindtree Approach
Our team researched various solutions available in market including Amazon EMR, Snowflake and Databricks Delta. After in-depth analysis, the team found Databricks Delta to be the best-fit solution that met all of the criteria.
Databricks Delta Lake provides:
- Atomicity, Consistency, Isolation, Durability (ACID) transactions.
- Scalable metadata handling that unifies streaming and batch data processing.
- Ability to run on top of the existing data lake and is fully compatible with Apache Spark APIs.
- Ability to configure based on workload patterns, providing optimized layouts and indexes for fast interactive queries.
- Ability to time travel to improve developer and data scientist productivity.
Solution Validation and Findings
In order to validate the identified solution, the team ran tests on 24TB of data, focusing on performance, scalability and ease of use. Here are the key findings.
- The reading performance was 51.2% better with lower resource consumption compared to a similarly sized Cloudera cluster.
- The writing performance was 70% better for inserts and almost the same for updates when compared to Hbase, with an overall performance improvement of about 30%.
- Based on the cost benefit analysis there was a 40% hosting cost savings over the current Hadoop based environment.
- Analysts and data scientists can run their queries on a cluster customized for their needs without resource constraints.
- Increased utilization efficiency.
- Reduced cluster administration and maintenance costs.
- Development teams will have the flexibility to tune the hardware resources based on the needs of each Extract, Transform Load (ETL) job, reporting job or analytics job.
- Dynamic scaling to enable jobs to run even faster
- Teams can easily update data without expensive data movements and corrections that are needed in a Hadoop environment.
- Ability to write and run the code in multiple languages like Scala, Python, SQL, etc. using the user-friendly Databricks notebooks environment.
- Increased parallelism for running the analysis, thus increasing the productivity of analysts and data scientists.
- Time travel capabilities to help data scientists evaluate the models better.
Data-driven marketing systems - even those with terabytes and petabytes of data - can be made lean, nimble and cost effective by using the Databricks platform for building the data lake. This approach offers several benefits across reporting, analytics and campaign management layers, significantly speeding up the customer targeting and marketing campaign cycle.