AWS Big Data Automation with DevOps
To see how easily we can do a small big data (sample) flow including DevOps
Tech Stack Used:
- Cloud Formation template
- Spark Jobs written in Scala
- Apache Kafka
Disclaimer: This architecture is minimalistic and just to show the ease of work with AWS, not the efficiency of the architecture
- With the help of CFT (Cloud Formation Template) it is super easy to setup the whole infrastructure in few minutes. This can be replicated as many as times we need.
- Once the infra is setup, running the EMR to process a job can be done from Jenkins.
- This is where the AWS IAM roles comes handy and makes the whole process secure.
- Fine grained IAM role given to the Jenkins can be tuned to access the EMR and S3 as per the needed.
- By configuring right profiles with AWS CLI using the IAM roles, we can ensure that Jenkins is able to put the JAR/Other files in S3 and also able to trigger the EMR command.
Cherry on the top:
AWS gives a wonderful tool – Data Pipelines. Using data pipelines, we can get rid of unused EMR costing. This is suitable for the batch processing. Like earlier, we can use the Jenkins, with correct profile and IAM set in, to send commands to the data pipeline and it ensure that the required EMR cluster is created, job is processed and then the EMR is decommissioned, to save cost.
Tooling: With sufficient tooling, it’s very easy to create and manage AWS environments using CFT. There are enough supports around AWS. Plugins are available for majorly used platforms like Jenkins, Eclipse, etc.
Security: With fine grained access control of AWS IAM, it makes the remote deployment possible and yet secure.
Cost: If our applications are written well to be fault tolerant, we can use the data pipelines and pre-emptible resources and get the work done around 80% cheaper cost.