Running Scalable Data Science on the Cloud
Data science is in high demand today. It uses scientific methods, processes, algorithms and systems to deduce knowledge and insights using data in various forms such as structured, unstructured and so on. Let's have a brief look at the complexity of data science, the advantages of running it on the cloud, different service providers in cloud computing, and the challenges of having data science on the cloud.
The Complexity of Data Science
Mainly these 3 factors drive the complexity of data science:
- Enormous Data Generation: If you look around, don't you encounter too many data emitting devices? Every electronic gadget such as notebook, smartphones (along with apps in it), fitness bands, and so on constantly produce data. In the near future, we'll be dealing with a lot more data-generating equipment: many of the devices and materials like fridges, the garments you attire, pens, the bottles used to drink water and so on will have nanochips to collect data for analysis.
- Inexpensive Data Storage: Imagine, what would be the cost to store the entire universe of music ever created? This would be significantly less than a grand!
- Economical & Exceptional Computational Power: Recently launched powerfully configured laptop, weighing 2.5 kg, having Xeon processor equipped with Quadro GPU and 64 GB RAM comes in less than USD 2000. No need to say more, right?
Overall, it’s is easier to store and compute data without spending much money.
Benefits of Running Data Science on the Cloud
People ask a volley of questions on the need to run data science on cloud. Why is the cloud better for data science, even when laptops with a RAM of 64 GB are easily available? Read the answers below:
- Hassle-free Scaling up: Let’s look back a few years, in 2010, a multi-national BFS Company wanted to set up data science unit. They bought a server with 16 GB RAM, considering what’s needed for next 3-5 years. Everything was great at the start of the journey and no scaling up was needed when they hired more people in the team But, eventually the team strength increased and data started to accumulate exponentially. The server slowed down and company started facing challenges, as they couldn’t buy a new server or upgrade the current one. A machine working on cloud would have scaled with just a click without any hassles.
- Cost: Assume that you need to work on a problem, such as mining the project data of past 5 years which needs a greater computational infrastructure. Buying a new machine for this temporary task can’t be an option, but if you are using the cloud, you can purchase a higher configuration for few hours or days to solve the task in the most economical way.
- Collaboration: If you want to work simultaneously with several data scientists and don’t want all the team members to copy the data and code in their local machines, the cloud is the best option.
- Sharing: What if you want to share a piece of Python/R code with your colleagues? The libraries used might not be available or perhaps in the older version. The cloud helps you track the code and transfer to a different machine.
- Have a Larger Ecosystem: The cloud services like AWS, Azure and GCP provide complete ecosystem to collate data, run your models and then deploy it. In case of premises, we’ll have to do the complete setup.
- Quickly Build Prototypes: Something strikes on your mind while you are on the move or when you are debating some topics with your friends. In these scenarios, it is easier to use the out-of-the-box services on the cloud. You can quickly build architypes without any worries about the versions or scalability. Once concept is proven, you always have the option to build a production stack at a later stage (For example: BigData plugin in- house MT tool).
Hope it’s clear how beneficial cloud computing is for data science. Shall we explore the options to run R and Python on the cloud?
Various Platforms to Run Data Science on the Cloud
- Amazon Web Services (AWS): AWS is the ruler of cloud computing space. They have a huge market share, great documentation, trouble free environment with flexibility to scale up rapidly. You can have your own machine with R or RStudio. If you launch a Linux server, you can get Python pre-installed. Additionally, you can install the libraries and modules based on the requirement.
Use AWS machine learning by setting up your machine or else by using Data Science Tool Box, which provide all the software out of the box. Along with the cloud services platform, AWS also provides large datasets to use in case you want to play around with Big Data. Definitely, AWS is the most popular choice for cloud computing with an ecosystem to easily find resources, but it’s relatively expensive. Also, AWS doesn’t provide machine learning services for Asia Pacific, so if you are from those regions, select a server based in NA.
- Azure Machine Learning: If Amazon is the champ, Azure is the contender. It provides an interface to perform end to end data science and machine learning workflows. You can set up machine learning workflow with Azure’s studio, Jupyter notebooks on the cloud or use their ML APIs directly.
- IBM BlueMix: If AWS and Azure have grown progressively in cloud presence, IBM has acquired BlueMix and started to aggressively market it recently, but the offering wasn’t as straight forward when comparing to the other two competitors. However, BlueMix is considerable for setting up notebooks on the cloud.
- Sense.io: You can be deploy Sense projects with a click, and it offers services based on R, Python, Spark, Impala and Julia, flexible to share with groups and share analysis.
- Domino DataLabs: Domino DataLabs is from San Francisco and provides a secure environment which supports languages like R, Python, Julia and Matlab. It also provides version control and options to seamlessly collaborate and share works with your team members.
- DataJoy: DataJoy looks more of an exposed version of Sense and Domino DataLabs at this stage, but to run R or Python on the cloud, this platform works well.
- PythonAnywhere: If planning to build websites and web-based applications along with data science stack, then PythonAnywhere sounds to be a nice option for you. But it provides single window only to hosting, website building and running data science.
Challenges of Running Data Science on the Cloud
Along with the huge benefits of cloud computing, there are a few challenges as well. Though these problems never stop the increased usage of cloud in the upcoming days, at times, it acts as hurdles.
- Data Sharing with a Third Party: Having data on the cloud means, your service provider always has access to it. If reluctant to share data outside of your company due to any concerns, cloud computing might not suit you. For example, banks are not comfortable uploading their data on cloud for analysis due to security reasons.
- Massive Data Uploads and Downloads: One-time upload of huge data from DC might be challenging if internet infrastructure is not robust.
To conclude, cloud computing will gain more infiltration into the data science services and become a standard at later stage. Hope cloud services will be useful and handy, whenever you need them. Do you agree with us that cloud computing will extremely benefit the boom of Data Science in the future? Feel free to reach out to Mindtree experts here.