Steve's Machine Learning Blog: A Look at the Architecture Diagram for IBM's Data Science Experience Local

One of the developments in the world of Data Science is the emergence of the Enterprise Grade Data Science Platforms.

Who Needs this?

For an individual researcher working alone, a simple Jupyter notebook with a few open source packages may be sufficient to begin extracting value from a data source. Large enterprises, however, have very different needs. There are numerous people involved, with different skill sets, hundreds of data sources, existing access control schemes, security requirements, compliance regulations, high availability needs, source code control etc. It doesn't make sense for a large organization to try to build all that from scratch before getting to the real work of data science. Hence the need for the enterprise grade data science platform.

On the other hand, what happens if an enterprise ignores the above issues and allows data scientist to build whatever and wherever they want? It gets really messy! Also, Data Scientists change jobs so quickly, much knowledge is lost when an employee leaves and especially if their work was lost on a laptop somewhere. So, to answer the question: "Who Needs This?" EVERY large organization doing Data Science needs this. Let's say you had to manage 120 data scientist in your organization, what would you choose?

What Is It?

IBM's Data Science Experience Local (DSXL) is a relatively new offering (Initial General Availability 2017). DSXL is built using the latest in modern scalable containerized micro-service cloud architecture. I know that sounds like a mouthful, but that actually is the truth. All code is built into Docker containers and is managed by Kubernetes. The low level details of Docker and Kubernetes are beyond the scope of this post, so if you are not familiar with these technologies, I encourage you do further reading here:

1. Docker Containers: https://www.docker.com/what-docker
2. Kubernetes: https://kubernetes.io/docs/concepts/overview/what-is-kubernetes/
3. Microservices: https://en.wikipedia.org/wiki/Microservices

All these pieces come together to provide an overall computing environment which is self monitoring and easily scalable for enterprise grade workloads. At the same time enables agile development and rapid delivery of new features plus rollbacks in case of problems.

Truly a remarkable improvement over traditional monolithic software delivery methods. This Architecture gives DSXL the power and flexibility never seen before.

Where Does it Run?

Data Science Experience Local can be run in your on-premise data center, behind your firewall, or on your organization's favorite private cloud, including IBM Softlayer, Microsoft Azure and Amazon AWS.

Let's See it Already!

Although the base configurations are described as 5-node and 8-node, I am depicting an 11-node configuration to highlight the various services, their placement within the system and to emphasize the idea that these can easily be scaled and customized based on your needs.

Click to Enlarge...Note: The yellow boxes are optional components

As you can see, the light blue box represents the DSXL platform. The cluster of nodes are broken down into

Control Nodes (or Master) used for cluster management, monitoring of resource usage (RAM, CPU and Network) plus administrative dashboards.
Compute Nodes used for running actual data science work loads.
Storage Nodes used for internal information about users, projects, access control, logging, code control, etc. Although you are able to save small data sets to your local workspace, it is generally not recommended for enterprise work. Instead, database connections to external databases is the way to go. In this case the connection information and credentials are stored locally and can be used to access data warehouses and/or hadoop clusters in the enterprise.
Deployment Nodes - Used for deploying machine learning models into production (the subject of an entirely different blog posting). Depicted here are two deployment services which are needed high availability environments.

So, DSXL has a fairly large footprint and is designed to provide a common working platform for large teams of data scientist within organizations. The code base consists of both open source packages/systems along with much proprietary code. In general, the open source code that you are familiar with is available in DSXL already. Plus, the installed open source packages are all warranted and tested to work together. So if you like Python and SKLearn, it is in there. If you like working in R with Spark, that too is in there.

Summary:

The field of Data Science is progressing rapidly. The industry has moved beyond "What is data science and and why do we need it?" to "How can we be faster and more effective at delivering data science value?". This is an amazing time and I am glad to be a part of it. I hope you are enjoying the journey as well.

Comments or questions? You can contact me @anlytcs on Twitter, through LinkedIn or by phone.

Steve's Machine Learning Blog

Pages

Thursday, May 24, 2018

A Look at the Architecture Diagram for IBM's Data Science Experience Local

No comments:

Post a Comment