webinars

7 Oct 2021

Demo: Setting up Data Science Development Environments in the Cloud

Overview

Cloud-based development environments aren't just for software developers! In this demo, learn how Coder can simplify and streamline data science workflows by allowing data scientists and data engineers to create fully configured workspaces with the click of a button.

We'll cover some of the reasons data science teams love Coder, including:

Easy-to-access, fully-configured, shareable workspaces hosted on your cloud
Access to a variety of IDEs, including VS Code, PyCharm, RStudio, and Jupyter
GPU support for intensive compute power when crunching large data sets
Co-located with your data for faster access
All intellectual property and customer data is hosted on your infrastructure

Introduction

I'm pretty excited to show you again just how Coder is helping data scientists with simplifying their environments, letting them run their experiments, do their trainings, and hopefully in a more efficient way that's a little bit easier to use and less complicated setup process.

We're just gonna do a quick overview of what Coder is, just to kind of give a baseline to everyone of what our product is and and how we see it kind of fitting into this data science world. I'm going to do a live demo -- that's the point of this live demo, right -- of showing a Jupyter Notebook and some data analysis around IMDB movie ratings, something kind of fun that we can relate to. We'll have a little bit of wrap up discussion and the Q&A section like I said. And then there's a resource slide at the end that has some information from like a blog post that we did and things like that.

Overview: What is Coder?

At a high level, what is Coder? Again, I want to level set everyone here so what Coder is doing is we're taking the developer experience of the data science experiment experience off of your workstation and moving it into the cloud. Cloud-powered development environments that are built from a standardized container image, containerized image that gives you all the tools that you need, and it doesn't matter what you connect to because everything is running in the cloud on your Kubernetes infrastructure.

Leveraging Cloud Resources & Performance

What this means is that all of my development is happening on the same network, very close to the data warehouse where my models are, for example, or my data is being passed through my neural networks and things like that. All that compute power is available too, so I don't need to worry about a really beefy machine locally and having a whole GPU farm that I'm running locally here. But instead I can have like, let's say, a Chromebook and connect to my Kubernetes cluster that has 128 cores and 256 gigs of RAM and a whole bunch of hard drive space and maybe five or six GPUs attached to it or something like that. I can actually do all my computational modelling that way and do my training and stuff like that. When it comes to speed and efficiency it's great for that.

Workspaces: Templatized Development Environments

And then because everything is built from a containerized model, I'll also show you a template feature that we have to build out of these workspaces. All of your teammates are going to be using the exact same to the exact same base as well. You don't have I guess you eliminate variants in your experiments. One common problem in data science is kind of tracking experiments and collaborating across teams and things like that and having this repeatable workspace that we can all build from, and then having the power of cloud compute to help speed the whole process up. This helps kind of normalize our experiments and help out with that piece of it.

Coder IDE Support: Jupyter, IntelliJ, VS Code & More

Yeah so basically other things to call out, we support VS Code of course, but there's also a Jupyter Notebook, RStudio, IntelliJ, PyCharm, any JetBrains IDE, including an early access preview for DataSpell, the new data science IDE that Jetbrains put out there. Lots of cool stuff.

We support all those in the browser like you'll see here, and you can also connect local editors and things like that. I did something using Spyder IDE, as an example. There's a lot of things you can do within Coder and a lot of cool ways you can be creative with this.

Live Coder Demo: Getting Started on a Data Science Project

Basically I wanted to show the idea of getting started on a data science project and like I said, this one is IMDB movie ratings. So I actually took this notebook from a data science article that I found on a blog. Basically what it is, it's a csv of IMDB movies and it's taking a look at the rating that the movie had based on the budget. We're going to go through and kind of just run some Matplot stuff and things like that to kind of see how this works.

Typically if I was getting started on this type of project, I might need to install Jupyter Notebook locally. I'd have to maybe clone down this repository, get all of my data set here. In this case the csv would have to be local, make any changes and stuff like that.

Getting Started with Coder: Team Management

But with Coder, it's really easy. I can go down to this getting started by and click open in Coder. This is going to redirect me over to Coder itself. It's going to prompt me for just a couple of things to create my workspace. If I go in here and I'll call this IMDB data set, I can call it anything I'd like. It's gonna ask me which organization I want to go into. This is Coder's way to logically separate teams, so you can see I have quite a few available here. We have a data science team, for example, Enterprise team, DevSecOps, etcetera. I'll just leave it in our default Coder organization here. Which provider? We do have this idea of being able to have multiple Kubernetes clusters available, and I can talk about that a little bit more later. But one of the key things is maybe I have some specific hardware that this model set needs to run on, so I want to make sure I pick the right provider that has that hardware underlying. For now I'll use our default built in and I'll click create workspace.

Loading the Workspace

What this is doing is it's reading from a template file that was in the repository. It is building out--here we go--it is building out my workspace right now by pulling a container image that I wrote that has all the editors that I'm going to need. It's going to have all the tooling I need, the Python modules, things like that. And because it's a container, everyone else that builds us from the template is going to have the exact same thing. The template will tell you how many cores to add to this workspace, how much RAM to allocate, and how much hard drive space to allocate. And I could even allocate GPS as well like I mentioned. Really cool. Ben actually helped me out with part of the Dockerfile too.

So I'll go over that and kind of what that looks like behind the scenes in a moment here.

But the last step of this is it's going to be assigning a persistent volume claim which will basically preserve all of my in-flight work. If that image updates, I can rebuild this and not lose anything that I've been working on. And then I can also clone down that repository that's going to have all my data here.

Connecting to Your Workspace From Any Device

Again, all of this is happening behind the scenes. I haven't installed anything, I'm connecting over a browser.

And because of this, I can do this from an iPad, I can do it from, again, a Chromebook or a thin client. I don't need to have a $2000 Macbook Pro locally. You know going over the whole thing like let's issue you a laptop or let's issue you some really expensive hardware and get you all set up so we can get your data science lab going. All of this can happen on the cloud instead. It saves us a lot of setup time and all of that.

While I was speaking there, just about a minute or two, I guess at most, we were able to get this full workspace set up and we're actually ready to start working on our project.

We've got a few applications that I put into this image here. We've got VS Code PyCharm, Jupyter Notebook, and then we actually have a terminal as well. So Coder does have a command line and you can connect all of that locally to your remote workspace and issue commands that way. But you can just click on this terminal here and get all the same stuff you would be used to.

Jupyter Notebook Support

One of the things with data scientists is that you probably aren't always using the terminal; you're used to doing things like notebooks and Excel sheets and stuff like that. One of the other cool things again, without having to install or set up anything extra, I can just click on Jupyter here and we'll actually see the Jupyter Notebook, which will look very familiar to anyone that's used it before.

And you can see I have a few files in here, including that repository that got cloned down, you can see my csv files in here, some other information, but more importantly, that specific Jupyter Notebook file itself and let me make that just a tad bit bigger. There we go. Here we go. Here's my Jupyter Notebook. This is the file that's been source controled that my team's been using on this repository here. You can see that this data set came from Kaggle, which if you're into data science, you've probably seen that before. But we're basically looking at top ranked movies from IMDB specifically. And again, we're gonna be doing a comparison on budget versus rating and kind of doing that type of work here.

Yeah, basically you can click through and I wanted to show that you can basically run the same stuff that you would normally do just like you're used to. So I can run that import statement, run this head command and make sure I check out the top values of my data set. I can do some normalization here as well. Check out my types and everything of course drop values needed and replace things to normalize that as well. Do some extra stuff here with string replacements and you know, even come on down and check if I keep running through here, even doing some scattered matrix.

Matplot Support

Let's say I have Matplot that's installed on this container inside of this workspace. So I can go ahead and do my scatter plot here as well and just say, okay cool, let me do this, baseline modeling and check that out, generates it for me and I'd be able to do this. Any tweaks that I make to this notebook obviously would also show up in here which is great. You know, I can definitely do that. These are here because the Jupyter Notebook has been used before, so obviously we have the same version here, but I can make my tweaks to this data set and rerun this stuff if I needed to. If I need to update my csv file, I could do that. You know, basically do the whole analysis here all the way through, and you can see as we keep kind of scrolling down I'm doing again trend analysis here, through another scatter plot, can do some secondary modeling in there as well, and that type of thing. Kind of a cool you know, practical use I guess of just saying is there a relation between budget and IMDB movies and the rating that they receive on IMDB, which is kind of neat.

CSV Support

With that said, I talked about being able to edit your csv, and you know, again you can see that we're just reading that file directly here. But this could be pulling from a data warehouse or something like that instead. This could be a huge model set. You'll see if I actually go back to my workspace, I'll open VS Code I guess for this, just for quickness and ease. But if I edit that csv file, we'll see. It's not too large. It's a pretty small dataset--119 lines in here. It's not too much in this case, but in a general sense though, maybe you know, I'm working with thousands and thousands and thousands of models, right? Most data science isn't going to be just like 100 lines. It's not necessarily the best data for us to do as an input. When you have that huge data set, you want that to be located close to your workspace. The time it takes to do all this computation is a lot quicker, you're not doing something locally, you're not bringing it over a VPN and then back over the VPN as you're sending data back and forth.

Data Security: How Does Coder Architect a Secure Development Environment?

If we take a look at the architecture diagram for Coder, just to kind of give an idea of how this is all working, the users workspaces are represented in this top block here and, again, you can connect your command line or just do the browser connection like I'm doing. Everything happens over https. It's all secure over SSL and within the Coder damon that's running on your cluster. That's actually going to connect to all of your data. Your PostgresSQL database for Coder, of course, but any other database that your workspace needs.

You can have your own container registry. So I happen to be using Docker Hub, but it could be a private container registry as well that has all my images on it. And then of course I can have those additional providers like I talked about as well, that says this is the cluster I want to deploy into and maybe that cluster has again, specific hardware configurations or something like that. And then finally we have our data plane where all our workspaces are. This is all in a secure behind your firewall environment where the interconnections between those databases are just the data warehouse in general and your workspace is going to be a lot quicker. And this should help improve model training time and things like that as well as giving you that cloud compute.

Managing Cloud Compute Resources

My workspace specifically here has four cores and 8 gigs of RAM. But I could have very easily built this out to have 128 cores and 256 gigs of RAM, basically like whatever the highest allotment is. So I can do all of this and train my models and really have that beefy system. And then the great thing is I can rebuild these as many times I want. That's obviously expensive to have a GPU farm in the cloud and to have all those VMs spun up that have so much compute power, so I can just hit the stop workspace button and it basically puts it on a standby mode where I can rebuild it and I'll be able to jump right back in where I was afterwards to save on infrastructure costs and everything there.

Managing Environment Configuration

To wrap this up, I do want to show how this template worked specifically. It does say "built from a template." Now I'm going to jump over to my GitHub repository again. And within my repository we had this coder.yaml file that specifies our workspace as code. It's a workspace template. It's going to give us information here on how we want these developer or data science workspaces to look. In this case we have the image specified again, this one's coming from Docker Hub and it's a data science sample that I did specifically. I have my CPU, memory, disk and then again, GPUs, I could add in there as well. If I said I wanted that 128 cores instead. I would have to basically just change its value in source control. And when I build from this template everyone will get 128 cores instead.

The neat thing about this is that it is taking DevSecOps practice or just the DevOps practice of infrastructure-as-code. And we're basically normalizing it.

Again, every data scientist that builds a workspace here is getting the exact same base image and then we're getting the exact same base hardware to run that image on. It's a really neat way to be able to kind of keep track and normalize that piece and eliminate some of the variables in our experiments.

Customizing Environment Configuration

We can do some labelling, Kubernetes allows that. We can do things like chargeback groups or label specific Python versions or things like that. Maybe I need to add a specific label for this workspace that has to do with a special project or something like that as well.

And then of course you can run a bunch of different commands. My Dockerfile has a bunch of packages and stuff in it, which I'll show you in just a second, but just for showing some of the use cases here, I'm also doing a lot of pip installs here so we can get again that Matplot you know, and then I added some other stuff, Flask, Django, things like that. It's pretty neat and the nice thing about this is this is fully auditable. Our ops team can handle updating this. The data scientist doesn't really have to worry about any of this piece and as long as this image is up-to-date and has everything we need for a project. We're constantly able to build from this, like you saw, just by clicking that "Open in Coder" button

Docker Configuration

Let's go and jump over to that Dockerfile real quick just to show what's in there. On this site we can see a very standard Dockerfile, if you've ever worked with that before. Again, your Ops team might handle this versus the data scientists themselves. But as Ben was saying at the beginning, sometimes it's kind of being the other way around. Like software developers are being told they need to now do some data science or something like that. So I want to make sure we covered this too. This Dockerfile specifies all the tooling that we're going to use. Things like obviously Python are in there and stuff like that. I do some user management for Coder's user, doing things like installing Jupyter, installing PyCharm, and then you can see I actually had DataSpell in here as well commented out right now. It is an early access preview. So I figured for live demos maybe let's comment it out and not not risk a preview I guess. But in a general sense though it's that easy.

We have a Dockerfile that specifies everything I need for my workspace and we have a workspace template that specifies everything I need to run that container. Now, everything that is built from this template is going to be using the exact same base. Onboarding the new data scientist is really just as easy as going, like you saw, and clicking "Open in Coder" and they get the exact same base as everyone else. Pretty cool.

[Ben] I just I just wanted to mention that for a data scientist or a developer it's really just that "Open in Coder" button and then not only do you have all the dependencies you need such as Matplotlib, which is a requirement for your data science example. That's something that normally each person would have to install. Maybe instructions differ across like Mac, Windows, and Linux.

I always run into weirdness trying to install stuff like that, so having that same environment and hardware is great. As well as when the project scales and maybe you're working with a larger data set or something like that. I know data science teams often when they're working locally work with like a small subset of the data where the large dataset is in a data farm or something with Coder you could quite literally just connect to that in the cloud. Or maybe even like where your data set lives is in the same cloud as your Coder workspaces, so you can kind of interact with that full data set with very low latency delay just because of the resources that you have in the cloud. It's pretty exciting to see and your presentation really kind of helped me get those ideas together.

[Thomas] Yeah. Yeah, I mean again, there's literally data scientists are using like Excel still and everything too and kind of dealing with those types of trouble. Like you said, any type of configuration locally that we can remove some of the puzzle pieces to I guess and make that a little bit easier and standardized is great. It's not uncommon in the data science world for someone that wrote an experiment and had some model that they decided to go with for you know whatever.

Use Case: Unraveling Local Setup to Revisit Old Data Models

Let's just use the Uber example. I don't want to say that this is Uber, but let's use the Uber example. Maybe they have some model that worked really well for them on getting the next driver available to a person's location. Maybe that data scientists left or is on a different team now, so then someone else has to go in there and say okay this model was built like a year ago and I need to go through and figure out how they even set up their local environment to run this model so I can get like for like data and then once I do how can I improve this experiment, make my tweaks and everything like that. As you said basically a lot of setup process locally to even get to that point. Hopefully there's some documentation if you're lucky on how to set up the the experiments again, so you can start doing your own new experiments and making those tweaks and you know. There's there's just a lot more involved with it. Having the standardized template that builds out this base image, in theory a year from now, someone could build out this project and if I haven't made any tweaks to it, it would be the exact same thing we just saw today, so pretty cool.

Demo Q&A

Question: Managing Source Control

I know oftentimes source control might be something new to data science, so what would that experience be like in Coder?

The good thing is that Coder integrates directly with the three major Git providers, namely GitHub, GitLab, and Bitbucket. Whether it's on premise or in the cloud, we have tightly integrated with all of those, so your Coder administrator would set up the initial bit. You link your account and then you're able to do all of your cloning and pushing and things like that too, so you don't really have to worry about too much there.

The nice thing about some of the editors like PyCharm, VS Code, stuff like that, is they have an upload to VCS button that you can just click usually and it will just push it to your repository. You don't necessarily need to hop in the terminal and learn all the Git commands and stuff like that, which is pretty cool. Yeah, I think that would be great there and then if you're using a tool like Common ML, or ML Ops, or MLflow or something like that for tracking experiments, you can still use the exact same tooling that you're using now for that. Those all integrate with the providers as well. We would basically have that same integration and it would totally just work.

Question: Using Coder For Web Development & Django

I do see a question that came in here from Fernando. Does it work well for web development, for example, building a complex Django app?

Yeah, actually that was probably the original use case I would say for Coder is taking our developer environment in general and just moving into this cloud and having this repeatable environment. For last month's demo we talked about securing remote development and that's on YouYube.

I happened to be doing a React app in that one, but again, it could have been Django, Flask, or whatever. Totally works for that. It's one of those things that you just basically specify your Docker image what tools you need, you do whatever pip installs and things like that you would want to do for Django and its dependencies, and then you launch your workspace and we can actually expose a port internally to listen to the web server, see the front end changes happen. Have the whole back end set up as well spin up a database and all that fun stuff.

One final thing, I wanted to call out is that we have a blog post on Coder with data science.