How to Setup VSCode for Data Science

How to Setup VSCode for Data Science

  • Post author:
  • Post category:Tooling
  • Reading time:12 mins read

Introduction

Visual Studio Code or VSCode is definitely the new star at the software development horizon. It can be used as a lightweight editor for short scripts or config files but with the right extensions installed it transforms into a comprehensive development environment on steroids for almost any needs. In this post we will discuss how to setup and use VSCode especially for data science / data engineering workflows to simplify and speedup your development process. My examples will be written in python but of course, the ideas can be adopted to any kind of software development use case. At first, we will cover the basics like useful extensions, Jupyter Notebooks and git version control. Later we look at some advanced topics like working on remote machines like cloud instances or developing in docker containers to guarantee easily reproducible environments.

Now let’s start by installing VSCode if you haven’t already: Download

To follow this post I assume you have python installed already on your system. A guide of how to setup a proper python environment on your development machine can be found here.

Basic Setup and Extensions (Python and Pylance)

After the installation VSCode is already a proper editor but the real power comes from the so-called extensions. With extensions you can add functionalities for nearly every use case you may be confronted with as a developer. To get an idea click the extensions button the right and give it a try.

Open Extensions Tab in VSCode
Open Extensions Tab in VSCode

For those of you who like to work with shortcuts take a look at File > Preferences > Keyboard Shortcuts (Code > Preferences > Keyboard Shortcuts on macOS). On my Mac I can type Shift+Command+X to open the extensions tab.

Tip: Write yourself a cheatsheet with shortcuts you need during your daily workflow (if you need more than one click it’s probably worth to use a keyboard shortcut) and try to force yourself to use these shortcuts at any time to get comfortable with them. This will speed up your workflow drastically. Maybe print it out and position it visibly on your desk so you’ll always be reminded. For now, we start by installing the python extension. Simply search for python and select the first element in the list. This extension provides a broad toolset which can be helpful for developing in python and also supports Jupyter Notebooks out of the box (more on that later). But first lets write a small sample program:

def foo(): 
    print("Hi") 

foo()

You can execute the script in a new terminal window inside VSCode by clicking on the green arrow at the top right corner of the window. As you can see your code is properly highlighted already which came from the python extension. To make your work more convenient I also strongly recommend another extension called Pylance. Search for it in the extensions tab and click on install. Pylance provides you a bunch of cool features like auto imports, code completion and type checking. After installation accept the notification which ask you if you want Pylance as your default language server and reload the window. Now try the following. Set your cursor on the line where the function foo() is called. Then press F12. Your courser should have moved to the function definition of foo at the top. This also works across different files within the project and helps you to jump through large code bases much faster.

Jupyter Notebooks

If you work as a data scientist you most likely know about the power of Jupyter Notebooks already. It is basically a tool to organize, execute and document your code in logically connected chunks with integrated outputs and visualizations. These chunks, or code cells, can, for example, bundle code to create a histogram of your data, and the resulting graph is displayed below the code immediately after the cell is executed. This increases the comprehensibility of analyses and is therefore also frequently used in tutorials and teaching cases. To open a fresh Notebook, use the Command Palette. On a Mac simply type Command+Shift+P (Crtl+Shift+P for Windows) and search for Jupy… Execute “Jupyter: Create New Blank Juypter Notebook”.

Command Palette in VSCode to open a new Jupyter Notebook
Command Palette in VSCode to open a new Jupyter Notebook

For this to work properly you have to make sure two things. First, check your python interpreter at the bottom left of the VSCode window. You should see something like this.

Select a python interpreter
Select a python interpreter

Otherwise, or if the displayed version does not match the interpreter you want to use for your project use the Command Palette again und search for “Python: Select Interpreter”. Again, have a look here to setup python correctly on your machine or create virtual environments inside your project folder which should be detected by VSCode automatically. Second, make sure that the Jupyter package is installed within the selected python environment. Now your notebook editor should show up.

Jupyter Notebook in VSCode
Jupyter Notebook in VSCode

Git

Using a version control system can be painful without the right interface. And guess what, there are extensions for that in VSCode. The standard functionalities to work with git are already build into the core of VSCode but I recommend one additional extension: “GitLens“. GitLens allows you for example to see who has changed a specific line of code you are currently looking at, to which commit it belongs to and what the changes are.

GitLens for version control features in VSCode
GitLens for version control features in VSCode

I recommend reading the documentation in the marketplace before you klick on install to get an overview of all the nice functionalities which makes your teamwork much more efficient and secure.

Remote Development

Now, let’s tackle some advanced features. Today in a world of “Big Data” you probably face challenges which cannot be calculated on your working machine. For example, training Neural Networks on multiple powerful GPUs or loading several GB of data into your memory is not practical. So, in some data science use cases it is necessary to use your machine just for connecting to a much more powerful server with several GPUs for example and running your code remotely. VSCode provides an excellent support for these workflows. In our example we will connect to a remote server over a ssh connection and use our local VSCode instance to execute some example scripts.

In the Extensions Tab search for “Remote Development” and install it. This will also install some useful extras like working in a Docker container (we will cover this later in this post) or connecting to a WSL (Windows Subsystem for Linux) instance if you use a native Linux as a development environment on your windows machine. But first we will focus on the connection to a remote server.

Preparation

I assume that you have a machine in your LAN or some cloud computing instance available and be able to connect to it with an ssh client. Otherwise, prepare your remote machine by installing an ssh server and prepare your working machine by installing a compatible ssh client. I also strongly recommend a key based authentication although passwords are allowed to use.

Configuration

After installing the extensions you should have a new green area in the bottom left of your VSCode instance.

Remote Development button at the bottom left of VSCode
Remote Development button at the bottom left of VSCode

Click on it and select “Remote-SSH: Connect to Host” and type in the connection string:

username@ipaddress

A new window of VSCode opens and you should see at the bottom left your server name or IP address. You are now connected to the remote machine and be able to execute anything as if you were running VSCode directly on the server. For example open a new terminal (in the menu bar click Terminal > New Terminal) and if your remote server is a Linux based machine you should see something like this.

Terminal in VSCode
Terminal in VSCode

You can now open a new file or notebook as described above. But remember that you are not working on your current machine. If you need any files, packages, or python versions you have to manage them on the remote server.

Developing in a Remote Docker Container

First I assume you have installed Docker Desktop on your machine. Otherwise, install it from here.

As described in this post you can manage different versions of python and packages very easily even between projects and working machines. But there are use cases in the data science field where this is not enough. For example, you want to work in the exact same environment as your model will run later in a production environment including different system packages, Linux environment variables and dependencies etc. To realize this, you are able to work directly inside a container which fulfills all these requirements. If you are not familiar with docker have a look at their website. You can of course simulate a Linux server locally on your working machine with Docker Desktop installed but in this example we will go a step further and connect to a Docker host on a remote server machine. This has several advantages.

Firstly, you can use the computing power and general hardware infrastructure of the remote server as described above, and secondly, you are able to define your system requirements exactly the way you need it without changing anything about the configuration on the server itself. Let’s see how it works. Again, I assume you have a working ssh connection to a remote machine. This time, however, you need a docker host running on the server. For example, you can install docker on an ubuntu server as described here. Now let’s start the setup.

First create a folder with the name of a project and open it in VSCode (File > Open …). Then click on the green area at the bottom left of the screen like you did to connect to the remote machine. This time select “Remote-Containers: Add Development Container Configuration Files”.

Select a Development Container
Select a Development Container

Select for example “Python 3” from the resulting list, choose a python version you want to work with and uncheck the nodejs option. In your folder a new hidden folder named .devcontainer should appear. For now, ignore the notification which pops up that you must reaload the window and have a look at the files inside the new folder. The devcontainer.json file specifies everything which relates to the connection between docker and your VSCode instance. For example, you can specify the user inside the container or some specific environment variables. The second file is a Dockerfile which allows you to specify your working environment. Have a look here how to write Dockerfiles. For now don’t change anything in these files and keep the defaults.

To connect to a remote docker host the easiest way is to configure a new docker context. This simply points your current docker installation to a remote docker installation so you can manage the remote docker instance from your local working machine. If you have docker desktop installed and running type the following in your terminal (adjust username and host-ip):

docker context create remote-docker --docker "host=ssh://username@host-ip:22"

To be able to connect to the docker host via ssh tunneling you need to specify an identity for your agent. See here how to do it for all major operating systems. Now go back to VSCode and use the Command Palette to search for “Docker Contexts: Use” and select your new context you have created before. One small adjustment to the devcontainer.json file is needed now before we can start developing inside the remote container. We have to specify which folder on the host is our working directory inside the container. Therefore, at the end of the file add two lines inside the curly braces (adjust the paths as you like).

"workspaceFolder": "/workspace", 
"workspaceMount": "source=/home/jedsadmin/,target=/workspace,type=bind,consistency=cached"

This will mount your home directory on the host machine inside a folder /workspace of your new development container.

To spin up the development container click again at the bottom left on the green area and select “Remote-Containers: Open Folder in Container”. Now select the folder where you have created the .devcontainer folder. The container in the remote docker host starts building and after a few seconds you see the content of the home directory of the server host in your VSCode instance Explorer Tab.

Tips:

  • Use git to sync your projects between your local machine and your server so you can develop either on the server, inside a container or locally.
  • If you have an existing project, clone the repository to a directory on the server and point the source of the workspaceMount to this folder. This mounts only the project folder into your container.
  • To get access to GPUs on the remote server add: “RunArgs”: [“–gpus=all”]
  • Some extensions need to be specified for remote Instance separately. You can to this by adding: “extensions”:[“ms-python.python”,…]
  • Find a list of all possible configurations here.

Conclusions

In this post we’ve looked at some of the VSCode basics like Jupyter Notebooks and git as well as some more advances use cases like development inside a remote docker container. This should get you started for a successful usage of VSCode as a data scientist or data engineer or software developer. A general remark. Make sure to know your tools and force yourself at any time to exploit the features it provides. It’s probably a good idea to adjust your workflow a slightly bit to a convention in VSCode or other tools than sticking to your own habits and try to adjust your tools to them.

Leave a Reply