My personal reference sheet for data engineering on GCP

This guide is intended to be a handy reference for myself when i’m looking for a specific link or command, or setting up a new workstation. I’m expecting this guide to grow with more commands over time.

I’ll be covering the following areas in this guide:

Homebrew, Pyenv & Virtual env

This holy trinity should be taught in all tutorials. If you want control and understanding of your local environment, the combination of homebrew, pyenv & venv is the only way to go.

Homebrew

Homebrew is the de-facto package manager for mac/linux. Find more information about how to install it here.

Commands to know:

brew doctor              #Performs a health-check on your install
brew install *name* #For installing cli-based applications
brew cask install *name* #For installing gui-based applications

Pyenv

Forgot about ‘installing python’. Install pyenv (with homebrew) and use it to install and select python versions. Pyenv basically intercepts the command ‘python’ in your terminal and makes it point to the specific python version you want. More information on pyenv can be found here.

brew install pyenv

To make sure every terminal session has pyenv initiated you need to put an init-script in your .bash_profile or .zshrc. Information on how to do this can be found after point at point 3 of ‘Basic GitHub Checkout’ here.

When you’ve got pyenv working you can install your preferred python version like so:

pyenv install 3.7.8   #Installs python 3.7.8 pyenv global 3.7.8    #Sets python 3.7.8 to be your global version

Type ‘python’ in your terminal and watch the magic of pyenv.

python --version
> Python 3.7.8

Venv

Packages in Python are by default installed to a global packages folder. If you want to ensure your code performs the same in the cloud as on your local computer, this is not ideal.

Venv solves this problem by creating virtual environments that are project specific. Packages can be installed to this virtual environment instead of the global scope. By utilizing a requirements.txt file to keep track of packages you want installed for a specific project you can ensure consistency between the cloud and your local development environment.

The venv-module is bundled with python since 3.5.

To create a venv-config folder run the following in your terminal:

python -m venv venv-config      #Creates the venv-config folder
source venv-config/bin/activate #Takes you inside the virtual env

You are now inside the virtual environment. Feel free to install any packages you want. Preferably these are listed in a requirements.txt file and can be installed with this command:

pip install -r requirements.txt

When you want to leave the virtual environment, type:

deactivate

GCP client libraries

To be able to access googles services you need a client library that be ‘imported’ from and used as a module in your Python code. This comes in two forms (one for all discovery-based APIs and one for interacting with services on GCP). There’s a slight overlap from the first one to the second one, my advise is to use the second one when available

Google Cloud Python Client

Supports all GCP services. Please note that this is just a container repo, all specific clients have their own specific libraries. For example the Cloud Storage API Client can be found here

All client libraries can be pip installed (or put in requirements.txt) on the format:

google-cloud-*service*

And imported in your code (main.py) like so:

from google.cloud import *service*

Please note that there’s also a Google API Python Client. This library is intended to be used for the discovery-based APIs, that is to say Googles products outside of GCP. For example the Google Analytics API. I found this rather confusing at first.

Google Auth Library

This library contains the authorization-layer depended upon by all the google-cloud-*service* libraries. For clarity i like putting this in the requirements.txt when i’m specifically using it:

google-auth

And this is how you go about creating a client with specific credentials:

from google.oauth2 import service_account
from google.cloud import storage
credentials = service_account.Credentials.from_service_account_file(
'path_to_service_account_key.json',
scopes=['https://www.googleapis.com/auth/devstorage.read_only']
)
storage_client = storage.Client(credentials=credentials)

A full list of available scopes can be found here.

Authentication from environment variables~

~/.bashrc or ~/.zshrc

Depending on your shell driver, one of these files is executed when you start a new shell session. By placing initializiation calls and the definition of handy environment variables here you…

GOOGLE_APPLICATION_CREDENTIALS

When deploying functions or apps to GCP, GCP automatically injects a service account to be used by all clients created in the code that It does this by setting the environment variable GOOGLE_APPLICATION_CREDENTIALS to point to a path with the service account json key.

In order to get the same feature to work when developing and testing code locally you need to set up the environment variable GOOGLE_APPLICATION_CREDENTIALS to have an absolute path to the service account json key that you want to use for local development.

export GOOGLE_APPLICATION_CREDENTIALS= "/path_to_key.json"

With this you can initialize clients in this way:

from google.cloud import storage#Client with credentials from environment variable.
storage_client = storage.Client()

Handy ~/.bashrc or ~/.zshrc lines

Instead of memorizing venv-related commands, give them easy to remember aliases:

alias venv="python -m venv venv"
alias venva="source venv/bin/activate"
alias pipi="pip install -r requirements.txt"

pyenv:

export PYENV_ROOT="$HOME/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init -)"

Data Engineer @ Valtech — Stockholm, Sweden

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store