We recently configure our own Kubernetes (K8s), cluster using MicroK8s and learn how we could use an operator for deploying a PostgreSQL Server within our cluster. Now we will learn how we could create k8s jobs that populate a database within our cluster.
Creating our Database
This exercise assume that we have the PostgreSQL operator installed in our cluster, if not we may need to follow the steps that we did in the previous post.
First we will delete the database that we created previously.
We will put everything that we are going to use in a directory that latter we could push to a repository, so we will start with a similar yaml that we original did movies-db.yml :
We are requesting 2 Gibibytes for the volume on our database named movies and a admin user named moviesdba and normal user name moviesuser, and we like to have to instance, one will be our master and the other our replica.
For those not familiar with the Gibibytes (Gi) instead of Gigabytes (G) you may check the IEEE 1541-2002 by the International Electrotechnical Commission (IEC) or this article in the Wikipedia
Now we could create the database executing :
Now we we need to wait that our database is running, we could check until we get this output :
And the inspect the nodes four our database with :
Creating a simple Job
Now that we have our database we could start creating our Job but first we need to learn how Jobs work in k8s. A k8s job is a task that run
until completion, that means if the job doesn’t end it will run forever.
We could tell k8s to deploy an image, base on a existing image, and run it until it’s end, for this we will use a yaml file to
describe it, in this example will be : database-load.yml
For running this job we could just do :
To check the job that we have create we could do :
We see that our job run during 3s, completions show 0/1 because is not running anymore. Let’s check the pods for our job
We could see that we have a pod for our job that is not running and is complete, we could check the log for our run with :
Rerun our Job
Our job is complete but we may want to run it again, we could try doing :
And this is because that job already exist, so we could not create again, however we could delete it first :
And we could get the log as before :
But this is not great, there is a different way to do this, first we will delete our job:
We will modify our database-load.yml to use generateName instead of name :
And now let’s do a couple of runs :
Now we could get our jobs with :
And the pods that they run with :
We are doing this using the custom label that we have add named job-group, and we could use it even for get the logs :
Ne could delete our jobs using as well the group :
One interesting fact is that when we run a job a pod is created and remain on hour cluster until is restarted, we
may not want to do that for freeing cluster resources so we want may want to delete the jobs.
There is a alpha feature of k8s, TTL Controller for Finished Resources, that allow that, but since is alpha is may change in the future so we are not going to use it, for now.
Visualizing our Jobs in Kibana
Since we want to delete our job if we do it so we could not use kubectl to view our past logs, however we could use Kibana, as it was setup in our MicroK8s, for visualizing our past jobs.
For example we could just filter using the simple expression : kubernetes.labels.job-group: “movie-load”
Or we could explore a particular run with : kubernetes.labels.job-name: “movies-load-run-mrn7f”
Creating our own image
For creating our job so far we have use an existing image, however we will need to create our own in order to add the dependencies that we will need,
such the PostgreSQL client, but we need a tool for creating images and we are going to use docker.
Note : there is a bug in linux mint 19.1 that the add-apt-repository does not work, you could add it manually as describe here.
We will need as well to add our user to the docker group so we do not need to sudo :
After this we should restart our system.
Them we to configure in docker the registry that we have in our MicroK8s cluster modifying the file /etc/docker/daemon.json as superuser :
After that we will need to restart the docker engine :
Now we will start to create our Dockerfile :
Now we will could build our container with :
If we like to test our docker we could do :
Now we will push our docker into the registry
We need to modify the database-load.yml for our job to refer to the new image:
Now we could run our job as before :
And get our logs as we did previously :
To make the things simple lets create an script name build.sh to build and push our docker :
Lets make it executable with :
Now for build and push our docker we could just do :
Adding a PostgreSQL Client
Since our job will load data from a database we will add the PostgreSQL client our Dockerfile :
And we will change our job to test the client :
For making the thing simpler we will create a new script named run.sh ::
Lets make it executable as well :
Now we could easily build and run our changes with :
Connecting to our database
In order to connect to our database we require credentials so first we will inject them in our job changing our database-load.yml :
Now we will modify our script job.sh to use it :
This script will run a file init.sql into our database, let’s create it :
But we need to add this file into our Dockerfile :
Finally we could build and run our job :
With this running we could add to our init.sql the creation of the tables for our database :
We will create some tables from our jobs using this SQL statements, let’s build and run our job :
We got some messages because our tables didn’t exist but if we run our job again :
Download a file with movies data
Now that we are able to create our tables we will download movies data from movielens, but
first let’s modify our database-load.yml to have a volume to temporary store this data using emptyDir :
An empty dir is able to store information for our pod at long at is runs, and disappear when our pod/job ends,
but will we need more tools in our container so we will first modify our Dockerfile adding wget and unzip:
Now we will modify our job to download our movies file updating our job.sh :
Now we could run our job but we will follow the log since will take longer :
Upload csv data into our database
Now that we have the data download we will create an script that insert the csv data into our database, we will name load.sql :
We need to add this new SQL script into our Dockerfile :
After this we need to modify our job.sh to run our new sql :
Now we could run our job to load our database:
This job has load 58k movies and 2.7M ratings, in a 2GB file into our database in about 4 minutes, 2 minutes just to import the data, not bat at all.
Exploring the loaded data
Lest explore the data using PSQL, we will login in our server with our moviesdba user so we need to get it password, we could get it with:
We will forward the postgres port 5432 on our master to our localhost port 6432, this will run until we do ctrl+c :
Finally we can connect to our database with with the provide user and password using psql in another
shell :
Conclusions
We this we have complete this article, we have learn tons of new concepts, tools and information to create our own jobs that could run in our cluster and
now we have a database populated with data that we will be used in further examples.
If you like to download the scripts for this example you could grab it from this github repository.
Note : 22 Dec 2019 : I’ve update the SQL scripts to add some foreign keys and indexes to speed up queries on the database, they are updated in the repository