Domain Specific Language for CI Pipelines

The current state of CI pipeline definition files

Currently most generic CI systems (for example GitHub Actions or GitLab CI) let their users design pipelines in a single or multiple yaml files. This is fine for project tailored CI pipelines, because most steps are probably really project specific. These specific pipeline definition files can get really long without necessarily describing complex pipelines. Many of these pipelines execute the same step with different parameters, this can lead to a lot of duplicate code. Lets look at the example down below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
stage:
  - build
  - test

build_arm:
  stage: build
  script:
    - # some build command

build_x86:
  stage: build
  script:
    - # some build command

test_arm:
  stage: test
  script:
    - # some test command

test_x86:
  stage: test
  script:
    - # some test command

This pipeline just builds a project with two different target architectures and tests both of them. Unfortunately describing this use-case already requires \(~20\) lines of code. Of course you could use templates to reduce structural complexity, especially for more suffisticated use-cases. But even templates aren’t able to reduce the code size to an optimal degree in my opinion.

Is there a better way?

A project that peaked my interest is Kubeflow. Kubeflow is a framework tailored towards data science and ML¹ projects and work-groups. It also lets users define a pipeline made out of stitched together Docker containers, that gets executed remotely. The main difference in describing a pipeline between GitLab CI and Kubeflow is the language used to do so. Kubeflow uses a layer of abstraction in the form of a DSL that is built upon Python that lets users import pre-exisitng components². These components can be provided by the community through Google’s own online platform.

What is a Kubeflow component?

You can get a good explanation of Kubeflow components through the official Kubeflow documentation.

Essentially a component consists of two different elements:

A component definition file that contains the following information:
- Metadata (for example component name or description)
- Interfaces
- A link to the Docker image that contains the implementation
The implementation in the form of a Docker image

This makes working with Kubeflow components really comfortable, because you only need to import the component definition file if you want to use a component. At no point during development you need the implementation (Docker image) on your local system. Instead as soon as you schedule a pipeline run on your remote system, it’ll pull the implementation from a registry. The flow-graph below illustrates how implementation and definition are strictly separated.

Then why not use KubeFlow?

Unfortunately KubeFlow is a very heavy weight framework, that is only suited for large scale organizations or projects. For most use cases it is not feasible to setup and maintain a KubeFlow Kubernetes cluster just so you can run your CI/CD pipelines there.

Proposed Solution

Most companies or work groups use some sort of CI/CD system already (GitLab CI or GitHub Actions comes to mind). So a nice solution with minimal setup time would be to take the design principles of KubeFlow components and apply them to a framework that you can use to combine components into a pipeline that is run-able with GitLab CI or GitHub Actions.

Such a framework should support following features:

Component definition file similar to KubeFlow’s. A definition should contain the following information:
- inputs & outputs (potentially even typed)
- implementation as a docker image (URI)
- parameters
DSL to describe the pipeline
Translator that is responsible for generating a pipeline definition file for a specific target (for example the .gitlab-ci.yml file)

Example

import build from "https://gitlab.com/.../build.yml"
import test from "https://gitlab.com/.../test.yml"

# build
build_1 = build(arch="arm")
build_2 = build(arch="x86")

# test
test(build_1)
test(build_2)

With the build component defined as follows.

1
2
3
4
5
6
7
8
metadata:
  name: 'build'
  description: 'multi-arch build component'
image: docker.gitlab.com/.../...:latest
parameters:
  - {name: 'arch', type: 'str'}
entrypoint:
  - 'make -march=$arch'

Component compatibility with KubeFlow

An additional benefit of using the design principles of KubeFlow is potential support for components that were originally designed for Kubeflow. This would allow users to choose of a huge pool of potential components to use in their CI/CD pipelines.

Easier migrations to other CI pipeline environments

By having an abstracted representation of the CI pipeline and generating all files required by the targeted CI environment automatically, migration to other systems can be trivial. In theory it could be as easy as switching the --target CLI option to another implemented backend (pipeline-gen --definition=pipeline.pi --target=github). Of course implementing the specific targets is not an easy task but can probably be done for most CI systems.

A software like the one I proposed in this article, could grant teams the flexibility to choose and experiment with different CI systems, without having to invest a lot of time and effort.

Challenges

The most complicated part of the project would be to implement the source code generator for the desired target platform. Some of these challanges could be:

Handling artifacts of pipeline steps or the pipeline as a whole
Propagation of artifacts to other steps
Potentially inferring dependencies between steps
Support for target specific parameters

Summary

Current CI systems like GitLab CI or GitHub Actions can require a lot of boilerplate code. To reduce the amount of code to maintain, I propose a system in which the user defines the pipeline behaviour by stitching together premade pipeline steps using a DSL. An interpreter can then generate all required files from the pipeline definition (in the DSL) for a targeted CI environment.

Additionally this opens up the possibility to statically analyse the pipeline behaviour. For example type checking each steps input and outputs would be possible.

Machine Learning ↩︎
Each Kubeflow component consists of a yaml file describing the component interface and the implementation as a Docker container. ↩︎