Projects Workflows

This workflows explain how to work with projects for machine learning.

Leaf Segmentation

The leaf segmentation problem consist in finding the areas of an image that contain leaves. The base dataset consist on images with black/white masks that identify the area. Here is an example.

real mask

Obtaining the dataset

The dataset can be found in the dataset list. From here, you can go to the details and create a new project based on the dataset.

Now you can clone the project and follow the steps on dataset usage to download the images. This can be summarized as running dvc pull

Building a pipeline

Preparing the dataset

The images must be preprocessed in order to use a training algorithm. For this case, it is required to resize the images to a given resolution. 224x224. This will be achieved a python script that will find all the images in the dataset and generate a copy with the desired resolution. The script can be invoked with the following command.

python data_preprocessing.py --dataset dataset --out-dataset preprocessed-dataset

To automate the execution, we will create a dvc pipeline. The pipelines list the dependencies and outputs so dvc can efficiently track changes and execute steps efficiently. The pipeline can be created with the following command.

dvc run \
    -n data_preprocessing \
    -d data_preprocessing.py \
    -d dataset \
    -o preprocessed-dataset \
    --no-exec \
    python data_preprocessing.py \
    --dataset dataset \
    --out-dataset preprocessed-dataset  

This creates the dvc.yml file with an easier to read format.

stages:
  data_preprocessing:
    cmd: python data_preprocessing.py --dataset dataset --out-dataset preprocessed-dataset
    deps:
    - data_preprocessing.py
    - dataset
    outs:
    - preprocessed-dataset

Now we can run ‘dvc repro’ to execute the pipeline. This will generate a dvc.lock file that list the md5 values for dependencies and outputs that will be used by dvc to detect changes.

The output folder will be tracked by dvc and stored in a storage site. This enables other team members to checkout the desired version and download the result without having to compute it again.

Split training validation & test

next step is to split the dataset, we can use the run command again to generate the next stage.

dvc run \
    -n split \
    -d split_dataset.py \
    -d preprocessed-dataset \
    -p random_seed \
    -p split \
    -o split \
    --no-exec \
    python split.py \
    --dataset preprocessed-dataset
  split:
    cmd: python split_dataset.py --dataset preprocessed-dataset
    deps:
    - split_dataset.py
    - preprocessed-dataset
    params:
    - random_seed
    - split
    outs:
    - split

This stage will generate a split folder containing 3 txt files with a list of file paths relative to the working directory dividing the dataset in training validation and tests groups.

The difference with the previous step, is that now there are parameters that can be modified to alter the result of the stage. This parameters are defined in params.yaml file in the root of the repository. As described by the dvc documentation, this parameters are tracked by dvc but is the responsibility of the cmd command to properly read them.

random_seed: 398503

split:
    train: 80
    validation: 10
    test: 10

Training

Training just follows the same pattern, we can create the pipeline directly by editing the dvc.yaml file or use the terminal command. Just remember to add outputs to .gitignore if you do it manually.

  train:
    cmd: python train.py --dataset preprocessed-dataset --split split --out train --metrics metrics.json
    deps:
    - train.py
    - preprocessed-dataset
    - split
    params:
    - train
    outs:
    - train
    metrics:
    - metrics.json:
        cache: false
    plots:
    - training.csv

In this case we have metrics and plots as an output. If you define the cache: false like in the metrics.json, dvc wont track the files and it will be your responsibility to track them with git. As the metrics.json is a small file, it is better to use git for the tracking.

Reproduce the experiment

Thanks to dvc, now we can run dvc repro and all the stages will be executed and automatically tracked. After the execution you will notice that dvc.lock file is modified and populated with md5 hashes of the results. You have to commit this file to ensure the reproducibility and run dvc push to upload your results to the storage site. If you forget to commit the dvc.lock, other people wont be able to find your results and if you forget to dvc push other people wont be able to download your results.

Test different inputs

Now you modify any of the input parameters and run dvc repro to automatically execute the necessary steps to reproduce the experiment. This means that if you modify the learning_rate parameter, only the final training step will be executed as preprocessing and split will yield the same output.

Run experiments in execution endpoints

Soon, you will require a more powerful machine to try real-world parameter configuration or to use a bigger input dataset. Once you get it to work locally with a small sample of data, you can create a commit more data or modifying the parameters to include the hole dataset and let an execution endpoint reproduce que experiment for you. This can be easily achieved using GitLab Runners

Prepare the environment

nowadays, the best way to create environments is using Docker. For this use case, we need an image with the dependencies specified in requirements.txt plus the tooling around dvc. This multistage docker file builds a docker image with with the base dependencies including jisap-cli, vega-cli and dvc, the final stage just copies over the requirements. It is a good idea to build the base image and publish it to a registry so you don’t have to build all over again. For tutorial purposes, all code is provided here.

## a prebuilded image with jisap-cli to copy the binary. you can build find the dockerfile for this image on jisap-cli source code
FROM tecnalia-docker-dev.artifact.tecnalia.com/jisap/cli:v0.5.2-buster as cli

FROM node:14.15-buster as base

ENV JISAP_CLI_VERSION v0.5.2
ARG repo_token

RUN apt-get update -y && apt-get install -y \
    python3 \
    python3-pip \
    git \
    libcairo2-dev \
    libpango1.0-dev \
    libjpeg-dev \
    libgif-dev \
    librsvg2-dev \
    libfontconfig-dev

RUN npm config set user 0 && \ 
    npm install -g vega canvas vega-cli vega-lite

RUN pip3 install dvc

COPY --from=cli /bin/jisap-cli /bin/jisap-cli

CMD ["bash"]

FROM base as production

COPY requirements.txt requirements.txt
RUN pip3 install -r requirements.txt

You can build with the following command

docker build -t your.registry.com/segmentation:latest .

Configure credentials

It is required to define a pipeline variable repo_token with an access token that can be used to interact with GitLab. Then we will use this variable to give access to the different tools during the pipeline.

GitLab pipeline

We can easily run the pipeline in GitLab with the following code.

reproduce:
  image:
    name: your.registry.com/segmentation:latest
  script:
    - pip install -r requirements.txt
    - dvc remote modify --local jisap-basf password $repo_token
    - dvc pull || true
    - dvc repro

This is not really useful because the outputs are lost so you won’t be able to download them on your local machine. To fix this, we have to commit back the changes to the repository.

In order to create a commit we need two prerequisites

  • configure user email and user name
  • create a branch

The user and the branch should be the ones triggering the pipeline. For the user this can be achieved with environment variables and for the branch, it is required to fetch and hard reset to origin. this is due to GitLab runner optimizations. This is performed during the before_script section.

In the commit message, it is a good idea to place the tag [ci skip]. This will prevent pipeline loops.

To push back changes, we need to update the remote url to include the repo_token, so we can actually push changes.

Finally, you should run dvc push to upload artifacts and git push to upload the tracking files.

reproduce:
  image:
    name: your.registry.com/segmentation:latest

  before_script:
    - git config user.email "$GITLAB_USER_EMAIL"
    - git config user.name "$GITLAB_USER_NAME"
    - git fetch
    - git checkout $CI_COMMIT_BRANCH
    - git reset --hard origin/$CI_COMMIT_BRANCH
  script:
    - pip install -r requirements.txt
    - dvc remote modify --local jisap-basf password $repo_token
    - dvc pull || true
    - dvc repro
    - git add .
    - git commit -m "CI repro commit from pipeline [ci skip]" || true
    - git remote set-url origin "https://token:$repo_token@$CI_SERVER_HOST/$CI_PROJECT_PATH.git"
    - dvc push
    - git push
  tags:
    - docker

From your local computer, you can now pull the changes from the branch, run dvc pull and you will automatically download the results from the pipeline reproduction.

Report

It is recommended to do this as a separate step in your GitLab pipeline.

report:
  stage: report
  image:
    name: your.registry.com/segmentation:latest
  ## Generate the report only on non default branches
  rules:
    - if: '$CI_COMMIT_BRANCH != $CI_DEFAULT_BRANCH'
      when: always

  script:
    - git fetch

    ## Update default branch
    - git checkout $CI_DEFAULT_BRANCH
    - git reset --hard origin/$CI_DEFAULT_BRANCH
    - dvc pull training.csv

    ## Update default branch
    - git checkout $CI_COMMIT_BRANCH
    - git reset --hard origin/$CI_COMMIT_BRANCH
    - dvc pull training.csv

    - touch report.md
    - echo "# $CI_DEFAULT_BRANCH vs $CI_COMMIT_BRANCH" >> report.md
    - dvc params diff --all --show-md $CI_DEFAULT_BRANCH $CI_COMMIT_BRANCH >> report.md
    - dvc metrics diff --all --show-md $CI_DEFAULT_BRANCH $CI_COMMIT_BRANCH >> report.md
    - dvc -v plots diff $CI_DEFAULT_BRANCH $CI_COMMIT_BRANCH --target training.csv --show-vega  > vega.json
    - vl2png vega.json | jisap-cli publish >> report.md

  after_script:
    - jisap-cli comment report.md
  tags:
    - docker

This pipeline is just an example to generate a report vs the default branch. You will provably want to design your own reports depending on your needs.