This workflows explain how to work with projects for machine learning.
The leaf segmentation problem consist in finding the areas of an image that contain leaves. The base dataset consist on images with black/white masks that identify the area. Here is an example.
The dataset can be found in the dataset list. From here, you can go to the details and create a new project based on the dataset.
Now you can clone the project and follow the steps on
dataset usage to download the images. This can be summarized as running dvc pull
The images must be preprocessed in order to use a training algorithm. For this case, it is required to resize the images to a given resolution. 224x224. This will be achieved a python script that will find all the images in the dataset and generate a copy with the desired resolution. The script can be invoked with the following command.
python data_preprocessing.py --dataset dataset --out-dataset preprocessed-dataset
To automate the execution, we will create a dvc pipeline. The pipelines list the dependencies and outputs so dvc can efficiently track changes and execute steps efficiently. The pipeline can be created with the following command.
dvc run \
-n data_preprocessing \
-d data_preprocessing.py \
-d dataset \
-o preprocessed-dataset \
--no-exec \
python data_preprocessing.py \
--dataset dataset \
--out-dataset preprocessed-dataset
This creates the dvc.yml file with an easier to read format.
stages:
data_preprocessing:
cmd: python data_preprocessing.py --dataset dataset --out-dataset preprocessed-dataset
deps:
- data_preprocessing.py
- dataset
outs:
- preprocessed-dataset
Now we can run ‘dvc repro’ to execute the pipeline. This will generate a dvc.lock file that list the md5 values for dependencies and outputs that will be used by dvc to detect changes.
The output folder will be tracked by dvc and stored in a storage site. This enables other team members to checkout the desired version and download the result without having to compute it again.
next step is to split the dataset, we can use the run command again to generate the next stage.
dvc run \
-n split \
-d split_dataset.py \
-d preprocessed-dataset \
-p random_seed \
-p split \
-o split \
--no-exec \
python split.py \
--dataset preprocessed-dataset
split:
cmd: python split_dataset.py --dataset preprocessed-dataset
deps:
- split_dataset.py
- preprocessed-dataset
params:
- random_seed
- split
outs:
- split
This stage will generate a split folder containing 3 txt files with a list of file paths relative to the working directory dividing the dataset in training validation and tests groups.
The difference with the previous step, is that now there are parameters that can be modified to alter the result of the stage. This parameters are defined in params.yaml
file in the root of the repository. As described by the dvc documentation, this parameters are tracked by dvc but is the responsibility of the cmd command to properly read them.
random_seed: 398503
split:
train: 80
validation: 10
test: 10
Training just follows the same pattern, we can create the pipeline directly by editing the dvc.yaml
file or use the terminal command. Just remember to add outputs to .gitignore if you do it manually.
train:
cmd: python train.py --dataset preprocessed-dataset --split split --out train --metrics metrics.json
deps:
- train.py
- preprocessed-dataset
- split
params:
- train
outs:
- train
metrics:
- metrics.json:
cache: false
plots:
- training.csv
In this case we have metrics and plots as an output. If you define the cache: false
like in the metrics.json, dvc wont track the files and it will be your responsibility to track them with git. As the metrics.json is a small file, it is better to use git for the tracking.
Thanks to dvc, now we can run dvc repro
and all the stages will be executed and automatically tracked. After the execution you will notice that dvc.lock
file is modified and populated with md5 hashes of the results. You have to commit this file to ensure the reproducibility and run dvc push
to upload your results to the storage site. If you forget to commit the dvc.lock, other people wont be able to find your results and if you forget to dvc push
other people wont be able to download your results.
Now you modify any of the input parameters and run dvc repro
to automatically execute the necessary steps to reproduce the experiment. This means that if you modify the learning_rate
parameter, only the final training step will be executed as preprocessing and split will yield the same output.
Soon, you will require a more powerful machine to try real-world parameter configuration or to use a bigger input dataset. Once you get it to work locally with a small sample of data, you can create a commit more data or modifying the parameters to include the hole dataset and let an execution endpoint reproduce que experiment for you. This can be easily achieved using GitLab Runners
nowadays, the best way to create environments is using Docker. For this use case, we need an image with the dependencies specified in requirements.txt
plus the tooling around dvc. This multistage docker file builds a docker image with with the base dependencies including jisap-cli, vega-cli and dvc, the final stage just copies over the requirements. It is a good idea to build the base image and publish it to a registry so you don’t have to build all over again. For tutorial purposes, all code is provided here.
## a prebuilded image with jisap-cli to copy the binary. you can build find the dockerfile for this image on jisap-cli source code
FROM tecnalia-docker-dev.artifact.tecnalia.com/jisap/cli:v0.5.2-buster as cli
FROM node:14.15-buster as base
ENV JISAP_CLI_VERSION v0.5.2
ARG repo_token
RUN apt-get update -y && apt-get install -y \
python3 \
python3-pip \
git \
libcairo2-dev \
libpango1.0-dev \
libjpeg-dev \
libgif-dev \
librsvg2-dev \
libfontconfig-dev
RUN npm config set user 0 && \
npm install -g vega canvas vega-cli vega-lite
RUN pip3 install dvc
COPY --from=cli /bin/jisap-cli /bin/jisap-cli
CMD ["bash"]
FROM base as production
COPY requirements.txt requirements.txt
RUN pip3 install -r requirements.txt
You can build with the following command
docker build -t your.registry.com/segmentation:latest .
It is required to define a pipeline variable repo_token
with an access token that can be used to interact with GitLab. Then we will use this variable to give access to the different tools during the pipeline.
We can easily run the pipeline in GitLab with the following code.
reproduce:
image:
name: your.registry.com/segmentation:latest
script:
- pip install -r requirements.txt
- dvc remote modify --local jisap-basf password $repo_token
- dvc pull || true
- dvc repro
This is not really useful because the outputs are lost so you won’t be able to download them on your local machine. To fix this, we have to commit back the changes to the repository.
In order to create a commit we need two prerequisites
The user and the branch should be the ones triggering the pipeline. For the user this can be achieved with environment variables and for the branch, it is required to fetch and hard reset to origin. this is due to GitLab runner optimizations. This is performed during the before_script section.
In the commit message, it is a good idea to place the tag [ci skip]. This will prevent pipeline loops.
To push back changes, we need to update the remote url to include the repo_token
, so we can actually push changes.
Finally, you should run dvc push to upload artifacts and git push to upload the tracking files.
reproduce:
image:
name: your.registry.com/segmentation:latest
before_script:
- git config user.email "$GITLAB_USER_EMAIL"
- git config user.name "$GITLAB_USER_NAME"
- git fetch
- git checkout $CI_COMMIT_BRANCH
- git reset --hard origin/$CI_COMMIT_BRANCH
script:
- pip install -r requirements.txt
- dvc remote modify --local jisap-basf password $repo_token
- dvc pull || true
- dvc repro
- git add .
- git commit -m "CI repro commit from pipeline [ci skip]" || true
- git remote set-url origin "https://token:$repo_token@$CI_SERVER_HOST/$CI_PROJECT_PATH.git"
- dvc push
- git push
tags:
- docker
From your local computer, you can now pull the changes from the branch, run dvc pull and you will automatically download the results from the pipeline reproduction.
It is recommended to do this as a separate step in your GitLab pipeline.
report:
stage: report
image:
name: your.registry.com/segmentation:latest
## Generate the report only on non default branches
rules:
- if: '$CI_COMMIT_BRANCH != $CI_DEFAULT_BRANCH'
when: always
script:
- git fetch
## Update default branch
- git checkout $CI_DEFAULT_BRANCH
- git reset --hard origin/$CI_DEFAULT_BRANCH
- dvc pull training.csv
## Update default branch
- git checkout $CI_COMMIT_BRANCH
- git reset --hard origin/$CI_COMMIT_BRANCH
- dvc pull training.csv
- touch report.md
- echo "# $CI_DEFAULT_BRANCH vs $CI_COMMIT_BRANCH" >> report.md
- dvc params diff --all --show-md $CI_DEFAULT_BRANCH $CI_COMMIT_BRANCH >> report.md
- dvc metrics diff --all --show-md $CI_DEFAULT_BRANCH $CI_COMMIT_BRANCH >> report.md
- dvc -v plots diff $CI_DEFAULT_BRANCH $CI_COMMIT_BRANCH --target training.csv --show-vega > vega.json
- vl2png vega.json | jisap-cli publish >> report.md
after_script:
- jisap-cli comment report.md
tags:
- docker
This pipeline is just an example to generate a report vs the default branch. You will provably want to design your own reports depending on your needs.