Usage

Commonly, a dataset is defined as a collection of input data related to the same field, aggregated for a certain purpose. Within the Imaging Platform context, a dataset is a collection of image files (optionally, could include labels, annotations and/or masks) that are set to work on a specific artificial vision problem and derive an automated solution as the outcome of an AI project.

The Imaging Platform datasets are designed to version any large file that, due to its size, is not suitable for regular git repositories. These are a few examples of files that could be uploaded within a dataset repository.

  • High resolution pictures
  • Image masks
  • Annotations
  • Trained models (large binary files, the outcome of a project)

Dataset creation

A Dataset can be created using the web UI by going to the Dataset view, clicking the button + New Dataset… and providing the necessary data. This fields are the same required to create a GitLab project.

Field Description
Name Name of the dataset. It will be used also as GitLab project name
Namespace Equivalent to GitLab namespaces. used mainly to control access to the dataset. The namespace must exist
Description Short description used for dataset indexing. Synchronized with GitLab project description
Visibility Public, Internal or Private, Same as GitLab visibility. It cannot be higher that the namespace visibility
Tags Identifiers used for later search and classification

This action is equivalent to creating a project on GitLab and ensuring that the project is indexed by the platform so it can be searched.

After creation, you will be redirected to the details view.

Storage site selection

On the details view, you will see a section displaying the available storage sites. If you don’t see any, you must register at least one. You can use the search controls to find the right storage and then click to apply the configuration. This action can only be performed once and cannot be changed.

Read more about storage sites here

File upload

The platform provides a web application where you can easily upload new images (files) within a previously created Dataset. You can simply select the folder you want to upload files to, and click the Upload Files button.

This creates individual dvc tracking files. To upload large directories it is better to use dvc cli tool.

Versions

Versions can be deleted or updated by uploading, deleting or replacing files directly from the dataset view. Each operation performed is tracked by Git and can be reverted if needed.

Download single files

From the dataset view, you can browse the file list in the dataset. If the file contains a valid image format, you can see a preview by enabling show preview images. Then you can download the file by clicking the button on the right of the file description.

You right click on a file link and copy it to the clipboard. This link can be shared with anyone who has access to the platform to provide a direct download link. If you need a permanent url, you can provide a query parameter ref with the git reference that you want to download. Typically this will be a git tag or a commit hash.

Example download with curl

curl -k -b "tokenGitLab=token" "https://jisap.tecnalia.com/api/v1/datasets/4473/files/dataset/VITVI/1-410fad37671c209905a07d7b6867a9d1.jpg?ref=v0.1.0" --output "1-410fad37671c209905a07d7b6867a9d1.jpg"

Example download jisap-cli

jisap-cli download --project-id 4473 --ref v0.1.0 --output jisap-cli2 "1-410fad37671c209905a07d7b6867a9d1.jpg" "/dataset/VITVI/1-410fad37671c209905a07d7b6867a9d1.jpg"

Download full dataset

In case you want to have a local copy of the complete dataset, you can download a copy with the download files button. The downloaded files are those within the Dataset git repository, holding the pointers to versions of the dataset. Keep in mind that for performance reasons, the real files are not directly downloaded; you must have DVC locally installed, plus verified access to the external Storage Site where the Dataset is actually stored, to get them.

Dataset visibility and permissions

From the dataset view you will be able to browse the datasets which you have at least reading privileges. This privileges are deduced directly from GitLab access. You can modify GitLab Access directly from GitLab to control access to the platform. Keep in mid that a dataset is a repository to store versions (dataset) and a repository to store data (storage site). You need access to both of them for full functionality.

Start a new project

From the dataset details view, you can click the button “start a new project” to create a new project based on this dataset. You can find more information in project usage