Technical Details

Dataset management is based on DVC technology to version large files. The platform is tightly integrated with GitLab, Git, and DVC in such way that any operation performed locally will be reflected on and synchronized with the platform.

Git Integration

The versions of a dataset are reflected as git branches on the repository. Any time a modification is performed from the frontend, a new commit is created to store the changes.

Despite the fact that the dataset view doesn’t support merge branches, a local developer can use standard git tools to perform any operation and the platform will still be able to display branches as different versions.

GitLab Integration

Dataset’s name, namespace, description, tags and logo fields are bidirectionally synchronized with respective fields on GitLab . A change using the dataset view will be reflected on GitLab and a direct modification from GitLab will be reflected on the platform.

When creating a dataset/project through the platform, if the user is Owner or Maintainer of the group the Imaging Platform functional user will be added as a member; in case the user is a developer, it will be the functional user who will create the dataset/project and add the user as Maintaner. This functional user is used to represent the platform direct access on GitLab. Periodically, the Platform will scan projects where its user is member and add them to the search list. This means that modifications to previous fields performed directly on GitLab will have a delay before appearing on the platform search engine. If the fields are modified through the platform api, the update is automatic.

DVC vs git file versioning

When you upload small text files to a dataset, it is not recommended to version it using DVC. Using git to track this versions is more efficient and provides more flexibility to see git diffs and apply incremental changes. The platform is able to deal with files stored in DVC or git as well, but is up to the developer to choose where to store new files.

At the time of writing, every file uploaded using the API will be versioned using DVC as it is designed to upload mainly big image files. For this reason it is recommended to handle this special cases using directly the cli tools.

File upload/download

sequenceDiagram participant Client participant Datalake participant Storage Site participant GitLab Client->>Datalake: Upload files Datalake->>Datalake: dvc add/git add/git commit Datalake->>Storage Site: dvc push Datalake->>GitLab: git push Datalake->>Client: ok
sequenceDiagram participant Client participant Datalake participant Storage Site participant GitLab Client->>Datalake: Request file Datalake->>GitLab: git clone Datalake->>Storage Site: dvc pull file Datalake->>Client: Return file