Data Vault Storage Root

Introduction

The Data Vault is subdivided into Storage Roots, each one containing the long term preservation copies for either a Data Station or a "Vault as a Service" (VaaS) customer. The Data Vault Storage Root (DVSR) can be viewed as a type of interface, or exchange format, albeit an atypical one, as it is aimed at future users, rather than current ones.

dd-data-vault interface

Do not confuse the DVSR with the service interface of dd-data-vault, which is an internal microservice interface that is used by the transfer service to store data in the Data Vault.

OCFL repositories

The DANS Data Vault is implemented as an array of OCFL repositories. OCFL stands for Oxford Common File Layout. It is a community specification for the layout of a repository that stores versioned digital objects. Each repository, or "storage root," is one Data Vault Storage Root (DVSR). The Data Stations each have their own DVSR as does each customer of the Vault as a Service.

Serialization in layers

OCFL repositories can be serialized in different ways, for example as a directory structure on a file system, or as objects in an object store. The DANS Data Vault uses the SURF Data Archive tape storage. The tape storage system that is used by Data Archive organizes files in a file-folder structure, so in principle serialization should be the same as to a disk-based files system, from OCFL's perspective. However, the tape storage system requires a minimum file size of 1GB, which is much larger than the typical data file stored in the DANS Data Vault. To meet this requirement, the OCFL repositories are stored as a series of DMF TAR archives (see note below on how this is different from a regular TAR archive), each of which is larger than 1GB. Each archive forms a layer. To restore the OCFL repository, the layers must be extracted in the correct order. For a more detailed description of the layers, see the documentation of dans-layer-store-lib.

DMF TAR

The tape storage system used by Data Archive is managed by DMF, which stands for Data Migration Facility. SURF has developed a utility called dmftar: "dmftar is a wrapper for the Linux tool gnutar and automatically creates multi-volume archive files (...) and can incorporate the transfer of the files to the archive file system if necessary." dmftar stores the TAR volumes in a directory with the extension .dmftar, which also contains an index and a checksum file.

Dataset model mapping

OCFL is a generic storage model. It does not define the concept of a dataset. The DANS archival systems (Data Stations and Vault as a Service), on the other hand, are built around the dataset concept. The mapping between the two models is as follows:

DANS dataset model OCFL model
Dataset OCFL Object
Dataset Version OCFL Object Version
Datafile OCFL Content File

Versions

Each Dataset Version Export (DVE) is stored in a separate OCFL Object Version. This means that there is a 1-to-1 mapping between a DVE and an OCFL Object Version. Note however, that it is possible that one dataset version is exported multiple times. The mapping of a dataset version to an OCFL Object is therefore a 1-to-n relationship.

A multiple exports scenario

A scenario where a dataset version is exported multiple times is when the dataset was updated in the Data Station without creating a new version. This can be done by a superuser and is known as "updatecurrent". A new Dataset Version Export will be created and therefore a new OCFL Object Version will be created as well. The Data Station version history, however, will not display an additional version.

Identifying metadata

To identify datasets, versions and data files in the OCFL repository, the following metadata is used:

Vault metadata

The full metadata of each dataset version is stored, but the way it is stored depends on the export format used. The current export format is based on Dataverse implementation of the RDA Research Data Repository Interoperability WG recommendations.