DANS BagIt Profile v1.0.0

Introduction

Version

  • Document version: 1.0.0
  • Publication date: 2024-02-06

Status

  • ARCHIVED

THIS VERSION OF THE PROFILE IS NO LONGER IN USE.

Scope

This document specifies what constitutes an acceptable DANS bag. This includes all the requirements for a bag to be successfully processed by the ingest workflow.

Version 1.0.0 covers the requirements used by the DANS Data Stations and the Vault as a Service.

(DANS internal note: DANS BagIt Profile is also being used in the migration from EASY to the Data Stations. Some rules are modified for migration deposits. See Migration Supplement.)

Overview and conventions

Keywords

The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and " OPTIONAL" in this document are to be interpreted as described in RFC 2119.

The keyword "SHOULD" is also used to specify requirements that are impossible or impractical to check by the archival organization (i.e. DANS). The client should do its best to meet these requirements, but not rely on their being validated by the archival organization.

Subdivisions

The requirements are subdivided into the following sections:

  • BagIt-related requirements - requirements that refer back to the BagIt specifications
  • Structural requirements - requirements regarding the directories and files in the bag
  • Metadata requirements - requirements regarding the metadata files included in the bag
  • Data Station context requirements - requirements regarding the Data Station where the dataset is to be created or modified; these do not apply to the Vault as a Service
  • Vault as a Service context requirements - requirements that are only applicable to the Vault as a Service and not to the Data Stations

The sections are numbered and may have numbered subsections. The requirements themselves are stated as numbered rules. Rules may have parts that are labeled with letters: (a), (b), (c), etc. To uniquely identify a specific rule, use the notation

<section-nr>[.<subsection-nr>].<rule-nr> [(<letter>)]

Example: 2.3.4 (e) means part e of the fourth rule in subsection 3 of section 2.

XML namespaces

When referring to XML element or attribute names or attribute values that have a prefix (such as dcterms:identifier) an element in a certain namespace is intended. The table below lists the mapping from prefix to namespace. In the actual document, the namespace may be bound to a different prefix, or be the default namespace.

Prefix Namespace URI Namespace documentation
dc http://purl.org/dc/elements/1.1/ DC
dcterms http://purl.org/dc/terms/ DCTERMS
dcx-dai http://easy.dans.knaw.nl/schemas/dcx/dai/ DANS DCTERMS extension: Digital Author Identifier
dcx-gml http://easy.dans.knaw.nl/schemas/dcx/gml/ DANS DCTERMS extension: Geography Markup Language
ddm http://schemas.dans.knaw.nl/dataset/ddm-v2/ DANS dataset metadata schema
files http://easy.dans.knaw.nl/schemas/bag/metadata/files/ DANS bag file metadata schema
gml http://www.opengis.net/gml GML 3.1.1 Simplified Features profile Levels 0 and 1
id-type http://easy.dans.knaw.nl/schemas/vocab/identifier-type/ DANS controlled list of dcterms:identifier types
xsi http://www.w3.org/2001/XMLSchema-instance XML Schema

Names

Exact names are specified in code style. The capitalization of those names must be exactly as specified, unless it is explicitly stated that the name is case-insensitive. If the name is a file path it is relative to the bag base directory.

Glossary

This section defines a number of terms that may be useful when discussing this specification. Several of these terms are used in the Requirements section of this document. Where terms are used in the definitions of other terms, they are printed in italics.

  • bag: a data package conforming to the BagIt specifications.

  • bag-ID: a urn:uuid given to every deposit upon delivery to DANS. The bag-ID of the first version of a dataset is also known as the sword token.

  • (version N) DANS-bag: a bag that also conforms to (version N) of the DANS BagIt Profile specifications.

  • DANS Data Station: the publishing and archiving service of DANS for datasets

  • Vault as a Service: the service that allows customers to store datasets in the DANS Data Vault, without publishing them in a Data Station

  • DANS SWORD Service: the DANS implementation of the SWORDv2 protocol

  • dataset: an information package containing data and metadata published in a DANS Data Station. A dataset may have multiple versions.

  • deposit: a package containing data and metadata sent to DANS by the depositor to create a new dataset or dataset version.

  • depositor: the agent sending data to DANS for publishing in a DANS Data Station. This term corresponds to the Producer in the OAIS Reference Model.

  • deaccessioned version: a dataset that is present in a DANS Data Station, but no longer disseminated. To the external user the dataset version appears to be deleted, only a tombstone remains.

  • update-deposit: a deposit that is intended to create a new version of an existing dataset.

  • version-bag-ID: a bag-ID that references a version of a dataset. If the dataset consists of several versions, the DOI resolves to the most recent of those versions. All versions of one dataset have the same sword token.

  • sword token: a urn:uuid assigned to first version bags upon submission through the DANS SWORD Service. This UUID is used when depositing an update-deposit and to identify all versions of one dataset.

Requirements

1.1 Validity

  1. The bag MUST be valid according to the BagIt specifications v1.0 (RFC 8493).

1.2 bag-info.txt

  1. The bag MUST contain a bag-info.txt file.

  2. (a) The bag-info.txt file MUST contain exactly one element called Created. (b) It MUST have a timestamp value in ISO 8601 format, including the time zone and a millisecond precision time part. (c) The values in the Created timestamps SHOULD reflect the correct order of the versions.

  3. (a) The bag-info.txt file MAY contain at most one element called Is-Version-Of (b) with a urn:uuid-value. See rules 4.1 and 5.1 for the context requirements if Is-Version-Of is provided.

  4. (a) The bag-info.txt file MAY contain at most one element called Has-Organizational-Identifier. (b) If this Has-Organizational-Identifier is given, at most one Has-Organizational-Identifier-Version MAY be present, containing a version number. (c) If Has-Organizational-Identifier is present then its value MUST start with one of the approved prefixes. Each client will be assigned a unique prefix to use for this purpose.

1.3 Manifests

  1. If the bag has only one payload manifest it MUST NOT use the MD5 algorithm. However, it MAY have an MD5 payload manifest in addition to other payload manifests.

2 Structural requirements

  1. The bag MUST have a tag-directory called metadata directly under the bag base directory.

  2. The metadata directory MUST contain the files: (a) metadata/dataset.xml and (b) metadata/files.xml.

  3. The metadata directory MUST NOT contain any other files or directories.

  4. A DANS bag MAY contain a file original-filepaths.txt in the root of the bag in UTF-8 encoding. For the content requirements and purpose of this file, if it exists, see Section 3.3.

3 Metadata requirements

3.1 metadata/dataset.xml

  1. The file metadata/dataset.xml MUST adhere to DANS dataset metadata schema.

  2. The file metadata/dataset.xml MUST have at least one dcterms:license element as a child of the dcmiMetadata element. Exactly one of these elements MUST have the attribute xsi:type="dcterms:URI" and have a URI as element text. (See also rule 4.2.)

  3. Any dcx-dai:<scheme> elements (i.e. (a) DAI, (b) ISNI or (c) ORCID) MUST contain an identifier that complies with the syntaxis for the selected identifier scheme URI.

  4. Any gml:posList elements nested in gml:Polygon elements:

    • MUST have an even number of values.
    • MUST have at least three different pairs of values, each of which describes one point.
    • MUST start and end with the same pair of values.
  5. If a gml:MultiSurface is used, all nested gml:Polygon MUST have the same srsName attribute.

  6. Any gml:Point/gml:pos (pos nested in a Point), gml:lowerCorner and gml:upperCorner elements MUST have at least two values. These values MUST be numeric. In case the srsName specifies the RD scheme the values MUST be within the valid range.

  7. Any dc:identifier or dcterms:identifier element with attribute xsi:type="id-type:ARCHIS-ZAAK-IDENTIFICATIE MUST have a value of 10 or fewer characters.

  8. All URLs used in metadata/dataset.xml MUST be valid URLs with protocol http or https.

  9. The file metadata/dataset.xml MUST have one or more rights holders, each in a <dcterms:rightsHolder> element as a child of the dcmiMetadata element.

  10. dcx-dai:author and dcx-dai:organization elements MUST NOT have the role RightsHolder.

3.2 metadata/files.xml

  1. The file metadata/files.xml MUST adhere to DANS bag file metadata schema.

  2. Each file element's filepath attribute MUST contain the bag local path to the payload file described. When an original-filepaths.txt exists, rule 3.3.1 applies. Directories and non-payload files MUST NOT be described by a file element.

  3. There MUST NOT be more than one file element corresponding to a payload file and every payload file MUST be described by a file element.

3.3 original-filepaths.txt

  1. The purpose of original-filepaths.txt is to provide a complete mapping from renamed files back to their original full path (including filename). It MUST be a text file encoded with UTF-8.
  2. The lines of original-filepaths.txt MUST be formatted as <physical-bag-relative-path><whitespace><original-bag-relative-path>, where

    • <physical-bag-relative-path> MUST NOT contain whitespace.
    • <physical-bag-relative-path> MUST correspond one-to-one to an existing payload file.
    • <original-bag-relative-path> MUST correspond one-to-one to a filepath attribute in the file elements in the files.xml.

4 Data Station context requirements

  1. If bag-info.txt contains the element Is-Version-Of there MUST be a dataset in the target Data Station with the following properties: (a) it has a dansSwordToken with the same value as Is-Version-Of (b) it has a dansOtherId with the same value as bag-info.txt's Has-Organizational-Identifier (or both are absent).

  2. The metadata/dataset.xml element dcmiMetadata/dcterms:license with attribute xsi:type="dcterms:URI" (see 3.1.2) must be one of the licenses supported by the target Data Station.

  3. The date in metadata/dataset.xml element profile/available MUST NOT be further in the future than the limit set on embargoes in the target Data Station.

  4. The bag MUST NOT contain a payload file data/original-metadata.zip.

5 Vault as a Service context requirements

  1. If bag-info.txt contains the element Is-Version-Of there MUST be a dataset in the Vault that has a dansSwordToken with the same value as Is-Version-Of.

  2. (a) The file metadata/dataset.xml MAY contain at most one dcterms:identifier element with an attribute xsi:type="id-type:DOI". (b) If present, the text of this element MUST be a syntactically valid DOI identifier which SHOULD be resolvable.

References