DANS BagIt Profile v0.0.0

Introduction

Version

  • Document version: 0.0.0
  • Publication date: 2019-05-02

Status

Status of this document is PUBLISHED.

THIS VERSION OF THE PROFILE IS DEPRECATED AND ONLY APPLIES TO THE LEGACY EASY SWORD2 SERVICE.

Scope

This document specifies what constitutes an acceptable DANS bag. This includes all the requirements for a bag to be successfully processed by the ingest workflow. What is acceptable as a SIP is sometimes not exactly the same as what is acceptable as an AIP. Where these are different, this is documented.

Version 0.0.0 covers the de facto requirements that are used by the DANS EASY archive. In this context a SIP is the bag as it is submitted by the client and an AIP is the bag as it is stored in the DANS EASY Vault, which is implemented as a bag-store.

Overview and conventions

Key words

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and " OPTIONAL" in this document are to be interpreted as described in RFC 2119.

The key word "SHOULD" is also used to specify requirements that are impossible or impractical to check by the archival organization (i.e. DANS). The client should do its best to meet these requirements, but not rely on their being validated by the archival organization.

Subdivisions

The requirements are subdivided into the following sections:

  • BagIt-related requirements - requirements that refer back to the BagIt specifications
  • Structural requirements - requirements regarding the directories and files in the bag
  • Metadata requirements - requirements regarding the metadata files included in the bag
  • Bag sequence requirements - requirements regarding the metadata that records how a sequence of bags can represent the history of a dataset

The sections are numbered and may have numbered subsections. The requirements themselves are stated as numbered rules. To uniquely identify a specific rule, use the notation <section-nr>[.<subsection-nr>].<rule-nr>, e.g., 2.3.4 for the fourth rule in subsection 3 of section 2.

SIP vs AIP

Rules that only apply to SIPs or AIPs are annotated with the comments "(SIP)" and "(AIP)" respectively. All other rules apply to SIP and AIP alike.

Stand-alone vs in-sequence-context validation

There are two levels of validation:

  • stand-alone: this includes the validation steps that can be performed without taking into account any data outside the bag, in particular the bag store that the bag is (to be) stored in. This means that only the rules in sections 1-3 are checked in this level.
  • in-sequence-context: this includes all the checks done in stand-alone validation plus checks that involve the bag-sequence that the bag is (to be) part of. This means all the rules in this document are checked.

Stand-alone validation can be useful when you do not have access to the target bag store. Another reason to do only stand-alone validation may be that the overhead of doing calls to the bag store over the network is too high.

XML namespaces

When referring to XML element or attribute names or attribute values that have a prefix (such as dcterms:identifier) an element in a certain namespace is intended. The table below lists the mapping from prefix to namespace. In the actual document, the namespace may be bound to a different prefix, or be the default namespace.

Prefix Namespace URI Namespace documentation
am http://easy.dans.knaw.nl/schemas/bag/metadata/agreements/ DANS deposit agreement metadata
amd http://easy.dans.knaw.nl/easy/dataset-administrative-metadata/ EASY administrative metadata schema
dc http://purl.org/dc/elements/1.1/ DC
dcterms http://purl.org/dc/terms/ DCTERMS
dcx-dai http://easy.dans.knaw.nl/schemas/dcx/dai/ DANS DCTERMS extension: Digital Author Identifier
dcx-gml http://easy.dans.knaw.nl/schemas/dcx/gml/ DANS DCTERMS extension: Geography Markup Language
ddm http://easy.dans.knaw.nl/schemas/md/ddm/ DANS dataset metadata schema
emd http://easy.dans.knaw.nl/easy/easymetadata/ EASY metadata schema
files http://easy.dans.knaw.nl/schemas/bag/metadata/files/ DANS bag file metadata schema
gml http://www.opengis.net/gml GML 3.1.1 Simplified Features profile Levels 0 and 1
id-type http://easy.dans.knaw.nl/schemas/vocab/identifier-type/ DANS controlled list of dcterms:identifier types
wfs http://easy.dans.knaw.nl/easy/workflow/ EASY administrative workflow metadata schema
xsi http://www.w3.org/2001/XMLSchema-instance XML Schema

Name spelling

Exact names are specified in code style. The capitalization of those names must be exactly as specified, unless it is explicitly stated that the name is case insensitive. If the name is a file path it is relative to the bag base directory.

Glossary

This section defines a number of terms that may be useful when discussing this specification. Several of these terms are used in the Requirements section of this document. Where terms are used in the definitions of other terms, they are printed in italics.

  • AIP: Archival Information Package, as defined in the OAIS Reference Model. See also AIP.

  • bag: a data package conforming to the BagIt specifications.

  • bag-sequence: a sequence of DANS-bags, that together represent the history of an archived dataset in the DANS Data Vault.

  • bag-store: storage for immutable bags. All DANS bag-stores combined are called the DANS Data Vault.

  • (version N) DANS-bag: a bag that also conforms to (version N) of the DANS BagIt Profile specifications.

  • base-revision: the revision that is used to reference a complete bag-sequence. Usually, this is the oldest revision.

  • corresponding bag-files: files in different bags of the same bag-sequence, that are identical or have the exact same path (including the file name) within the bag.

  • DANS Data Vault: the bag-stores of DANS combined.

  • dataset: the DANS implementation of the Information Package concept from the OAIS Reference Model. An archived dataset corresponds to an AIP. A dataset is represented in the * DANS Data Vault as a bag-sequence*.

  • deposit: a dataset as it is delivered to DANS by the depositor, i.e. the SIP, as it is being transformed into an AIP.

  • depositor: the agent sending data to DANS for archiving in the DANS Data Vault. This term corresponds to the Producer in the OAIS Reference Model.

  • inactive bag: a bag that is present in the DANS Data Vault, but no longer disseminated. To the external user the bag appears to be deleted. A bag that is not inactive is active.

  • revision: a DANS-bag that has been successfully archived in the DANS Data Vault. It has a UUID identifier and a timestamp that records its place in the bag-sequence.

  • SIP: Submission Information Package, as defined in the OAIS Reference Model. See also SIP.

  • update-deposit: a deposit that is intended to be added as a revision of an existing, archived dataset.

  • version: a revision that has its own version-DOI. A version of an archived dataset may be updated by new revisions that keep the same version-DOI.

  • version-DOI: a DOI that references a version of an archived dataset. If the version consists of several revisions, the version-DOI resolves to the most recent of those revisions.

Requirements

1.1 Validity

  1. (SIP) The bag MUST be VALID according to the BagIt specifications.

1.2 bag-info.txt

  1. The bag MUST contain a bag-info.txt file.

  2. (a) The bag-info.txt file MAY contain at most one element called BagIt-Profile-Version. (b) If present, its value MUST be 0.

  3. (a) The bag-info.txt file MAY contain at most one element called BagIt-Profile-URI. (b) If present, its value MUST be doi:10.17026/dans-z52-ybfe.

  4. (a) The bag-info.txt file MUST contain exactly one element called Created. (b) It MUST have a timestamp value in ISO 8601 format, including the time zone and a millisecond precision time part. (c) The values in the Created timestamps SHOULD reflect the correct order of the versions.

  5. The bag-info.txt file MAY contain at most one element called Is-Version-Of with a urn:uuid-value.

  6. The bag-info.txt file (a) MUST (AIP) or (b) MAY (SIP) contain an entry called EASY-User-Account. If present, its value SHOULD be the user name of an existing EASY user account.

1.3 Manifests

  1. (AIP) (a) The bag MUST have a SHA-1 payload manifest. (b) It MUST have entries for all the payload files.

  2. The bag MAY have other payload manifests and tag manifests.

2 Structural requirements

  1. The bag MUST have a tag-directory called metadata (one word, lowercase letters) directly under the bag base directory.

  2. The metadata directory MUST contain the files: (a) metadata/dataset.xml and (b) metadata/files.xml. It MAY contain the files metadata/amd.xml, metadata/emd.xml or metadata/license.txt.

  3. (a) The metadata directory MAY contain a directory depositor-info in which the following files MAY be present metadata/depositor-info/agreements.xml with information about agreements between the depositor and DANS. Among other things it specifies the existence of personal data within the files in data. Other files that MAY be present are:

    • either metadata/depositor-info/depositor-agreement.pdf or metadata/depositor-info/depositor-agreement.txt;
    • metadata/depositor-info/message-from-depositor.txt.
    • metadata/provenance.xml
    • (b) The metadata directory MAY contain a directory original containing a dataset.xml and a files.xml, representing the original deposit
  4. (SIP) The metadata directory MAY contain a file called metadata/depositor-info/message-from-depositor.txt.

  5. The metadata directory MUST NOT contain any other files or directories.

  6. Files in the data directory MUST NOT have filepaths that include the following characters: /, :, *, ?, " , <, >, |, ;, #.

2.7 original-filepaths.txt

  1. A DANS bag MAY contain a file original-filepaths.txt in the root of the bag in UTF-8 encoding.

  2. It MUST be formatted as <physical-bag-relative-path><whitespace><original-bag-relative-path>.

    • <physical-bag-relative-path> MUST NOT contain whitespace.
    • <physical-bag-relative-path> MUST correspond one-to-one to an existing payload file.
    • <original-bag-relative-path> MUST correspond one-to-one to a filepath attribute in the file elements in the files.xml.

3 Metadata requirements

3.1 metadata/dataset.xml

  1. The file metadata/dataset.xml MUST adhere to DANS dataset metadata schema.

  2. The file metadata/dataset.xml MAY have one dcterms:license element as a child of the dcmiMetadata element. It MUST NOT have more than one such element. If present, and if it contains an attribute xsi:type="dcterms:URI", then the element text MUST be one of the URIs of approved licenses.

  3. (a) (AIP) The file metadata/dataset.xml MUST contain at least one dcterms:identifier element with an attribute xsi:type="id-type:URN". The text of this element MUST be a syntactically valid urn:NBN identifier which SHOULD be resolvable. (b) The file metadata/dataset.xml MAY contain one or more dcterms:identifier elements with an attribute xsi:type="id-type:DOI". The text of this element MUST be a syntactically valid DOI identifier which SHOULD be resolvable.

  4. Any dcx-dai:DAI elements MUST contain a syntactically valid Digital Author Identifier (DAI) with a valid check digit. This DAI SHOULD refer to an existing author. The DAI MAY be provided as a info:eu-repo/dai/nl/ URI.

  5. Any gml:postList elements nested in gml:Polygon elements:

    • MUST have an even number of values.
    • MUST have at least three different pairs of values, each of which describes one point.
    • MUST start and end with the same pair of values.
  6. If a gml:MultiSurface is used, all nested gml:Polygon MUST have the same srsName attribute.

  7. Any gml:Point, gml:lowerCorner and gml:upperCorner elements MUST have at least two values. These values MUST be numeric. In case of RD (http://www.opengis.net/def/crs/EPSG/0/28992) the values MUST be within the valid range.

  8. Any dc:identifier or dcterms:identifier element with xsi:type="id-type:ARCHIS-ZAAK-IDENTIFICATIE MUST have a value of 10 or fewer characters.

  9. All URLs used in metadata/dataset.xml MUST be valid URLs with protocol http or https.

  10. The file metadata/dataset.xml MUST have a rights holder, either in <dcterms:rightsHolder> or as an author with role RightsHolder.

3.2 metadata/files.xml

  1. If the file metadata/files.xml declares the DANS bag file metadata schema namespace then it MUST adhere to that schema.

  2. The document element MUST be files.

  3. The document element MUST contain zero or more file elements, and MUST NOT contain other elements.

  4. Each file element MUST have a filepath attribute which contains the bag local path to the payload file described. When an original-filepaths.txt exists, rule 2.7.2 applies. Directories and non-payload files MUST NOT be described by a file element.

  5. There MUST NOT be more than one file element corresponding to a payload file and every payload file MUST be described by a file element.

  6. Each file element MUST contain at least one dcterms:format element containing the Media Type of the file described.

  7. Each file element MAY contain any number of other DC or DCTERMS elements describing the file. It MUST NOT contain elements from other namespaces, unless permitted by the DANS bag file metadata schema schema.

  8. Each file element MAY contain either one dcterms:accessRights (DEPRECATED) or files:accessibleToRights and/or files:visibleToRights (RECOMMENDED). The values of these elements are limited to ANONYMOUS, RESTRICTED_REQUEST and NONE. It is recommended to make files visible to all, by setting files:visibleToRights to ANONYMOUS.

3.3 metadata/depositor-info/agreements.xml

  1. (AIP) The file metadata/depositor-info/agreements.xml MUST adhere to DANS deposit agreement metadata.

3.4 metadata/depositor-info/message-from-depositor.txt

  1. The file metadata/depositor-info/message-from-depositor.txt MUST be a plain text file using the UTF-8 character encoding. It SHOULD contain any remarks that the depositor wants to convey to the data manager who is to curate this deposit.

4 Bag sequence requirements

  1. (AIP) If bag-info.txt contains the element Is-Version-Of the UUID in its value MUST be the bag-id of an existing, archived bag. This bag SHOULD be the base revision for the dataset that is to be updated by the deposit.

2 (AIP) The bags of a bag-sequence MUST be stored in the same bag-store.

3 (AIP) The EASY-User-Account in bag-info.txt MUST be identical in all the bags of a bag-sequence.

References