DANS BagIt Profile v1.2.0¶
Introduction¶
Version¶
- Document version: 1.2.0
- Publication date: 2024-08-28
Status¶
- PUBLISHED
Changes¶
Changed from version 1.1.0 to 1.2.0¶
Added rules 11 and 12 to section 3.1.
- Rule 11 introduces the use of the
valueCodeattribute as an alternative forvalueURIfor the ABR controlled vocabularies. While usingvalueURIis preferred,valueCodeis also allowed to facilitate client that cannot easily convert the short codes to term URIs. Because this is an added (and backwards compatible) feature, this release is a minor one. - Rules 12 adds explicitly the requirement that term URIs (or corresponding short codes) used as
valueURIvalues must be valid terms in the controlled vocabulary. While technically this adds a backwards incompatible requirement, it is considered to be a clarification of implied requirements and therefore as a patch.
Scope¶
This document specifies what constitutes an acceptable DANS bag. This includes all the requirements for a bag to be successfully processed by the ingest workflow.
Versions 1.x.x cover the requirements used by the DANS Data Stations and the Vault as a Service.
(DANS internal note: DANS BagIt Profile is also being used in the migration from EASY to the Data Stations. Some rules are modified for migration deposits. See Migration Supplement.)
Overview and conventions¶
Keywords¶
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and " OPTIONAL" in this document are to be interpreted as described in RFC 2119.
The keyword "SHOULD" is also used to specify requirements that are impossible or impractical to check by the archival organization (i.e. DANS). The client should do its best to meet these requirements, but not rely on their being validated by the archival organization.
Subdivisions¶
The requirements are subdivided into the following sections:
- BagIt-related requirements - requirements that refer back to the BagIt specifications
- Structural requirements - requirements regarding the directories and files in the bag
- Metadata requirements - requirements regarding the metadata files included in the bag
- Data Station context requirements - requirements regarding the Data Station where the dataset is to be created or modified; these do not apply to the Vault as a Service
- Vault as a Service context requirements - requirements that are only applicable to the Vault as a Service and not to the Data Stations
The sections are numbered and may have numbered subsections. The requirements themselves are stated as numbered rules. Rules may have parts that are labeled with letters: (a), (b), (c), etc. To uniquely identify a specific rule, use the notation
<section-nr>[.<subsection-nr>].<rule-nr> [(<letter>)]
Example: 2.3.4 (e) means part e of the fourth rule in subsection 3 of section 2.
XML namespaces¶
When referring to XML element or attribute names or attribute values that have a prefix (such as dcterms:identifier)
an element in a certain namespace is intended. The table below lists the mapping from prefix to namespace. In the actual
document, the namespace may be bound to a different prefix, or be the default namespace.
| Prefix | Namespace URI | Namespace documentation |
|---|---|---|
dc |
http://purl.org/dc/elements/1.1/ |
DC |
dcterms |
http://purl.org/dc/terms/ |
DCTERMS |
dcx-dai |
http://easy.dans.knaw.nl/schemas/dcx/dai/ |
DANS DCTERMS extension: Digital Author Identifier |
dcx-gml |
http://easy.dans.knaw.nl/schemas/dcx/gml/ |
DANS DCTERMS extension: Geography Markup Language |
ddm |
http://schemas.dans.knaw.nl/dataset/ddm-v2/ |
DANS dataset metadata schema |
files |
http://easy.dans.knaw.nl/schemas/bag/metadata/files/ |
DANS bag file metadata schema |
gml |
http://www.opengis.net/gml |
GML 3.1.1 Simplified Features profile Levels 0 and 1 |
id-type |
http://easy.dans.knaw.nl/schemas/vocab/identifier-type/ |
DANS controlled list of dcterms:identifier types |
xsi |
http://www.w3.org/2001/XMLSchema-instance |
XML Schema |
Names¶
Exact names are specified in code style. The capitalization of those names must be exactly as specified, unless it is
explicitly stated that the name is case-insensitive. If the name is a file path it is relative to the bag base
directory.
Glossary¶
This section defines a number of terms that may be useful when discussing this specification. Several of these terms are used in the Requirements section of this document. Where terms are used in the definitions of other terms, they are printed in italics.
-
bag: a data package conforming to the BagIt specifications.
-
bag-ID: a urn:uuid given to every deposit upon delivery to DANS. The bag-ID of the first version of a dataset is also known as the sword token.
-
(version N) DANS-bag: a bag that also conforms to (version N) of the DANS BagIt Profile specifications.
-
DANS Data Station: the publishing and archiving service of DANS for datasets
-
Vault as a Service: the service that allows customers to store datasets in the DANS Data Vault, without publishing them in a Data Station
-
DANS SWORD Service: the DANS implementation of the SWORDv2 protocol
-
dataset: an information package containing data and metadata published in a DANS Data Station. A dataset may have multiple versions.
-
deposit: a package containing data and metadata sent to DANS by the depositor to create a new dataset or dataset version.
-
depositor: the agent sending data to DANS for publishing in a DANS Data Station. This term corresponds to the Producer in the OAIS Reference Model.
-
deaccessioned version: a dataset that is present in a DANS Data Station, but no longer disseminated. To the external user the dataset version appears to be deleted, only a tombstone remains.
-
update-deposit: a deposit that is intended to create a new version of an existing dataset.
-
version-bag-ID: a bag-ID that references a version of a dataset. If the dataset consists of several versions, the DOI resolves to the most recent of those version
s. All versions of one dataset have the same sword token. -
sword token: a urn:uuid assigned to first version bags upon submission through the DANS SWORD Service. This UUID is used when depositing an update-deposit and to identify all versions of one dataset.
Requirements¶
1 BagIt related¶
1.1 Validity¶
- The bag MUST be valid according to the BagIt specifications v1.0 (RFC 8493).
1.2 bag-info.txt¶
-
The bag MUST contain a
bag-info.txtfile. -
The
bag-info.txtfile MAY contain one or more elements calledCreated. This element, when present, is ignored. The only reason to allow it is for backwards compatibility with v1.0.0 of the DANS BagIt Profile. -
(a) The
bag-info.txtfile MAY contain at most one element calledIs-Version-Of(b) with a urn:uuid-value. See rules 4.1 and 5.1 for the context requirements ifIs-Version-Ofis provided. -
(a) The
bag-info.txtfile MAY contain at most one element calledHas-Organizational-Identifier. (b) If thisHas-Organizational-Identifieris given, at most oneHas-Organizational-Identifier-VersionMAY be present, containing a version number. (c) IfHas-Organizational-Identifieris present then its value MUST start with one of the approved prefixes. Each client will be assigned a unique prefix to use for this purpose.
1.3 Manifests¶
- If the bag has only one payload manifest it MUST NOT use the MD5 algorithm. However, it MAY have an MD5 payload manifest in addition to other payload manifests.
2 Structural requirements¶
-
The bag MUST have a tag-directory called
metadatadirectly under the bag base directory. -
The
metadatadirectory MUST contain the files: (a)metadata/dataset.xmland (b)metadata/files.xml. -
The
metadatadirectory MUST NOT contain any other files or directories. -
A DANS bag MAY contain a file
original-filepaths.txtin the root of the bag in UTF-8 encoding. For the content requirements and purpose of this file, if it exists, see Section 3.3.
3 Metadata requirements¶
3.1 metadata/dataset.xml¶
-
The file
metadata/dataset.xmlMUST adhere to DANS dataset metadata schema. -
The file
metadata/dataset.xmlMUST have at least onedcterms:licenseelement as a child of thedcmiMetadataelement. Exactly one of these elements MUST have the attributexsi:type="dcterms:URI"and have a URI as element text. (See also rule 4.2.) -
Any
dcx-dai:<scheme>elements (i.e. (a) DAI, (b) ISNI or (c) ORCID) MUST contain an identifier that complies with the syntaxis for the selected identifier scheme URI. -
Any
gml:posListelements nested ingml:Polygonelements:- MUST have an even number of values.
- MUST have at least three different pairs of values, each of which describes one point.
- MUST start and end with the same pair of values.
-
If a
gml:MultiSurfaceis used, all nestedgml:PolygonMUST have the samesrsNameattribute. -
Any
gml:Point/gml:pos(pos nested in a Point),gml:lowerCornerandgml:upperCornerelements MUST have at least two values. These values MUST be numeric. In case thesrsNamespecifies the RD scheme the values MUST be within the valid range. -
Any
dc:identifierordcterms:identifierelement with attributexsi:type="id-type:ARCHIS-ZAAK-IDENTIFICATIEMUST have a value of 10 or fewer characters. -
All URLs used in
metadata/dataset.xmlMUST be valid URLs with protocolhttporhttps. -
The file
metadata/dataset.xmlMUST have one or more rights holders, each in a<dcterms:rightsHolder>element as a child of thedcmiMetadataelement. -
dcx-dai:authoranddcx-dai:organizationelements MUST NOT have the roleRightsHolder. -
The following elements, if present, MUST have exactly one of the attributes
valueURIorvalueCode(and MUST NOT have both).ddm:reportNumberddm:acquisitionMethodddm:subjectddm:temporal
-
(a) For the
schemeURIlisted below, the value of avalueURIattribute MUST be a term URI that is part of the controlled vocabulary specified in theschemeURIattribute of the same element. This is validated against the version currently hosted by the DANS controlled vocabulary service. (b) If avalueCodeattribute is used, it MUST correspond to a valid term in the controlled vocabulary.This rule applies to the following schemeURIs:
https://data.cultureelerfgoed.nl/term/id/rn/a4a7933c-e096-4bcf-a921-4f70a78749fe(ABR)https://data.cultureelerfgoed.nl/term/id/abr/b6df7840-67bf-48bd-aa56-7ee39435d2ed(ABR)https://data.cultureelerfgoed.nl/term/id/abr/e9546020-4b28-4819-b0c2-29e7c864c5c0(ABR Complextypen)https://data.cultureelerfgoed.nl/term/id/abr/22cbb070-6542-48f0-8afe-7d98d398cc0b(ABR Artefacten)https://data.cultureelerfgoed.nl/term/id/abr/9b688754-1315-484b-9c89-8817e87c1e84(ABR Periodes)https://data.cultureelerfgoed.nl/term/id/abr/7a99aaba-c1e7-49a4-9dd8-d295dbcc870e(ABR Rapporttypes)https://data.cultureelerfgoed.nl/term/id/abr/554ca1ec-3ed8-42d3-ae4b-47bcb848b238(ABR Verwervingswijzen)https://vocabularies.dans.knaw.nl/collections/(DANS Collections)
3.2 metadata/files.xml¶
-
The file
metadata/files.xmlMUST adhere to DANS bag file metadata schema. -
Each
fileelement'sfilepathattribute MUST contain the bag local path to the payload file described. When anoriginal-filepaths.txtexists, rule 3.3.1 applies. Directories and non-payload files MUST NOT be described by afileelement. -
There MUST NOT be more than one
fileelement corresponding to a payload file and every payload file MUST be described by afileelement.
3.3 original-filepaths.txt¶
- The purpose of
original-filepaths.txtis to provide a complete mapping from renamed files back to their original full path (including filename). It MUST be a text file encoded with UTF-8. -
The lines of
original-filepaths.txtMUST be formatted as<physical-bag-relative-path><whitespace><original-bag-relative-path>, where<physical-bag-relative-path>MUST NOT contain whitespace.<physical-bag-relative-path>MUST correspond one-to-one to an existing payload file.<original-bag-relative-path>MUST correspond one-to-one to afilepathattribute in thefileelements in thefiles.xml.
4 Data Station context requirements¶
-
If
bag-info.txtcontains the elementIs-Version-Ofthere MUST be a dataset in the target Data Station with the following properties: (a) it has adansSwordTokenwith the same value asIs-Version-Of(b) it has adansOtherIdwith the same value asbag-info.txt'sHas-Organizational-Identifier(or both are absent). -
The
metadata/dataset.xmlelementdcmiMetadata/dcterms:licensewith attributexsi:type="dcterms:URI"(see 3.1.2) must be one of the licenses supported by the target Data Station. -
The date in
metadata/dataset.xmlelementprofile/availableMUST NOT be further in the future than the limit set on embargoes in the target Data Station. -
The bag MUST NOT contain a payload file
data/original-metadata.zip.
5 Vault as a Service context requirements¶
-
If
bag-info.txtcontains the elementIs-Version-Ofthere MUST be a dataset in the Vault that has adansSwordTokenwith the same value asIs-Version-Of. -
(a) The file
metadata/dataset.xmlMAY contain at most onedcterms:identifierelement with an attributexsi:type="id-type:DOI". (b) If present, the text of this element MUST be a syntactically valid DOI identifier which SHOULD be resolvable.
References¶
- BagIt - packaging format on which the DANS Data Station ingest process is based
- DANS bag file metadata schema - schema for content file specific metadata
- DANS dataset metadata schema - schema for dataset level metadata
- DANS controlled list of dcterms:identifier types - list of valid values for
xsi:id-type - DANS DCTERMS extension: Digital Author Identifier - extensions to accommodate Digital Author Identifiers
- DANS DCTERMS extension: Geography Markup Language - extensions to accommodate Geography Markup Language
- DC - Dublin Core metadata elements
- DCTERMS - metadata schema from which elements are used in several DANS metadata schemas
- DOI - Digital Object Identifier
- GML 3.1.1 Simplified Features profile Levels 0 and 1 - Profile of Geography Mark-up Language
- ISO 8601 - standard for formatting dates and times
- Media Type - identifiers for file formats
- OAIS Reference Model - reference model for long term preservation archives
- RFC 2119 - specification for use of requirement level keywords
- RD - Stelsel van de Rijksdriehoeksmeting, a scheme for geographical locations
- urn:uuid - URN scheme for UUIDs
- urn:NBN - URN scheme for National Bibliographic Numbers
- UUID - universally unique identifiers
- XML Schema - constraint language for specifying types of XML documents