DANS BagIt Profile v1.0.0¶
Introduction¶
Version¶
- Document version: 1.0.0
- Publication date: 2024-02-06
Status¶
- ARCHIVED
THIS VERSION OF THE PROFILE IS NO LONGER IN USE.
Scope¶
This document specifies what constitutes an acceptable DANS bag. This includes all the requirements for a bag to be successfully processed by the ingest workflow.
Version 1.0.0 covers the requirements used by the DANS Data Stations and the Vault as a Service.
(DANS internal note: DANS BagIt Profile is also being used in the migration from EASY to the Data Stations. Some rules are modified for migration deposits. See Migration Supplement.)
Overview and conventions¶
Keywords¶
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and " OPTIONAL" in this document are to be interpreted as described in RFC 2119.
The keyword "SHOULD" is also used to specify requirements that are impossible or impractical to check by the archival organization (i.e. DANS). The client should do its best to meet these requirements, but not rely on their being validated by the archival organization.
Subdivisions¶
The requirements are subdivided into the following sections:
- BagIt-related requirements - requirements that refer back to the BagIt specifications
- Structural requirements - requirements regarding the directories and files in the bag
- Metadata requirements - requirements regarding the metadata files included in the bag
- Data Station context requirements - requirements regarding the Data Station where the dataset is to be created or modified; these do not apply to the Vault as a Service
- Vault as a Service context requirements - requirements that are only applicable to the Vault as a Service and not to the Data Stations
The sections are numbered and may have numbered subsections. The requirements themselves are stated as numbered rules. Rules may have parts that are labeled with letters: (a), (b), (c), etc. To uniquely identify a specific rule, use the notation
<section-nr>[.<subsection-nr>].<rule-nr> [(<letter>)]
Example: 2.3.4 (e)
means part e of the fourth rule in subsection 3 of section 2.
XML namespaces¶
When referring to XML element or attribute names or attribute values that have a prefix (such as dcterms:identifier
)
an element in a certain namespace is intended. The table below lists the mapping from prefix to namespace. In the actual
document, the namespace may be bound to a different prefix, or be the default namespace.
Prefix | Namespace URI | Namespace documentation |
---|---|---|
dc |
http://purl.org/dc/elements/1.1/ |
DC |
dcterms |
http://purl.org/dc/terms/ |
DCTERMS |
dcx-dai |
http://easy.dans.knaw.nl/schemas/dcx/dai/ |
DANS DCTERMS extension: Digital Author Identifier |
dcx-gml |
http://easy.dans.knaw.nl/schemas/dcx/gml/ |
DANS DCTERMS extension: Geography Markup Language |
ddm |
http://schemas.dans.knaw.nl/dataset/ddm-v2/ |
DANS dataset metadata schema |
files |
http://easy.dans.knaw.nl/schemas/bag/metadata/files/ |
DANS bag file metadata schema |
gml |
http://www.opengis.net/gml |
GML 3.1.1 Simplified Features profile Levels 0 and 1 |
id-type |
http://easy.dans.knaw.nl/schemas/vocab/identifier-type/ |
DANS controlled list of dcterms:identifier types |
xsi |
http://www.w3.org/2001/XMLSchema-instance |
XML Schema |
Names¶
Exact names are specified in code
style. The capitalization of those names must be exactly as specified, unless it is
explicitly stated that the name is case-insensitive. If the name is a file path it is relative to the bag base
directory.
Glossary¶
This section defines a number of terms that may be useful when discussing this specification. Several of these terms are used in the Requirements section of this document. Where terms are used in the definitions of other terms, they are printed in italics.
-
bag: a data package conforming to the BagIt specifications.
-
bag-ID: a urn:uuid given to every deposit upon delivery to DANS. The bag-ID of the first version of a dataset is also known as the sword token.
-
(version N) DANS-bag: a bag that also conforms to (version N) of the DANS BagIt Profile specifications.
-
DANS Data Station: the publishing and archiving service of DANS for datasets
-
Vault as a Service: the service that allows customers to store datasets in the DANS Data Vault, without publishing them in a Data Station
-
DANS SWORD Service: the DANS implementation of the SWORDv2 protocol
-
dataset: an information package containing data and metadata published in a DANS Data Station. A dataset may have multiple versions.
-
deposit: a package containing data and metadata sent to DANS by the depositor to create a new dataset or dataset version.
-
depositor: the agent sending data to DANS for publishing in a DANS Data Station. This term corresponds to the Producer in the OAIS Reference Model.
-
deaccessioned version: a dataset that is present in a DANS Data Station, but no longer disseminated. To the external user the dataset version appears to be deleted, only a tombstone remains.
-
update-deposit: a deposit that is intended to create a new version of an existing dataset.
-
version-bag-ID: a bag-ID that references a version of a dataset. If the dataset consists of several versions, the DOI resolves to the most recent of those version
s. All versions of one dataset have the same sword token. -
sword token: a urn:uuid assigned to first version bags upon submission through the DANS SWORD Service. This UUID is used when depositing an update-deposit and to identify all versions of one dataset.
Requirements¶
1 BagIt related¶
1.1 Validity¶
- The bag MUST be valid according to the BagIt specifications v1.0 (RFC 8493).
1.2 bag-info.txt
¶
-
The bag MUST contain a
bag-info.txt
file. -
(a) The
bag-info.txt
file MUST contain exactly one element calledCreated
. (b) It MUST have a timestamp value in ISO 8601 format, including the time zone and a millisecond precision time part. (c) The values in theCreated
timestamps SHOULD reflect the correct order of the versions. -
(a) The
bag-info.txt
file MAY contain at most one element calledIs-Version-Of
(b) with a urn:uuid-value. See rules 4.1 and 5.1 for the context requirements ifIs-Version-Of
is provided. -
(a) The
bag-info.txt
file MAY contain at most one element calledHas-Organizational-Identifier
. (b) If thisHas-Organizational-Identifier
is given, at most oneHas-Organizational-Identifier-Version
MAY be present, containing a version number. (c) IfHas-Organizational-Identifier
is present then its value MUST start with one of the approved prefixes. Each client will be assigned a unique prefix to use for this purpose.
1.3 Manifests¶
- If the bag has only one payload manifest it MUST NOT use the MD5 algorithm. However, it MAY have an MD5 payload manifest in addition to other payload manifests.
2 Structural requirements¶
-
The bag MUST have a tag-directory called
metadata
directly under the bag base directory. -
The
metadata
directory MUST contain the files: (a)metadata/dataset.xml
and (b)metadata/files.xml
. -
The
metadata
directory MUST NOT contain any other files or directories. -
A DANS bag MAY contain a file
original-filepaths.txt
in the root of the bag in UTF-8 encoding. For the content requirements and purpose of this file, if it exists, see Section 3.3.
3 Metadata requirements¶
3.1 metadata/dataset.xml
¶
-
The file
metadata/dataset.xml
MUST adhere to DANS dataset metadata schema. -
The file
metadata/dataset.xml
MUST have at least onedcterms:license
element as a child of thedcmiMetadata
element. Exactly one of these elements MUST have the attributexsi:type="dcterms:URI"
and have a URI as element text. (See also rule 4.2.) -
Any
dcx-dai:<scheme>
elements (i.e. (a) DAI, (b) ISNI or (c) ORCID) MUST contain an identifier that complies with the syntaxis for the selected identifier scheme URI. -
Any
gml:posList
elements nested ingml:Polygon
elements:- MUST have an even number of values.
- MUST have at least three different pairs of values, each of which describes one point.
- MUST start and end with the same pair of values.
-
If a
gml:MultiSurface
is used, all nestedgml:Polygon
MUST have the samesrsName
attribute. -
Any
gml:Point/gml:pos
(pos nested in a Point),gml:lowerCorner
andgml:upperCorner
elements MUST have at least two values. These values MUST be numeric. In case thesrsName
specifies the RD scheme the values MUST be within the valid range. -
Any
dc:identifier
ordcterms:identifier
element with attributexsi:type="id-type:ARCHIS-ZAAK-IDENTIFICATIE
MUST have a value of 10 or fewer characters. -
All URLs used in
metadata/dataset.xml
MUST be valid URLs with protocolhttp
orhttps
. -
The file
metadata/dataset.xml
MUST have one or more rights holders, each in a<dcterms:rightsHolder>
element as a child of thedcmiMetadata
element. -
dcx-dai:author
anddcx-dai:organization
elements MUST NOT have the roleRightsHolder
.
3.2 metadata/files.xml
¶
-
The file
metadata/files.xml
MUST adhere to DANS bag file metadata schema. -
Each
file
element'sfilepath
attribute MUST contain the bag local path to the payload file described. When anoriginal-filepaths.txt
exists, rule 3.3.1 applies. Directories and non-payload files MUST NOT be described by afile
element. -
There MUST NOT be more than one
file
element corresponding to a payload file and every payload file MUST be described by afile
element.
3.3 original-filepaths.txt
¶
- The purpose of
original-filepaths.txt
is to provide a complete mapping from renamed files back to their original full path (including filename). It MUST be a text file encoded with UTF-8. -
The lines of
original-filepaths.txt
MUST be formatted as<physical-bag-relative-path><whitespace><original-bag-relative-path>
, where<physical-bag-relative-path>
MUST NOT contain whitespace.<physical-bag-relative-path>
MUST correspond one-to-one to an existing payload file.<original-bag-relative-path>
MUST correspond one-to-one to afilepath
attribute in thefile
elements in thefiles.xml
.
4 Data Station context requirements¶
-
If
bag-info.txt
contains the elementIs-Version-Of
there MUST be a dataset in the target Data Station with the following properties: (a) it has adansSwordToken
with the same value asIs-Version-Of
(b) it has adansOtherId
with the same value asbag-info.txt
'sHas-Organizational-Identifier
(or both are absent). -
The
metadata/dataset.xml
elementdcmiMetadata/dcterms:license
with attributexsi:type="dcterms:URI"
(see 3.1.2) must be one of the licenses supported by the target Data Station. -
The date in
metadata/dataset.xml
elementprofile/available
MUST NOT be further in the future than the limit set on embargoes in the target Data Station. -
The bag MUST NOT contain a payload file
data/original-metadata.zip
.
5 Vault as a Service context requirements¶
-
If
bag-info.txt
contains the elementIs-Version-Of
there MUST be a dataset in the Vault that has adansSwordToken
with the same value asIs-Version-Of
. -
(a) The file
metadata/dataset.xml
MAY contain at most onedcterms:identifier
element with an attributexsi:type="id-type:DOI"
. (b) If present, the text of this element MUST be a syntactically valid DOI identifier which SHOULD be resolvable.
References¶
- BagIt - packaging format on which the DANS Data Station ingest process is based
- DANS bag file metadata schema - schema for content file specific metadata
- DANS dataset metadata schema - schema for dataset level metadata
- DANS controlled list of dcterms:identifier types - list of valid values for
xsi:id-type
- DANS DCTERMS extension: Digital Author Identifier - extensions to accommodate Digital Author Identifiers
- DANS DCTERMS extension: Geography Markup Language - extensions to accommodate Geography Markup Language
- DC - Dublin Core metadata elements
- DCTERMS - metadata schema from which elements are used in several DANS metadata schemas
- DOI - Digital Object Identifier
- GML 3.1.1 Simplified Features profile Levels 0 and 1 - Profile of Geography Mark-up Language
- ISO 8601 - standard for formatting dates and times
- Media Type - identifiers for file formats
- OAIS Reference Model - reference model for long term preservation archives
- RFC 2119 - specification for use of requirement level keywords
- RD - Stelsel van de Rijksdriehoeksmeting, a scheme for geographical locations
- urn:uuid - URN scheme for UUIDs
- urn:NBN - URN scheme for National Bibliographic Numbers
- UUID - universally unique identifiers
- XML Schema - constraint language for specifying types of XML documents