DANS BagIt Profile v0.0.0¶
Introduction¶
Version¶
- Document version: 0.0.0
- Publication date: 2019-05-02
Status¶
Status of this document is PUBLISHED.
THIS VERSION OF THE PROFILE IS DEPRECATED AND ONLY APPLIES TO THE LEGACY EASY SWORD2 SERVICE.
Scope¶
This document specifies what constitutes an acceptable DANS bag. This includes all the requirements for a bag to be successfully processed by the ingest workflow. What is acceptable as a SIP is sometimes not exactly the same as what is acceptable as an AIP. Where these are different, this is documented.
Version 0.0.0 covers the de facto requirements that are used by the DANS EASY archive. In this context a SIP is the bag as it is submitted by the client and an AIP is the bag as it is stored in the DANS EASY Vault, which is implemented as a bag-store.
Overview and conventions¶
Key words¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and " OPTIONAL" in this document are to be interpreted as described in RFC 2119.
The key word "SHOULD" is also used to specify requirements that are impossible or impractical to check by the archival organization (i.e. DANS). The client should do its best to meet these requirements, but not rely on their being validated by the archival organization.
Subdivisions¶
The requirements are subdivided into the following sections:
- BagIt-related requirements - requirements that refer back to the BagIt specifications
- Structural requirements - requirements regarding the directories and files in the bag
- Metadata requirements - requirements regarding the metadata files included in the bag
- Bag sequence requirements - requirements regarding the metadata that records how a sequence of bags can represent the history of a dataset
The sections are numbered and may have numbered subsections. The requirements themselves are stated as numbered rules.
To uniquely identify a specific rule, use the notation <section-nr>[.<subsection-nr>].<rule-nr>
, e.g., 2.3.4 for the
fourth rule in subsection 3 of section 2.
SIP vs AIP¶
Rules that only apply to SIPs or AIPs are annotated with the comments "(SIP)" and "(AIP)" respectively. All other rules apply to SIP and AIP alike.
Stand-alone vs in-sequence-context validation¶
There are two levels of validation:
- stand-alone: this includes the validation steps that can be performed without taking into account any data outside the bag, in particular the bag store that the bag is (to be) stored in. This means that only the rules in sections 1-3 are checked in this level.
- in-sequence-context: this includes all the checks done in stand-alone validation plus checks that involve the bag-sequence that the bag is (to be) part of. This means all the rules in this document are checked.
Stand-alone validation can be useful when you do not have access to the target bag store. Another reason to do only stand-alone validation may be that the overhead of doing calls to the bag store over the network is too high.
XML namespaces¶
When referring to XML element or attribute names or attribute values that have a prefix (such as dcterms:identifier
)
an element in a certain namespace is intended. The table below lists the mapping from prefix to namespace. In the actual
document, the namespace may be bound to a different prefix, or be the default namespace.
Prefix | Namespace URI | Namespace documentation |
---|---|---|
am |
http://easy.dans.knaw.nl/schemas/bag/metadata/agreements/ |
DANS deposit agreement metadata |
amd |
http://easy.dans.knaw.nl/easy/dataset-administrative-metadata/ |
EASY administrative metadata schema |
dc |
http://purl.org/dc/elements/1.1/ |
DC |
dcterms |
http://purl.org/dc/terms/ |
DCTERMS |
dcx-dai |
http://easy.dans.knaw.nl/schemas/dcx/dai/ |
DANS DCTERMS extension: Digital Author Identifier |
dcx-gml |
http://easy.dans.knaw.nl/schemas/dcx/gml/ |
DANS DCTERMS extension: Geography Markup Language |
ddm |
http://easy.dans.knaw.nl/schemas/md/ddm/ |
DANS dataset metadata schema |
emd |
http://easy.dans.knaw.nl/easy/easymetadata/ |
EASY metadata schema |
files |
http://easy.dans.knaw.nl/schemas/bag/metadata/files/ |
DANS bag file metadata schema |
gml |
http://www.opengis.net/gml |
GML 3.1.1 Simplified Features profile Levels 0 and 1 |
id-type |
http://easy.dans.knaw.nl/schemas/vocab/identifier-type/ |
DANS controlled list of dcterms:identifier types |
wfs |
http://easy.dans.knaw.nl/easy/workflow/ |
EASY administrative workflow metadata schema |
xsi |
http://www.w3.org/2001/XMLSchema-instance |
XML Schema |
Name spelling¶
Exact names are specified in code
style. The capitalization of those names must be exactly as specified, unless it is
explicitly stated that the name is case insensitive. If the name is a file path it is relative to the bag base
directory.
Glossary¶
This section defines a number of terms that may be useful when discussing this specification. Several of these terms are used in the Requirements section of this document. Where terms are used in the definitions of other terms, they are printed in italics.
-
AIP: Archival Information Package, as defined in the OAIS Reference Model. See also AIP.
-
bag: a data package conforming to the BagIt specifications.
-
bag-sequence: a sequence of DANS-bag
s, that together represent the history of an archived dataset in the DANS Data Vault. -
bag-store: storage for immutable bag
s. All DANS bag-stores combined are called the DANS Data Vault. -
(version N) DANS-bag: a bag that also conforms to (version N) of the DANS BagIt Profile specifications.
-
base-revision: the revision that is used to reference a complete bag-sequence. Usually, this is the oldest revision.
-
corresponding bag-files: files in different bag
s of the same bag-sequence, that are identical or have the exact same path (including the file name) within the bag. -
DANS Data Vault: the bag-store
s of DANS combined. -
dataset: the DANS implementation of the Information Package concept from the OAIS Reference Model. An archived dataset corresponds to an AIP. A dataset is represented in the * DANS Data Vault as a bag-sequence*.
-
deposit: a dataset as it is delivered to DANS by the depositor, i.e. the SIP, as it is being transformed into an AIP.
-
depositor: the agent sending data to DANS for archiving in the DANS Data Vault. This term corresponds to the Producer in the OAIS Reference Model.
-
inactive bag: a bag that is present in the DANS Data Vault, but no longer disseminated. To the external user the bag appears to be deleted. A bag that is not inactive is active.
-
revision: a DANS-bag that has been successfully archived in the DANS Data Vault. It has a UUID identifier and a timestamp that records its place in the bag-sequence.
-
SIP: Submission Information Package, as defined in the OAIS Reference Model. See also SIP.
-
update-deposit: a deposit that is intended to be added as a revision of an existing, archived dataset.
-
version: a revision that has its own version-DOI. A version of an archived dataset may be updated by new revision
s that keep the same version-DOI. -
version-DOI: a DOI that references a version of an archived dataset. If the version consists of several revisions, the version-DOI resolves to the most recent of those revision
s.
Requirements¶
1 BagIt related¶
1.1 Validity¶
- (SIP) The bag MUST be
VALID
according to the BagIt specifications.
1.2 bag-info.txt
¶
-
The bag MUST contain a
bag-info.txt
file. -
(a) The
bag-info.txt
file MAY contain at most one element calledBagIt-Profile-Version
. (b) If present, its value MUST be0
. -
(a) The
bag-info.txt
file MAY contain at most one element calledBagIt-Profile-URI
. (b) If present, its value MUST bedoi:10.17026/dans-z52-ybfe
. -
(a) The
bag-info.txt
file MUST contain exactly one element calledCreated
. (b) It MUST have a timestamp value in ISO 8601 format, including the time zone and a millisecond precision time part. (c) The values in theCreated
timestamps SHOULD reflect the correct order of the versions. -
The
bag-info.txt
file MAY contain at most one element calledIs-Version-Of
with a urn:uuid-value. -
The
bag-info.txt
file (a) MUST (AIP) or (b) MAY (SIP) contain an entry calledEASY-User-Account
. If present, its value SHOULD be the user name of an existing EASY user account.
1.3 Manifests¶
-
(AIP) (a) The bag MUST have a SHA-1 payload manifest. (b) It MUST have entries for all the payload files.
-
The bag MAY have other payload manifests and tag manifests.
2 Structural requirements¶
-
The bag MUST have a tag-directory called
metadata
(one word, lowercase letters) directly under the bag base directory. -
The
metadata
directory MUST contain the files: (a)metadata/dataset.xml
and (b)metadata/files.xml
. It MAY contain the filesmetadata/amd.xml
,metadata/emd.xml
ormetadata/license.txt
. -
(a) The
metadata
directory MAY contain a directorydepositor-info
in which the following files MAY be presentmetadata/depositor-info/agreements.xml
with information about agreements between the depositor and DANS. Among other things it specifies the existence of personal data within the files indata
. Other files that MAY be present are:- either
metadata/depositor-info/depositor-agreement.pdf
ormetadata/depositor-info/depositor-agreement.txt
; metadata/depositor-info/message-from-depositor.txt
.metadata/provenance.xml
- (b) The
metadata
directory MAY contain a directoryoriginal
containing adataset.xml
and afiles.xml
, representing the original deposit
- either
-
(SIP) The
metadata
directory MAY contain a file calledmetadata/depositor-info/message-from-depositor.txt
. -
The
metadata
directory MUST NOT contain any other files or directories. -
Files in the
data
directory MUST NOT have filepaths that include the following characters:/
,:
,*
,?
,"
,<
,>
,|
,;
,#
.
2.7 original-filepaths.txt
¶
-
A DANS bag MAY contain a file
original-filepaths.txt
in the root of the bag in UTF-8 encoding. -
It MUST be formatted as
<physical-bag-relative-path><whitespace><original-bag-relative-path>
.<physical-bag-relative-path>
MUST NOT contain whitespace.<physical-bag-relative-path>
MUST correspond one-to-one to an existing payload file.<original-bag-relative-path>
MUST correspond one-to-one to afilepath
attribute in thefile
elements in thefiles.xml
.
3 Metadata requirements¶
3.1 metadata/dataset.xml
¶
-
The file
metadata/dataset.xml
MUST adhere to DANS dataset metadata schema. -
The file
metadata/dataset.xml
MAY have onedcterms:license
element as a child of thedcmiMetadata
element. It MUST NOT have more than one such element. If present, and if it contains an attributexsi:type="dcterms:URI"
, then the element text MUST be one of the URIs of approved licenses. -
(a) (AIP) The file
metadata/dataset.xml
MUST contain at least onedcterms:identifier
element with an attributexsi:type="id-type:URN"
. The text of this element MUST be a syntactically valid urn:NBN identifier which SHOULD be resolvable. (b) The filemetadata/dataset.xml
MAY contain one or moredcterms:identifier
elements with an attributexsi:type="id-type:DOI"
. The text of this element MUST be a syntactically valid DOI identifier which SHOULD be resolvable. -
Any
dcx-dai:DAI
elements MUST contain a syntactically valid Digital Author Identifier (DAI) with a valid check digit. This DAI SHOULD refer to an existing author. The DAI MAY be provided as ainfo:eu-repo/dai/nl/
URI. -
Any
gml:postList
elements nested ingml:Polygon
elements:- MUST have an even number of values.
- MUST have at least three different pairs of values, each of which describes one point.
- MUST start and end with the same pair of values.
-
If a
gml:MultiSurface
is used, all nestedgml:Polygon
MUST have the samesrsName
attribute. -
Any
gml:Point
,gml:lowerCorner
andgml:upperCorner
elements MUST have at least two values. These values MUST be numeric. In case of RD (http://www.opengis.net/def/crs/EPSG/0/28992) the values MUST be within the valid range. -
Any
dc:identifier
ordcterms:identifier
element withxsi:type="id-type:ARCHIS-ZAAK-IDENTIFICATIE
MUST have a value of 10 or fewer characters. -
All URLs used in
metadata/dataset.xml
MUST be valid URLs with protocol http or https. -
The file
metadata/dataset.xml
MUST have a rights holder, either in<dcterms:rightsHolder>
or as an author with roleRightsHolder
.
3.2 metadata/files.xml
¶
-
If the file
metadata/files.xml
declares the DANS bag file metadata schema namespace then it MUST adhere to that schema. -
The document element MUST be
files
. -
The document element MUST contain zero or more
file
elements, and MUST NOT contain other elements. -
Each
file
element MUST have afilepath
attribute which contains the bag local path to the payload file described. When anoriginal-filepaths.txt
exists, rule 2.7.2 applies. Directories and non-payload files MUST NOT be described by afile
element. -
There MUST NOT be more than one
file
element corresponding to a payload file and every payload file MUST be described by afile
element. -
Each
file
element MUST contain at least onedcterms:format
element containing the Media Type of the file described. -
Each
file
element MAY contain any number of other DC or DCTERMS elements describing the file. It MUST NOT contain elements from other namespaces, unless permitted by the DANS bag file metadata schema schema. -
Each file element MAY contain either one
dcterms:accessRights
(DEPRECATED) orfiles:accessibleToRights
and/orfiles:visibleToRights
(RECOMMENDED). The values of these elements are limited to ANONYMOUS, RESTRICTED_REQUEST and NONE. It is recommended to make files visible to all, by settingfiles:visibleToRights
to ANONYMOUS.
3.3 metadata/depositor-info/agreements.xml
¶
- (AIP) The file
metadata/depositor-info/agreements.xml
MUST adhere to DANS deposit agreement metadata.
3.4 metadata/depositor-info/message-from-depositor.txt
¶
- The file
metadata/depositor-info/message-from-depositor.txt
MUST be a plain text file using the UTF-8 character encoding. It SHOULD contain any remarks that the depositor wants to convey to the data manager who is to curate this deposit.
4 Bag sequence requirements¶
- (AIP) If
bag-info.txt
contains the elementIs-Version-Of
the UUID in its value MUST be the bag-id of an existing, archived bag. This bag SHOULD be the base revision for the dataset that is to be updated by the deposit.
2 (AIP) The bags of a bag-sequence MUST be stored in the same bag-store.
3 (AIP) The EASY-User-Account
in bag-info.txt
MUST be identical in all the bags of a bag-sequence.
References¶
- AIP - Archival Information Package
- bag-store - storage for immutable, virtually-valid bags
- BagIt - packaging format on which the DANS Data Vault is based
- DANS administrative metadata - shema for administrative metadata
- DANS bag file metadata schema - schema for content file specific metadata
- DANS dataset metadata schema - schema for dataset level metadata
- DANS DCTERMS extension: Digital Author Identifier - extensions to accommodate Digital Author Identifiers
- DANS DCTERMS extension: Geography Markup Language - extensions to accommodate Geography Markup Language
- DANS deposit agreement metadata - schema for metadata about the agreement between DANS and the depositor
- DC - Dublin Core metadata elements
- DCTERMS - metadata schema from which elements are used in several DANS metadata schemas
- DOI - Digital Object Identifier
- EASY administrative metadata schema - legacy schema for administrative metadata
- EASY administrative workflow metadata schema - legacy helper schema for EASY administrative metadata schema
- EASY metadata schema - legacy schema for dataset level metadata
- GML 3.1.1 Simplified Features profile Levels 0 and 1 - Profile of Geography Mark-up Language
- ISO 8601 - standard for formatting dates and times
- Media Type - identifiers for file formats
- OAIS Reference Model - reference model for long term preservation archives
- RFC 2119 - specification for use of requirement level key words
- SIP - Submission Information Package
- URIs of approved licenses - licenses that may be used for datasets deposited in the DANS archive
- urn:uuid - URN scheme for UUIDs
- urn:NBN - URN scheme for National Bibliographic Numbers
- UUID - universally unique identifiers
- XML Schema - constraint language for specifying types of XML documents