dd-ingest-flow¶
Ingests DANS deposit directories into Dataverse.
SYNOPSIS¶
# Start server
dd-ingest-flow { server | check }
# Client
ingest-flow-*
For ingest-flow-*
commands see dans-datastation-tools.
DESCRIPTION¶
Summary¶
The dd-ingest-flow
service imports deposit directories into Dataverse. If successful, this will result in a new dataset in
Dataverse or a new version of an existing dataset. The input deposit directories must be located in a directory on local disk storage known as
an ingest area.
Ingest areas¶
An ingest area is a directory on local disk storage that is used by the service to receive deposits. It contains the following subdirectories:
inbox
- the directory under which all input deposits must be locatedoutbox
- a directory where the processed deposit are moved to (if successful to a subdirectoryprocessed
, otherwise to one ofrejected
orfailed
)
The service supports three ingest areas:
- import - for bulk import of deposits, triggered by a data manager;
- migration - for bulk import of datasets migrated from EASY;
- auto-ingest - for continuous import of deposits offered through deposit service, such as dd-sword2;
Processing of a deposit¶
Order of deposit processing¶
A deposit directory represents one dataset version. The version history of a datasets is represented by a sequence of deposit directories. When enqueuing
deposits the program will first order them by the timestamp in the Created
element in the contained bag's bag-info.txt
file.
Processing steps¶
The processing of a deposit consists of the following steps:
- Check that the deposit is a valid deposit directory.
- Check that the bag in the deposit is a valid v1 DANS bag.
- Map the dataset level metadata to the metadata fields expected in the target Dataverse.
- If:
- deposit represents first version of a dataset: create a new dataset draft.
- deposit represents an update to an existing dataset: draft a new version.
- Publish the new dataset-version.
Update-deposit¶
When receiving a deposit that specifies a new version for an existing dataset (an update-deposit) the assumption is that the bag contains the metadata and file data that must be in the new version. This means:
- The metadata specified completely overwrites the metadata in the latest version. So, even if the client needs to change only one word, it must send all the existing metadata with only that particular word changed. Any metadata left out will be deleted in the new version.
- The files will replace the files in the latest version. So the files that are in the deposit are the ones that will be in the new version. If a file is to be deleted from the new version, it should simply be left out in the deposit. If a file is to remain unchanged in the new version, an exact copy of the current file must be sent.
File path is key
The local file path (in Dataverse terms: directoryLabel + name) is used as the key to determine what file in the latest published version, if any, is targetting.
For example, to replace a published file with label foo.txt
and directoryLabel my/special/folder
, the bag must contain the new version at
data/my/special/folder/foo.txt
. (Note that the directoryLabel is the path relative to the bag's data/
folder.)
Mapping to Dataverse dataset¶
The mapping rules are documented in the spreadsheet DD Ingest Flow Mapping Rules. Access to the Google spreadsheet is granted on request to customers of DANS.
The spreadsheet includes rules for:
- dataset level metadata
- dataset terms
- file level metadata and attributes (including setting an embargo)
ARGUMENTS¶
Server¶
positional arguments:
{server,check} available commands
named arguments:
-h, --help show this help message and exit
-v, --version show the application version and exit
Client¶
The service has a RESTful API. In dans-datastation-tools commands to manage the service are available. These commands have names starting with
ingest-flow-
.
INSTALLATION AND CONFIGURATION¶
Currently this project is built as an RPM package for RHEL7/CentOS7 and later. The RPM will install the binaries to
/opt/dans.knaw.nl/dd-ingest-flow
and the configuration files to /etc/opt/dans.knaw.nl/dd-ingest-flow
. The configuration options are documented by comments
in the default configuration file config.yml
.
To install the module on systems that do not support RPM, you can copy and unarchive the tarball to the target host. You will have to take care of placing the files in the correct locations for your system yourself. For instructions on building the tarball, see next section.
BUILDING FROM SOURCE¶
Prerequisites:
- Java 11 or higher
- Maven 3.3.3 or higher
- RPM
Steps:
git clone https://github.com/DANS-KNAW/dd-ingest-flow.git
cd dd-ingest-flow
mvn clean install
If the rpm
executable is found at /usr/local/bin/rpm
, the build profile that includes the RPM packaging will be activated. If rpm
is available, but at a
different path, then activate it by using Maven's -P
switch: mvn -Pprm install
.
Alternatively, to build the tarball execute:
mvn clean install assembly:single