Sharing data pipelines: Why sharing data may not be enough, and what to do about it

Käthner, David

Conference Object

Sharing data pipelines: Why sharing data may not be enough, and what to do about it

Author(s) / Creator(s)

Käthner, David

Abstract / Description

New research challenges and low-cost technological solutions drive the motivation to record behavior in multivariate ways using high temporal resolution. Making such data accessible and usable is more complicated than it may seem. Methods like ECG, EEG, and eye tracking can produce very large amounts of data in a short time. Further, context data to explain the observed behavior must be recorded as well. E.g., in a field study using an instrumented research vehicle, the position of the vehicle and the distance to the vehicle in front could act as context data. To make this multitude of data analyzable, data must be cleaned and fused in data pipelines. Cleaning happens in multiple stages, and requires decisions which have direct effects on patterns in the data. Time series data are often up- or down sampled, potentially altering characteristics of signals of interest. Sharing the data pipeline alongside an uncleaned version of the data therefore should be the default when publishing research results. Data science has developed a number of solutions to store and document data and data pipelines, whose benefits and costs will be discussed in this talk. These approaches can be structured in three interdependent dimensions: data storage, data processing, and competencies required by developers and users of data pipelines. Data from empirical studies can be very challenging to store, process, and document. Solutions to these issues do exist, but they require a training which is yet to be implemented in the typical Psychology curriculum.

Keyword(s)

data pipeline data sciene data processing time series data multi variate data data fusion

Persistent Identifier

https://doi.org/10.23668/psycharchives.4479

Date of first publication

2020-12-07

Is part of

CSPD 2020, online

Publisher

ZPID (Leibniz Institute for Psychology)

Citation

Käthner, D. (2020). Sharing data pipelines: Why sharing data may not be enough, and what to do about it. ZPID (Leibniz Institute for Psychology). https://doi.org/10.23668/PSYCHARCHIVES.4479

CSPD2020_SharingDataPipelines_InfrastructurePerspective_DavidKaethner.pdf

Adobe PDF - 773.61KB

MD5 : da4bcf34a5ca7d942fd0430b14c3156f

Sharing Level 0 (Public Use) CC-BY-SA 4.0

Download

There are no other versions of this object.

Author(s) / Creator(s)

Käthner, David
PsychArchives acquisition timestamp

2021-01-18T09:33:03Z
Made available on

2021-01-18T09:33:03Z
Date of first publication

2020-12-07
Abstract / Description

New research challenges and low-cost technological solutions drive the motivation to record behavior in multivariate ways using high temporal resolution. Making such data accessible and usable is more complicated than it may seem. Methods like ECG, EEG, and eye tracking can produce very large amounts of data in a short time. Further, context data to explain the observed behavior must be recorded as well. E.g., in a field study using an instrumented research vehicle, the position of the vehicle and the distance to the vehicle in front could act as context data. To make this multitude of data analyzable, data must be cleaned and fused in data pipelines. Cleaning happens in multiple stages, and requires decisions which have direct effects on patterns in the data. Time series data are often up- or down sampled, potentially altering characteristics of signals of interest. Sharing the data pipeline alongside an uncleaned version of the data therefore should be the default when publishing research results. Data science has developed a number of solutions to store and document data and data pipelines, whose benefits and costs will be discussed in this talk. These approaches can be structured in three interdependent dimensions: data storage, data processing, and competencies required by developers and users of data pipelines. Data from empirical studies can be very challenging to store, process, and document. Solutions to these issues do exist, but they require a training which is yet to be implemented in the typical Psychology curriculum.
Review status

unknown
Citation

Käthner, D. (2020). Sharing data pipelines: Why sharing data may not be enough, and what to do about it. ZPID (Leibniz Institute for Psychology). https://doi.org/10.23668/PSYCHARCHIVES.4479

en
Persistent Identifier

https://hdl.handle.net/20.500.12034/4058
Persistent Identifier

https://doi.org/10.23668/psycharchives.4479
Language of content

eng
Publisher

ZPID (Leibniz Institute for Psychology)
Is part of

CSPD 2020, online
Is related to

https://www.conference-service.com/CSPD2020/xpage.html?xpage=244&lang=en
Keyword(s)

data pipeline

en_US
Keyword(s)

data sciene

en_US
Keyword(s)

data processing

en_US
Keyword(s)

time series data

en_US
Keyword(s)

multi variate data

en_US
Keyword(s)

data fusion

en_US
Dewey Decimal Classification number(s)

150
Title

Sharing data pipelines: Why sharing data may not be enough, and what to do about it

en_US
DRO type

conferenceObject
Visible tag(s)

ZPID Conferences and Workshops