Concatenating Machine Learning Applications for Analyzing Digital Behavioral Data (Poster)

Sust, Larissa; Stachl, Clemens; Schoedel, Ramona

Conference Object

Concatenating Machine Learning Applications for Analyzing Digital Behavioral Data (Poster)

Author(s) / Creator(s)

Sust, Larissa

Stachl, Clemens

Schoedel, Ramona

Abstract / Description

Machine learning (ML) has become a popular tool for modeling diverse psychological phenomena from digital behavioral data such as footprints left online (e.g., likes on social media platforms) or sensor data (e.g., phone logs from smartphones). Thereby, ML is most commonly used to accommodate the high dimensionality of digital records (i.e., data in which the number of variables is close to or larger than the number of observations) in supervised predictions of psychological outcomes. However, ML also plays a role in the preceding steps of preprocessing when extracting behavioral variables from the raw digital data, which often contain highly abstract information that is not per se meaningful. In this talk, I want to demonstrate the integration of ML techniques at different steps in the modeling process, from preprocessing to outcome prediction to interpretation. I will consider the example of inferring personality traits from a natural music-listening dataset (N = 330) collected via the PhoneStudy smartphone sensing app. First, in the processing pipeline, we enriched the raw time-stamped song records in our smartphone-sensing data with song-level data such as audio features and song lyrics from external online sources. Second, for the lyrics, in particular, we combined three text-mining approaches (a sentiment lexicon and the unsupervised ML methods of topic modeling and word embeddings) to represent lyrical contents in terms of numeric variables. The extracted variables ranged from habitual aspects of music consumption (e.g., average daily duration) to song preferences, represented by melodic attributes (e.g., melodic key) and song lyrics (e.g., love themes). Third, these music-listening features served to predict self-reported Big Five scores using supervised ML algorithms, and fourth, interpretable ML approaches helped us gain insights into the resulting associations. For each step of this processing pipeline, I will present exemplary methodological details and findings to evaluate the advantages and disadvantages of using ML techniques at different analysis steps. Thereby, I will put special emphasis on how the concatenation of different ML techniques aligns with Open Science practices because the numerous specifications necessary at each step may appear difficult to register in advance or report in a transparent manner. Hence, I will discuss whether the level of granularity can be balanced between preregistration and (supplemental) method descriptions and why open data and material (in particular: open code) are particularly crucial.

Persistent Identifier

https://doi.org/10.23668/psycharchives.13021

Date of first publication

2023-07-24

Is part of

Big Data & Research Syntheses 2023, Frankfurt, Germany

Publisher

ZPID (Leibniz Institute for Psychology)

Citation

Select Style

Download BibTex

Download as Text

Sust_Poster.pdf

Adobe PDF - 334.7KB

MD5: a07c3896c2e3056a4b9a8731a3b7ca4c

Sharing Level 0 (Public Use) CC-BY-SA 4.0

Download

Is related to

Conference Object
Concatenating Machine Learning Applications for Analyzing Digital Behavioral Data (Slides)

Sust, Larissa & Stachl, Clemens & Schoedel, Ramona, 2023-07-18, ZPID (Leibniz Institute for Psychology)

Machine learning (ML) has become a popular tool for modeling diverse psychological phenomena from digital behavioral data such as footprints left online (e.g., likes on social media platforms) or sensor data (e.g., phone logs from smartphones). Thereby, ML is most commonly used to accommodate the high dimensionality of digital records (i.e., data in which the number of variables is close to or larger than the number of observations) in supervised predictions of psychological outcomes. However, ML also plays a role in the preceding steps of preprocessing when extracting behavioral variables from the raw digital data, which often contain highly abstract information that is not per se meaningful. In this talk, I want to demonstrate the integration of ML techniques at different steps in the modeling process, from preprocessing to outcome prediction to interpretation. I will consider the example of inferring personality traits from a natural music-listening dataset (N = 330) collected via the PhoneStudy smartphone sensing app. First, in the processing pipeline, we enriched the raw time-stamped song records in our smartphone-sensing data with song-level data such as audio features and song lyrics from external online sources. Second, for the lyrics, in particular, we combined three text-mining approaches (a sentiment lexicon and the unsupervised ML methods of topic modeling and word embeddings) to represent lyrical contents in terms of numeric variables. The extracted variables ranged from habitual aspects of music consumption (e.g., average daily duration) to song preferences, represented by melodic attributes (e.g., melodic key) and song lyrics (e.g., love themes). Third, these music-listening features served to predict self-reported Big Five scores using supervised ML algorithms, and fourth, interpretable ML approaches helped us gain insights into the resulting associations. For each step of this processing pipeline, I will present exemplary methodological details and findings to evaluate the advantages and disadvantages of using ML techniques at different analysis steps. Thereby, I will put special emphasis on how the concatenation of different ML techniques aligns with Open Science practices because the numerous specifications necessary at each step may appear difficult to register in advance or report in a transparent manner. Hence, I will discuss whether the level of granularity can be balanced between preregistration and (supplemental) method descriptions and why open data and material (in particular: open code) are particularly crucial.

There are no other versions of this object.

Author(s) / Creator(s)

Sust, Larissa
Author(s) / Creator(s)

Stachl, Clemens
Author(s) / Creator(s)

Schoedel, Ramona
PsychArchives acquisition timestamp

2023-07-24T10:49:44Z
Made available on

2023-07-24T10:49:44Z
Date of first publication

2023-07-24
Abstract / Description

Machine learning (ML) has become a popular tool for modeling diverse psychological phenomena from digital behavioral data such as footprints left online (e.g., likes on social media platforms) or sensor data (e.g., phone logs from smartphones). Thereby, ML is most commonly used to accommodate the high dimensionality of digital records (i.e., data in which the number of variables is close to or larger than the number of observations) in supervised predictions of psychological outcomes. However, ML also plays a role in the preceding steps of preprocessing when extracting behavioral variables from the raw digital data, which often contain highly abstract information that is not per se meaningful. In this talk, I want to demonstrate the integration of ML techniques at different steps in the modeling process, from preprocessing to outcome prediction to interpretation. I will consider the example of inferring personality traits from a natural music-listening dataset (N = 330) collected via the PhoneStudy smartphone sensing app. First, in the processing pipeline, we enriched the raw time-stamped song records in our smartphone-sensing data with song-level data such as audio features and song lyrics from external online sources. Second, for the lyrics, in particular, we combined three text-mining approaches (a sentiment lexicon and the unsupervised ML methods of topic modeling and word embeddings) to represent lyrical contents in terms of numeric variables. The extracted variables ranged from habitual aspects of music consumption (e.g., average daily duration) to song preferences, represented by melodic attributes (e.g., melodic key) and song lyrics (e.g., love themes). Third, these music-listening features served to predict self-reported Big Five scores using supervised ML algorithms, and fourth, interpretable ML approaches helped us gain insights into the resulting associations. For each step of this processing pipeline, I will present exemplary methodological details and findings to evaluate the advantages and disadvantages of using ML techniques at different analysis steps. Thereby, I will put special emphasis on how the concatenation of different ML techniques aligns with Open Science practices because the numerous specifications necessary at each step may appear difficult to register in advance or report in a transparent manner. Hence, I will discuss whether the level of granularity can be balanced between preregistration and (supplemental) method descriptions and why open data and material (in particular: open code) are particularly crucial.

en
Publication status

unknown
Review status

unknown
External description on another website

http://www.ressyn-bigdata.org
Persistent Identifier

https://hdl.handle.net/20.500.12034/8520
Persistent Identifier

https://doi.org/10.23668/psycharchives.13021
Language of content

eng
Publisher

ZPID (Leibniz Institute for Psychology)
Is part of

Big Data & Research Syntheses 2023, Frankfurt, Germany
Is related to

https://hdl.handle.net/20.500.12034/8507
Dewey Decimal Classification number(s)

150
Title

Concatenating Machine Learning Applications for Analyzing Digital Behavioral Data (Poster)

en
DRO type

conferenceObject
Visible tag(s)

Smartphone Sensing Panel Study

en
Visible tag(s)

ZPID Conferences and Workshops