Conference Object

Concatenating Machine Learning Applications for Analyzing Digital Behavioral Data (Poster)

Author(s) / Creator(s)

Sust, Larissa
Stachl, Clemens
Schoedel, Ramona

Abstract / Description

Machine learning (ML) has become a popular tool for modeling diverse psychological phenomena from digital behavioral data such as footprints left online (e.g., likes on social media platforms) or sensor data (e.g., phone logs from smartphones). Thereby, ML is most commonly used to accommodate the high dimensionality of digital records (i.e., data in which the number of variables is close to or larger than the number of observations) in supervised predictions of psychological outcomes. However, ML also plays a role in the preceding steps of preprocessing when extracting behavioral variables from the raw digital data, which often contain highly abstract information that is not per se meaningful. In this talk, I want to demonstrate the integration of ML techniques at different steps in the modeling process, from preprocessing to outcome prediction to interpretation. I will consider the example of inferring personality traits from a natural music-listening dataset (N = 330) collected via the PhoneStudy smartphone sensing app. First, in the processing pipeline, we enriched the raw time-stamped song records in our smartphone-sensing data with song-level data such as audio features and song lyrics from external online sources. Second, for the lyrics, in particular, we combined three text-mining approaches (a sentiment lexicon and the unsupervised ML methods of topic modeling and word embeddings) to represent lyrical contents in terms of numeric variables. The extracted variables ranged from habitual aspects of music consumption (e.g., average daily duration) to song preferences, represented by melodic attributes (e.g., melodic key) and song lyrics (e.g., love themes). Third, these music-listening features served to predict self-reported Big Five scores using supervised ML algorithms, and fourth, interpretable ML approaches helped us gain insights into the resulting associations. For each step of this processing pipeline, I will present exemplary methodological details and findings to evaluate the advantages and disadvantages of using ML techniques at different analysis steps. Thereby, I will put special emphasis on how the concatenation of different ML techniques aligns with Open Science practices because the numerous specifications necessary at each step may appear difficult to register in advance or report in a transparent manner. Hence, I will discuss whether the level of granularity can be balanced between preregistration and (supplemental) method descriptions and why open data and material (in particular: open code) are particularly crucial.

Persistent Identifier

Date of first publication

2023-07-24

Is part of

Big Data & Research Syntheses 2023, Frankfurt, Germany

Publisher

ZPID (Leibniz Institute for Psychology)

Citation

  • Author(s) / Creator(s)
    Sust, Larissa
  • Author(s) / Creator(s)
    Stachl, Clemens
  • Author(s) / Creator(s)
    Schoedel, Ramona
  • PsychArchives acquisition timestamp
    2023-07-24T10:49:44Z
  • Made available on
    2023-07-24T10:49:44Z
  • Date of first publication
    2023-07-24
  • Abstract / Description
    Machine learning (ML) has become a popular tool for modeling diverse psychological phenomena from digital behavioral data such as footprints left online (e.g., likes on social media platforms) or sensor data (e.g., phone logs from smartphones). Thereby, ML is most commonly used to accommodate the high dimensionality of digital records (i.e., data in which the number of variables is close to or larger than the number of observations) in supervised predictions of psychological outcomes. However, ML also plays a role in the preceding steps of preprocessing when extracting behavioral variables from the raw digital data, which often contain highly abstract information that is not per se meaningful. In this talk, I want to demonstrate the integration of ML techniques at different steps in the modeling process, from preprocessing to outcome prediction to interpretation. I will consider the example of inferring personality traits from a natural music-listening dataset (N = 330) collected via the PhoneStudy smartphone sensing app. First, in the processing pipeline, we enriched the raw time-stamped song records in our smartphone-sensing data with song-level data such as audio features and song lyrics from external online sources. Second, for the lyrics, in particular, we combined three text-mining approaches (a sentiment lexicon and the unsupervised ML methods of topic modeling and word embeddings) to represent lyrical contents in terms of numeric variables. The extracted variables ranged from habitual aspects of music consumption (e.g., average daily duration) to song preferences, represented by melodic attributes (e.g., melodic key) and song lyrics (e.g., love themes). Third, these music-listening features served to predict self-reported Big Five scores using supervised ML algorithms, and fourth, interpretable ML approaches helped us gain insights into the resulting associations. For each step of this processing pipeline, I will present exemplary methodological details and findings to evaluate the advantages and disadvantages of using ML techniques at different analysis steps. Thereby, I will put special emphasis on how the concatenation of different ML techniques aligns with Open Science practices because the numerous specifications necessary at each step may appear difficult to register in advance or report in a transparent manner. Hence, I will discuss whether the level of granularity can be balanced between preregistration and (supplemental) method descriptions and why open data and material (in particular: open code) are particularly crucial.
    en
  • Publication status
    unknown
  • Review status
    unknown
  • External description on another website
    http://www.ressyn-bigdata.org
  • Persistent Identifier
    https://hdl.handle.net/20.500.12034/8520
  • Persistent Identifier
    https://doi.org/10.23668/psycharchives.13021
  • Language of content
    eng
  • Publisher
    ZPID (Leibniz Institute for Psychology)
  • Is part of
    Big Data & Research Syntheses 2023, Frankfurt, Germany
  • Is related to
    https://hdl.handle.net/20.500.12034/8507
  • Dewey Decimal Classification number(s)
    150
  • Title
    Concatenating Machine Learning Applications for Analyzing Digital Behavioral Data (Poster)
    en
  • DRO type
    conferenceObject
  • Visible tag(s)
    Smartphone Sensing Panel Study
    en
  • Visible tag(s)
    ZPID Conferences and Workshops