Concatenating Machine Learning Applications for Analyzing Digital Behavioral Data (Poster)
Author(s) / Creator(s)
Sust, Larissa
Stachl, Clemens
Schoedel, Ramona
Abstract / Description
Machine learning (ML) has become a popular tool for modeling diverse psychological phenomena from digital behavioral data such as footprints left online (e.g., likes on social media platforms) or sensor data (e.g., phone logs from smartphones). Thereby, ML is most commonly used to accommodate the high dimensionality of digital records (i.e., data in which the number of variables is close to or larger than the number of observations) in supervised predictions of psychological outcomes. However, ML also plays a role in the preceding steps of preprocessing when extracting behavioral variables from the raw digital data, which often contain highly abstract information that is not per se meaningful. In this talk, I want to demonstrate the integration of ML techniques at different steps in the modeling process, from preprocessing to outcome prediction to interpretation. I will consider the example of inferring personality traits from a natural music-listening dataset (N = 330) collected via the PhoneStudy smartphone sensing app. First, in the processing pipeline, we enriched the raw time-stamped song records in our smartphone-sensing data with song-level data such as audio features and song lyrics from external online sources. Second, for the lyrics, in particular, we combined three text-mining approaches (a sentiment lexicon and the unsupervised ML methods of topic modeling and word embeddings) to represent lyrical contents in terms of numeric variables. The extracted variables ranged from habitual aspects of music consumption (e.g., average daily duration) to song preferences, represented by melodic attributes (e.g., melodic key) and song lyrics (e.g., love themes). Third, these music-listening features served to predict self-reported Big Five scores using supervised ML algorithms, and fourth, interpretable ML approaches helped us gain insights into the resulting associations. For each step of this processing pipeline, I will present exemplary methodological details and findings to evaluate the advantages and disadvantages of using ML techniques at different analysis steps. Thereby, I will put special emphasis on how the concatenation of different ML techniques aligns with Open Science practices because the numerous specifications necessary at each step may appear difficult to register in advance or report in a transparent manner. Hence, I will discuss whether the level of granularity can be balanced between preregistration and (supplemental) method descriptions and why open data and material (in particular: open code) are particularly crucial.
Persistent Identifier
Date of first publication
2023-07-24
Is part of
Big Data & Research Syntheses 2023, Frankfurt, Germany
Publisher
ZPID (Leibniz Institute for Psychology)
Citation
-
Sust_Poster.pdfAdobe PDF - 334.7KBMD5: a07c3896c2e3056a4b9a8731a3b7ca4c
-
There are no other versions of this object.
-
Author(s) / Creator(s)Sust, Larissa
-
Author(s) / Creator(s)Stachl, Clemens
-
Author(s) / Creator(s)Schoedel, Ramona
-
PsychArchives acquisition timestamp2023-07-24T10:49:44Z
-
Made available on2023-07-24T10:49:44Z
-
Date of first publication2023-07-24
-
Abstract / DescriptionMachine learning (ML) has become a popular tool for modeling diverse psychological phenomena from digital behavioral data such as footprints left online (e.g., likes on social media platforms) or sensor data (e.g., phone logs from smartphones). Thereby, ML is most commonly used to accommodate the high dimensionality of digital records (i.e., data in which the number of variables is close to or larger than the number of observations) in supervised predictions of psychological outcomes. However, ML also plays a role in the preceding steps of preprocessing when extracting behavioral variables from the raw digital data, which often contain highly abstract information that is not per se meaningful. In this talk, I want to demonstrate the integration of ML techniques at different steps in the modeling process, from preprocessing to outcome prediction to interpretation. I will consider the example of inferring personality traits from a natural music-listening dataset (N = 330) collected via the PhoneStudy smartphone sensing app. First, in the processing pipeline, we enriched the raw time-stamped song records in our smartphone-sensing data with song-level data such as audio features and song lyrics from external online sources. Second, for the lyrics, in particular, we combined three text-mining approaches (a sentiment lexicon and the unsupervised ML methods of topic modeling and word embeddings) to represent lyrical contents in terms of numeric variables. The extracted variables ranged from habitual aspects of music consumption (e.g., average daily duration) to song preferences, represented by melodic attributes (e.g., melodic key) and song lyrics (e.g., love themes). Third, these music-listening features served to predict self-reported Big Five scores using supervised ML algorithms, and fourth, interpretable ML approaches helped us gain insights into the resulting associations. For each step of this processing pipeline, I will present exemplary methodological details and findings to evaluate the advantages and disadvantages of using ML techniques at different analysis steps. Thereby, I will put special emphasis on how the concatenation of different ML techniques aligns with Open Science practices because the numerous specifications necessary at each step may appear difficult to register in advance or report in a transparent manner. Hence, I will discuss whether the level of granularity can be balanced between preregistration and (supplemental) method descriptions and why open data and material (in particular: open code) are particularly crucial.en
-
Publication statusunknown
-
Review statusunknown
-
External description on another websitehttp://www.ressyn-bigdata.org
-
Persistent Identifierhttps://hdl.handle.net/20.500.12034/8520
-
Persistent Identifierhttps://doi.org/10.23668/psycharchives.13021
-
Language of contenteng
-
PublisherZPID (Leibniz Institute for Psychology)
-
Is part ofBig Data & Research Syntheses 2023, Frankfurt, Germany
-
Is related tohttps://hdl.handle.net/20.500.12034/8507
-
Dewey Decimal Classification number(s)150
-
TitleConcatenating Machine Learning Applications for Analyzing Digital Behavioral Data (Poster)en
-
DRO typeconferenceObject
-
Visible tag(s)Smartphone Sensing Panel Studyen
-
Visible tag(s)ZPID Conferences and Workshops