Conference Object

Assessing the Correspondence of Results in Replication Studies

Author(s) / Creator(s)

Steiner, Peter M.
Wong, Vivian C.

Abstract / Description

Background: Reproducibility is a hallmark of science. Instead of relying on causal conclusions from a single experimental or non-experimental study, scientific knowledge is best achieved through careful replication of studies, or meta-analysis of results from multiple studies. In replication studies, researchers assess whether the original study results are reproduced by looking at the direction and magnitude of effects, as well as statistical significance patterns of results. Researchers may also conduct direct tests of statistical difference between the effect estimates of the original and replication study. Steiner and Wong (2018) distinguish between two classes of measures for assessing correspondence in results. The first consist of distance-based measures, which estimates the difference in the original and replicated effect. The second class of metrics is what Steiner and Wong (2018) call “conclusion-based measures.” These approaches assess correspondence in study results by looking at the size, direction, and statistical significance patterns of results. Objective/Research Question: This paper focuses on distance-based significance tests for assessing whether the effect estimates of the original and replication study actually replicate within the boundaries of statistical uncertainty. All distance-based tests use the difference in the effect estimates as the starting point. However, they differ with respect to the null- and alternative hypotheses under investigation. In a comparative simulation study, we assess the properties of three different significance tests for probing the difference or equivalence of two effect estimates. The first test is the standard null hypothesis significance test to which we refer as difference test because the alternative hypothesis claims a difference in effects. Thus, the difference test rejects the null hypothesis of no difference if the p-value is sufficiently low. The second test is the less well-known equivalence test where the alternative hypothesis claims the equivalence of the two effects while the null hypothesis states a difference in effects (Tryon, 2001). This test protects against the common type II error of difference tests, that is, the failure to reject the false null hypothesis of no difference. However, an underpowered equivalence test might fail to reject the false null hypothesis of a difference in effects. We suggest the use of a combined difference and equivalence test, the correspondence test (Steiner & Wong, 2018; Tryon & Lewis, 2008). The correspondence test has four possible outcomes: equivalence, difference, trivial difference, and indeterminacy. Its advantage is that it indicates indeterminacy whenever the two studies lack sufficient power to demonstrate either a significant difference or equivalence. Many of the replication efforts undertaken so far would presumably obtain an indeterminate outcome from the correspondence test. Thus, the objective of the paper is to compare the three tests for assessing correspondence in replication results under different scenarios. Research Question: (1) In which scenarios do difference tests (fail to) perform well? (2) In which scenarios do equivalence tests show excellent or poor performance? (3) Does the combined correspondence test outperform the difference and equivalence test? (4) What are the power requirements for the original and replication study to guarantee conclusive correspondence test results Method/Approach: First, we use statistical arguments to highlight the strengths and weaknesses of each test from a theoretical point of view. Second, we use a simulation study to compare the performance of the three distance-based tests under different scenarios. The scenarios are defined with respect to variations in (a) each study’s minimum detectable effect size (i.e., sample size and magnitude of the underlying true effect, which directly relates to the studies’ power) and (b) the difference in the true effects (including the equivalence of effects). True effect differences might result from effect heterogeneities across sites, populations, or settings, or from biases due to imperfect implementation of at least one of the studies. Results/Findings: Theory and our simulations suggest that the difference test regularly fails to indicate a true difference in effects whenever one of the two studies is insufficiently powered. That is, the probability of a type II error is high. A similar issue occurs with the equivalence test which fails to indicate equivalence (i.e., reject the null hypothesis of a difference) if one of the studies is insufficiently powered. Importantly, failing to reject the null hypothesis in an equivalence test does not imply that the effects actually differ. Here, the correspondence test has a clear advantage because if equivalence cannot be established, the correspondence test is able to distinguish between insufficient power (indeterminacy) and a significant difference. However, if researchers want to avoid an indeterminate outcome from the correspondence test both studies need to be sufficiently powered. The simulation results also indicate that both studies’ power must be significantly larger than what researchers usually plan for in a single study. This is so because the comparison of two effect estimates from independent studies requires that both effects are estimated with high efficiency. Conclusion & Implications: Theoretical considerations and the simulation results suggest that researchers interested in replication should use the correspondence test for assessing whether the effects of an original and replication study successfully reproduce or fail to reproduce. A major advantage of the correspondence test is that it indicates an indeterminate test outcome if the studies lack sufficient power to establish either a significant difference or equivalence. It becomes also clear that testing the correspondence of replication efforts requires highly powered studies.

Persistent Identifier

Date of first publication

2019-03-13

Is part of

Open Science 2019, Trier, Germany

Publisher

ZPID (Leibniz Institute for Psychology Information)

Citation

Steiner, P. M., & Wong, V. C. (2019, March 13). Assessing the Correspondence of Results in Replication Studies. ZPID (Leibniz Institute for Psychology Information). https://doi.org/10.23668/psycharchives.2394
  • Author(s) / Creator(s)
    Steiner, Peter M.
  • Author(s) / Creator(s)
    Wong, Vivian C.
  • PsychArchives acquisition timestamp
    2019-04-01T15:38:54Z
  • Made available on
    2019-04-01T15:38:54Z
  • Date of first publication
    2019-03-13
  • Abstract / Description
    Background: Reproducibility is a hallmark of science. Instead of relying on causal conclusions from a single experimental or non-experimental study, scientific knowledge is best achieved through careful replication of studies, or meta-analysis of results from multiple studies. In replication studies, researchers assess whether the original study results are reproduced by looking at the direction and magnitude of effects, as well as statistical significance patterns of results. Researchers may also conduct direct tests of statistical difference between the effect estimates of the original and replication study. Steiner and Wong (2018) distinguish between two classes of measures for assessing correspondence in results. The first consist of distance-based measures, which estimates the difference in the original and replicated effect. The second class of metrics is what Steiner and Wong (2018) call “conclusion-based measures.” These approaches assess correspondence in study results by looking at the size, direction, and statistical significance patterns of results. Objective/Research Question: This paper focuses on distance-based significance tests for assessing whether the effect estimates of the original and replication study actually replicate within the boundaries of statistical uncertainty. All distance-based tests use the difference in the effect estimates as the starting point. However, they differ with respect to the null- and alternative hypotheses under investigation. In a comparative simulation study, we assess the properties of three different significance tests for probing the difference or equivalence of two effect estimates. The first test is the standard null hypothesis significance test to which we refer as difference test because the alternative hypothesis claims a difference in effects. Thus, the difference test rejects the null hypothesis of no difference if the p-value is sufficiently low. The second test is the less well-known equivalence test where the alternative hypothesis claims the equivalence of the two effects while the null hypothesis states a difference in effects (Tryon, 2001). This test protects against the common type II error of difference tests, that is, the failure to reject the false null hypothesis of no difference. However, an underpowered equivalence test might fail to reject the false null hypothesis of a difference in effects. We suggest the use of a combined difference and equivalence test, the correspondence test (Steiner & Wong, 2018; Tryon & Lewis, 2008). The correspondence test has four possible outcomes: equivalence, difference, trivial difference, and indeterminacy. Its advantage is that it indicates indeterminacy whenever the two studies lack sufficient power to demonstrate either a significant difference or equivalence. Many of the replication efforts undertaken so far would presumably obtain an indeterminate outcome from the correspondence test. Thus, the objective of the paper is to compare the three tests for assessing correspondence in replication results under different scenarios. Research Question: (1) In which scenarios do difference tests (fail to) perform well? (2) In which scenarios do equivalence tests show excellent or poor performance? (3) Does the combined correspondence test outperform the difference and equivalence test? (4) What are the power requirements for the original and replication study to guarantee conclusive correspondence test results Method/Approach: First, we use statistical arguments to highlight the strengths and weaknesses of each test from a theoretical point of view. Second, we use a simulation study to compare the performance of the three distance-based tests under different scenarios. The scenarios are defined with respect to variations in (a) each study’s minimum detectable effect size (i.e., sample size and magnitude of the underlying true effect, which directly relates to the studies’ power) and (b) the difference in the true effects (including the equivalence of effects). True effect differences might result from effect heterogeneities across sites, populations, or settings, or from biases due to imperfect implementation of at least one of the studies. Results/Findings: Theory and our simulations suggest that the difference test regularly fails to indicate a true difference in effects whenever one of the two studies is insufficiently powered. That is, the probability of a type II error is high. A similar issue occurs with the equivalence test which fails to indicate equivalence (i.e., reject the null hypothesis of a difference) if one of the studies is insufficiently powered. Importantly, failing to reject the null hypothesis in an equivalence test does not imply that the effects actually differ. Here, the correspondence test has a clear advantage because if equivalence cannot be established, the correspondence test is able to distinguish between insufficient power (indeterminacy) and a significant difference. However, if researchers want to avoid an indeterminate outcome from the correspondence test both studies need to be sufficiently powered. The simulation results also indicate that both studies’ power must be significantly larger than what researchers usually plan for in a single study. This is so because the comparison of two effect estimates from independent studies requires that both effects are estimated with high efficiency. Conclusion & Implications: Theoretical considerations and the simulation results suggest that researchers interested in replication should use the correspondence test for assessing whether the effects of an original and replication study successfully reproduce or fail to reproduce. A major advantage of the correspondence test is that it indicates an indeterminate test outcome if the studies lack sufficient power to establish either a significant difference or equivalence. It becomes also clear that testing the correspondence of replication efforts requires highly powered studies.
    en_US
  • Sponsorship
    Supported by NSF grant #2015‐0285‐00
    en_US
  • Citation
    Steiner, P. M., & Wong, V. C. (2019, March 13). Assessing the Correspondence of Results in Replication Studies. ZPID (Leibniz Institute for Psychology Information). https://doi.org/10.23668/psycharchives.2394
    en
  • Persistent Identifier
    https://hdl.handle.net/20.500.12034/2026
  • Persistent Identifier
    https://doi.org/10.23668/psycharchives.2394
  • Language of content
    eng
    en_US
  • Publisher
    ZPID (Leibniz Institute for Psychology Information)
    en_US
  • Is part of
    Open Science 2019, Trier, Germany
    en_US
  • Dewey Decimal Classification number(s)
    150
  • Title
    Assessing the Correspondence of Results in Replication Studies
    en_US
  • DRO type
    conferenceObject
    en_US
  • Visible tag(s)
    ZPID Conferences and Workshops