Accuracy in Parameter Estimation and Simulation Approaches for Sample Size Planning

Erin M. Buchanan

Harrisburg University

Power and Sample Size Planning

Sample Size Planning: New Tools and Innovations
- Accuracy in Parameter Estimation and Simulation Approaches for Sample Size Planning, Erin M. Buchanan
- Power Analyses for Interaction Effects in Observational Studies, David A. Baranger
- Empowering Sample Size Justification with the Superpower R Package, Aaron Caldwell

Power and Sample Size Planning

A Blender Mix

Accuracy in Parameter Estimation and Simulation Approaches for Sample Size Planning
How we took a bunch of interesting ideas and mixed them together

Sample Size Planning

Sample size planning is often thought of as “point and click”
- G*Power: https://www.psychologie.hhu.de/arbeitsgruppen/allgemeine-psychologie-und-arbeitspsychologie/gpower
- https://jakewestfall.shinyapps.io/pangea/
- https://pwrss.shinyapps.io/index/
- https://designingexperiments.com/
Sample size planning is technically a closed-form solution for many analysis plans
An incredible number of cool R packages for sample size planning, such as pwr
So, why do we need new innovations for power?

The Need

TOPS movement + pre-registration + grants + registered reports = need for power analyses
Power analyses are just our best guesses and are likely wrong
Many Analyst studies show us that there is no design = analysis answer
The smallest effect of interest may be unknown
Some research papers do not have one specific hypothesis (i.e., dataset creations)
Once you leave the t-test behind, power becomes more complicated and often based on simulation

Our Use Case

Research studies that use many items to assess the parameter of interest
Research studies designed to collect data on many items and share the data
We should be careful not to assume all items are equal …
And move away from using item-level averages as parameters of interest

Combining Toolkits

Accuracy in Parameter Estimation: finding the sample size that allows for “accurately measured” parameters
- Determine a “sufficiently narrow” confidence interval around your parameter
- Determine the sample size that should provide that CI
Bootstrapping (sort of) and Simulation
- Taking pilot data and simulating various sample sizes based on bootstrapping your sample
- Use this technique to find the sample size for a “sufficiently narrow” CI for items

Sequential Testing

Sequential testing: examine the parameter of interest for the intended CI
- After each participant
- At regular intervals during data collection
Benefits:
- Maximizing the usefulness of data collection
Cons:
- Usually requires code based skill sets

Proposed Method

Proposed Procedure for Powering Studies with Multiple Items
Step	Proposed Steps
1	Use representative pilot data.
2	Calculate standard error of each of the items in the pilot data. Using the 40%, determine the cutoff and stopping rule for the standard error of the items.
3	Create bootstrapped samples of your pilot data starting with at least 20 participants up to a maximum number of participants.
4	Calculate the standard error of each of the items in the bootstrapped data. From these scores, calculate the percent of items below the cutoff score from Step 2.
5	Determine the sample size at which 80%, 85%, 90%, 95% of items are below the cutoff score. Use the correction formula to adjust your proposed sample size based on pilot data size, power, and percent variability.
6	Report all values. Designate one as the minimum sample size, the cutoff score as the stopping rule for adaptive designs, and the maximum sample size.

Package

Upcoming package semanticprimeR as part of a larger project
devtools::install_github("SemanticPriming/semanticprimeR")
Functions for each step of the proposed process
Functionality for when you have example data and when you do not (i.e., simulate example multiple-item data)
As part of the manuscript and semanticprimeR package, we provide 12+ examples online
Psycholinguistics, social psychology, COVID related, traditional cognitive psychology

Example: Step 1 (Pilot Sample)

You want to run a lexical decision project measuring response latencies for concrete and abstract words
You can use the English Lexicon Project as pilot data + previous publications of concreteness ratings
In these studies, we also have to factor in data loss!
- Combined data includes 27031 real words filtered down to 40 selected stimuli
- Average sample size per word: 32.67 (SD = 0.53)
- Pilot sample size: n = 33

Example: Step 2 (Calculate Cutoff)

library(semanticprimeR)
cutoff <- calculate_cutoff(population = elp_use, # pilot data or simulated data
  grouping_items = "Stimulus", # name of the item indicator column
  score = "RT", # name of the dependent variable column
  minimum = min(elp_use$RT), # minimum possible/found score
  maximum = max(elp_use$RT)) # maximum possible/found score

Example: Step 2 (Calculate Cutoff)

cutoff$se_items # all standard errors of items

 [1]  56.83131  58.59754  38.40305  69.22966  76.96831  51.80277  89.16515
 [8]  55.81059  36.93046  80.47134  42.17122  17.72957  39.32024  46.65783
[15]  72.07065 248.68735  93.89229  89.69502  46.02416  87.82424 140.39440
[22]  24.65804  45.83884  51.05279  36.09320  56.19962  79.21760  41.87754
[29]  59.16929  32.45934  62.30085  21.44458  30.91690  37.13134  55.69565
[36]  39.11986  66.73485  77.64671  34.97541 208.74359

cutoff$sd_items # standard deviation of the standard errors

[1] 45.19364

cutoff$cutoff # 40% decile score

     40% 
46.40436

cutoff$prop_var # proportion of possible variance

[1] 0.02466902

Example: Step 3 (BootSim Samples)

samples <- bootstrap_samples(start = 20, # starting sample size
  stop = 100, # stopping sample size
  increase = 5, # increase bootstrapped samples by this amount
  population = elp_use, # population or pilot data
  replace = TRUE, # bootstrap with replacement? 
  nsim = 500, # number of simulations to run
  grouping_items = "Stimulus") # item column label  

head(samples[[1]])

# A tibble: 6 × 6
# Groups:   Stimulus [1]
  Trial  Type Accuracy    RT Stimulus  Participant   
  <int> <int>    <int> <int> <chr>     <chr>         
1  1521     1        1   563 admirable participant629
2  2512     1        1   692 admirable participant63 
3  3078     1        1   781 admirable participant102
4  2354     1        1   635 admirable participant39 
5   634     1        1   463 admirable participant344
6  2274     1        1   729 admirable participant404

Example: Step 4-5 (Calculate Proportion)

proportion_summary <- calculate_proportion(samples = samples, # samples list
  cutoff = cutoff$cutoff, # cut off score 
  grouping_items = "Stimulus", # item column name
  score = "RT") # dependent variable column name 

head(proportion_summary)

# A tibble: 6 × 2
  sample_size percent_below
        <dbl>         <dbl>
1          20         0.35 
2          25         0.375
3          30         0.425
4          35         0.5  
5          40         0.575
6          45         0.7

Example: Step 6 (Apply Correction)

corrected_summary <- calculate_correction(
  proportion_summary = proportion_summary, # prop from above
  pilot_sample_size = pilot_size_e, # number of participants in the pilot data 
  proportion_variability = cutoff$prop_var, # proportion variance from cutoff scores
  power_levels = c(80, 85, 90, 95)) # what levels of power to calculate 

corrected_summary

# A tibble: 3 × 3
  percent_below sample_size corrected_sample_size
          <dbl>       <dbl>                 <dbl>
1          82.5          80                  74.1
2          90            90                  82.3
3          90            90                  82.3

Last Thoughts

Use case: multiple items that intend on using item level focused analyses
Should simulate only what is expected for a participant to do in the study
- Large numbers of items may bias estimates
Could combine with “traditional” power
Provides “well-measured” data –> not a specific decision for a specific sample

Thanks

Thanks for listening!
Reproducible manuscript: https://github.com/SemanticPriming/stimuli-power
Package: https://github.com/SemanticPriming/semanticprimeR
Scan me for a copy of this talk with links:

Simulation Method

To evaluate our approach, we ran a simulation study:
- Scale size: popular cognitive scales (1-7 measurements, 0-100 percentage measurements, and 0-3000 response latency type scale data)
- Item heterogeneity: small, medium, large
- Skew: normal distributions versus skewed (ceiling) distributions
- Pilot sample size: 20 to 100 increasing in units of 10
1,620,000 simulations of 3 X 3 X 2 X 9 design

Parameter Values for Data Simulation
Information	Likert	Percent	Milliseconds
Minimum	1.00	0	0
Maximum	7.00	100	3000
Mu	4.00	50	1000
Skewed Mu	6.00	85	2500
Sigma Mu	0.25	10	150
Sigma	2.00	25	400
Small Sigma Sigma	0.20	4	50
Medium Sigma Sigma	0.40	8	100
Large Sigma Sigma	0.80	16	200

Simulation Results: Scale Size

Simulation Results: Skew

Simulation Results: Item Heterogeneity

Dealing with Pilot Sample Size

At some point, power usually asymptotes with increasing sample size
So, we need a correction:

\[ 1 - \sqrt{\frac{N_{Pilot} - min(N_{Simulation})}{N_{Pilot}}}^{log_2(N_{Pilot})}\]

Dealing with Pilot Sample Size

Researchers Have One Sample

Long story short: we can provide a function for researchers to use to control pilot sample size
We also determined which level “sufficiently small” was probably best

Parameters for 40% Decile Cutoff Scores
Term	Estimate	SE	t	p
Intercept	206.589	128.861	1.603	.109
Projected Sample Size	0.368	0.005	71.269	< .001
Pilot Sample Size	-0.770	0.013	-59.393	< .001
Log2 Projected Sample Size	27.541	0.552	49.883	< .001
Log2 Pilot Sample Size	2.583	0.547	4.725	< .001
Log2 Power	-66.151	25.760	-2.568	.010
Proportion Variability	16.405	6.005	2.732	.006
Log2 Proportion Variability	-1.367	0.382	-3.577	< .001
Power	1.088	0.426	2.552	.011