Screen Shot 2018-12-14 at 11.18.19 PM.png

Digital Well-being Field Experiment: iOS 12 Grayscale and Screen Time

 

 The rise of smartphones and continuous connectivity has given way to what some describe as “compulsive” use. I designed, analyzed, and interpreted data from a field experiment in which I conducted an empirical test of the causal effect of the GrayScale screen filter on the duration of smartphone use using the iOS12 native application ‘Screen Time’.

 

The Problem

 

Past literature has found that smartphone use can have detrimental effects on enjoyment of face-to-face interactions and the associated mental health benefits of social connections: popular media coverage has often focused on the damaging effects of smartphones among young users

 
Screen Shot 2018-12-16 at 5.30.52 PM.png

With the new availability of usage data, I decided to examine the effect of a grayscale screen, which – before the emergence of the more recent digital wellbeing movement pursued by tech giants like Apple,Google, and Facebook – had gained traction as a way to limit smartphone “addiction” after being endorsed by Tristan Harris, founder of the Center for Humane Technology. My research question was: does changing to a grayscale screen cause less time spent on mobile devices? I hypothesized that using devices in grayscale would indeed result in decreased usage, as Harris has suggested.

More recent peer reviewed literature such as Andrews et al (2015) “Beyond Self-Report: Tools to Compare Estimated and Real-World Smartphone Use” has demonstrated how human factors researchers can use data processed from smartphones for a variety of applications.


The Process

Experiment design

Experiment design

Treatment: Subjects in treatment were asked to switch their phones into grayscale mode via text message in the morning for a “couple” of days.

In a repeated measures design over a two week period, each subject was assigned both treatment and control on each day of the week, with half having treatment both at the beginning and end of the study (R1) while the other half had consecutive days in the middle (R2). The experiment design was intended to allow for both analysis of within subjects and across subjects comparisons.

Average screen-time, notifications, and pickup measures of participants in R1(treatment=0), and R2(treatment=1).

Average screen-time, notifications, and pickup measures of participants in R1(treatment=0), and R2(treatment=1).

I implemented a robust approach to randomization engineering in which I analyzed pre-experimental measurements of the outcome variable, in order to ensure an even balance of pre-experiment screen-time and other covariate measures between subjects groups who start the experiment in the treatment condition (R1), and subjects groups who started the experiment in the control condition (R2).

Screen Shot 2018-12-16 at 4.58.52 PM.png

Insights & Findings

Distribution of outcome measure - screen time

Distribution of outcome measure - screen time

Model Selection: I implemented this experimental design to analyze the casual effect of the treatment both within and across subjects. I ran a Panel OLS regression as well as a clustered and non-clustered OLS regression to measure the treatment effect using the ‘statsmodels’ package in Python.

Box-plot of outcome measure, screen-time, across days.

Box-plot of outcome measure, screen-time, across days.

After paneling the data – multi-indexing with participant_id and date of observation – I applied an F-test t for the poolability across cross sections, which generated lower standard errors of the Average Treatment Effect(ATE) of GrayScale screen filter than a clustered OLS regression. The ATE of the GrayScale treatment was negative, but statistically insignificant at a 95% confidence interval, with a standard error of 9.62 minutes.

Compliance rates were not statistically significant between subject groups. While attrition rates between subject groups were statistically significant, implementing a weighted least squares specification with inverse probability weighting showed estimates similar to those in the original model specification.


Limitations and Future Work

Sample size (n=20) gives a statistical power of 20%. This means that there’s only a 20% likelihood that the study would detect an effect when there is an effect there to be detected.

Sample size (n=20) gives a statistical power of 20%. This means that there’s only a 20% likelihood that the study would detect an effect when there is an effect there to be detected.

Many of the subjects in this study are university students: their behavioral patterns may not generalize to non-student populations. However even within this data sample(n=20), we find low individual heterogeneity, indicating that a within subject pooled regression could work more accurately in estimating any casual effects of the GrayScale treatment. An experiment of longer duration with more subjects would likely be preferred to understand effects reliably. Employing a Noncentral F-distribution, with an estimated effect of 30 mins gives sample size (n=20) an 80% likelihood of committing Type II error.

My future pursuits in this area may benefit from supplemental collection of qualitative data obtain through structured-interviews questions or surveys. This data may be used in a multi-factor design of a fully saturated specification. In addition, more attitudinal qualitative information can be gathered to understand how participants feel about the treatment condition, even it may not affect their behavior.