Here at TCG, we're never afraid of a challenge. Whether we're crushing conversion goals, tackling traffic targets, or taking on the steep task of collectively taking 10 million steps in one month in our first-ever step challenge, we're always game to push ourselves to achieve the impossible. In just one month, TCG team members walked, ran, skipped, climbed and jumped our way to millions of steps, crushing our goal of taking 10 million steps in only four weeks.

And then we totally geeked out analyzing our step data!

Here, our very own in-house data guru, Charles Showley dives into the hard numbers behind our step challenge, as he analyzes all that delicious data we produced.

Take it away, Charles!

## Using Predictive Data Modeling To Motivate The Team To Take More Steps

A challenge to accrue 10 million steps in a month sounds like a very “at-face-value” kind of proposition, but it creates a promising landscape for all kinds of numerical analysis; namely, given 73 people with 29 normally distributed measurements for tracking their number of steps per day, what total number of steps can you realistically expect to have if everyone was diligent about recording their data every day? In other words, we probably aren’t going to have everyone participate in the month-long challenge, so rather than assuming we can fill in a constant “2,500 steps per blank cell,” how can we simulate data to get a more realistic assessment of what our performance would be?

At stake: a company-sponsored tour of San Diego’s greatest breweries. So why not build a mathematical, predictive model to motivate everyone to walk more? (N.B. I can hardly take a holier-than-thou stance, as my final ranking hovered around 50%.)

By the end of the challenge, there were 66 people who had participated, or 90.4% of TCG’s employees. I plotted our daily progress over time, as the step total approached the 10 million mark. For each day of recorded data, I compared the day’s progress against an ideal amount of steps (*smin*) to take that day: 10 million steps divided into 29 equal parts (*ndays*), and divided further by the number of participants, 73 (*nTCG*), shown below by:

This comes out to 4,724 steps per day, per person. After our first week, we were not off to a good start:

[caption id="attachment_9046" align="aligncenter" width="800"] Fig. 1[/caption]

This is a semi-logarithmic plot compressing the y-axis for clarity. With such a large range of data to display, the individual contributions to the total progress would be dwarfed by the plot’s scale. This is why the daily goal line is curved even though it increases by the constant value (*s**min*) every day. It’s been transformed by the semi-logarithmic nature of the space it occupies.

The next step was to take historical data from what had been recorded so far and plot a linear regression through to the end of the event. This would visualize past behavior of the system and predict how it would behave an arbitrary amount of time into the future. The predictions would become more accurate over time with more historical data input. I used Python’s *statsmodels.formula.api.ols* function to plot the regression:

[caption id="attachment_9036" align="aligncenter" width="800"] Fig. 2[/caption]

The regression is plotted in solid purple, and where it terminates is the predicted final step count if there's no change in behavior of the system.

Almost two weeks in, and we seemed doomed to barely crack the 7.5 million step mark. Either more people were going to have to log their steps, or those actively contributing would have to start running a marathon a day (N.B. Given average heights for American men and women, that would be 57,016 and 62,215 steps, respectively). So how much would having more people log steps help? Is there a way to calculate this number, based on the performance of active participants? How much can we reasonably expect to gain from maximum participation?

I present our best friend: the normal distribution.

[caption id="" align="aligncenter" width="576"] Fig. 3 Source[/caption]

It describes what you may recognize as a bell-curve, which assumes a majority of the measurements in an experiment (steps per day with respect to us) will be close to the average (; *mu*) of all measurements.The width of the curve will depend on the standard deviation (; *sigma*) of the measurements, or how much the measurements vary relative to each other. A large standard deviation means lots of people logging very few steps, and others logging a ton of steps, would consequently produce a very wide curve. Based on data from active participants, I calculate the average (6,500 steps) and standard deviation (5,800 steps) from all nonzero recorded steps and use Python’s *numpy.random.normal* function to simulate *n*-number of randomly generated data for each day that hasn’t been filled in on our progress spreadsheet. The beauty of employing this function is that it relies on data that exists within the spreadsheet already instead of pulling numbers out of thin air, so even the numbers generated with this function will still conform to the pre-existing distribution of steps.

But of course with any randomly generated numbers, even ones derived from a framework like a normal distribution could come out very, very high. I corrected for this by running the simulation 100 times and averaging the results to get an overall average of 6,709 steps and standard deviation of 5,640 steps. This function can create a number anywhere between negative to positive infinity but with high probability of returning a number close to the average of the distribution. The shape and location of the distribution along the number line means a probability of about 13.2% of getting a negative number which, using calculus and the equation for a normal distribution, is determined:

I could either roll the dice again to try and get a non-negative number, or I could manually set the negative number to zero and accept that about 13.2% of my randomly generated data will be zero. I take the latter option as the former would have artificially inflated my results.

[caption id="attachment_9050" align="alignnone" width="1600"] Fig. 4[/caption]

Plotting the simulated data along with TCG’s recorded data by the end of the challenge returns:

The triangles marking **ideal performance** describe the random data points generated with the method described above.

We cleared our 10 million mark! As the deadline approached, more people began contributing their steps than in the earlier days in the challenge. This raised the projection up from what was shown in Fig. 2. Our combined efforts led us to surpass the 7-digit goal mark by Day 27 but, with 100% participation, we would have crossed the finish line as early as Day 23.

Our final totals came out to: 10,815,636 steps, averaging 373,089 steps per day, and 5,653 steps per day, per person.

The final count we would have had with theoretical maximum participation came to: 14,192,439 steps.

Our top 3 participants were front-end developer, Jake Howard, application developer, Tony Lea, and application developer, Eric Greene. Their steps accounted for almost 16% of the total number of steps taken! Honorable mention to front-end developer, Leica Zetisky for the snagging the most steps taken in one day: 66,727.