Unlocking the Winning Formula: Sports Analytics with Python and ArcticDB

Aug 19, 2025

A polar bear dressed as a football coach stands on the sidelines of a stadium at night, intently analyzing glowing charts and data on a tablet.

How many wins should a team expect given the number of points (or runs or goals) it scores and allows/concedes over the course of a season? That number provides a natural benchmark for the team’s actual record. Did it outperform or underperform its capabilities? Did the team’s manager do a good or bad job of in-game decision-making?

Bill James' groundbreaking work in Major League Baseball during the 1980s was one of the first to link points scored and allowed to a team's expected wins and losses:

Expected Winning Percentage=E[WW+L]PSλPSλ+PAλ,λ>0\text{Expected Winning Percentage} =E\left[\frac{W}{W+L}\right]\approx\frac{PS^\lambda}{PS^\lambda+PA^\lambda},\lambda>0

where W is the number of team wins over the course of a season, L is the number of losses, PS is points scored by the offense and PA is points allowed by the defense.

Ties can be incorporated by fractionally allocating to the equivalent wins and losses. In the English Premier League, for example, ties are worth 1/3 of a win and 2/3 of a loss in the league standings. Since the league average winning percentage depends on the number of league ties in this case, one would need to incorporate an additional factor on the right-hand side of (1) such that if PS = PA, the expected winning percentage equals the league average winning percentage. The factor is not necessary if ties are split 50/50 between wins and losses.

Because James found λ = 2 reasonable in describing baseball, the formula has been nicknamed the ‘Pythagorean’ Won-Loss Formula. Note that if PS = PA, the team would be expected to win half of its games for a ‘0.500’ record. For example, the 2020–2021 NBA Champion Milwaukee Bucks scored 8,649 points and yielded 8,225 to its opponents during the regular season. For NBA teams, we estimate λ ≈ 14.3 (more on where we get this value later).

And there can be a debate about what λ should be. For example, ESPN assumes 16.5 for its Pythagorean expectation for NBA teams (https://www.espn.com/nba/stats/rpi/_/year/2024).

8,64914.38,64914.3+8,22514.3=67.2%\frac{8,649^{14.3}}{8,649^{14.3} + 8,225^{14.3}} = 67.2\%

The Bucks’ actual winning percentage was only 63.9% in the regular season (= 46 wins / 72 games). So, they underperformed their capabilities at least by this measure. This might also explain their playoff success, where they likely played closer to their ‘true’ level of talent.

Equation (1) is central to much of the ‘Moneyball’ revolution in sports. Jonah Hill’s character in the 2011 movie Moneyball references the equation in one scene while discussing with Brad Pitt’s character how they will build their team in the upcoming season after losing most of their best players. The equation’s first derivative can be used to calculate an individual player’s marginal added value, commonly referred to as Wins Above Replacement (WAR) or Value Over Replacement Player (VORP). These metrics are now widely used for player valuation across several sports. Conceptually, a replacement player is modeled as a below average, fringe player who is readily available to a team at the league minimum wage. WAR and VORP-like metrics serve as standardized estimates of player’s ‘alpha,’ allowing comparisons across players and teams.

We can see λ’s importance in these marginal added value calculations. Taking the first derivative of (1) with respect to PS and using an average team to provide a common baseline (i.e., a team that scores as many points as it allows at a rate equal to the league average scoring rate per game: ), the marginal wins per point function is:

Using PA would give the same answer but with an opposite sign. Fewer points allowed implies more wins.

Wins per PointAvg Team=λ4P\text{Wins per Point} _{Avg\ Team} = \frac{\lambda}{4 \cdot P}

Once we have a reasonable estimate of λ, marginal wins per point (or run or goal) can be computed for any sport. To make this more concrete, let’s take the case of Aaron Judge, an outfielder for Major League Baseball’s (MLB) New York Yankees, who made $40 million in salary for the 2024 season. Did he earn it? Purposely handwaving over most or all of the details, we use data from Fangraphs.com, which publishes one of the widely referenced WAR values for baseball. According to the site, Judge’s offensive and defensive contributions added 108.5 runs above a replacement player during the season. How many wins over the full 162-game season would those runs be worth to an average team?

Using (2), plugging in our estimate of λ ≈ 1.8 for MLB (again more on this later), and an average of 4.39 runs per game for the 2024 season, our wins-per-run estimate is about 0.103, which is consistent with Fangraph’s calculations. Flipping the relationship, we can say that 9 to10 runs over the season are worth about one win in baseball. Multiplying through, Judge added about 11.2 wins (= 108.5 runs x 0.103 wins per run) during the season. And finally, Fangraphs estimates that a win in baseball is worth about $8 million based on free agency market transactions, implying Judge created $88–$89 million in value in 2024 which was significantly higher than his $40 million salary.

Our purpose here

So why write more on this well-covered equation that is now central to professional sports (other than highlighting how cool ArcticDB is of course[GT1] )? Most studies of James’ formula we have read have been written in isolation, focusing on individual sports.

See the sports listed here for examples https://en.wikipedia.org/wiki/Pythagorean_expectation

As a result, λ often feels somewhat arbitrary to us or, at best, not well-understood other than each sport has its own specific value that makes the James model ‘work’.

An important caveat – we are not necessarily experts in this field and given the vast literature written by professionals, academics, and enthusiasts, there is a very high likelihood we’ve missed something along the way.

James’ derivation of (1) was largely empirically driven. Miller (2006/7) provides theoretical support by showing that (1) gives the closed-form probability that PS > PA (i.e., the probability that a team wins a game) assuming, among other things, that points scored and allowed are independent random variables following continuous Weibull distributions with means equal to PS and PA, respectively, and both share a shape parameter = λ.

Miller, Steven J. ‘A Derivation of the Pythagorean Won-Loss Formula in Baseball.’ Chance Magazine. 20.1 (2007): 40–48. An abridged version appeared in The Newsletter of the SABR Statistical Analysis Committee 16.1 (2006), 17–22, and an expanded version is available at http://arxiv.org/abs/math/0509698. To derive a closed form solution, Miller implicitly assumes that the IR (defined later in this paper) of a team’s offense and defense are the same, though allowing the expected values for PS and PA to be different. Equivalent IRs is a potentially stringent assumption at the team level. Relaxing this assumption opens the possibility for teams to manage not only mean scoring but also scoring volatility in attempting to win more games.

The Weibull distribution, like the lognormal distribution commonly used in finance to model stock returns, provides a reasonable representation for positive-valued, random variables which is the case for the sports analyzed here (e.g., we don’t study golf).

Our purpose here is to examine several sports simultaneously to provide some intuition about what the λ coefficient is capturing. We will show that the coefficient functions as a kind of exchange rate or scaling factor that translates the mean and standard deviation of the scoring distribution in each sport into the expected win probability. λ is approximately linear in each sport’s Information Ratio (IR). In finance, IR (and its cousin the Sharpe Ratio) describes the ratio of an asset’s return to its risk. In sports, we define it as the mean points (or runs or goals) scored per game divided by the standard deviation of points scored per game.

IR=μPPGσPPGIR = \frac{\mu_{PPG}}{\sigma_{PPG}}

For example, using all the available history we collected for the English Premier League (EPL) and the NBA, we compute IRs of:

We use the EPL labeling loosely since we combine EPL history with data from the top division of the English Football League prior to the EPL’s first season in 1992–1993.

EPL 1889-2019
(n = 48,670 matches x 2)

NBA 1947-2019
(n = 64,043 games x 2)

1.491.35=1.1\frac{1.49}{1.35}=1.1
103.214.7=7.0\frac{103.2}{14.7}=7.0

Which sports? What data?

After scouring the internet for several months during the Covid-19 pandemic, we collected game-level scoring data for nine sports drawn from across the globe, covering seasons as far back as the 19th century and through the last pre-Covid-19 year of 2019 (data details can be found in Appendix A):

In collecting the data, we targeted sports that are either clock- or turn-limited (e.g., basketball or baseball, respectively) and are played at a high level in a controlled league where teams compete generally among themselves in a repeated fashion, in regulated arenas, under fixed rules, and in a common season. Play-to-a-score sports like tennis and volleyball would not be a good fit for the analysis. We avoided national team contests like international test match cricket where match-over-match roster consistency is a concern, and few games are played each year.

  • Aussie Rules Football – Australian Football League (AFL)

  • Football – English Premier League (EPL)

  • Twenty20 Cricket – Indian Premier League (IPL

  • Lacrosse – NCAA Men’s Division I Lacrosse (LAX)

  • Baseball – Major League Baseball (MLB)

  • Basketball – National Basketball Association (NBA)

  • American Football – National Football League (NFL)

  • Ice Hockey – National Hockey League (NHL)

  • Rugby – Super Rugby Pacific (SUP)

Using ArcticDB, we stored game-by-game results dating back as far as possible, recording key information for each game:

  • League

  • Date

  • Team names (both teams)

  • Points scored and points allowed

We list the mean, standard deviation, and IR for each league in the respective boxes in Figure 1, covering the 2010–2019 seasons. Scoring distributions vary across sports but are generally positively skewed. For example, sports like American football have distinct scoring clusters consistent with their respective scoring systems. The James model and our fitting procedure have proven robust across sports, so our findings here are not highly sensitive to an assumed distribution.

Figure 1. Points per game distributions by league, all games, 2010–2019 seasons

Loading chart...

Source: Various (listed in appendix) as at January 2025 Key: Australian Football League (AFL), English Premier League (EPL), Indian Premier League (IPL), NCAA Men’s Division I Lacrosse (LAX), Major League Baseball (MLB), National Basketball Association (NBA), National Football League (NFL), National Hockey League (NHL), Super Rugby Pacific (SUP)

Fitting the model

For our purposes, it will be easier to work in units of (log) odds, which affords fitting the James model through ordinary least squares (OLS).

We follow the methodology discussed in Max Marchi et al., Analyzing Baseball Data with R, CRC Press, 2019, pp. 101–102.

With some rearranging of (1), taking the natural logarithm of both sides and setting up for a regression:

Taking logs forces us to add more assumptions, i.e.,

Wi>0,Li>0,PSi>0,andPAi>0W_i > 0, L_i > 0, {PS}_i > 0, and {PA}_i > 0

which knock out a few team-seasons. For example, the NFL’s 0–16 Browns in 2017 and 16–0 New England Patriots in 2007 are excluded. There are teams primarily in the early history of some sports with a small number of games per season that failed to score a single point in any game. Overall, we don’t view the impact of these additional assumptions to be meaningful.

i=log(WiLi)=a+λlog(PSiPAi)+εi\ell_i=\log{\left(\frac{W_i}{L_i}\right)}=a+\lambda\bullet\log{\left(\frac{{PS}_i}{{PA}_i}\right)+\varepsilon_i}

where the index i indicates a given team-season (e.g., Liverpool 2018–2019), i is the log odds ratio, a is an intercept term, and εi the residual error for each team-season.

Can be non-zero depending on distribution of wins and losses since there is no assurance that mean of
log(WiLi)=0 \log{\left(\frac{W_i}{L_i}\right)}=0
.

Building a regression with ArcticDB data

At this point, we want to run a regression in python using data from ArcticDB. To get a feel for the data, let’s look at all teams in all seasons of the English Premier League (EPL) from 1888–1889 to 2018-19 and the National Basketball Association (NBA) from 1946–1947 to 2018–2019.

To run this in python we first need to install ArcticDB and import the necessary libraries. Examples here are run with version 5.1.2.

%pip install arcticdb
import arcticdb as adb
adb.__version__
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme()

Getting this data for yourself

We’ve collated data from various free sources in a .csv file and shown you how you can load this data into ArcticDB for analysis in Python. All sources are listed in Appendix D.

Loading and preparing the regression data

Next, let’s load and list out the league data we have. Each league is a separate dataframe (table) in ArcticDB. Details of how to load data from .csv into ArcticDB are in Appendix E.

arctic = adb.Arctic('lmdb://sports')
league_data = arctic.get_library('leagues')
leagues = league_data.list_symbols()
leagues
['AFL', 'NBA', 'LAX', 'SUP', 'NFL', 'NHL', 'MLB', 'IPL', 'EPL'] 

Each table for a league looks like this with both team names, PS (points scored), PA (points allowed), the game date, whether it was played at home or away, and whether it was won (1) or lost (0). The top and bottom five rows are displayed here.

Table 1: Extract showing top and tail of data for SUP Rugby data

YEAR_ID

TEAM_ID

OPP_TEAM_ID

PS

PA

date

home

wi

LEAGUE

1897

Carlton

Fitzroy

16

49

1897-05-08

0

0

AFL

1897

Carlton

South Melbourne

36

40

1897-05-15

0

0

AFL

1897

Carlton

Essendon

41

78

1897-05-24

0

0

AFL

1897

Carlton

Geelong

22

44

1897-05-29

0

0

AFL

1897

Carlton

Melbourne

26

107

1897-06-05

0

0

AFL

Each observation for the regression includes the total number of points scored or allowed, as well as the mean ‘win’ value for each season and team. Let’s extract the relevant columns from ArcticDB and prepare the data. Here, we use the batch-read and lazy-dataframe features to rapidly retrieve all the league tables and concatenate into a single table.

lazy_dfs = league_data.read_batch(leagues, lazy=True)
df = adb.concat(lazy_dfs).collect().data[['LEAGUE', 'YEAR_ID', 'TEAM_ID', 'PS', 'PA', 'WI']]
df_team = df.groupby(['LEAGUE', 'YEAR_ID', 'TEAM_ID']).agg(PS=('PS', 'mean'),
                                                        PA=('PA', 'mean'),
                                                        WI=('WI', 'mean'),
                                                        nobs=('WI', 'count'))

df_team.round(2).head()

Per Equation (3), we calculate the log ratios and plot both EPL and NBA lines.

# prepare data for regression
df_reg = df_team.copy()
df_reg = df_reg.loc[(df_reg['PS']>0) & (df_reg['PA']>0) & \
                    (df_reg['WI']>0) & (df_reg['WI']<1)]
df_reg['lnWI']=np.log(df_reg['WI']/(1-df_reg['WI']))
df_reg['lnPSPA']=np.log(df_reg['PS']/df_reg['PA'])
df_reg = df_reg.reset_index()

# plot
NBA_and_EPL = df_reg[df_reg['LEAGUE'].isin(['NBA', 'EPL'])]
sns.lmplot(data=NBA_and_EPL,
           x='lnPSPA',
           y='lnWI',
           hue='LEAGUE',
           scatter_kws=dict(s=3),
           line_kws=dict(color="grey"),
           facet_kws=dict(legend_out=False),
           aspect=1.6)

See Figure 2, where each dot represents one team-season (e.g., Chicago Bulls 1995–1996). The slope of the regression line through the data points gives us the implied or fitted λ for each sport. Football (smallest λ) and basketball (largest λ) sit at the opposite extremes of the nine sports in our data.

Figure 2. Team-season plot: EPL (1888–1889 to 2018–2019) and NBA (1946–1947 to 2018–2019

Loading chart...

Source: Various (listed in Appendix) as at January 2025Note: Number of team-season observations: EPL = 2,463 and NBA = 1,607. For consistency, the EPL datapoints assume ties are worth 1/2 of a win for the entire period, which was the case up until the 1980–1981 season when it was lowered to 1/3 of a win to encourage fewer ties. Source: James P. Curley engsoccerdata: English Soccer Data 1871–2019 (https://github.com/jalapic/engsoccerdata). fivethirtyeight NBA ELO data 1947–2019 (https://data.fivethirtyeight.com/). We use the EPL labeling loosely, since we combine EPL history with data from the top division of the English Football League prior to the EPL’s first season in 1992–1993.

It’s useful to keep track of our results as we go, enabling us to revisit them if our methodology or input-data changes for any reason. In ArcticDB we can create a library (i.e. dataset) for this, and just write the resulting dataframe without requiring database administration.

analysis = arctic.get_library('analysis', create_if_missing=True)
analysis.write('regression_data', df_reg)

Results

Before discussing the regression results in detail, we can build some intuition about what we should find. Looking at the scatterplot in Figure 2, note the tight spread of dots along the X-axis for the NBA compared to the much wider spread for the EPL. The best and worst regular-season teams in NBA history exhibit a spread of just 25% in the log ratio of points scored to points allowed: +10% for the Golden State Warriors (2015–16) versus -15% for the Carlotte Bobcats (2011–12). In the modern era of the EPL, we observe a much larger spread of almost 300%: +137% for Manchester City (2017–18) versus -149% for Derby County (2007–08). Thus, the cross-sectional spread, or variation in the log ratio of PS to PA (i.e., the ‘X’ or independent variable in the regression) appears to be an important driver of λ.

As many of you will remember from your basic statistics classes, the slope coefficient in a single variable OLS regression can be decomposed into the ratio of the covariance between the Y and X variables and the variance of X. The ratio functions as a scaling factor that normalizes the different units of Y and X. So, the lower (or higher) the variation in X, the steeper (or flatter) the slope of the line will be. The opposite is true for the variation in Y.

λ=COV(Y,X)σ2(X)\lambda=\frac{COV\left(Y,X\right)}{\sigma^2\left(X\right)}

For the moment, let’s assume full team parity within a sports league — each team is equally skilled and plays the same number of N games per season. All teams’ offenses and defenses draw from two independent points distributions: one for PS and one for PA, each with mean μ and variance σ2. Also assume that the values drawn for PS and PA in each game perfectly determine its outcome (PS > PA = win; otherwise = loss). Since the expected values for PS and PA are the same, we would expect every team to have a 50% chance of winning any given game.

We show in Appendix B that, under these simplifying assumptions, λ is approximately linear in each sport’s IR. Equation (5a) clearly shows that both the number of games played in a season, and IR are inverse drivers of the dispersion in X in Figure 2. EPL teams play 38 games per season, while NBA teams play 82. As outlined in the next section, the EPL has the lowest IR among the sports in our sample, while the NBA has the highest, which is also consistent with the slopes for the two sports in the scatter plot.

σ2(X)=σ2[log(PSiPAi)]2NIR2\sigma^2\left(X\right)=\sigma^2\left[\log{\left(\frac{{PS}_i}{{PA}_i}\right)}\right]\approx\frac{2}{N}\bullet{IR}^{-2}
COV(Y,X)=COV[log(WiLi),log(PSiPAi)]4CNIR1COV\left(Y,X\right)=COV\left[\log{\left(\frac{W_i}{L_i}\right),\log{\left(\frac{{PS}_i}{{PA}_i}\right)}}\right]\approx\frac{4\bullet C}{N}\bullet{IR}^{-1}

Plugging (5a) and (5b) into (4) we get an approximation of λ:

λ2CIR\lambda\approx2\bullet C\bullet IR

where is a constant that depends on the underlying distributions of PS and PA.

For Miller’s Weibull distribution case, we can solve for the shape and scale parameters which define the distribution such that the mean of the distribution equals μ and the variance equals σ2. For simplicity, we assume a shift parameter of 0 vs. Miller’s -0.5. The mean and variance constraints are satisfied if we find a shape parameter such that . The scale parameter then follows as . is the gamma function. The shape parameter solution required a numerical inversion of the equation. We found a first order approximation . Given our theoretical arguments above, it’s not surprising that the approximation of the scale parameter is linear in IR with an implied . Hundal’s derivation for independent, log-normal distributions has an implied (see https://en.wikipedia.org/wiki/Pythagorean_expectation).

Without assuming an actual underlying point distribution, it’s hard to say much more than that a sport’s IR is an important factor in determining λ, and the relationship between the two is approximately linear under our assumptions.

We are fitting across team-seasons in Equation (3), and some leagues may have a large dispersion or skew among their top and bottom teams. It’s unclear how each sport’s actual point distribution and team competitive imbalances will impact the OLS estimates of λ. We can allow the regression results sort these details.

Regression results for the James model

While we show all available data in Figure 2 for both the NBA and EPL, we operate in rolling 10-year windows, focusing most of our attention on the 2010–2019 seasons. Sports evolve over time, leading to varying estimates of λ. For example, the NBA introduced the 24-second shot clock in the 1954–1955 season to save fans from extreme boredom. In 1950, the Fort Wayne Pistons defeated the Minneapolis Lakers 19–18 in the lowest-scoring game in league history. Fitting the model over a decade of data feels about right as we balance having enough team-seasons to estimate the model and capturing the changing dynamics of each league.

Looking at the recent period first, we fit Equation (3) for each league by pooling all team-seasons from 2010– 2019 seasons.

We’ll use a utility function to perform the regression, so we can return the fit within a pandas apply method.

def ols_coef(df, xcol, ycol):
    """OLS function used for lambda calcs -- assumes univariate regression"""
    model=sm.OLS(df[ycol],sm.add_constant(df[xcol])).fit()
    outdf = pd.DataFrame({'adj.'     : model.params.iloc[0],
                          'adj. se'  : model.bse.iloc[0],
                          'λ'        : model.params.iloc[1],
                          'λ se'     : model.bse.iloc[1],
                          'nobs'     : model.nobs,
                          'adj. R^2$' : model.rsquared_adj}, index=[0])

    return outdf

Using the same prepared data above (df_reg), we filter for our selected decade and perform an OLS regression for each league.

df_ols = df_reg.loc[(df_reg['YEAR_ID']>=2010) & (df_reg['YEAR_ID']<=2019)]
cols=df_ols.groupby(['LEAGUE']).apply(ols_coef,xcol=['lnPSPA'],ycol=['lnWI'])
cols = ols.droplevel(1).sort_values(['λ']).reset_index().round(2)
cols

LEAGUE

adj.

adj. se

λ

λ se

nobs

adj. R²

EPL

-0.15

0.01

1.24

0.02

200.0

0.94

MLB

-0.00

0.01

1.78

0.04

300.0

0.89

NHL

0.24

0.01

2.06

0.04

302.0

0.91

SUP

-0.03

0.03

2.60

0.09

154.0

0.84

NFL

0.02

0.02

2.84

0.08

319.0

0.81

LAX

-0.02

0.01

3.17

0.05

655.0

0.84

AFL

-0.02

0.02

3.57

0.09

177.0

0.90

IPL

-0.02

0.04

6.79

0.65

84.0

0.57

NBA

-0.00

0.01

14.32

0.21

300.0

0.94

Let’s record the OLS results, just as we did with the regression data, into an ArcticDB library, for later analysis or sharing with colleagues. Since the library has already been created, we simply save the table.

analysis.write('ols', ols)
VersionedItem(symbol='ols', library='analysis', data=n/a, version=2, metadata=None
             , host='UMDB(path=sports)', timestamp=1738255232379464601)

The return value shows that the data is automatically versioned in ArcticDB and includes a timestamp for each version. This is extremely useful for tracking the fit changes when we make edits.

The λ estimates and errors are shown in Figure 3a. As we’ve noted, the EPL and NBA sit at the extremes. The model fits all sports well with statistically significant coefficients. The confidence bands around our estimate for the IPL is wider (also adjusted R2 is 0.57 for the IPL compared to a range of 0.81 to 0.94 for the other leagues). For those that are interested, the poorer fit for the IPL we believe is due to the limited number of teams each season (8–10) and the truncated game effect in Twenty20 Cricket, discussed in Appendix A, which we can only partially adjust for.

Consistent with our back-of-the-envelope calculations, we observe a near linear relationship between λ and IR in Figure 3b, confirming IR as a central, endemic feature of each sport and clarifying the drivers of λ differences across sports. Finally, restricting our analysis to sports with deeper data histories (sorry no rugby, cricket, or lacrosse), we do see time varying behavior in the fitted coefficient for all sports, supporting our use of rolling window estimates (see Figure 3c). The balance between offense and defense evolves over time. Scoring in the NFL, for example, has been on a consistent uptrend since the league’s founding, while the NHL achieved peak scoring in the 1990s, which has recently begun to rebound. The impact of the 24-second shot clock is clearly evident in the early history of our NBA data.

Figure 3a. Estimates of λ by league, 2010–2019 season

Loading chart...

Source: Various (listed in Appendix) as at January 2025. Key: Australian Football League (AFL), English Premier League (EPL), Indian Premier League (IPL), NCAA Men’s Division I Lacrosse (LAX), Major League Baseball (MLB), National Basketball Association (NBA), National Football League (NFL), National Hockey League (NHL), Super Rugby Pacific (SUP).

Note: Error bars represent ±2 standard errors around the coefficient estimate. N is the number of team-seasons used to estimate the model.

Figure 3b. Estimates of λ versus league information ratio, 2010–2019 seasons

Loading chart...

Source: Various (listed in Appendix) as at January 2025. Key: Australian Football League (AFL), English Premier League (EPL), Indian Premier League (IPL), NCAA Men’s Division I Lacrosse (LAX), Major League Baseball (MLB), National Basketball Association (NBA), National Football League (NFL), National Hockey League (NHL), Super Rugby Pacific (SUP)

Figure 3c. Rolling 10-year estimates of λ, all available seasons through 2019

Loading chart...

Source: Various (listed in Appendix) as at January 2025. Key: Australian Football League (AFL), English Premier League (EPL), Major League Baseball (MLB), National Basketball Association (NBA), National Football League (NFL), National Hockey League (NHL)

Note: Shaded regions represent ±2 standard errors around the coefficient estimate.

Parting thoughts

Hopefully, by looking across sports and time and with a little mathematics, we have provided some intuition behind James’ formula, which lies at the center of modern sports analysis. While the formula has empirical roots in just one sport, baseball, it has proven to be a robust model working across very different scoring distributions.

With ArcticDB we have successfully kept track of what we are doing with automatically versioned and shareable input and output data — all without any of the typical database operation overhead, such as creating tables or managing schemas. ArcticDB is also extremely fast; with the amount of data here (approx. 853,000 rows) the load times are imperceptible. Please check it out at https://github.com/man-group/ArcticDB/.

Appendix A – Data sources and notes

Australian Football League (AFL) – The league was known as the Victorian Football League (VFL) prior to 1990. Our first season of data is 1897. Source: https://afltables.com.

English Premier League (EPL) – We use the EPL labeling loosely since we combine EPL history with data from the top division of the English Football League prior to the EPL’s first season in 1992–1993. Our first season of data is 1888–1889, and we exclude the 1939–1940 partial season. Source: James P. Curley engsoccerdata: English Soccer Data 1871–2019 (https://github.com/jalapic/engsoccerdata).

Indian Premier League (IPL) – Our earliest data starts with the 2008 IPL inaugural season. Source: https://cricsheet.org/. In Twenty20 cricket, teams play just two innings, which can obfuscate the ‘true’ scoring capability of the team batting last and the ‘true’ run prevention skill of the team bowling last. Upon the batting team scoring one more run than its opponent in the last inning, the game is immediately halted. We partially adjust for this truncation effect by grossing up the second team’s runs by the remaining ‘resources’ left when the game ends. Remaining resources are estimated using the Duckworth–Lewis method which accounts for the number of overs left to be played and the number of wickets in hand by the bowling team. See Senevirathne and Manage, ‘Predicting the winning percentage of limited-overs cricket using the Pythagorean formula,’ Journal of Sports Analytics, vol. 7, no. 3, pp. 169–183, 2021 and Bhattacharya, Gill, and Swartz, ‘Duckworth–Lewis and Twenty20 cricket,’ Journal of the Operational Research Society, Volume 62, 2011 – Issue 11.

NCAA Men’s Division I Lacrosse (LAX) – This is the only non-professional league included in our sample. Our first season of data is 2010. Source: https://stats.ncaa.org/.

Major League Baseball (MLB) – The sport with the deepest history of data in our sample starting in 1871, We use the game-by-game data provided by fivethirtyeight.com, which notably excludes any games played in the Negro Leagues (1920 to late 1940s) and includes games played in the National Association (1871–1875). Source: fivethirtyeight MLB ELO data (https://data.fivethirtyeight.com/).

National Basketball Association (NBA) – Our data begins with the 1946–1947 season of the Basketball Association of America (BAA) and includes games played the American Basketball Association (ABA) from 1967–1968 to 1975–1976. Source: fivethirtyeight NBA ELO data (https://data.fivethirtyeight.com/).

National Football League (NFL) – Our data begins with the 1922 season and includes games played by the All-America Football Conference (AAFC) from 1946–1949 and the American Football League (AFL) from 1960–1969. Source: fivethirtyeight NFL ELO data (https://data.fivethirtyeight.com/).

National Hockey League (NHL) – While our data begins with the 1917–1918 season, the league was extremely concentrated with as a few as 3 and up to just 6 teams per year through the 1966–1967 season. After that season, the league expanded initially to 12 teams and then incrementally to a total of 31 teams by the 2019 season. Source: fivethirtyeight NHL ELO data (https://data.fivethirtyeight.com/).

Super Rugby Pacific (SUP) – Our data begins with the 1996 season. Source: pick & go’s Super Rugby Database (http://www.lassen.co.nz/s14tab.php).

Appendix B – λ is approximately linear in IR

All teams are equally skilled in the league and play N games each per season.

For each team i and game n, assume and are i.i.d. with mean and variance σ2.

PSi=n=1NPSi,n{PS}_i=\sum_{n=1}^{N}{PS}_{i,n}
PAi=n=1NPAi,n{PA}_i=\sum_{n=1}^{N}{PA}_{i,n}
Pr(PSi,n>PAi,n)=Pr(PSi,n<PAi,n)=p=0.5Pr\left({PS}_{i,n}>{PA}_{i,n}\right)=Pr\left({PS}_{i,n}<{PA}_{i,n}\right)=p=0.5
Wi,n={1,PSi,n>PAi,n0,PSi,nPAi,nW_{i,n} = \begin{cases} 1, & PS_{i,n} > PA_{i,n} \\ 0, & PS_{i,n} \leq PA_{i,n} \end{cases}
Wi=n=1NWi,nW_i=\sum_{n=1}^{N}W_{i,n}
Li=NWiL_i=N-W_i

Expected points scored/allowed and wins for each team over the course of a full season of N games are straightforward:

E[PSi]=E[PAi]=μN\mathbb{E}\left[{PS}_i\right]=\mathbb{E}\left[{PA}_i\right]=\mu\bullet N
E[Wi]=E[Li]=0.5N\mathbb{E}\left[W_i\right]=\mathbb{E}\left[L_i\right]=0.5\bullet N

The variance for points scored and allowed is simply the per-game variance multiplied by the number of games (remember all sampled points are assumed independent). The variance for wins can be taken from the well-known result for the variance of the sum of independent Bernoulli random variables. A Bernoulli random variable equals 1 with probability p and 0 with probability 1-p and has variance equal to p x [1-p], which in our case is 0.5 x 0.5 = 0.25.

VAR[PSi]=VAR[PAi]=σ2NVAR\left[{PS}_i\right]=VAR\left[{PA}_i\right]=\sigma^2\bullet N
VAR[Wi]=VAR[Li]=0.25NVAR\left[W_i\right]=VAR\left[L_i\right]=0.25\bullet N

The covariance between the difference in points scored and allowed and wins for any team-game is:

COV[PSi,nPAi,n,Wi,n]=E[(PSi,nPAi,n)Wi,n]E[(PSi,nPAi,n)]E[Wi,n]COV\left[{PS}_{i,n}-{PA}_{i,n},W_{i,n}\right]=\mathbb{E}\left[\left({PS}_{i,n}-{PA}_{i,n}\right)\bullet W_{i,n}\right]-\mathbb{E}\left[\left({PS}_{i,n}-{PA}_{i,n}\right)\right]\mathbb{E}\left[W_{i,n}\right]
=E[(PSi,nPAi,n)Wi,n]=\mathbb{E}\left[\left({PS}_{i,n}-{PA}_{i,n}\right)\bullet W_{i,n}\right]
=E[PSi,nPAi,nWi,n=1]p=\mathbb{E}\left[{PS}_{i,n}-{PA}_{i,n}|W_{i,n}=1\right]\bullet p
=E[PSi,nPAi,nPSi,nPAi,n>0]0.5=\mathbb{E}\left[{PS}_{i,n}-{PA}_{i,n}|{PS}_{i,n}-{PA}_{i,n}>0\right]\bullet0.5

For any two i.i.d. variables, X and Y, with the same mean and variance. Then X-Y follows a symmetry distribution. Therefore, we can write:

E[XYX>Y]=E[XY]=2Cσ\mathbb{E}\left[X-Y|X>Y\right]=\mathbb{E}\left[|X-Y|\right]=2\bullet C\bullet\sigma

where C > 0 is a constant that depends on the underlying distributions of PS and PA.

So,

COV[PSi,nPAi,n,Wi,n]= CσCOV\left[{PS}_{i,n}-{PA}_{i,n},W_{i,n}\right]=\ C\bullet\sigma

Note by Cauchy–Schwarz inequality:

COV[PSi,nPAi,n,Wi,n] VAR(PSi,nPAi,n)VAR(Wi,n)\left|COV\left[{PS}_{i,n}-{PA}_{i,n},W_{i,n}\right]\right|\le\ \sqrt{VAR\left({PS}_{i,n}-{PA}_{i,n}\right)\bullet V A R\left(W_{i,n}\right)}

Which implies:

0<C120<C\le\sqrt{\frac{1}{2}}

The covariance of points scored with wins for the entire season is simply the per game covariance times the number of games:

COV[PSiPAi,Wi]=CσNCOV\left[{PS}_i-{PA}_i,W_i\right]=C\bullet\sigma\bullet N

Of course, what we are interested in is the variance / standard deviations of the Y and X variables (4). We can derive a reasonable estimate for both by using a technique called the Delta Method, which relies on a Taylor series expansion centered around the variable’s mean to derive a linear function.

For variance:

VAR[g(X)](g(μX))2VAR[X]VAR\left[g\left(X\right)\right]\approx\left(g^\prime\left(\mu_X\right)\right)^2\bullet VAR\left[X\right]

For covariance:

COV[g(X),h(Y)]g(μX)h(μY)COV[X,Y]14g(μX)h(μY)VAR[X]VAR[Y]COV\left[g\left(X\right),h\left(Y\right)\right]\approx g^\prime\left(\mu_X\right)\bullet h^\prime\left(\mu_Y\right)\bullet COV\left[X,Y\right]-\frac{1}{4}g^{\prime\prime}\left(\mu_X\right)\bullet h^{\prime\prime}\left(\mu_Y\right)\bullet VAR\left[X\right]\bullet VAR\left[Y\right]

Applying the variance estimate:

VAR[log(PSiPAi)]=VAR[log(PSi)]+VAR[log(PAi)]VAR[PSi](E[PSi])2+VAR[PAi](E[PAi])2=2NIR2VAR\left[\log{\left(\frac{{PS}_i}{{PA}_i}\right)}\right]=VAR\left[\log{\left({PS}_i\right)}\right]+VAR\left[\log{\left({PA}_i\right)}\right]\approx\frac{VAR\left[{PS}_i\right]}{\left(\mathbb{E}\left[{PS}_i\right]\right)^2}+\frac{VAR\left[{PA}_i\right]}{\left(\mathbb{E}\left[{PA}_i\right]\right)^2}=\frac{2}{N}\bullet{IR}^{-2}
VAR[log(WiLi)]=VAR[log(WiNWi)](NE[Wi](NE[Wi]))2VAR[Wi]=4NVAR\left[\log{\left(\frac{W_i}{L_i}\right)}\right]=VAR\left[\log{\left(\frac{W_i}{N-W_i}\right)}\right]\approx\left(\frac{N}{\mathbb{E}\left[W_i\right]\bullet\left(N-\mathbb{E}\left[W_i\right]\right)}\right)^2\bullet VAR\left[W_i\right]=\frac{4}{N}

Applying the covariance estimate, taking advantage of symmetry due to our parity assumptions:

COV[log(WiLi),log(PSiPAi)]=2COV[log(Wi),log(PSi)]2COV[log(Wi),log(PAi)]COV\left[\log{\left(\frac{W_i}{L_i}\right),\log{\left(\frac{{PS}_i}{{PA}_i}\right)}}\right]=2\bullet COV\left[\log{\left(W_i\right),\log{\left({PS}_i\right)}}\right]-2\bullet COV\left[\log{\left(W_i\right),\log{\left({PA}_i\right)}}\right]
COV[log(Wi),log(PSi)]2COV[Wi,PSi]E[Wi]E[PSi]2COV[Wi,PAi]E[Wi]E[PAi]COV\left[\log{\left(W_i\right),\log{\left({PS}_i\right)}}\right]\approx2\bullet\frac{COV\left[{W_i,PS}_i\right]}{\mathbb{E}\left[W_i\right]\bullet\mathbb{E}\left[{PS}_i\right]}-2\bullet\frac{COV\left[{W_i,PA}_i\right]}{\mathbb{E}\left[W_i\right]\bullet\mathbb{E}\left[{PA}_i\right]}
COV[log(Wi),log(PSi)]2COV[PSiPAi,Wi]E[Wi]E[PSi]=2CσNμN0.5N=4CNIR1COV\left[\log{\left(W_i\right),\log{\left({PS}_i\right)}}\right]\approx2\bullet\frac{COV\left[{PS}_i-{PA}_i,W_i\right]}{\mathbb{E}\left[W_i\right]\bullet\mathbb{E}\left[{PS}_i\right]}=2\bullet\frac{C\bullet\sigma\bullet N}{\mu\bullet N\bullet0.5\bullet N}=\frac{4\bullet C}{N}\bullet{IR}^{-1}

Author Names:

Greg Bond, Maxim Morozov for ArcticDB

Appendix D – Data Sources & Notes

Australian Football League (AFL) – The league was known as the Victorian Football League (VFL) prior to 1990. Our first season of data is 1897. Source: https://afltables.com.

English Premier League (EPL) – We use the EPL labeling loosely since we combine EPL history with data from the top division of the English Football League prior to the EPL’s first season in 1992-93. Our first season of data is 1888-89, and we exclude the 1939-40 partial season. Source: James P. Curley engsoccerdata: English Soccer Data 1871-2019 (https://github.com/jalapic/engsoccerdata).

Indian Premier League (IPL) – Our earliest data starts with the 2008 IPL inaugural season. Source: https://cricsheet.org/. In Twenty20 cricket, teams play just two innings, which can obfuscate the “true” scoring capability of the team batting last and the “true” run prevention skill of the team bowling last. Upon the batting team scoring one more run than its opponent in the last inning, the game is immediately halted. We partially adjust for this truncation effect by grossing up the second team’s runs by the remaining “resources” left when the game ends. Remaining resources are estimated using the Duckworth–Lewis method which accounts for the number of overs left to be played and the number of wickets in hand by the bowling team. See Senevirathne and Manage, “Predicting the winning percentage of limited-overs cricket using the Pythagorean formula,” Journal of Sports Analytics, vol. 7, no. 3, pp. 169-183, 2021 & Bhattacharya, Gill, and Swartz, “Duckworth–Lewis and Twenty20 cricket,” Journal of the Operational Research Society, Volume 62, 2011 - Issue 11.

NCAA Men’s Division I Lacrosse (LAX) – This is the only non-professional league included in our sample. Our first season of data is 2010. Source: https://stats.ncaa.org/.

Major League Baseball (MLB) – The sport with the deepest history of data in our sample starting in 1871, We use the game-by-game data provided by fivethirtyeight.com, which notably excludes any games played in the Negro Leagues (1920 to late 1940s) and includes games played in the National Association (1871-1875). Source: fivethirtyeight MLB ELO data (https://data.fivethirtyeight.com/).

National Basketball Association (NBA) – Our data begins with the 1946-1947 season of the Basketball Association of America (BAA) and includes games played the American Basketball Association (ABA) from 1967-68 to 1975-76. Source: fivethirtyeight NBA ELO data (https://data.fivethirtyeight.com/).

National Football League (NFL) – Our data begins with the 1922 season and includes games played by the All-America Football Conference (AAFC) from 1946-1949 and the American Football League (AFL) from 1960-1969. Source: fivethirtyeight NFL ELO data (https://data.fivethirtyeight.com/).

National Hockey League (NHL) – While our data begins with the 1917-18 season, the league was extremely concentrated with as a few as 3 and up to just 6 teams per year through the 1966-67 season. After that season, the league expanded initially to 12 and then incrementally to the 2019 season 31 team total. Source: fivethirtyeight NHL ELO data (https://data.fivethirtyeight.com/).

Super Rugby Pacific (SUP) – Our data begins with the 1996 season. Source: pick & go’s Super Rugby Database (http://www.lassen.co.nz/s14tab.php).

Appendix E

Preparing data for loading into ArcticDB. ArcticDB is a database solution that’s optimised for timeseries data for quantitative finance. In this example we show how to load data into ArcticDB.

import arcticdb as adb
import pandas as pd

df = pd.read_csv(dataFile)

df = df.drop(columns="Unnamed: 0")
df = df.rename(columns={'FREQ_RUNS_S': 'PS',
                        'FREQ_RUNS_A': 'PA' })

df = df.sort_values(['LEAGUE','YEAR_ID','TEAM_ID'])

YEAR_ID

TEAM_ID

OPP_TEAM_ID

PS

PA

date

home

wi

LEAGUE

1897

Carlton

Fitzroy

16

49

1897-05-08

0

0

AFL

1897

Carlton

South Melbourne

36

40

1897-05-15

0

0

AFL

1897

Carlton

Essendon

41

78

1897-05-24

0

0

AFL

1897

Carlton

Geelong

22

44

1897-05-29

0

0

AFL

1897

Carlton

Melbourne

26

107

1897-06-05

0

0

AFL

arctic = adb.Arctic('lmdb://sports')
lib = arctic.get_library('leagues', create_if_missing=True)
for league in df['LEAGUE'].unique():
    lib.write(league, df[df['LEAGUE']==league])
lib.list_symbols()

['EPL', 'SUP', 'NHL', 'MLB', 'AFL', 'LAX', 'NBA', 'IPL', 'NFL']