Statistics
Overview
statistics
is a library that consolidates three well-known statistical libraries:
- The statistics library from numerical-utilities
- Larry Hunter’s cl-statistics
- Gary Warren King’s cl-mathstats
There are a few challenges in using these as independent systems on projects though:
- There is a good amount of overlap. Everyone implements, for example
mean
(as does alexandria, cephes, and almost every other library out there). - In the case of
mean
,variance
, etc., the functions deal only with samples, not distributions
This library brings these three systems under a single ‘umbrella’, and adds a few missing ones. To do this we use Tim Bradshaw’s conduit-packages. For the few functions that require dispatch on type (sample data vs. a distribution), we use typecase
because of its simplicity and not needing another system. There’s a slight performance hit here in the case of run-time determination of types, but until it’s a problem prefer it. Some alternatives considered for dispatch was https://github.com/pcostanza/filtered-functions.
nu-statistics
These functions cover sample moments in detail, and are accurate. They include up to forth moments, and are well suited to the work of an econometrist (and were written by one).
lh-statistics
These were written by Larry Hunter, based on the methods described in Bernard Rosner’s book, Fundamentals of Biostatistics 5th Edition, along with some from the CLASP system. They cover a wide range of statistical applications. Note that lh-statistics
uses lists and not vectors, so you’ll need to convert. To see what’s available see the statistics github repo.
gwk-statistics
These are from Gary Warren King, and also partially based on CLASP. It is well written, and the functions have excellent documentation. The major reason we don’t include it by default is because it uses an older ecosystem of libraries that duplicate more widely used system (for example, numerical utilities, alexandria). If you want to use these, you’ll need to uncomment the appropriate code in the ASDF and pkgdcl.lisp
files.
ls-statistics
These are considered the most complete, and they account for various types and dispatch properly.
Accuracy
LH and GWK statistics compute quantiles, CDF, PDF, etc. using routines from CLASP, that in turn are based on algorithms from Numerical Recipes. These are known to be accurate to only about four decimal places. This is probably accurate enough for many statistical problem, however should you need greater accuracy look at the distributions system. The computations there are based on special-functions, which has accuracy around 15 digits. Unfortunately documentation of distributions and the ‘wrapping’ of them here are incomplete, so you’ll need to know the pattern, e.g. pdf-gamma, cdf-gamma, etc., which is described in the link above.
Versions
Because this system is likely to change rapidly, we have adopted a system of versioning proposed in defpackage+. This is also the system alexandria
uses where a version number is appended to the API. So, statistics-1
is our current package name. statistics-2
will be the next and so on. If you don’t like these names, you can always change it locally using a package local nickname.
Dictionary
scale
scale
scale is generic function whose default method centers and/or scales the columns of a numeric matrix. This is neccessary when the units of measurement for your data differ. The scale
function is provided for this purpose.
(defun standard-scale (x &key center scale)
Returns
The function returns three values:
- (x - x̄) / s where X̄ is the mean and S is the standard deviation
- the
center
value used - the
scale
value used
Parameters
CENTRE
value to center on.(mean x)
by defaultSCALE
value to scale by.(sd x)
by default
If center
or scale
is nil
, do not scale or center respectively.
Example: Scale the values in a vector
(defparameter x #(11 12 13 24 25 16 17 18 19))
(scale x)
; => #(-1.2585064099313854d0
-1.0562464511924128d0
-0.85398649245344d0
1.3708730536752591d0
1.5731330124142318d0
-0.24720661623652215d0
-0.044946657497549475d0
0.15731330124142318d0
0.3595732599803958d0)
Note that the scaled vector contains negative values. This is expected scaling a vector. Let’s try the same thing but without scaling:
(scale x :scale nil)
#(-56/9 -47/9 -38/9 61/9 70/9 -11/9 -2/9 7/9 16/9)
155/9
1
Note the scaling factor was set to 1, meaning no scaling was performed, only centering (division by zero returns the original value).
Example: Scale the columns of an array
(defparameter y #2A((1 2 3 4 5 6 7 8 9)
(10 20 30 40 50 60 70 80 90)))
(margin #'scale y 1) ; 1 splits along columns, 0 splits along rows
#(#(-1.4605934866804429d0 -1.0954451150103321d0 -0.7302967433402214d0 -0.3651483716701107d0 0.0d0 0.3651483716701107d0 0.7302967433402214d0 1.0954451150103321d0 1.4605934866804429d0)
#(-1.4605934866804429d0 -1.0954451150103321d0 -0.7302967433402214d0 -0.3651483716701107d0 0.0d0 0.3651483716701107d0 0.7302967433402214d0 1.0954451150103321d0 1.4605934866804429d0))
Example: Scale the variables of a data frame
LS-USER> (remove-column! iris 'species) ;species is a categorical variable
#<DATA-FRAME (150 observations of 4 variables)
Edgar Anderson's Iris Data>
LS-USER> (head iris)
;; SEPAL-LENGTH SEPAL-WIDTH PETAL-LENGTH PETAL-WIDTH
;; 0 5.1 3.5 1.4 0.2
;; 1 4.9 3.0 1.4 0.2
;; 2 4.7 3.2 1.3 0.2
;; 3 4.6 3.1 1.5 0.2
;; 4 5.0 3.6 1.4 0.2
;; 5 5.4 3.9 1.7 0.4
NIL
LS-USER> (map-columns iris #'scale)
#<DATA-FRAME (150 observations of 4 variables)>
LS-USER> (head *)
;; SEPAL-LENGTH SEPAL-WIDTH PETAL-LENGTH PETAL-WIDTH
;; 0 -0.8977 1.0156 -1.3358 -1.3111
;; 1 -1.1392 -0.1315 -1.3358 -1.3111
;; 2 -1.3807 0.3273 -1.3924 -1.3111
;; 3 -1.5015 0.0979 -1.2791 -1.3111
;; 4 -1.0184 1.2450 -1.3358 -1.3111
;; 5 -0.5354 1.9333 -1.1658 -1.0487
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.