This the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Documentation

This section contains user documentation for Lisp-Stat. It is designed for technical users who wish to understand how to use Lisp-Stat to perform statistical analysis.

Other content such as marketing material, case studies, and community updates are in the About and Community pages.

1 - What is Lisp-Stat?

A statistical analysis environment written in Common Lisp

Lisp-Stat is a domain specific language (DSL) for statistical analysis and machine learning. It is targeted at statistics practioners with little or no experience in programming.

Lisp has a history of being deployed for domain experts to use, and it’s a great language for beginners; the Symbolics Graphics Division wrote the software used by graphic artists to develop scenes in several films prior the rise of Pixar. One of the first statistical systems developed, XLisp-Stat, was a contemporary until the primary author joined the ‘R Core’ group.

Raisons d’être

There are several reasons to prefer Lisp-Stat over R or Python. The first is that it is fast. Lisp compilers produce native executable code that is nearly as fast as C. The Common Lisp numerical tower has support for rational numbers, which is a natural way to work with samples. For example an experiment may produce 11598 positives out of a sample of 25000. With exact rational arithmatic, there is no need to force everything to a float, the value is just what the experiment said: 11598 / 25000.

Probably the most important reason though is given in the paper by Ross Ihaka, one of the originators of the R language, Lisp as a Base for a Statistical Computing System about the deficiencies in R and the inability to compile to machine code (among other issues). The same is true of Python. In that paper he argues for Lisp as a replacement for R.

Not only does Common Lisp provide a compiler that produces machine code, it has native threading, a rich ecosystem of code libraries, and a history of industrial deployments, including:

  • Credit card authorisation at Amex (Authorizers Assistant)
  • US DoD logistics (and more that we do not know of)
  • CIA and NSA are big users based on Lisp sales
  • DWave, HSL and Rigetti use lisp for programming their quantum computers
  • Apple’s Siri was originally written in Lisp
  • Amazon got started with Lisp & C; so did Y-combinator
  • Google’s flight search engine is written in Common Lisp
  • AT&T used a stripped down version of Symbolics Lisp to process CDRs in the first IP telephony switches

If Lisp is good enough for those applications, it very likely can meet the needs of an enterprise deployment today.

Relationship to XLISP-Stat

Although inspired by Tierney’s XLisp-Stat, this is a reboot in Common Lisp. XLisp-Stat code is unlikely to run except in trivial cases, but existing XLisp-Stat libraries can be ported with the assistance of the XLS-Compat system.

In developing the system, I wanted to avoid the lisp curse, so selected the best existing libraries where possible, developed what didn’t exist, and documented them all in an attempt to make the learning curve a gentle slope.

Library Consolidation

Eventually, we hope for a consolidation of lisp statistical libraries in order to a critical mass in the domain. The reasons for moving in this direction were described in an article some years ago entitled Consolidating Common Lisp Libraries. Whilst historical precedent is against us, that does not mean we won’t try.

Core Systems

Lisp-Stat is composed of several systems (projects), each independently useful and brought together under the Lisp-Stat umbrella. Dependencies between systems have been minimised to the extent possible so you can use them individually without importing all of Lisp-Stat.

Data-Frame

A data frame is a data structure conceptually similar to a R data frame. It provides column-centric storage for data sets where each named column contains the values for one variable, and each row contains one set of observations. For data frames, we use the ‘tibble’ from the tidyverse as inspiration for functionality.

Data frames can contain values of any type. If desired, additional attributes, such as float, the unit and other information may be attached to the variable for convenience or efficiency. For example you could specify a unit of length, say m/s (meters per second), to ensure that mathmatical operations on that variable always produce lengths (though the unit may change).

DFIO

The Data Frame I/O system provides input and output operations for data frames. A data frame may be written to and read from files, strings or streams, including network streams or relational databases.

Select

Select is a facility for selecting portions of sequences or arrays. It provides:

  • An API for making selections (elements selected by the Cartesian product of vectors of subscripts for each axis) of array-like objects. The most important function is select. Unless you want to define additional methods for select, this is pretty much all you need from this library.
  • An extensible DSL for selecting a subset of valid subscripts. This is useful if, for example, you want to resolve column names in a data frame in your implementation of select, or implementing filtering based on row values.

Array Operations

This library is a collection of functions and macros for manipulating Common Lisp arrays and performing numerical calculations with them. The library provides shorthand codes for frequently used operations, displaced array functions, indexing, transformations, generation, permutation and reduction of columns. Array operations may also be applied to data frames, and data frames may be converted to/from arrays.

Special Functions

This library implements numerical special functions in Common Lisp with a focus on high accuracy double-float calculations. These functions are the basis for the statistical distributions functions, e.g. gamma, beta, etc.

Numerical Utilities

Numerical Utilities is the base system that most others depend on. It is a collection of packages providing:

  • num=, et. al. comparison operators for floats
  • simple arithmetic functions, like sum and l2norm
  • element-wise operations for arrays and vectors
  • intervals
  • special matrices and shorthand for their input
  • sample statistics
  • Chebyshev polynomials
  • quadratures
  • univariate root finding
  • horner’s, simpson’s and other functions for numerical analysis

Lisp-Stat

This is the top level system that uses the other packages to create a statistical computing environment. It is also the location for the ‘unified’ interface, where the holes are plugged with third party packages. For example cl-mathstats contains functionality not yet in Lisp-Stat, however its architecture does not lend itself well to incorporation via an ASDF depends-on, so as we consolidate the libraries, missing functionality will be placed in the Lisp-Stat system. Eventually parts of numerical-utilities, especially the statistics functions, will be relocated here.

IDEs

Emacs

Emacs, with the slime package is the most tested IDE and the one the authors use. If you are using one of the starter lisp packages mentioned in the getting started section, this will have been installed for you. Otherwise, slime/swank is available in quicklisp.

Jupyter Lab

Jupyter Lab and common-lisp-jupyter provide an environment similar to RStudio for working with data and performing analysis. The Lisp-Stat analytics examples use Jupyter Lab to illustrate worked examples based on the book, Introduction to the Practice of Statistics.

Clozure Common Lisp

On MacOS, Clozure Common Lisp, provides a graphical editing environment with a built-in editor and menu driven system for working with Lisp-Stat.

Roadmap

Generally, we are prioritising these systems for development:

  1. Data Frame
  2. Plotting
  3. Special Functions & Distributions

In terms of priority, 1 & 2 are equally rated, and special-functions/distributions lower priority because we have a few options for them, such as the CFFI for libRmath or less accurate Common Lisp implementations. As well, the knowledge of numerical methods required for accurate implementation is somewhat more limited.

For the most part, implementation priority is determined by the features required when working through the Lisp-Stat examples and the basic tutorial. Being able to execute all the examples in these two documents is the first MVP milestone. If you see something in one of these documents that does not work yet it will be a good starter issue for a contribution (you’ll have to look at the source for the document, as functionality that isn’t implemented will have been commented out).

Acknowledgements

Tamas Papp was the original author of many of these libraries. Starting with relatively clean, working, code that solves real-world problems was a great start to the development of Lisp-Stat.

What next?

Get Started
Examples
R Users

2 - Getting Started

Install to plotting in five minutes

Prerequisites

  • SBCL or CCL Common Lisp
  • MacOS or Windows 10
  • Quicklisp
  • Chrome

Load & Configure

First load Lisp-Stat, plotting libraries and data and configure the environment.

Lisp-Stat

(ql:quickload :lisp-stat)
(in-package :ls-user)

Vega-Lite

(ql:quickload :plot/vglt)

Data

(define-data-frame cars
  (vglt:vl-to-df
    (dex:get
	  "https://raw.githubusercontent.com/vega/vega-datasets/master/data/cars.json"
	  :want-stream t)))

View

Print the data frame (showing the first 25 rows by default)

(pprint cars)
;; ORIGIN YEAR       ACCELERATION WEIGHT_IN_LBS HORSEPOWER DISPLACEMENT CYLINDERS MILES_PER_GALLON NAME
;; USA    1970-01-01         12.0          3504        130        307.0         8             18.0 chevrolet chevelle malibu
;; USA    1970-01-01         11.5          3693        165        350.0         8             15.0 buick skylark 320
;; USA    1970-01-01         11.0          3436        150        318.0         8             18.0 plymouth satellite
;; USA    1970-01-01         12.0          3433        150        304.0         8             16.0 amc rebel sst
;; USA    1970-01-01         10.5          3449        140        302.0         8             17.0 ford torino
;; USA    1970-01-01         10.0          4341        198        429.0         8             15.0 ford galaxie 500
;; USA    1970-01-01          9.0          4354        220        454.0         8             14.0 chevrolet impala
;; USA    1970-01-01          8.5          4312        215        440.0         8             14.0 plymouth fury iii
;; USA    1970-01-01         10.0          4425        225        455.0         8             14.0 pontiac catalina
;; USA    1970-01-01          8.5          3850        190        390.0         8             15.0 amc ambassador dpl
;; Europe 1970-01-01         17.5          3090        115        133.0         4 NIL              citroen ds-21 pallas
;; USA    1970-01-01         11.5          4142        165        350.0         8 NIL              chevrolet chevelle concours (sw)
;; USA    1970-01-01         11.0          4034        153        351.0         8 NIL              ford torino (sw)
;; USA    1970-01-01         10.5          4166        175        383.0         8 NIL              plymouth satellite (sw)
;; USA    1970-01-01         11.0          3850        175        360.0         8 NIL              amc rebel sst (sw)
;; USA    1970-01-01         10.0          3563        170        383.0         8             15.0 dodge challenger se
;; USA    1970-01-01          8.0          3609        160        340.0         8             14.0 plymouth 'cuda 340
;; USA    1970-01-01          8.0          3353        140        302.0         8 NIL              ford mustang boss 302
;; USA    1970-01-01          9.5          3761        150        400.0         8             15.0 chevrolet monte carlo
;; USA    1970-01-01         10.0          3086        225        455.0         8             14.0 buick estate wagon (sw)
;; Japan  1970-01-01         15.0          2372         95        113.0         4             24.0 toyota corona mark ii
;; USA    1970-01-01         15.5          2833         95        198.0         6             22.0 plymouth duster
;; USA    1970-01-01         15.5          2774         97        199.0         6             18.0 amc hornet
;; USA    1970-01-01         16.0          2587         85        200.0         6             21.0 ford maverick                 ..

Show the last few rows:

(tail cars)
;; ORIGIN YEAR       ACCELERATION WEIGHT_IN_LBS HORSEPOWER DISPLACEMENT CYLINDERS MILES_PER_GALLON NAME
;; USA    1982-01-01         17.3          2950         90          151         4               27 chevrolet camaro
;; USA    1982-01-01         15.6          2790         86          140         4               27 ford mustang gl
;; Europe 1982-01-01         24.6          2130         52           97         4               44 vw pickup
;; USA    1982-01-01         11.6          2295         84          135         4               32 dodge rampage
;; USA    1982-01-01         18.6          2625         79          120         4               28 ford ranger
;; USA    1982-01-01         19.4          2720         82          119         4               31 chevy s-10

Statistics

Look at a few statistics on the data set.

(mean cars:acceleration) ; => 15.5197
(summary cars)
  CARS:MILES_PER_GALLON
                        398 reals, min=9, q25=17.33333317438761d0,
                        q50=22.727271751923993d0, q75=29.14999923706055d0,
                        max=46.6d0;
                        8 (2%) x "NIL"
  CARS:CYLINDERS
                 207 (51%) x 4,
                 108 (27%) x 8,
                 84 (21%) x 6,
                 4 (1%) x 3,
                 3 (1%) x 5
  CARS:DISPLACEMENT
                    406 reals, min=68, q25=104.25, q50=147.92307,
                    q75=277.76923, max=455
  CARS:HORSEPOWER
                  400 reals, min=46, q25=75.77778, q50=94.33333, q75=129.57143,
                  max=230;
                  6 (1%) x "NIL"
  CARS:WEIGHT_IN_LBS
                     406 reals, min=1613, q25=2226, q50=2822.5, q75=3620,
                     max=5140
  CARS:ACCELERATION
                    406 reals, min=8, q25=13.674999999999999d0, q50=15.45d0,
                    q75=17.16666632692019d0, max=24.8d0
  CARS:YEAR
            61 (15%) x "1982-01-01",
            40 (10%) x "1973-01-01",
            36 (9%) x "1978-01-01",
            35 (9%) x "1970-01-01",
            34 (8%) x "1976-01-01",
            30 (7%) x "1975-01-01",
            29 (7%) x "1971-01-01",
            29 (7%) x "1979-01-01",
            29 (7%) x "1980-01-01",
            28 (7%) x "1972-01-01",
            28 (7%) x "1977-01-01",
            27 (7%) x "1974-01-01"
  CARS:ORIGIN
              254 (63%) x "USA", 79 (19%) x "Japan", 73 (18%) x "Europe">

Note: The car models, essentially the row names, have been removed from the summary.

Plot

Create a scatter plot specification with default values:

(defparameter cars-plot (vglt:scatter-plot cars "HORSEPOWER" "MILES_PER_GALLON"))

Render the plot:

(plot:plot-from-file (vglt:save-plot 'cars-plot))

2.1 - Installation

Automated and manual installation

New to Lisp

If you are a Lisp newbie and want to get started as fast as possible, then Portacle is probably your best option. Portacle is a multi-platform IDE for Common Lisp that includes Emacs, SBCL, Git, Quicklisp, all configured and ready to use.

If you are an existing emacs user, you can configure emacs for Common Lisp.

Users new to lisp should also consider going through the basic tutorial, which guides you step-by-step through the basics of working with Lisp as a statistics practitioner.

Experienced with Lisp

We assume an experienced user will have their own Emacs and lisp implementation and will want to install according to their own tastes and setup. The repo links you need are below, or you can install with quicklisp.

Prerequisites

All that is needed is an ANSI Common Lisp implementation. Development is done with CCL and SBCL. Other platforms should work, but will not have been tested.

Installation

Automated install

The easiest way to install Lisp-Stat is with Quicklisp:

(ql:quickload :lisp-stat)

Manual install

If you want to modify Lisp-Stat you’ll need to retrieve the files from github and place them in a directory that is known to quicklisp. This long shell command will checkout all the required systems:

cd ~/quicklisp/local-projects && \
git clone https://github.com/Lisp-Stat/data-frame.git && \
git clone https://github.com/Lisp-Stat/dfio.git && \
git clone https://github.com/Lisp-Stat/special-functions.git && \
git clone https://github.com/Lisp-Stat/numerical-utilities.git && \
git clone https://github.com/Lisp-Stat/documentation.git && \
git clone https://github.com/Lisp-Stat/plot.git && \
git clone https://github.com/Lisp-Stat/select.git && \
git clone https://github.com/Lisp-Stat/lisp-stat.git

The above assumes you have the default installation directories. Adjust accordingly if you have changed this. If Quicklisp claims it cannot find the systems, try this at the REPL:

(ql:register-local-projects)

Documentation

Lisp-Stat reference manuals are generated with the declt system. This produces high quality PDFs, markdown, HTML and Info output. The API reference manuals are available in HTML in the reference section of this website; PDF and Info files that can be download from the individual systems docs/ directory.

You can install the info manuals into the emacs help system and this allows searching and browsing from within the editing environment. To do this, use the install-info command. As an example, on my MS Windows 10 machine, with MSYS2/emacs installation:

install-info --add-once select.info /c/msys64/mingw64/share/info/dir

installs the select manual into a Lisp-Stat node at the top level of the info tree.

Try it out

Load Lisp-Stat:

(ql:quickload :lisp-stat)

Change to the Lisp-Stat user package:

(in-package :ls-user)

Load some data:

(load #P"LS:DATASETS;CAR-PRICES")

Find the sample mean and median:

(mean car-prices)
(median car-prices)

Next steps

Get Started
Examples
R Users

2.2 - Data Frame

Getting started with data frames

Load data

We will use one of the example data sets from R, mtcars, for these examples. First, load Lisp-Stat and the R data libraries, and switch into the Lisp-Stat package:

(ql:quickload :lisp-stat)
(ql:quickload :lisp-stat/rdata)
(in-package   :ls-user)

Now define the data frame, naming it mtcars:

(define-data-frame mtcars
	(read-csv (rdata:rdata 'rdata:datasets 'rdata:mtcars)))
;;WARNING: Missing column name was filled in
;;#<DATA-FRAME (32 observations of 11 variables)>

This macro defines a global variable named mtcars and sets up some convenience functions.

Examine data

Lisp-Stat’s printing system is integrated with the Common Lisp Pretty Printing facility. By default Lisp-Stat sets *print-pretty* to nil.

Basic information

Type the name of the data frame at the REPL to get a simple one-line summary.

mtcars ;; => #<DATA-FRAME (32 observations of 12 variables)>

Printing data

By default, head returns the first 6 rows:

(head mtcars)
;;   X1                 MPG CYL DISP  HP DRAT    WT  QSEC VS AM GEAR CARB
;; 0 Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
;; 1 Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
;; 2 Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
;; 3 Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
;; 4 Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
;; 5 Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

and tail the last 6 rows:

;;   X1              MPG CYL  DISP  HP DRAT    WT QSEC VS AM GEAR CARB
;; 0 Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.7  0  1    5    2
;; 1 Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.9  1  1    5    2
;; 2 Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.5  0  1    5    4
;; 3 Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.5  0  1    5    6
;; 4 Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.6  0  1    5    8
;; 5 Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.6  1  1    4    2

pprint can be used to print the whole data frame:

(pprint mtcars)

;;    X1                   MPG CYL  DISP  HP DRAT    WT  QSEC VS AM GEAR CARB
;;  0 Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
;;  1 Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
;;  2 Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
;;  3 Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
;;  4 Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
;;  5 Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
;;  6 Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
;;  7 Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
;;  8 Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
;;  9 Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
;; 10 Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
;; 11 Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
;; 12 Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
;; 13 Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
;; 14 Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
;; 15 Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
;; 16 Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
;; 17 Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
;; 18 Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
;; 19 Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
;; 20 Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
;; 21 Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
;; 22 AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
;; 23 Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4 ..

The two dots “..” at the end indicate that output has been truncated. Lisp-Stat sets the default for pretty printer *print-lines* to 25 rows and output more than this is truncated. If you’d like to print all rows, set this value to nil.

Notice the column named X1. This is the name given to the column by the import function. Note the warning that was issued during the import. Missing columns are named X1, X2, …, Xn in increasing order for the duration of the Lisp-Stat session.

This column is actually the row name, so we’ll rename it:

(replace-key mtcars row-name x1)

and view the results

(head mtcars)
;;   ROW-NAME           MPG CYL DISP  HP DRAT    WT  QSEC VS AM GEAR CARB
;; 0 Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
;; 1 Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
;; 2 Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
;; 3 Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
;; 4 Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
;; 5 Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Column names

To see the names of the columns, use the column-names function:

(column-names mtcars)
;; => ("ROW-NAMES" "MPG" "CYL" "DISP" "HP" "DRAT" "WT" "QSEC" "VS" "AM" "GEAR" "CARB")

Dimensions

We saw the dimensions above in basic information. That was a printed for human consumption. To get the values in a form suitable for passing to other functions, use the dims command:

(aops:dims mtcars) ;; => (32 12)

Common Lisp specifies dimensions in row-column order, so mtcars has 32 rows and 12 columns.

Basic Statistics

Minimum & Maximum

To get the minimum or maximum of a column, say mpg, you can use several Common Lisp methods. Let’s see what mpg looks like by typing the name of the column into the REPL:

 mtcars:mpg
;; => #(21 21 22.8d0 21.4d0 18.7d0 18.1d0 14.3d0 24.4d0 22.8d0 19.2d0 17.8d0 16.4d0 17.3d0 15.2d0 10.4d0 10.4d0 14.7d0 32.4d0 30.4d0 33.9d0 21.5d0 15.5d0 15.2d0 13.3d0 19.2d0 27.3d0 26 30.4d0 15.8d0 19.7d0 15 21.4d0)

You could, for example, use something like this to find the minimum:

(reduce #'min mtcars:mpg) ;; => 10.4d0

or the Lisp-Stat function sequence-maximum to find the maximum

(sequence-maximum mtcars:mpg) ;; => 33.9d0

or perhaps you’d prefer alexandria:extremum, a general-purpose tool to find the minimum in a different way:

(extremum mtcars:mpg #'<) ;; => 10.4d0

The important thing to note is that mtcars:mpg is a standard Common Lisp vector and you can manipulate it like one.

Mean & standard deviation

(mean mtcars:mpg) ;; => 20.090625000000003d0
(standard-deviation mtcars:mpg) ;; => 5.932029552301219d0

Summarise

You can summarise a column with the column-summary function:

(column-summary mtcars:mpg)
;; => 32 reals, min=10.4d0, q25=15.399999698003132d0, q50=19.2d0, q75=22.8d0, max=33.9d0

or the entire data frame:

(summary mtcars)
#<DATA-FRAME (12 x 32)
  MTCARS:CARB
              10 (31%) x 4,
              10 (31%) x 2,
              7 (22%) x 1,
              3 (9%) x 3,
              1 (3%) x 6,
              1 (3%) x 8
  MTCARS:GEAR
              15 (47%) x 3, 12 (38%) x 4, 5 (16%) x 5
  MTCARS:AM bits, ones: 13 (41%)
  MTCARS:VS bits, ones: 14 (44%)
  MTCARS:QSEC
              32 reals, min=14.5d0, q25=16.884999999999998d0, q50=17.71d0,
              q75=18.9d0, max=22.9d0
  MTCARS:WT
            32 reals, min=1.513d0, q25=2.5425d0, q50=3.325d0,
            q75=3.6766665957371387d0, max=5.424d0
  MTCARS:DRAT
              32 reals, min=2.76d0, q25=3.08d0, q50=3.6950000000000003d0,
              q75=3.952000046730041d0, max=4.93d0
  MTCARS:HP
            32 reals, min=52, q25=96.0, q50=123, q75=186.25, max=335
  MTCARS:DISP
              32 reals, min=71.1d0, q25=120.65d0, q50=205.86666333675385d0,
              q75=334.0, max=472
  MTCARS:CYL
             14 (44%) x 8, 11 (34%) x 4, 7 (22%) x 6
  MTCARS:MPG
             32 reals, min=10.4d0, q25=15.399999698003132d0, q50=19.2d0,
             q75=22.8d0, max=33.9d0

Recall that a column named row-name is treated specially, notice that it is not included in the summary. You can see why it’s excluded by examining the column’s summary:

(pprint (column-summary mtcars:row-name))
1 (3%) x "Mazda RX4",
1 (3%) x "Mazda RX4 Wag",
1 (3%) x "Datsun 710",
1 (3%) x "Hornet 4 Drive",
1 (3%) x "Hornet Sportabout",
1 (3%) x "Valiant",
1 (3%) x "Duster 360",
1 (3%) x "Merc 240D",
1 (3%) x "Merc 230",
1 (3%) x "Merc 280",
1 (3%) x "Merc 280C",
1 (3%) x "Merc 450SE",
1 (3%) x "Merc 450SL",
1 (3%) x "Merc 450SLC",
1 (3%) x "Cadillac Fleetwood",
1 (3%) x "Lincoln Continental",
1 (3%) x "Chrysler Imperial",
1 (3%) x "Fiat 128",
1 (3%) x "Honda Civic",
1 (3%) x "Toyota Corolla",
1 (3%) x "Toyota Corona",
1 (3%) x "Dodge Challenger",
1 (3%) x "AMC Javelin",
1 (3%) x "Camaro Z28", ..

Columns with unique values in each row aren’t very interesting.

“Use” a data frame

By use-ing a data frame package you can avoid the use of the package qualifier symbol : and directly refer to the variable name. This is similar to R’s attach function.

(use-package 'mtcars)
(mean mpg) ;; => 20.090625000000003d0

the unuse-package function stops using the symbols from the data-frame.

(unuse-package 'mtcars)

Saving data

To save a data frame to a CSV file, use the data-frame-to-csv method. Here we save mtcars into the Lisp-Stat datasets directory, including the column names:

(data-frame-to-csv mtcars
		           :stream #P"LS:DATASETS;mtcars.csv"
		           :add-first-row t)

3 - Examples

Using Lisp-Stat in the real world

One of the best ways to learn Lisp-Stat is to see examples of actual work. This section contains examples of performing statistical analysis, derived from the book Introduction to the Practices of Statistics (2017) by Moore, McCabe and Craig and plotting from the Vega-Lite example gallery.

3.1 - Analysis

From the ninth edition of the book, Introduction to the Practice of Statistics

These notebooks describe how to undertake statistical analyses introduced as examples in the Ninth Edition of Introduction to the Practices of Statistics (2017) by Moore, McCabe and Craig. The notebooks are organised in the same manner as the chapters of the book. The data comes from the site IPS9 in R by Nicholas Horton.

Looking at data

Chapter 1 – Distributions : Exploratory data analysis using plots and numbers

3.2 - Plotting

Example plots

The plots here show equivalents to the Vega-Lite example gallery.

Preliminaries

Load Vega-Lite

Load Vega-Lite and network libraries:

(ql:quickload :lisp-stat)
(ql:quickload :plot/vglt)
(ql:quickload :dexador)
(ql:quickload :access)

Load example data

(in-package :lisp-stat)
(defparameter vega-cars
  (vglt:vl-to-df
    (dex:get
	  "https://raw.githubusercontent.com/vega/vega-datasets/master/data/cars.json"
	  :want-stream t)))

Strip plot

The Vega-Lite strip plot example shows the relationship between horsepower and the number of cylinders using tick marks.

In this example we will show how to build a spec from beginning to end, without using a plot template.

JSON

{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
  "description": "Shows the relationship between horsepower and the number of cylinders using tick marks.",
  "data": {"url": "data/cars.json"},
  "mark": "tick",
  "encoding": {
    "x": {"field": "Horsepower", "type": "quantitative"},
    "y": {"field": "Cylinders", "type": "ordinal"}
  }
}

Lisp-Stat

(defparameter cars-strip-plot
  (line-up-first
	(vglt:spec)
    (vglt:add "description" "Shows the relationship between horsepower and the number of cylinders using tick marks.")
	(vglt:add "data" `(("values" . ,(vglt:df-to-alist vega-cars))))
	(vglt:add "mark" "tick")
	(vglt:add "encoding" '(("x" ("field" . "HORSEPOWER") ("type" . "quantitative") ("title" . "Horsepower"))
	                       ("y" ("field" . "CYLINDERS")  ("type" . "ordinal") ("title" . "Cylinders"))))))
(plot:plot-from-file (vglt:save-plot 'cars-strip-plot))

Scatter plots

Basic

A basic Vega-Lite scatterplot showing horsepower and miles per gallons for various cars.

Horsepower vs. MPG scatter plot

In this example we use the Lisp-Stat template for a basic scatter plot.

JSON

{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
  "description": "A scatterplot showing horsepower and miles per gallons for various cars.",
  "data": {"url": "data/cars.json"},
  "mark": "point",
  "encoding": {
    "x": {"field": "Horsepower", "type": "quantitative"},
    "y": {"field": "Miles_per_Gallon", "type": "quantitative"}
  }
}

Lisp-Stat

(defparameter cars-scatter-plot
  (vglt:scatter-plot vega-cars "HORSEPOWER" "MILES_PER_GALLON"))
(plot:plot-from-file (vglt:save-plot 'cars-scatter-plot))

Colored

In this example we’ll show how to modify a plot that was based on one of the the Lisp-Stat plotting templates. We’d like to add some additional information to the cars scatter plot to show the cars origin. The Vega-Lite example shows that we have to add two new directives to the encoding of the plot:

(pushnew
 '("color" . (("field" . "ORIGIN") ("type" . "nominal")))
 (access:accesses cars-scatter-plot :encoding))
(pushnew
 '("shape" . (("field" . "ORIGIN") ("type" . "nominal")))
 (access:accesses cars-scatter-plot :encoding))
(plot:plot-from-file (vglt:save-plot 'cars-scatter-plot))

With this change we can see that the higher horsepower, lower efficiency, cars are from the USA, and the higher efficiency cars from Japan and Europe.

Text marks

The same information, but further indicated with a text marker. This Vega-Lite example is sufficiently different from the template that we’ll construct it all here. Notice the use of a data transformation.

JSON

{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
  "data": {"url": "data/cars.json"},
  "transform": [{
    "calculate": "datum.Origin[0]",
    "as": "OriginInitial"
  }],
  "mark": "text",
  "encoding": {
    "x": {"field": "Horsepower", "type": "quantitative"},
    "y": {"field": "Miles_per_Gallon", "type": "quantitative"},
    "color": {"field": "Origin", "type": "nominal"},
    "text": {"field": "OriginInitial", "type": "nominal"}
  }
}

Lisp-Stat

(defparameter cars-scatter-text-plot
   (line-up-first
    (vglt:spec)
	(vglt:add "data" `(("values" . ,(vglt:df-to-alist vega-cars))))
	(vglt:add "transform" #((("calculate" . "datum.ORIGIN[0]") ("as" . "OriginInitial"))))
	(vglt:add "mark" "text")
	(vglt:add "encoding" '(("x" ("field" . "HORSEPOWER") ("type" . "quantitative") ("title" . "Horsepower"))
	                       ("y" ("field" . "MILES_PER_GALLON") ("type" . "quantitative") ("title" . "Miles per Gallon"))
	                       ("color" . (("field" . "ORIGIN") ("type" . "nominal")))
					       ("text" . (("field" . "OriginInitial") ("type" . "nominal")))))))
(plot:plot-from-file (vglt:save-plot 'cars-scatter-text-plot))

Interactive scatter plot matrix

This Vega-Lite interactive scatter plot matrix includes interactive elements and demonstrates creating a SPLOM (scatter plot matrix).

Above is a PNG file. The interactive version is here.

JSON

{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
  "repeat": {
    "row": ["Horsepower", "Acceleration", "Miles_per_Gallon"],
    "column": ["Miles_per_Gallon", "Acceleration", "Horsepower"]
  },
  "spec": {
    "data": {"url": "data/cars.json"},
    "mark": "point",
    "params": [
      {
        "name": "brush",
        "select": {
          "type": "interval",
          "resolve": "union",
          "on": "[mousedown[event.shiftKey], window:mouseup] > window:mousemove!",
          "translate": "[mousedown[event.shiftKey], window:mouseup] > window:mousemove!",
          "zoom": "wheel![event.shiftKey]"
        }
      },
      {
        "name": "grid",
        "select": {
          "type": "interval",
          "resolve": "global",
          "translate": "[mousedown[!event.shiftKey], window:mouseup] > window:mousemove!",
          "zoom": "wheel![!event.shiftKey]"
        },
        "bind": "scales"
      }
    ],
    "encoding": {
      "x": {"field": {"repeat": "column"}, "type": "quantitative"},
      "y": {
        "field": {"repeat": "row"},
        "type": "quantitative",
        "axis": {"minExtent": 30}
      },
      "color": {
        "condition": {
          "param": "brush",
          "field": "Origin",
          "type": "nominal"
        },
        "value": "grey"
      }
    }
  }
}

Lisp-Stat equivalent

(defparameter cars-interactive-splom
  (line-up-first
   (vglt:spec)
   (vglt:add "repeat" '(("row" . #("HORSEPOWER" "ACCELERATION" "MILES_PER_GALLON"))
			            ("column" . #("MILES_PER_GALLON" "ACCELERATION" "HORSEPOWER"))))
   (vglt:add "spec"
             `(("data" ("values" . ,(vglt:df-to-alist vega-cars)))
		      ("mark" . "point")
		      ("params" . #(
			        (("name" . "brush")
				     ("select"
				      ("type" . "interval")
				      ("resolve" . "union")
				      ("on" . "[mousedown[event.shiftKey], window:mouseup] > window:mousemove!")
				      ("translate" . "[mousedown[event.shiftKey], window:mouseup] > window:mousemove!")
				      ("zoom" . "wheel![event.shiftKey]")))
				    (("name" . "grid")
				     ("select"
				      ("type" . "interval")
				      ("resolve" . "global")
				      ("translate" . "[mousedown[!event.shiftKey], window:mouseup] > window:mousemove!")
				      ("zoom" . "wheel![!event.shiftKey]"))
				      ("bind" . "scales"))))
		      ("encoding" . (("x" ("field" ("repeat" . "column")) ("type" . "quantitative"))
				             ("y" ("field" ("repeat" . "row")) ("type" . "quantitative") ("axis" ("minExtent" . 30)))
				             ("color" ("condition" ("param" . "brush")
							                       ("field" . "ORIGIN")
							                       ("type" . "nominal"))
					                  ("value" . "grey"))))))))
(plot:plot-from-file (vglt:save-plot 'cars-interactive-splom))

4 - Core Tasks

User guides for statistical workflow

4.1 - Array operations

Manipulating sample data as arrays

Overview

The array-operations system contains a collection of functions and macros for manipulating Common Lisp arrays and performing numerical calculations with them.

Array-operations is a ‘generic’ way of operating on array like data structures. Several aops functions have been implemented for data-frame. For those that haven’t, you can transform arrays to data frames using the df:matrix-df function, and a data-frame to an array using df:as-array. This make it convenient to work with the data sets using either system.

Quick look

Arrays can be created with numbers from a statistical distribution:

(rand '(2 2)) ; => #2A((0.62944734 0.2709539) (0.81158376 0.6700171))

in linear ranges:

(linspace 1 10 7) ; => #(1 5/2 4 11/2 7 17/2 10)

or generated using a function, optionally given index position

(generate #'identity '(2 3) :position) ; => #2A((0 1 2) (3 4 5))

They can also be transformed and manipulated:

(defparameter A #2A((1 2)
                    (3 4)))
(defparameter B #2A((2 3)
                    (4 5)))

;; split along any dimension
(split A 1)  ; => #(#(1 2) #(3 4))

;; stack along any dimension
(stack 1 A B) ; => #2A((1 2 2 3)
              ;        (3 4 4 5))

;; element-wise function map
(each #'+ #(0 1 2) #(2 3 5)) ; => #(2 4 7)

;; element-wise expressions
(vectorize (A B) (* A (sqrt B))) ; => #2A((1.4142135 3.4641016)
                                 ;        (6.0       8.944272))

;; index operations e.g. matrix-matrix multiply:
(each-index (i j)
  (sum-index k
    (* (aref A i k) (aref B k j)))) ; => #2A((10 13)
	                                ;        (22 29))

Array shorthand

The library defines the following short function names that are synonyms for Common Lisp operations:

array-operations Common Lisp
size array-total-size
rank array-rank
dim array-dimension
dims array-dimensions
nrow number of rows in matrix
ncol number of columns in matrix

The array-operations package has the nickname aops, so you can use, for example, (aops:size my-array) without use‘ing the package.

Displaced arrays

According to the Common Lisp specification, a displaced array is:

An array which has no storage of its own, but which is instead indirected to the storage of another array, called its target, at a specified offset, in such a way that any attempt to access the displaced array implicitly references the target array.

Displaced arrays are one of the niftiest features of Common Lisp. When an array is displaced to another array, it shares structure with (part of) that array. The two arrays do not need to have the same dimensions, in fact, the dimensions do not be related at all as long as the displaced array fits inside the original one. The row-major index of the former in the latter is called the offset of the displacement.

displace

Displaced arrays are usually constructed using make-array, but this library also provides displace for that purpose:

(defparameter *a* #2A((1 2 3)
                      (4 5 6)))
(aops:displace *a* 2 1) ; => #(2 3)

Here’s an example of using displace to implement a sliding window over some set of values, say perhaps a time-series of stock prices:

(defparameter stocks (aops:linspace 1 100 100))
(loop for i from 0 to (- (length stocks) 20)
      do (format t "~A~%" (aops:displace stocks 20 i)))
;#(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20)
;#(2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21)
;#(3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22)

flatten

flatten displaces to a row-major array:

aops:flatten *a*) ; => #(1 2 3 4 5 6)

The real fun starts with split, which splits off sub-arrays nested within a given axis:

(aops:split *a* 1) ; => #(#(1 2 3) #(4 5 6))
(defparameter *b* #3A(((0 1) (2 3))
                      ((4 5) (6 7))))
(aops:split *b* 0) ; => #3A(((0 1) (2 3)) ((4 5) (6 7)))
(aops:split *b* 1) ; => #(#2A((0 1) (2 3)) #2A((4 5) (6 7)))
(aops:split *b* 2) ; => #2A((#(0 1) #(2 3)) (#(4 5) #(6 7)))
(aops:split *b* 3) ; => #3A(((0 1) (2 3)) ((4 5) (6 7)))

Note how splitting at 0 and the rank of the array returns the array itself.

sub

Now consider sub, which returns a specific array, composed of the elements that would start with given subscripts:

(aops:sub *b* 0) ; => #2A((0 1)
                 ;        (2 3))
(aops:sub *b* 0 1) ; => #(2 3)
(aops:sub *b* 0 1 0) ; => 2

In the case of vectors, sub works like aref:

(aops:sub #(1 2 3 4 5) 1) ; => 2

There is also a (setf sub) function.

partition

partition returns a consecutive chunk of an array separated along its first subscript:

(aops:partition #2A((0 1)
                    (2 3)
                    (4 5)
                    (6 7)
                    (8 9))
              1 3) ; => #2A((2 3)
			       ;        (4 5))

and also has a (setf partition) pair.

combine

combine is the opposite of split:

(aops:combine #(#(0 1) #(2 3))) ; => #2A((0 1)
                                ;        (2 3))

subvec

subvec returns a displaced subvector:

(aops:subvec #(0 1 2 3 4) 2 4) ; => #(2 3)

There is also a (setf subvec) function, which is like (setf subseq) except for demanding matching lengths.

reshape

Finally, reshape can be used to displace arrays into a different shape:

(aops:reshape *a* '(3 2)) ; => #2A((1 2)
                          ;        (3 4)
						  ;        (5 6))

You can use t for one of the dimensions, to be filled in automatically:

(aops:reshape *b* '(1 t)) ; => #2A((0 1 2 3 4 5 6 7))

reshape-col and reshape-row reshape your array into a column or row matrix, respectively.

Specifying dimensions

Functions in the library accept the following in place of dimensions:

  • a list of dimensions (as for make-array),
  • a positive integer, which is used as a single-element list,
  • another array, the dimensions of which are used.

The last one allows you to specify dimensions with other arrays. For example, to reshape an array a1 to look like a2, you can use

(aops:reshape a1 a2)

instead of the longer form

(aops:reshape a1 (aops:dims a2))

Creation & transformation

When the resulting element type cannot be inferred, functions that create and transform arrays are provided in pairs; one of these will allow you to specify the array-element-type of the result, while the other assumes it is t. The former ends with a *, and the element-type is always its first argument. I give examples for the versions without *, use the other when you are optimizing your code and you are sure you can constrain to a given element-type.

Element traversal order of these functions is unspecified. The reason for this is that the library may use parallel code in the future, so it is unsafe to rely on a particular element traversal order.

The following functions all make a new array, taking the dimensions as input. The version ending in * also takes the array type as first argument. There are also versions ending in ! which do not make a new array, but take an array as first argument, which is modified and returned.

Function Description
zeros Filled with zeros
ones Filled with ones
rand Filled with uniformly distributed random numbers between 0 and 1
randn Normally distributed with mean 0 and standard deviation 1
linspace Evenly spaced numbers in given range

For example:

(aops:rand '(2 2))
; => #2A((0.6686077 0.59425664)
;        (0.7987722 0.6930506))

(aops:rand* 'single-float '(2 2))
; => #2A((0.39332366 0.5557821)
;        (0.48831415 0.10924244))

(let ((a (make-array '(2 2) :element-type 'double-float)))
  ;; Modify array A, filling with random numbers
  (aops:rand! a))
  ; => #2A((0.6324615478515625d0 0.4636608362197876d0)
  ;        (0.4145939350128174d0 0.5124958753585815d0))
(linspace 0 4 5)   ;=> #(0 1 2 3 4)
(linspace 1 3 5)   ;=> #(0 1/2 1 3/2 2)
(linspace 0 4d0 3) ;=> #(0.0d0 2.0d0 4.0d0)

generate

generate (and generate*) allow you to generate arrays using functions.

(aops:generate (lambda () (random 10)) 3) ; => #(6 9 5)
(aops:generate #'identity '(2 3) :position) ; => #2A((0 1 2)
                                            ;        (3 4 5))
(aops:generate #'identity '(2 2) :subscripts)
; => #2A(((0 0) (0 1))
;        ((1 0) (1 1)))
(aops:generate #'cons '(2 2) :position-and-subscripts)
; => #2A(((0 0 0) (1 0 1))
;        ((2 1 0) (3 1 1)))

Depending on the last argument, the function will be called with the (row-major) position, the subscripts, both, or no argument.

permute

permute can permute subscripts (you can also invert, complement, and complete permutations, look at the docstring and the unit tests). Transposing is a special case of permute:

(defparameter *a* #2A((1 2 3)
                      (4 5 6)))
(aops:permute '(0 1) *a*) ; => #2A((1 2 3)
                          ;        (4 5 6))
(aops:permute '(1 0) *a*) ; => #2A((1 4)
                          ;        (2 5)
						  ;        (3 6))

each

each applies a function to its one dimensional array arguments elementwise. It essentially is an element-wise function map on each of the vectors:

(aops:each #'+ #(0 1 2)
               #(2 3 5)
			   #(1 1 1)
; => #(3 5 8)

vectorize

vectorize is a macro which performs elementwise operations

(defparameter a #(1 2 3 4))
(aops:vectorize (a) (* 2 a)) ; => #(2 4 6 8)

(defparameter b #(2 3 4 5))
(aops:vectorize (a b) (* a (sin b)))
; => #(0.9092974 0.28224 -2.2704074 -3.8356972)

There is also a version vectorize* which takes a type argument for the resulting array, and a version vectorize! which sets elements in a given array.

margin

The semantics of margin are more difficult to explain, so perhaps an example will be more useful. Suppose that you want to calculate column sums in a matrix. You could permute (transpose) the matrix, split its sub-arrays at rank one (so you get a vector for each row), and apply the function that calculates the sum. margin automates that for you:

(aops:margin (lambda (column)
             (reduce #'+ column))
           #2A((0 1)
               (2 3)
               (5 7)) 0) ; => #(7 11)

But the function is more general than this: the arguments inner and outer allow arbitrary permutations before splitting.

recycle

Finally, recycle allows you to reuse the elements of the first argument, object, to create new arrays by extending the dimensions. The :outer keyword repeats the original object and :inner keyword argument repeats the elements of object. When both :inner and :outer are nil, object is returned as is. Non-array objects are intepreted as rank 0 arrays, following the usual semantics.

(aops:recycle #(2 3) :inner 2 :outer 4)
; => #3A(((2 2) (3 3))
         ((2 2) (3 3))
         ((2 2) (3 3))
	     ((2 2) (3 3)))

Three dimensional arrays can be tough to get your head around. In the example above, :outer asks for 4 2-element vectors, composed of repeating the elements of object twice, i.e. repeat ‘2’ twice and repeat ‘3’ twice. Compare this with :inner as 3:

(aops:recycle #(2 3) :inner 3 :outer 4)
; #3A(((2 2 2) (3 3 3))
      ((2 2 2) (3 3 3))
	  ((2 2 2) (3 3 3))
	  ((2 2 2) (3 3 3)))

map-array

map-array maps a function over the elements of an array.

(aops:map-array #2A((1.7 2.1 4.3 5.4)
                    (0.3 0.4 0.5 0.6))
				#'log)
; #2A((0.53062826 0.7419373 1.4586151 1.686399)
;     (-1.2039728 -0.9162907 -0.6931472 -0.5108256))

Indexing operations

nested-loop

nested-loop is a simple macro which iterates over a set of indices with a given range

(defparameter A #2A((1 2) (3 4)))

(aops:nested-loop (i j) (array-dimensions A)
  (setf (aref A i j) (* 2 (aref A i j))))
A ; => #2A((2 4) (6 8))

(aops:nested-loop (i j) '(2 3)
  (format t "(~a ~a) " i j)) ; => (0 0) (0 1) (0 2) (1 0) (1 1) (1 2)

sum-index

sum-index is a macro which uses a code walker to determine the dimension sizes, summing over the given index or indices

(defparameter A #2A((1 2) (3 4)))

;; Trace
(aops:sum-index i (aref A i i)) ; => 5

;; Sum array
(aops:sum-index (i j) (aref A i j)) ; => 10

;; Sum array
(aops:sum-index i (row-major-aref A i)) ; => 10

The main use for sum-index is in combination with each-index.

each-index

each-index is a macro which creates an array and iterates over the elements. Like sum-index it is given one or more index symbols, and uses a code walker to find array dimensions.

(defparameter A #2A((1 2)
                    (3 4)))
(defparameter B #2A((5 6)
                    (7 8)))

;; Transpose
(aops:each-index (i j) (aref A j i)) ; => #2A((1 3)
                                     ;        (2 4))

;; Sum columns
(aops:each-index i
  (aops:sum-index j
    (aref A j i))) ; => #(4 6)

;; Matrix-matrix multiply
(aops:each-index (i j)
   (aops:sum-index k
      (* (aref A i k) (aref B k j)))) ; => #2A((19 22)
	                                  ;        (43 50))

reduce-index

reduce-index is a more general version of sum-index; it applies a reduction operation over one or more indices.

(defparameter A #2A((1 2)
                    (3 4)))

;; Sum all values in an array
(aops:reduce-index #'+ i (row-major-aref A i)) ; => 10

;; Maximum value in each row
(aops:each-index i
  (aops:reduce-index #'max j
    (aref A i j)))  ; => #(2 4)

Reducing

Some reductions over array elements can be done using the Common Lisp reduce function, together with aops:flatten, which returns a displaced vector:

(defparameter a #2A((1 2)
                    (3 4)))
(reduce #'max (aops:flatten a)) ; => 4

argmax/argmin

argmax and argmin find the row-major-aref index where an array value is maximum or minimum. They both return two values: the first value is the index; the second is the array value at that index.

(defparameter a #(1 2 5 4 2))
(aops:argmax a) ; => 2 5
(aops:argmin a) ; => 0 1

vectorize-reduce

More complicated reductions can be done with vectorize-reduce, for example the maximum absolute difference between arrays:

(defparameter a #2A((1 2)
                    (3 4)))
(defparameter b #2A((2 2)
                    (1 3)))

(aops:vectorize-reduce #'max (a b) (abs (- a b))) ; => 2

See also reduce-index above.

Scalar values

Library functions treat non-array objects as if they were equivalent to 0-dimensional arrays: for example, (aops:split array (rank array)) returns an array that effectively equivalent (eq) to array. Another example is recycle:

(aops:recycle 4 :inner '(2 2)) ; => #2A((4 4)
                               ;        (4 4))

Stacking

You can stack compatible arrays by column or row. Metaphorically you can think of these operations as stacking blocks. For example stacking two row vectors yields a 2x2 array:

(stack-rows #(1 2) #(3 4))
;; #2A((1 2)
;;     (3 4))

Like other functions, there are two versions: generalised stacking, with rows and columns of type T and specialised versions where the element-type is specified. The versions allowing you to specialise the element type end in *.

The stack functions use object dimensions (as returned by dims to determine how to use the object.

  • when the object has 0 dimensions, fill a column with the element
  • when the object has 1 dimension, use it as a column
  • when the object has 2 dimensions, use it as a matrix

copy-row-major-block is a utility function in the stacking package that does what it suggests; it copies elements from one array to another. This function should be used to implement copying of contiguous row-major blocks of elements.

rows

stack-rows-copy is the method used to implement the copying of objects in stack-row*, by copying the elements of source to destination, starting with the row index start-row in the latter. Elements are coerced to element-type.

stack-rows and stack-rows* stack objects row-wise into an array of the given element-type, coercing if necessary. Always return a simple array of rank 2. stack-rows always returns an array with elements of type T, stack-rows* coerces elements to the specified type.

columns

stack-cols-copy is a method used to implement the copying of objects in stack-col*, by copying the elements of source to destination, starting with the column index start-col in the latter. Elements are coerced to element-type.

stack-cols and stack-cols* stack objects column-wise into an array of the given element-type, coercing if necessary. Always return a simple array of rank 2. stack-cols always returns an array with elements of type T, stack-cols* coerces elements to the specified type.

arbitrary

stack and stack* stack array arguments along axis. element-type determines the element-type of the result.

(defparameter *a1* #(0 1 2))
(defparameter *a2* #(3 5 7))
(aops:stack 0 *a1* *a2*) ; => #(0 1 2 3 5 7)
(aops:stack 1
          (aops:reshape-col *a1*)
          (aops:reshape-col *a2*)) ; => #2A((0 3)
	                               ;        (1 5)
								   ;        (2 7))

4.2 - Working with data

Manipulating data using a data frame

Overview

A common lisp data frame is a collection of observations of sample variables that shares many of the properties of arrays and lists. By design it can be manipulated using the same mechanisms used to manipulate lisp arrays. This allow you to, for example, transform a data frame into an array and use array-operations to manipulate it, and then turn it into a data frame again to use in modeling or plotting.

Load/install

Data-frame is part of the Lisp-Stat package. It can be used independently if desired. Since the examples in this manual use Lisp-Stat functionality, we’ll use it from there rather than load independently.

(ql:quickload :lisp-stat)

Within the Lisp-Stat system, the LS-USER package is set-up for statistics work. Type the following to enter the package:

(in-package :ls-user)

Common Lisp Implementation

Data frame is implemented as a two-dimensional common lisp data structure: a vector of vectors for data, and a hash table mapping variable names to column vectors. All columns are of equal length. This structure provides the flexibility required for column oriented manipulation, as well as speed for large data sets.

Data variables

If you’re collecting data and exploring a problem domain, you’ll sometimes have a collection of separate variable to start with. Common Lisp has two structures for holding multiple observations of variables: list and vector, collectively known as a sequence. For the most part a vector is more efficient, and the recommended way to work with variables that are independent of a data-frame.

defparameter

Lisp-Stat provides a wrapper over Common Lisp’s defparameter function to make working with data variables a little easier. You can define a variable with the def function. Here are some variables containing some weather data in Singapore over the last 14 days:

(def max-temps '#(30.1 30.3 30.3 30.8 31.6 31.5 32.7 32.1 32.1 31.4 31.9 31.7 32.2 31.1))
(def min-temps '#(24.6 25.4 25.1 24.5 23.7 25.6 24.6 24.7 25.0 25.2 25.1 25.6 25.5 25.2))
(def precipitation '#(0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 ))

For a quick analysis, you can see how this is easier to work with than a data-frame.

After you have been working for a while you may want to find out what variables you have defined (using def). The function variables will produce a listing:

(variables)
; => (max-temps min-temps precipitation)

If you are working with very large variables you may occasionally want to free up some space by getting rid of some variables you no longer need. You can do this using the undef function:

(undef 'max-temps)

To a save variable you can use the savevar function. This function allows you to save one or more variables into a file. A new file is created and any existing file by the same name is destroyed. To save the variable precipitation in a file called precipitation.lisp type

(savevar 'precipitation "precipitation")

Do not add the .lisp suffix yourself; savevar will supply it. To save the two variables precipitation and min-temps in the file examples.lisp type:

(savevar '(min-temps precipitation) "sg-weather")

The files precipitation.lisp and sg-weather.lisp now contain a set of expressions that, when read in with the load command, will recreate the variables precipitation and min-temp. You can look at these files with an editor like the Emacs editor and you can prepare files with your own data by following these examples.

define-data-frame

The define-data-frame macro is conceptually equivalent to the Common Lisp defparameter, but with some additional functionality that makes working with data frames easier. You use it the same way you’d use defparameter, for example:

(define-data-frame foo <any-function returning a data frame>

We’ll use both ways of defining data frames in this manual. The access methods that are defined by define-data-frame are described in the access data section.

Create data-frames

A data frame can be created from a Common Lisp array, alist, plist or individual data vectors.

Data frame columns represent sample set variables, and its rows are observations (or cases).

(defmethod print-object ((df data-frame) stream)
  "Print the first six rows of DATA-FRAME"
  (let* ((*print-lines* 6)
	     (*print-pretty* t))
    (df:pprint-data-frame df stream)))
(setf *print-pretty* t)

Let’s create a simple data frame. First we’ll setup some example variables to represent our sample domain:

(defparameter v #(1 2 3 4)) ; data vector
(defparameter b #*0110)     ; bits
(defparameter s #(a b c d)) ; symbols (variable names)
(defparameter plist `(:vector ,v :symbols ,s))

From p/a-lists

Now, suppose we want to create a data frame from a plist

(apply #'df plist)
;; VECTOR SYMBOLS
;;      1 A
;;      2 B
;;      3 C
;;      4 D

We could also have used the plist-df function:

(plist-df plist)
;; VECTOR SYMBOLS
;;      1 A
;;      2 B
;;      3 C
;;      4 D

and to demonstrate the same thing using an alist, we’ll use the alexandria:plist-alist function to convert the plist into an alist:

(alist-df (plist-alist plist))
;; VECTOR SYMBOLS
;;      1 A
;;      2 B
;;      3 C
;;      4 D

From vectors

You can use make-df to create a data frame from keys and a list of vectors. Each vector becomes a column in the data-frame.

(make-df '(:a :b) '(#(1 2 3) #(10 20 30)))
;; A  B
;; 1 10
;; 2 20
;; 3 30

This is useful if you’ve started working with variables defined with def, defparameter or defvar and want to combine them into a data frame.

From arrays

matrix-df converts a matrix (array) to a data-frame with the given keys.

(matrix-df #(:a :b) #2A((1 2)
	                    (3 4)))
;#<DATA-FRAME (2 observations of 2 variables)>

This is useful if you need to do a lot of numeric number-crunching on a data set as an array, perhaps with BLAS or array-operations then want to add categorical variables and continue processing as a data-frame.

Example datasets

Vincent Arel-Bundock maintains a library of nearly 1500 R datasets that is a consolidation of example data from various R packages. The lisp-stat/rdata system allows you to load these to use in Lisp-Stat. To get started, try loading the classic mtcars data set:

(ql:quickload :lisp-stat/rdata)
(define-data-frame mtcars
  (read-csv (rdata:rdata 'rdata:datasets 'rdata:mtcars)))
;"MTCARS"

You can list the packages in Rdatasets like so:

(rdata:show-packages)
;(RDATA:VCD RDATA:TIDYR RDATA:TEXMEX RDATA:SURVIVAL RDATA:STEVEDATA RDATA:STAT2DATA RDATA:SEM RDATA:SANDWICH RDATA:RPART RDATA:ROBUSTBASE RDATA:RESHAPE2 RDATA:QUANTREG RDATA:PSYCH RDATA:PSCL RDATA:PLYR RDATA:PLM RDATA:PALMERPENGUINS RDATA:NYCFLIGHTS13 RDATA:MULTGEE RDATA:MOSAICDATA RDATA:MI RDATA:MEDIATION RDATA:MASS RDATA:LMEC RDATA:LME4 RDATA:LATTICE RDATA:KMSURV RDATA:ISLR RDATA:HWDE RDATA:HSAUR RDATA:HISTDATA RDATA:GT RDATA:GGPLOT2MOVIES RDATA:GGPLOT2 RDATA:GEEPACK RDATA:GAP RDATA:FPP2 RDATA:FORECAST RDATA:EVIR RDATA:ECDAT RDATA:DRC RDATA:DRAGRACER RDATA:DPLYR RDATA:DATASETS RDATA:DAAG COUNT RDATA:CLUSTER RDATA:CARDATA RDATA:BOOT RDATA:AER)

and the individual data sets within each package with the show-package-items command. Here’s an example listing the built-in R data set:

(rdata:show-package-items 'rdata:datasets)

Here’s the first few rows of the table produced by the above command.

Dataset Title Vars. Obs.
ABILITY.COV Ability and Intelligence Tests 8 6
AIRMILES Passenger Miles on Commercial US Airlines, 1937-1960 2 24
AIRPASSENGERS Monthly Airline Passenger Numbers 1949-1960 2 144
AIRQUALITY New York Air Quality Measurements 6 153
ANSCOMBE Anscombe’s Quartet of ‘Identical’ Simple Linear Regressions 8 11
ATTENU The Joyner-Boore Attenuation Data 5 182
ATTITUDE The Chatterjee-Price Attitude Data 7 30
AUSTRES Quarterly Time Series of the Number of Australian Residents 2 89
BJSALES Sales Data with Leading Indicator 2 150
BOD Biochemical Oxygen Demand 2 6
CARS Speed and Stopping Distances of Cars 2 50
CHICKWEIGHT Weight versus age of chicks on different diets 4 578
CHICKWTS Chicken Weights by Feed Type 2 71
CO2 Mauna Loa Atmospheric CO2 Concentration 2 468

Export data frames

These next few functions are the reverse of the ones above used to create them. These are useful when you want to use foreign libraries or common lisp functions to process the data.

For this section of the manual, we are going to work with a subset of the mtcars data set from above. We’ll use the select package to take the first 5 rows so that the data transformations are easier to see.

(defparameter mtcars-small (select mtcars (range 0 5) t))

The next three functions convert a data-frame to and from standard common lisp data structures. This is useful if you’ve got data in Common Lisp format and want to work with it in a data frame, or if you’ve got a data frame and want to apply Common Lisp operators on it that don’t exist in df.

as-alist

Just like it says on the tin, as-alist takes a data frame and returns an alist version of it (formatted here for clearer output – a pretty printer that outputs an alist in this format would be a welcome addition to CL/Lisp-Stat)

(as-alist mtcars-small)
;; ((MTCARS:X1 . #("Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" "Hornet Sportabout"))
;;  (MTCARS:MPG . #(21 21 22.8d0 21.4d0 18.7d0))
;;  (MTCARS:CYL . #(6 6 4 6 8))
;;  (MTCARS:DISP . #(160 160 108 258 360))
;;  (MTCARS:HP . #(110 110 93 110 175))
;;  (MTCARS:DRAT . #(3.9d0 3.9d0 3.85d0 3.08d0 3.15d0))
;;  (MTCARS:WT . #(2.62d0 2.875d0 2.32d0 3.215d0 3.44d0))
;;  (MTCARS:QSEC . #(16.46d0 17.02d0 18.61d0 19.44d0 17.02d0))
;;  (MTCARS:VS . #*00110)
;;  (MTCARS:AM . #*11100)
;;  (MTCARS:GEAR . #(4 4 4 3 3))
;;  (MTCARS:CARB . #(4 4 1 1 2)))

as-plist

Similarly, as-plist will return a plist:

(nu:as-plist mtcars-small)
;; (MTCARS:X1 #("Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" "Hornet Sportabout")
;;  MTCARS:MPG #(21 21 22.8d0 21.4d0 18.7d0)
;;	MTCARS:CYL #(6 6 4 6 8)
;;	MTCARS:DISP #(160 160 108 258 360)
;;	MTCARS:HP #(110 110 93 110 175)
;;	MTCARS:DRAT #(3.9d0 3.9d0 3.85d0 3.08d0 3.15d0)
;;	MTCARS:WT #(2.62d0 2.875d0 2.32d0 3.215d0 3.44d0)
;;	MTCARS:QSEC #(16.46d0 17.02d0 18.61d0 19.44d0 17.02d0)
;;	MTCARS:VS #*00110
;;	MTCARS:AM #*11100
;;	MTCARS:GEAR #(4 4 4 3 3)
;;	MTCARS:CARB #(4 4 1 1 2))

as-array

as-array returns the data frame as a row-major two dimensional lisp array. You’ll want to save the variable names using the keys function to make it easy to convert back (see matrix-df). One of the reasons you might want to use this function is to manipulate the data-frame using array-operations. This is particularly useful when you have data frames of all numeric values.

(defparameter mtcars-keys (keys mtcars)) ; we'll use later
(defparameter mtcars-small-array (as-array mtcars-small))
mtcars-small-array
; #2A(("Mazda RX4" 21 6 160 110 3.9d0 2.62d0 16.46d0 0 1 4 4)
;     ("Mazda RX4 Wag" 21 6 160 110 3.9d0 2.875d0 17.02d0 0 1 4 4)
;     ("Datsun 710" 22.8d0 4 108 93 3.85d0 2.32d0 18.61d0 1 1 4 1)
;     ("Hornet 4 Drive" 21.4d0 6 258 110 3.08d0 3.215d0 19.44d0 1 0 3 1)
;     ("Hornet Sportabout" 18.7d0 8 360 175 3.15d0 3.44d0 17.02d0 0 0 3 2))

Our abbreviated mtcars data frame is now a two dimensional Common Lisp array.

vectors

The columns function returns the variables of the data frame as a vector of vectors:

(columns mtcars-small)
; #(#("Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" "Hornet Sportabout")
;   #(21 21 22.8d0 21.4d0 18.7d0)
;	#(6 6 4 6 8)
;	#(160 160 108 258 360)
;	#(110 110 93 110 175)
;	#(3.9d0 3.9d0 3.85d0 3.08d0 3.15d0)
;	#(2.62d0 2.875d0 2.32d0 3.215d0 3.44d0)
;	#(16.46d0 17.02d0 18.61d0 19.44d0 17.02d0)
;	#*00110
;	#*11100
;	#(4 4 4 3 3)
;	#(4 4 1 1 2))

This is a column-major lisp array.

You can also pass a selection to the columns function to return specific columns:

(columns mtcars-small 'mtcars:mpg)
; #(21 21 22.8d0 21.4d0 18.7d0)

The functions in array-operations are helpful in further dealing with data frames as vectors and arrays. For example you could convert this to an array by using aops:combine with columns:

(combine (columns mtcars-small))
; #2A(("Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" "Hornet Sportabout")
;     (21 21 22.8d0 21.4d0 18.7d0)
;	  (6 6 4 6 8)
;	  (160 160 108 258 360)
;	  (110 110 93 110 175)
;	  (3.9d0 3.9d0 3.85d0 3.08d0 3.15d0)
;	  (2.62d0 2.875d0 2.32d0 3.215d0 3.44d0)
;	  (16.46d0 17.02d0 18.61d0 19.44d0 17.02d0)
;	  (0 0 1 1 0)
;	  (1 1 1 0 0)
;	  (4 4 4 3 3)
;	  (4 4 1 1 2))

Load data

You can use the dfio system to load delimited text files, such as CSV, into a data frame.

From strings

Here is a short demonstration of reading from strings:

(defparameter *d* (dfio:read-csv
                     (format nil "Gender,Age,Height~@
                                  \"Male\",30,180.~@
                                  \"Male\",31,182.7~@
                                  \"Female\",32,1.65e2")))

dfio tries to hard to decipher the various number formats sometimes encountered in CSV files:

(select (dfio:read-csv
                 (format nil "\"All kinds of wacky number formats\"~%.7~%19.~%.7f2"))
                t 'all-kinds-of-wacky-number-formats)
; => #(0.7d0 19.0d0 70.0)

From files

We saw above that dfio can read from strings, so one easy way to read from a file is to use the uiop system function read-file-string. We can read one of the example data files included with Lisp-Stat like this:

(read-csv
	(uiop:read-file-string #P"LS:DATASETS;absorption.csv"))
;;    IRON ALUMINUM ABSORPTION 
;;  0   61       13          4
;;  1  175       21         18
;;  2  111       24         14
;;  3  124       23         18
;;  4  130       64         26
;;  5  173       38         26 ..

For most data sets, this method will work fine. If you are working with large CSV files, you may want to consider using a stream from an open file so you don’t have uiop read the whole thing in before processing it into a data frame:

(read-csv #P"LS:DATASETS;absorption.csv")
;;    IRON ALUMINUM ABSORPTION 
;;  0   61       13          4
;;  1  175       21         18
;;  2  111       24         14
;;  3  124       23         18
;;  4  130       64         26
;;  5  173       38         26 ..

From URLs

dfio can also read from Common Lisp streams. Streams operations can be network or file based. Here is an example of how to read the classic Iris data set over the network using the HTTP client dexador.

(read-csv
 (dex:get
   "https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/csv/datasets/iris.csv"
   :want-stream t))
;;     X27 SEPAL-LENGTH SEPAL-WIDTH PETAL-LENGTH PETAL-WIDTH SPECIES
;;   0   1          5.1         3.5          1.4         0.2 setosa
;;   1   2          4.9         3.0          1.4         0.2 setosa
;;   2   3          4.7         3.2          1.3         0.2 setosa
;;   3   4          4.6         3.1          1.5         0.2 setosa
;;   4   5          5.0         3.6          1.4         0.2 setosa
;;   5   6          5.4         3.9          1.7         0.4 setosa ..

From a database

Save data

Data frames can be saved into any delimited text format supported by cl-csv, or several flavors of JSON, such as Vega-Lite. Since the JSON reader/writers are specific to the plotting applications, they are described in the plotting section.

To files

To save the mtcars data frame to disk, you could use:

(write-csv mtcars
		   :stream #P"LS:DATASETS;mtcars.csv"
           :add-first-row t)         ; add column headers

to save it as CSV, or to save it to tab-separated values:

(write-csv mtcars
	       :separator #\tab
		   :stream #P"LS:DATASETS;mtcars.tsv"
		   :add-first-row t)         ; add column headers

To a database

See the section above, From a database.

Access data

This section describes various way to access data variables.

Access a data-frame

Let’s use define-data-frame to define the iris data frame. We’ll use both of these data frames in the examples below.

(define-data-frame iris
  (read-csv (rdata:rdata 'rdata:datasets 'rdata:iris)))
COMMON-LISP:WARNING: Missing column name was filled in
"IRIS"

We now have a global variable named iris that represents the data frame. Let’s look at the first part of this data:

(head iris)
;;   X29 SEPAL-LENGTH SEPAL-WIDTH PETAL-LENGTH PETAL-WIDTH SPECIES
;; 0   1          5.1         3.5          1.4         0.2 setosa
;; 1   2          4.9         3.0          1.4         0.2 setosa
;; 2   3          4.7         3.2          1.3         0.2 setosa
;; 3   4          4.6         3.1          1.5         0.2 setosa
;; 4   5          5.0         3.6          1.4         0.2 setosa
;; 5   6          5.4         3.9          1.7         0.4 setosa

Notice a couple of things. First, there is a column X27. In fact if you look back at previous data frame output in this tutorial you will notice various columns named X followed by some number. This is because the column was not given a name in the data set, so a name was generated for it. X starts at 1 and increased by 1 each time an unnamed variable is encountered during your Lisp-Stat session. The next time you start Lisp-Stat, numbering will start over from 1 again. We will see how to clean this up this data frame in the next sections.

The second thing to note is the row numbers on the far left side. When Lisp-Stat prints a data frame it automatically adds row numbers. Row and column numbering in Lisp-Stat start at 0. In R they start with 1. Row numbers make it convenient to make selections from a data frame, but they are not part of the data and cannot be selected or manipulated. They only appear when a data frame is printed.

Access a variable

The define-data-frame macro also defines symbol macros that allow you to refer to a variable by name, for example to refer to the mpg column of mtcars, you can refer to it by the Common Lisp package:symbol convention:

mtcars:mpg
;#(21 21 22.8d0 21.4d0 18.7d0 18.1d0 14.3d0 24.4d0 22.8d0 19.2d0 17.8d0 16.4d0 17.3d0 15.2d0 10.4d0 10.4d0 14.7d0 32.4d0 30.4d0 33.9d0 21.5d0 15.5d0 15.2d0 13.3d0 19.2d0 27.3d0 26 30.4d0 15.8d0 19.7d0 15 21.4d0)

There is a point of distinction to be made here: the values of mpg and the column mpg. For example to obtain the same vector using the selection/sub-setting package select we must refer to the column:

(select mtcars t 'mtcars:mpg)
;#(21 21 22.8d0 21.4d0 18.7d0 18.1d0 14.3d0 24.4d0 22.8d0 19.2d0 17.8d0 16.4d0 17.3d0 15.2d0 10.4d0 10.4d0 14.7d0 32.4d0 30.4d0 33.9d0 21.5d0 15.5d0 15.2d0 13.3d0 19.2d0 27.3d0 26 30.4d0 15.8d0 19.7d0 15 21.4d0)

Note that with select we passed the symbol ‘mtcars:mpg (you can tell it’s a symbol because of the quote in front of it).

So, the rule here is, if you want the value, refer to it directly, e.g. mtcars:mpg. If you are referring to the column, use the symbol. Data frame operations typically require the symbol, where as Common Lisp and other packages that take vectors use the direct access form.

Package names

The define-data-frame macro creates a package with the same name as the data frame and interns symbols for each column in it. This is how you can refer to the columns by name. So far we have referred to variables (values) with a package prefix. You can also refer to them without package names by using the Common Lisp use-package command:

(use-package 'mtcars)

You can now use mpg by itself, e.g.

(mean mpg) ;; => 20.090625000000003d0

To stop using the symbols in the current package, you can unuse the data frame:

(unuse-package 'mtcars)

Data-frame operations

These functions operate on data-frames as a whole.

copy

copy returns a newly allocated data-frame with the same values as the original:

(copy mtcars-small)
;;   X1                 MPG CYL DISP  HP DRAT    WT  QSEC VS AM GEAR CARB
;; 0 Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
;; 1 Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
;; 2 Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
;; 3 Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
;; 4 Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

By default only the keys are copied and the original data remains the same, i.e. a shallow copy. For a deep copy, use the copy-array function as the key:

(copy mtcars-small :key #'copy-array)
;;   X1                 MPG CYL DISP  HP DRAT    WT  QSEC VS AM GEAR CARB
;; 0 Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
;; 1 Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
;; 2 Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
;; 3 Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
;; 4 Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

Useful when applying destructive operations to the data-frame.

keys

Returns a vector of the variables in the data frame. The keys are symbols. Symbol properties describe the variable, for example units.

(keys mtcars)
; #(MTCARS:X1 MTCARS:MPG MTCARS:CYL MTCARS:DISP MTCARS:HP MTCARS:DRAT MTCARS:WT MTCARS:QSEC MTCARS:VS MTCARS:AM MTCARS:GEAR MTCARS:CARB)

Recall the earlier discussion of X1 for the column name.

map-df

map-df transforms one data-frame into another, row-by-row. Its function signature is:

(map-df data-frame keys function result-keys) ...

It applies function to each row, and returns a data frame with the result-keys as the column (variable) names. keys is a list. You can also specify the type of the new variables in the result-keys list.

The goal for this example is to transform df1:

(defparameter df1 (make-df '(:a :b) '(#(2 3 5) #(7 11 13))))

into a data-frame that consists of the product of :a and :b, and a bit mask of the columns that indicate where the value is <= 30. First we’ll need a helper for the bit mask:

(defun predicate-bit (a b)
  "Return 1 if a*b <= 30, 0 otherwise"
  (if (<= 30 (* a b))
      1
      0))

Now we can transform df1 into our new data-frame, df2, with:

(defparameter df2 (map-df df1 '(:a :b)
			  (lambda (a b)
			    (vector (* a b) (predicate-bit a b)))
			  '((:p fixnum) (:m bit))))

Since it was a parameter assignment, we have to view it manually:

(pprint df2)
;;    P M
;; 0 14 0
;; 1 33 1
;; 2 65 1

Note how we specified both the new key names and their type. Here’s an example that transforms the imperial to metric units of mtcars:

(map-df mtcars '(mtcars:x1 mtcars:mpg mtcars:disp mtcars:hp mtcars:wt)
	(lambda (model mpg disp hp wt)
	  (vector model ;no transformation for model (X1), return as-is
              (/ 235.214583 mpg)
		      (/ disp 61.024)
		      (* hp 1.01387)
		      (/ (* wt 1000) 2.2046)))
	'(:model (:100km/l float) (:disp float) (:hp float) (:kg float)))

View the new metric units data frame:

(head *)
;;   MODEL                        100KM/L      DISP        HP                 KG
;; 0 Mazda RX4         11.200694000000000 2.6219194 111.52570 1188.4241523222775
;; 1 Mazda RX4 Wag     11.200694000000000 2.6219194 111.52570 1304.0913885215832
;; 2 Datsun 710        10.316429138183594 1.7697955  94.28991 1052.3450509113297
;; 3 Hornet 4 Drive    10.991335717317101 4.2278447 111.52570 1458.3143701206573
;; 4 Hornet Sportabout 12.578320018747911 5.8993187 177.42725 1560.3736961788682
;; 5 Valiant           12.995280903347288 3.6870740 106.45635 1569.4456362729313

You might be wondering how we were able to refer to the columns without the ' (quote); in fact we did, at the beginning of the list. The lisp reader then reads the contents of the list as symbols.

rows

rows returns the rows of a data frame as a vector of vectors:

(rows mtcars-small)
;#(#("Mazda RX4" 21 6 160 110 3.9d0 2.62d0 16.46d0 0 1 4 4)
;  #("Mazda RX4 Wag" 21 6 160 110 3.9d0 2.875d0 17.02d0 0 1 4 4)
;  #("Datsun 710" 22.8d0 4 108 93 3.85d0 2.32d0 18.61d0 1 1 4 1)
;  #("Hornet 4 Drive" 21.4d0 6 258 110 3.08d0 3.215d0 19.44d0 1 0 3 1)
;  #("Hornet Sportabout" 18.7d0 8 360 175 3.15d0 3.44d0 17.02d0 0 0 3 2))

remove duplicates

The df-remove-duplicates function will remove duplicate rows. Let’s create a data-frame with duplicates:

(defparameter dup (make-df '(a b c) '(#(a1 a1 a3)
                                      #(a1 a1 b3)
									  #(a1 a1 c3))))
DUP

Confirm a duplicate row:

LS-USER> dup
;; A  B  C
;; A1 A1 A1
;; A1 A1 A1
;; A3 B3 C3

Now remove duplicate rows 0 and 1:

(df-remove-duplicates dup)
;; A  B  C
;; A1 A1 A1
;; A3 B3 C3

Column operations

You have seen some of these functions before, and for completeness we repeat them here. The remainder of the section covers the remaining column functions.

To obtain a variable (column) from a data frame, use the column function. Using mtcars, defined in example datasets above:

(column mtcars-small 'mtcars:mpg)
;; #(21 21 22.8d0 21.4d0 18.7d0)

Careful readers will note that we used the mtcars accessor, and not mtcars-small. We can do this when referring to a data frame that is a subset of a larger one.

To get all the columns as a vector, use the columns function:

(columns mtcars-small)
; #(#("Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" "Hornet Sportabout")
;   #(21 21 22.8d0 21.4d0 18.7d0)
;	#(6 6 4 6 8)
;	#(160 160 108 258 360)
;	#(110 110 93 110 175)
;	#(3.9d0 3.9d0 3.85d0 3.08d0 3.15d0)
;	#(2.62d0 2.875d0 2.32d0 3.215d0 3.44d0)
;	#(16.46d0 17.02d0 18.61d0 19.44d0 17.02d0)
;	#*00110
;	#*11100
;	#(4 4 4 3 3)
;	#(4 4 1 1 2))

You can also return a subset of the columns by passing in a selection:

(columns mtcars-small '(mtcars:mpg mtcars:wt))
;; #(#(21 21 22.8d0 21.4d0 18.7d0)
;;   #(2.62d0 2.875d0 2.32d0 3.215d0 3.44d0))

Add columns

There are two ‘flavors’ of add functions, destructive and non-destructive. The latter return a new data frame as the result, and the destructive versions modify the data frame passed as a parameter. The destructive versions are denoted with a ‘!’ at the end of the function name.

To add a single column to a data frame, use the add-column! function. We’ll use a data frame similar to the one used in our reading data-frames from a string example to illustrate column operations

(defparameter *d* (read-csv
		   (format nil "Gender,Age,Height
                              \"Male\",30,180
                              \"Male\",31,182
                              \"Female\",32,165
	                          \"Male\",22,167
	                          \"Female\",45,170")))
(pprint *d*)
;;   GENDER AGE HEIGHT
;; 0 Male    30    180
;; 1 Male    31    182
;; 2 Female  32    165
;; 3 Male    22    167
;; 4 Female  45    170

and add a ‘weight’ column to it:

(add-column! *d* 'weight #(75.2 88.5 49.4 78.1 79.4))

;;   GENDER AGE HEIGHT WEIGHT
;; 0 Male    30    180   75.2
;; 1 Male    31    182   88.5
;; 2 Female  32    165   49.4
;; 3 Male    22    167   78.1
;; 4 Female  45    170   79.4

now that we have weight, let’s add a BMI column to it to demonstrate using a function to compute the new column values:

(add-column! *d* 'bmi
	     (map-rows *d* '(height weight)
		       #'(lambda (h w) (/ w (square (/ h 100))))))
;;   SEX    AGE HEIGHT WEIGHT       BMI
;; 0 Female  10    180   75.2 23.209875
;; 1 Female  15    182   88.5 26.717787
;; 2 Male    20    165   49.4 18.145086
;; 3 Female  25    167   78.1 28.003874
;; 4 Male    30    170   79.4 27.474049

Now let’s add multiple columns destructively using add-columns!

(add-columns! *d* 'a #(1 2 3 4 5) 'b #(foo bar baz qux quux))

;;   GENDER AGE HEIGHT WEIGHT A B
;; 0 Male    30    180   75.2 1 FOO
;; 1 Male    31    182   88.5 2 BAR
;; 2 Female  32    165   49.4 3 BAZ
;; 3 Male    22    167   78.1 4 QUX
;; 4 Female  45    170   79.4 5 QUUX

(I removed the BMI column before creating this data frame to improve clarity)

Remove columns

Let’s remove the columns a and b that we just added above with the remove-columns function. Since it returns a new data frame, we’ll need to assign the return value to *d*:

(setf *d* (remove-columns *d* '(a b)))
;;   GENDER AGE HEIGHT WEIGHT
;; 0 Male    30    180   75.2
;; 1 Male    31    182   88.5
;; 2 Female  32    165   49.4
;; 3 Male    22    167   78.1
;; 4 Female  45    170   79.4

Rename columns

Sometimes data sources can have variable names that we want to change. To do this, use the substitute-key! function. This example will rename the ‘gender’ variable to ‘sex’:

(substitute-key! *d* 'sex 'gender)
; => #<ORDERED-KEYS WEIGHT, HEIGHT, AGE, SEX>

If you used define-data-frame to create your data frame, and this is the recommended way, then use the replace-key! macro to rename the column and update the variable references within the data package. Let’s use this now to rename the mtcars X1 variable to model. First a quick look at the first 2 rows as they are now:

(head mtcars 2)
;;   X1                 MPG CYL DISP  HP DRAT    WT  QSEC VS AM GEAR CARB
;; 0 Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
;; 1 Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4

Replace X1 with model:

(replace-key! mtcars model x1)

check that it worked:

(head mtcars 2)
;;   MODEL         MPG CYL DISP  HP DRAT    WT  QSEC VS AM GEAR CARB
;; 0 Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1    4    4
;; 1 Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1    4    4

We can now refer to mtcars:model

mtcars:model
#("Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" "Hornet Sportabout"
  "Valiant" "Duster 360" "Merc 240D" "Merc 230" "Merc 280" "Merc 280C"
  "Merc 450SE" "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
  "Lincoln Continental" "Chrysler Imperial" "Fiat 128" "Honda Civic"
  "Toyota Corolla" "Toyota Corona" "Dodge Challenger" "AMC Javelin"
  "Camaro Z28" "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2" "Lotus Europa"
  "Ford Pantera L" "Ferrari Dino" "Maserati Bora" "Volvo 142E")

Replace columns

Columns are “setf-able” places and the simplest way to replace a column is set the field to a new value. We’ll complement the sex field of *d*:

(df::setf (df:column *d* 'sex) #("Female" "Female" "Male" "Female" "Male"))
;#("Female" "Female" "Male" "Female" "Male")

Note that df::setf is not exported. This is an inherited (from Tamas Papp, aka ‘tkp’) behavior and likely because it bypasses checks on column length. Use this with caution.

You can also replace a column using two functions specifically for this purpose. Here we’ll replace the ‘age’ column with new values:

(replace-column *d* 'age #(10 15 20 25 30))
;;   SEX    AGE HEIGHT WEIGHT
;; 0 Female  10    180   75.2
;; 1 Female  15    182   88.5
;; 2 Male    20    165   49.4
;; 3 Female  25    167   78.1
;; 4 Male    30    170   79.4

That was a non-destructive replacement, and since we didn’t reassign the value of *d*, it is unchanged:

LS-USER> *d*
;;   SEX    AGE HEIGHT WEIGHT
;; 0 Female  30    180   75.2
;; 1 Female  31    182   88.5
;; 2 Male    32    165   49.4
;; 3 Female  22    167   78.1
;; 4 Male    45    170   79.4

We can also use the destructive version to make a permanent change instead of setf-ing *d*:

(replace-column! *d* 'age #(10 15 20 25 30))
;;   SEX    AGE HEIGHT WEIGHT
;; 0 Female  10    180   75.2
;; 1 Female  15    182   88.5
;; 2 Male    20    165   49.4
;; 3 Female  25    167   78.1
;; 4 Male    30    170   79.4

Transform columns

There are two functions for column transformations.

replace-column

replace-column can be used to transform a column by applying a function. This example will add 20 to each value of the age column:

(replace-column *d* 'age #'(lambda (x) (+ 20 x)))
;;   SEX    AGE HEIGHT WEIGHT
;; 0 Female  30    180   75.2
;; 1 Female  35    182   88.5
;; 2 Male    40    165   49.4
;; 3 Female  45    167   78.1
;; 4 Male    50    170   79.4

replace-column! can also apply functions to a column, destructively modifying the column.

map-columns

The map-columns functions can be thought of as applying a function on all the values of each variable as a vector, rather than the individual rows as replace-column does. To see this, we’ll use functions that operate on vectors, in this case nu:e+, which is the vector addition function for Lisp-Stat. Let’s see this working first:

(nu:e+ #(1 1 1) #(2 3 4))
; => #(3 4 5)

observe how the vectors were added element-wise. We’ll demonstrate map-columns by adding one to each of the numeric columns in the example data frame:

(map-columns (select *d* t '(weight age height))
	     #'(lambda (x)
		     (nu:e+ 1 x)))
;;   WEIGHT AGE HEIGHT 
;; 0   76.2  11    181
;; 1   89.5  16    183
;; 2   50.4  21    166
;; 3   79.1  26    168
;; 4   80.4  31    171

recall that we used the non-destructive version of replace-column above, so *d* has the original values. Also note the use of select to get the numeric variables from the data frame; e+ can’t add categorical values like gender/sex.

Row operations

As the name suggests, row operations operate on each row, or observation, of a data set.

count-rows

This function is used to determine how many rows meet a certain condition. For example if you want to know how many cars have a MPG (miles-per-galleon) rating greater than 20, you could use:

(count-rows mtcars 'mtcars:mpg #'(lambda (x) (< 20 x)))
; => 14

do-rows

do-rows applies a function on selected variables. The function must take the same number of arguments as variables supplied. It is analogous to dotimes, but iterating over data frame rows. No values are returned; it is purely for side-effects. Let’s create a new data data-frame to illustrate row operations:

LS-USER> (defparameter *d2*
                       (make-df '(a b) '(#(1 2 3) #(10 20 30))))
*D2*
LS-USER> *d2*
;;   A  B
;; 0 1 10
;; 1 2 20
;; 2 3 30

This example uses format to illustrate iterating using do-rows for side effect:

(do-rows *d2* '(a b) #'(lambda (a b) (format t "~A " (+ a b))))
11 22 33
; No value

map-rows

Where map-columns can be thought of as working through the data frame column-by-column, map-rows goes through row-by-row. Here we add the values in each row of two columns:

(map-rows *d2* '(a b) #'+)
#(11 22 33)

Since the length of this vector will always be equal to the data-frame column length, we can add the results to the data frame as a new column. Let’s see this in a real-world pattern, subtracting the mean from a column:

(add-column! *d2* 'c
           (map-rows *d2* 'b
                     #'(lambda (x) (- x (mean (select *d2* t 'b))))))
;;   A  B     C
;; 0 1 10 -10.0
;; 1 2 20   0.0
;; 2 3 30  10.0

You could also have used replace-column! in a similar manner to replace a column with normalize values.

Create subsets

This example assume you have saved the Rdataset mentioned above to a variables name mtcars.

mask-rows

mask-rows is similar to count-rows, except it returns a bit-vector for rows matching the predicate. This is useful when you want to pass the bit vector to another function, like select to retrieve only the rows matching the predicate.

(mask-rows mtcars 'mtcars:mpg #'(lambda (x) (< 20 x)))
; => #*11110001100000000111100001110001

to make this into a filter:

(defparameter efficient-cars
  (select mtcars (mask-rows mtcars 'mtcars:mpg #'(lambda (x) (< 20 x))) t)
  "Cars with MPG > 20")

To view them we’ll need to call the pprint function directly instead of using the print-object function we installed earlier. Otherwise, we’ll only see the first 6.

(pprint efficient-cars)
;;    MODEL           MPG CYL  DISP  HP DRAT    WT  QSEC VS AM GEAR CARB
;;  0 Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
;;  1 Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
;;  2 Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
;;  3 Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
;;  4 Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
;;  5 Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
;;  6 Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
;;  7 Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
;;  8 Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
;;  9 Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
;; 10 Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
;; 11 Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
;; 12 Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
;; 13 Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

You can mask multiple rows at the same time by using a predicate function that accepts the same number of arguments as rows you wish to mask.

The select system

select is a domain specific language (DSL) for slicing & dicing two dimensional data structures, including arrays and data frames. With select you can create data subsets by range, with sequence specifiers, bit masks and predicates. The select user manual documents this DSL.

For some additional examples of selecting columns, see manipulating columns.

Summarising data

Often the first thing you’ll want to do with a data frame is get a quick summary. You can do that with these functions, and we’ve seen most of them used in this manual. For more information about these functions, see the reference section.

nrow data-frame
return the number of rows in data-frame
ncol data-frame
return the number of columns in data-frame
dims data-frame
return the dimensions of data-frame as a list in (rows columns) format
keys data-frame
return a vector of symbols representing column names
column-names data-frame
returns a list of strings of the column names in data-frames
head data-frame &optional n
displays the first n rows of data-frame. n defaults to 6.
head data-frame &optional n
displays the last n rows of data-frame. n defaults to 6.

summary

summary data-frame
returns a summary of the variables in data-frame
  MTCARS:MPG
             32 reals, min=10.4d0, q25=15.399999698003132d0, q50=19.2d0,
             q75=22.8d0, max=33.9d0
  MTCARS:CYL
             14 (44%) x 8, 11 (34%) x 4, 7 (22%) x 6
  MTCARS:DISP
              32 reals, min=71.1d0, q25=120.65d0, q50=205.86666333675385d0,
              q75=334.0, max=472
  MTCARS:HP
            32 reals, min=52, q25=96.0, q50=123, q75=186.25, max=335
  MTCARS:DRAT
              32 reals, min=2.76d0, q25=3.08d0, q50=3.6950000000000003d0,
              q75=3.952000046730041d0, max=4.93d0
  MTCARS:WT
            32 reals, min=1.513d0, q25=2.5425d0, q50=3.325d0,
            q75=3.6766665957371387d0, max=5.424d0
  MTCARS:QSEC
              32 reals, min=14.5d0, q25=16.884999999999998d0, q50=17.71d0,
              q75=18.9d0, max=22.9d0
  MTCARS:VS bits, ones: 14 (44%)
  MTCARS:AM bits, ones: 13 (41%)
  MTCARS:GEAR
              15 (47%) x 3, 12 (38%) x 4, 5 (16%) x 5
  MTCARS:CARB
              10 (31%) x 4,
              10 (31%) x 2,
              7 (22%) x 1,
              3 (9%) x 3,
              1 (3%) x 6,
              1 (3%) x 8>

Note that the model column, essentially row-name was deleted from the output when writing this manual. If the column had been named row-name, this would have happened automatically.

Missing values

Data sets often contain missing values and we need to both understand where and how many are missing, and how to transform or remove them for downstream operations. In Lisp-Stat, missing values are represented by the keyword symbol :na. You can control this encoding during delimited text import by passing an a-list containing the mapping. By default this is a keyword parameter map-alist:

(map-alist '(("" . :na)
             ("NA" . :na)))

The default maps blank cells ("") and ones containing “NA” to the missing value keyword :na. Some systems encode missing values as numeric, e.g. 99; in this case you can pass in a map-alist that includes this mapping:

(map-alist '(("" . :na)
             ("NA" . :na)
			 (99 . :na)))

We will use the R air-quality dataset to illustrate working with missing values. Let’s load it now:

(define-data-frame aq
  (read-csv (rdata:rdata 'rdata:datasets 'rdata:airquality)))

Examine

To see missing values we use the predicate missingp. This works on sequences, arrays and data-frames. It returns a logical sequence, array or data-frame indicating which values are missing. T indicates a missing value, NIL means the value is present. Here’s an example of using missingp on a vector:

(missingp #(1 2 3 4 5 6 :na 8 9 10))
;#(NIL NIL NIL NIL NIL NIL T NIL NIL NIL)

and on a data-frame:

 (pprint (missingp aq))

;;     X3  OZONE SOLAR-R WIND TEMP MONTH DAY
;;   0 NIL   NIL     NIL NIL  NIL  NIL   NIL
;;   1 NIL   NIL     NIL NIL  NIL  NIL   NIL
;;   2 NIL   NIL     NIL NIL  NIL  NIL   NIL
;;   3 NIL   NIL     NIL NIL  NIL  NIL   NIL
;;   4 NIL     T       T NIL  NIL  NIL   NIL
;;   5 NIL   NIL       T NIL  NIL  NIL   NIL
;;   6 NIL   NIL     NIL NIL  NIL  NIL   NIL
;;   7 NIL   NIL     NIL NIL  NIL  NIL   NIL
;;   8 NIL   NIL     NIL NIL  NIL  NIL   NIL
;;   9 NIL     T     NIL NIL  NIL  NIL   NIL
;;  10 NIL   NIL       T NIL  NIL  NIL   NIL
;;  11 NIL   NIL     NIL NIL  NIL  NIL   NIL
;;  12 NIL   NIL     NIL NIL  NIL  NIL   NIL
;;  13 NIL   NIL     NIL NIL  NIL  NIL   NIL
;;  14 NIL   NIL     NIL NIL  NIL  NIL   NIL
;;  15 NIL   NIL     NIL NIL  NIL  NIL   NIL
;;  16 NIL   NIL     NIL NIL  NIL  NIL   NIL
;;  17 NIL   NIL     NIL NIL  NIL  NIL   NIL
;;  18 NIL   NIL     NIL NIL  NIL  NIL   NIL
;;  19 NIL   NIL     NIL NIL  NIL  NIL   NIL
;;  20 NIL   NIL     NIL NIL  NIL  NIL   NIL
;;  21 NIL   NIL     NIL NIL  NIL  NIL   NIL
;;  22 NIL   NIL     NIL NIL  NIL  NIL   NIL
;;  23 NIL   NIL     NIL NIL  NIL  NIL   NIL ..

We can see that the ozone variable contains some missing values. To see which rows of ozone are missing, we can use the which function:

(which aq:ozone :predicate #'missingp)
;#(4 9 24 25 26 31 32 33 34 35 36 38 41 42 44 45 51 52 53 54 55 56 57 58 59 60 64 71 74 82 83 101 102 106 114 118 149)

and to get a count, use the length function on this vector:

(length *) ; => 37

It’s often convenient to use the summary function to get an overview of missing values. We can do this because the missingp function is a transformation of a data-frame that yields another data-frame of boolean values:

(summary (missingp aq))
;#<DATA-FRAME (7 x 153)
;  AQ:X3
;        153 (100%) x NIL
;  AQ:OZONE
;           116 (76%) x NIL, 37 (24%) x T
;  AQ:SOLAR-R
;             146 (95%) x NIL, 7 (5%) x T
;  AQ:WIND
;          153 (100%) x NIL
;  AQ:TEMP
;          153 (100%) x NIL
;  AQ:MONTH
;           153 (100%) x NIL
;  AQ:DAY
;         153 (100%) x NIL>

we can see that ozone is missing 37 values, 24% of the total, and solar-r is missing 7 values.

Exclude

To exclude missing values from a single column, use the Common Lisp remove function:

(remove :na aq:ozone)
;#(41 36 12 18 28 23 19 8 7 16 11 14 18 14 34 6 30 11 1 11 4 32 ...

To ensure that our data-frame includes only complete observations, we exclude any row with a missing value. To do this use the drop-missing function:

(head (drop-missing aq))
;;   X3 OZONE SOLAR-R WIND TEMP MONTH DAY
;; 0  1    41     190  7.4   67     5   1
;; 1  2    36     118  8.0   72     5   2
;; 2  3    12     149 12.6   74     5   3
;; 3  4    18     313 11.5   62     5   4
;; 4  7    23     299  8.6   65     5   7
;; 5  8    19      99 13.8   59     5   8

Replace

To replace missing values we can use the transformation functions. For example we can recode the missing values in ozone by the mean. Let’s look at the first six rows of the air quality data-frame:

(head aq)
;;   X3 OZONE SOLAR-R WIND TEMP MONTH DAY
;; 0  1    41     190  7.4   67     5   1
;; 1  2    36     118  8.0   72     5   2
;; 2  3    12     149 12.6   74     5   3
;; 3  4    18     313 11.5   62     5   4
;; 4  5    NA      NA 14.3   56     5   5
;; 5  6    28      NA 14.9   66     5   6

Now replace ozone with the mean using the common lisp function nsubstitute:

(nsubstitute (mean (remove :na aq:ozone)) :na aq:ozone)

and look at head again:

(head aq)
;;   X3             OZONE SOLAR-R WIND TEMP MONTH DAY
;; 0  1           41.0000     190  7.4   67     5   1
;; 1  2           36.0000     118  8.0   72     5   2
;; 2  3           12.0000     149 12.6   74     5   3
;; 3  4           18.0000     313 11.5   62     5   4
;; 4  5           42.1293      NA 14.3   56     5   5
;; 5  6           28.0000      NA 14.9   66     5   6

You could have used the non-destructive substitute if you wanted to create a new data-frame and leave the original aq untouched.

Normally we’d round mean up to be consistent, but did not here so you can see the values that were replaced.

Dates & times

There are several libraries for working with time. Of these, local-time is probably the best designed and supported and the one we recommend for using with Lisp-Stat. It builds on the basic date & time functions included in Common Lisp and allows you to:

  • print timestamp in various standard or custom formats (e.g. RFC1123 or RFC3339)
  • parse time strings,
  • perform time arithmetic,
  • convert Unix times, timestamps, and universal times to and fro.

local-time is available in Quicklisp.

4.3 - Plotting

Visualising data in Lisp-Stat

Lisp-Stat can render plots with text or Vega-Lite. Vega-Lite (VL) is a browser based plotting system based on a grammar of graphics language.

Plotting with text

Lisp-Stat includes text based plotting functions that are useful for visualising data in the REPL. cl-spark provides this functionality. The text/histogram function provides text based histograms. See that function for documentation.

Plotting with Vega-Lite

Although Vega-Lite can render plots in any browser that supports JavaScript, we found that the easiest integration is with Chrome, and we assume here this browser is available. It would work equally well in Electron, should someone want to pick up that integration.

Configuring a browser

You can configure a default browser in the file browser.lisp in the main system directory. The default is configured for Chrome, and this is the recommended browser. Browser command-line options can also be configured here.

Vega-Lite specification

Vega-Lite plots are specified with JSON to encoding mappings from data to the properties of the plot. In Lisp-Stat, the encodings are specified as ALISTs, and then transformed to Vega-Lite format with a JSON library. An ALIST is a convenient format, since this data structure is built-in to Common Lisp and therefore can be manipulated with standard functions.

The easiest way to see how a Lisp-Stat plot encoding looks is to decode one of the Vega-Lite examples. For example a simple bar chart from the JSON spec files looks like this in JSON:

{
  "$schema": "https://vega.github.io/schema/vega-lite/v5.json",
  "description": "A simple bar chart with embedded data.",
  "data": {
    "values": [
      {"a": "A", "b": 28}, {"a": "B", "b": 55}, {"a": "C", "b": 43},
      {"a": "D", "b": 91}, {"a": "E", "b": 81}, {"a": "F", "b": 53},
      {"a": "G", "b": 19}, {"a": "H", "b": 87}, {"a": "I", "b": 52}
    ]
  },
  "mark": "bar",
  "encoding": {
    "x": {"field": "a", "type": "nominal", "axis": {"labelAngle": 0}},
    "y": {"field": "b", "type": "quantitative"}
  }
}

and and if we decode this with yason, using:

(reverse
 (yason:parse
  (dex:get "https://raw.githubusercontent.com/vega/vega-lite/master/examples/specs/bar.vl.json" :want-stream t)
  :object-as :alist
  :json-arrays-as-vectors t))

we get:

(("$schema" . "https://vega.github.io/schema/vega-lite/v5.json")
 ("description" . "A simple bar chart with embedded data.")
 ("data"
  ("values"
   . #((("b" . 28) ("a" . "A")) (("b" . 55) ("a" . "B"))
       (("b" . 43) ("a" . "C")) (("b" . 91) ("a" . "D"))
       (("b" . 81) ("a" . "E")) (("b" . 53) ("a" . "F"))
       (("b" . 19) ("a" . "G")) (("b" . 87) ("a" . "H"))
       (("b" . 52) ("a" . "I")))))
 ("mark" . "bar")
 ("encoding" ("y" ("type" . "quantitative") ("field" . "b"))
  ("x" ("axis" ("labelAngle" . 0)) ("type" . "nominal") ("field" . "a"))))

We can encode this alist back to the original JSON with:

(let ((yason:*list-encoder* 'yason:encode-alist))
  (yason:with-output-to-string* ()
    (yason:encode *)))

This mechanism is generic, and as you will see, we can build up an alist that corresponds to any Vega-Lite spec by manipulating the values in the alist. This is what the convenience functions (like bar-chart) in the vglt package do. Most of the time you will be working with the convenience functions.

Manipulating the spec

Let’s suppose that the width of the chart is too narrow. The Vega-Lite documentation page for customizing size tells us that adding a ‘width’ property will let us control this. For this, simply push the property onto the spec. Assuming that you have saved the specification into a variable named *plot*:

(pushnew '("width" . 300) *plot*)

and you are done. Sometimes the value you wish to manipulate is a bit deeper in the specification property hierarchy. For these cases you can use the access system, which provides a convenient mechanism to access these nested values. Say, for example, you wanted to add an ordering to the bar chart. To sort by another encoding channel, you need to add a ‘sort’ property to one of the channels. If we want to sort x by the value of the y field:

(pushnew '("sort" . "-y") (accesses *plot* :encoding :x))

You can use Common Lisp functions to retrieve or set values within the alist just like you would any other list to build up the plot specification.

Adding data

There are two ways to plot Lisp-Stat data in Vega-Lite:

  1. embed the data into the specification
  2. write the data to a file and use a data URL

Embedding data

To embed the data into the plot specification, use the vglt:df-to-alist function. This will transform a data frame into an alist format that can be embedded into the Vega-Lite specification. For example, let’s start with an empty variable spec, with only a schema in it. Here is how you would add data to it from a data-frame:

(setf spec (acons "data" `(("values" . ,(df-to-alist data-frame))) spec))

Writing data

For larger data sets, you probably want to save the data to a file or network location and use the Vega-Lite ‘url’ property in the specification. You can write data frames to streams or strings in Vega-Lite format using the vglt:df-to-vl function. You can also use the inverse of this function: vglt:vl-to-df to read a Vega-Lite data array into a data-frame. This is useful for obtaining sample data sets from the Vega-Lite ecosystem.

Rendering the plot

There are two steps to rendering a plot:

  • saving the specification to a file in HTML and JavaScript format
  • calling the browser to render the plot

The first step uses a back-end specific function. For example the Vega-Lite function for saving a plot is vglt:save-plot, the Plotly one (when available), would be plty:save-plot. The browser functionality is common across all backends that use a browser for rendering, and these are located in the plot package.

This example demonstrates rendering data from the Lisp-Stat notebook on categorical variables. First some quick boilerplate to set up the environment:

(ql:quickload :ips)         ; data examples
(ql:quickload :plot/vglt)   ; Vega-Lite plotting
(in-package :ips)
(defparameter online (read-csv (dex:get ips::eg01-07 :want-stream t)))
(defparameter online-bar-chart (vglt:bar-chart online "SOURCE" "COUNT"))

Now we can render the spec like so:

(plot:plot-from-file			        ; Common browser plotting
 (vglt:save-plot 'online-bar-chart))	; Vega-Lite specific save

You should see a new Chrome window containing the plot.

4.4 - Select

Selecting subsets of data

Overview

Select provides:

  1. An API for taking slices (elements selected by the Cartesian product of vectors of subscripts for each axis) of array-like objects. The most important function is select. Unless you want to define additional methods for select, this is pretty much all you need from this library. See the API reference for additional details.
  2. An extensible DSL for selecting a subset of valid subscripts. This is useful if, for example, you want to resolve column names in a data frame in your implementation of select.
  3. A set of utility functions for traversing selections in array-like objects.

It combines the functionality of dplyr’s slice and select methods.

Basic Usage

The most frequently used form is:

(select object selection1 selection2 ...)

where each selection specifies a set of subscripts along the corresponding axis. The selection specifications are found below.

Selection Specifiers

Selecting Single Values

A non-negative integer selects the corresponding index, while a negative integer selects an index counting backwards from the last index. For example:

(select #(0 1 2 3) 1)                  ; => 1
(select #(0 1 2 3) -2)                 ; => 2

These are called singleton slices. Each singleton slice drops the dimension: vectors become atoms, matrices become vectors, etc.

Selecting Ranges

(range start end) selects subscripts i where start <= i < end. When end is nil, the last index is included (cf. subseq). Each boundary is resolved according to the other rules, if applicable, so you can use negative integers:

(select #(0 1 2 3) (range 1 3))         ; => #(1 2)
(select #(0 1 2 3) (range 1 -1))        ; => #(1 2)

Selecting All Subscripts

t selects all subscripts:

(select #2A((0 1 2)
	        (3 4 5))
	 t 1)                           ; => #(1 4)

Selecting w/ Sequences

Sequences can be used to make specific selections from the object. For example:

(select #(0 1 2 3 4 5 6 7 8 9)
	(vector (range 1 3) 6 (range -2 -1))) ; => #(1 2 3 6 8 9)

(select #(0 1 2) '(2 2 1 0 0))                ; => #(2 2 1 0 0)

Masks

Bit Vectors

Bit vectors can be used to select elements of arrays and sequences as well:

(select #(0 1 2 3 4) #*00110)          ; => #(2 3)

Which

which returns an index of the positions in SEQUENCE which satisfy PREDICATE.

(defparameter data
  #(12 127 28 42 39 113 42 18 44 118 44 37 113 124 37 48 127 36 29 31 125
   139 131 115 105 132 104 123 35 113 122 42 117 119 58 109 23 105 63 27
   44 105 99 41 128 121 116 125 32 61 37 127 29 113 121 58 114 126 53 114
   96 25 109 7 31 141 46 13 27 43 117 116 27 7 68 40 31 115 124 42 128 146
   52 71 118 117 38 27 106 33 117 116 111 40 119 47 105 57 122 109 124
   115 43 120 43 27 27 18 28 48 125 107 114 34 133 45 120 30 127 31 116))
(which data :predicate #'evenp)
; #(0 2 3 6 7 8 9 10 13 15 17 25 26 30 31 34 40 44 46 48 55 56 57 59 60 66 71 74
;  75 78 79 80 81 82 84 86 88 91 93 98 100 103 107 108 109 112 113 116 117 120)

Extensions

The previous section describes the core functionality. The semantics can be extended. The extensions in this section are provided by the library and prove useful in practice. Their implementation provide good examples of extending the library.

including is convenient if you want the selection to include the end of the range:

(select #(0 1 2 3) (including 1 2))
				    ; => #(1 2), cf. (select ... (range 1 3))

nodrop is useful if you do not want to drop dimensions:

(select #(0 1 2 3) (nodrop 2))
			; => #(2), cf. (select ... (range 2 3))

All of these are trivial to implement. If there is something you are missing, you can easily extend select. Pull request are welcome.

(ref) is a version of (select) that always returns a single element, so it can only be used with singleton slices.

Select Semantics

Arguments of select, except the first one, are meant to be resolved using canonical-representation, in the select-dev package. If you want to extend select, you should define methods for canonical-representation. See the source code for the best examples. Below is a simple example that extends the semantics with ordinal numbers.

(defmacro define-ordinal-selection (number)
  (check-type number (integer 0))
  `(defmethod select-dev:canonical-representation
       ((axis integer) (select (eql ',(intern (format nil \"~:@@(~:r~)\" number)))))
     (assert (< ,number axis))
     (select-dev:canonical-singleton ,number)))

(define-ordinal-selection 1)
(define-ordinal-selection 2)
(define-ordinal-selection 3)

(select #(0 1 2 3 4 5) (range 'first 'third)) ; => #(1 2)

Note the following:

  • The value returned by canonical-representation needs to be constructed using canonical-singleton, canonical-range, or canonical-sequence. You should not use the internal representation directly as it is subject to change.
  • You can assume that axis is an integer; this is the default. An object may define a more complex mapping (such as, for example, named rows & columns), but unless a method specialized to that is found, canonical-representation will just query its dimension (with axis-dimension) and try to find a method that works on integers.
  • You need to make sure that the subscript is valid, hence the assertion.

5 - Tutorials

End to end demonstrations of statistical analysis

These learning tutorials demonstrate how to perform end-to-end statistical analysis of sample data using Lisp-Stat. Sample data is provided for both the examples and the optional exercises. By completing these tutorials you will understand the tasks required for a typical statistical workflow.

5.1 - Basics

An introduction to the basics of LISP-STAT

Preface

This document is intended to be a tutorial introduction to the basics of LISP-STAT and is based on the original tutorial for XLISP-STAT written by Luke Tierney, updated for Common Lisp and the 2021 implementation of LISP-STAT.

LISP-STAT is a statistical environment built on top of the Common Lisp general purpose programming language. The first three sections contain the information you will need to do elementary statistical calculations and plotting. The fourth section introduces some additional methods for generating and modifying data. The fifth section describes some features of the user interface that may be helpful. The remaining sections deal with more advanced topics, such as interactive plots, regression models, and writing your own functions. All sections are organized around examples, and most contain some suggested exercises for the reader.

This document is not intended to be a complete manual. However, documentation for many of the commands that are available is given in the appendix. Brief help messages for these and other commands are also available through the interactive help facility described in Section 5.1 below.

Common Lisp (CL) is a dialect of the Lisp programming language, published in ANSI standard document ANSI INCITS 226-1994 (S20018) (formerly X3.226-1994 (R1999)). The Common Lisp language was developed as a standardized and improved successor of Maclisp. By the early 1980s several groups were already at work on diverse successors to MacLisp: Lisp Machine Lisp (aka ZetaLisp), Spice Lisp, NIL and S-1 Lisp. Common Lisp sought to unify, standardise, and extend the features of these MacLisp dialects. Common Lisp is not an implementation, but rather a language specification. Several implementations of the Common Lisp standard are available, including free and open-source software and proprietary products. Common Lisp is a general-purpose, multi-paradigm programming language. It supports a combination of procedural, functional, and object-oriented programming paradigms. As a dynamic programming language, it facilitates evolutionary and incremental software development, with iterative compilation into efficient run-time programs. This incremental development is often done interactively without interrupting the running application.

Using this Tutorial

The best way to learn about a new computer programming language is usually to use it. You will get most out of this tutorial if you read it at your computer and work through the examples yourself. To make this easier the named data sets used in this tutorial have been stored in the file tutorial.lisp in the LS:DATASETS folder of the system. To load this file, execute:

(load #P"LS:DATASETS;TUTORIAL")

at the command prompt (REPL). The file will be loaded and some variables will be defined for you.

Why LISP-STAT Exists

There are three primary reasons behind the decision to produce the LISP-STAT environment. The first is speed. The other major languages used for statistics and numerical analysis, R, Python and Julia are all fine languages, but with the rise of ‘big data’ and large data sets, require workarounds for processing large data sets. Furthermore, as interpreted languages, they are relatively slow when compared to Common Lisp, that has a compiler that produces native machine code.

Not only does Common Lisp provide a compiler that produces machine code, it has native threading, a rich ecosystem of code libraries, and a history of industrial deployments, including:

  • Credit card authorisation at AMEX (Authorizers Assistant)
  • US DoD logistics (and more, that we don’t know of)
  • CIA and NSA are big users based on Lisp sales
  • DWave and Rigetti use lisp for programming their quantum computers
  • Apple’s Siri was originally written in Lisp
  • Amazon got started with Lisp & C; so did Y-combinator
  • Google’s flight search engine is written in Common Lisp
  • AT&T used a stripped down version of Symbolics Lisp to process CDRs in the first IP switches

Python and R are never (to my knowledge) deployed as front-line systems, but used in the back office to produce models that are executed by other applications in enterprise environments. Common Lisp eliminates that friction.

Availability

Source code for LISP-STAT is available in the Lisp-Stat github repository. The Getting Started section of the documentation contains instructions for downloading and installing the system.

Disclaimer

LISP-STAT is an experimental program. It has not been extensively tested. The corporate sponsor, Symbolics Pte Ltd, takes no responsibility for losses or damages resulting directly or indirectly from the use of this program.

LISP-STAT is an evolving system. Over time new features will be introduced, and existing features that do not work may be changed. Every effort will be made to keep LISP-STAT consistent with the information in this tutorial, but if this is not possible the reference documentation should give accurate information about the current use of a command.

Starting and Finishing

Once you have obtained the source code or pre-built image, you can load Lisp-Stat using QuickLisp. If you do not have quicklisp, stop here and get it. It is the de-facto package manager for Common Lisp and you will need it. This is what you will see if loading using the Slime IDE:

CL-USER> (ql:quickload :lisp-stat)
To load "lisp-stat":
  Load 1 ASDF system:
    lisp-stat
; Loading "lisp-stat"
..................................................
..................................................
[package num-utils]...............................
[package num-utils]...............................
[package dfio.decimal]............................
[package dfio.string-table].......................
.....
(:LISP-STAT)
CL-USER>

You may see more or less output, depending on whether dependent packages have been compiled before. If this is your first time running anything in this implementation of Common Lisp, you will probably see output related to the compilation of every module in the system. This could take a while, but only has to be done once.

Once finished, to use the functions provided, you need to make the LISP-STAT package the current package, like this:

(in-package :ls-user)
#<PACKAGE "LS-USER">
LS-USER>

The final LS-USER> in the window is the Slime prompt. Notice how it changes when you executed (in-package). In Slime, the prompt always indicates the current package, *package*. Any characters you type while the prompt is active will be added to the line after the final prompt. When you press return, LISP-STAT will try to interpret what you have typed and will print a response. For example, if you type a 1 and press return then LISP-STAT will respond by simply printing a 1 on the following line and then give you a new prompt:

    LS-USER> 1
    1
    LS-USER>

If you type an expression like (+ 1 2), then LISP-STAT will print the result of evaluating the expression and give you a new prompt:

    LS-USER> (+ 1 2)
    3
    LS-USER>

As you have probably guessed, this expression means that the numbers 1 and 2 are to be added together. The next section will give more details on how LISP-STAT expressions work. In this tutorial I will always show interactions with the program as I have done here: The LS-USER> prompt will appear before lines you should type. LISP-STAT will supply this prompt when it is ready; you should not type it yourself. In later sections I will omit the new prompt following the result in order to save space.

Now that you have seen how to start up LISP-STAT it is a good idea to make sure you know how to get out. The exact command to exit depends on the Common Lisp implementation you use. For SBCL, you can type the expression

    LS-USER> (exit)

In other implementations, the command is quit. One of these methods should cause the program to exit and return you to the IDE. In Slime, you can use the , short-cut and then type sayoonara.

The Basics

Before we can start to use LISP-STAT for statistical work we need to learn a little about the kind of data LISP-STAT uses and about how the LISP-STAT listener and evaluator work.

Data

LISP-STAT works with two kinds of data: simple data and compound data. Simple data are numbers

1                   ; an integer
-3.14               ; a floating point number
#C(0 1)             ; a complex number (the imaginary unit)

logical values

T                   ; true
nil                 ; false

strings (always enclosed in double quotes)

"This is a string 1 2 3 4"

and symbols (used for naming things; see the following section)

x
x12
12x
this-is-a-symbol

Compound data are lists

(this is a list with 7 elements)
(+ 1 2 3)
(sqrt 2)

or vectors

#(this is a vector with 7 elements)
#(1 2 3)

Higher dimensional arrays are another form of compound data; they will be discussed below in Section 9, “Arrays”.

All the examples given above can be typed directly into the command window as they are shown here. The next subsection describes what LISP-STAT will do with these expressions.

Data Frame

A data frame is a collection of name/data pairs. If you have used R, then you’ll already be familiar with this concept. To create a data frame from a name and a value (called a plist, or property-list):

(plist-df '(name #(1 2 3)))

The Listener and the Evaluator

A session with LISP-STAT basically consists of a conversation between you and the listener. The listener is the window into which you type your commands. When it is ready to receive a command it gives you a prompt. At the prompt you can type in an expression. You can use the mouse or the backspace key to correct any mistakes you make while typing in your expression. When the expression is complete and you type a return the listener passes the expression on to the evaluator. The evaluator evaluates the expression and returns the result to the listener for printing.1 The evaluator is the heart of the system.

The basic rule to remember in trying to understand how the evaluator works is that everything is evaluated. Numbers and strings evaluate to themselves:

LS-USER> 1
1
LS-USER> "Hello"
"Hello"
LS-USER>

Lists are more complicated. Suppose you type the list (+ 1 2 3) at the listener. This list has four elements: the symbol + followed by the numbers 1, 2 and 3. Here is what happens:

> (+ 1 2 3)
6
>

This list is evaluated as a function application. The first element is a symbol representing a function, in this case the symbol + representing the addition function. The remaining elements are the arguments. Thus the list in the example above is interpreted to mean “Apply the function + to the numbers 1, 2 and 3”.

Actually, the arguments to a function are always evaluated before the function is applied. In the previous example the arguments are all numbers and thus evaluate to themselves. On the other hand, consider

LS-USER> (+ (* 2 3) 4)
10
LS-USER>

The evaluator has to evaluate the first argument to the function + before it can apply the function.

Occasionally you may want to tell the evaluator not to evaluate something. For example, suppose we wanted to get the evaluator to simply return the list (+ 1 2) back to us, instead of evaluating it. To do this we need to quote our list:

LS-USER> (quote (+ 1 2))
(+ 1 2)
LS-USER>

quote is not a function. It does not obey the rules of function evaluation described above: Its argument is not evaluated. quote is called a special form – special because it has special rules for the treatment of its arguments. There are a few other special forms that we will need; I will introduce them as they are needed. Together with the basic evaluation rules described here these special forms make up the basics of the Lisp language. The special form quote is used so often that a shorthand notation has been developed, a single quote before the expression you want to quote:

LS-USER> '(+ 1 2)      ; single quote shorthand

This is equivalent to (quote (+ 1 2)). Note that there is no matching quote following the expression.

By the way, the semicolon ; is the Lisp comment character. Anything you type after a semicolon up to the next time you press return is ignored by the evaluator.

Exercises

For each of the following expressions try to predict what the evaluator will return. Then type them in, see what happens and try to explain any differences.

  1. (+ 3 5 6)

  2. (+ (- 1 2) 3)

  3. ’(+ 3 5 6)

  4. ’( + (- 1 2) 3)

  5. (+ (- (* 2 3) (/ 6 2)) 7)

  6. ’x

Remember, to quit from LISP-STAT type (exit), quit or use the IDE’s exit mechanism.

Elementary Statistical Operations

This section introduces some of the basic graphical and numerical statistical operations that are available in LISP-STAT.

First Steps

Statistical data usually consists of groups of numbers. Devore and Peck [@DevorePeck Exercise 2.11] describe an experiment in which 22 consumers reported the number of times they had purchased a product during the previous 48 week period. The results are given as a table:


0   2   5   0   3   1   8   0   3   1   1
9   2   4   0   2   9   3   0   1   9   8

To examine this data in LISP-STAT we represent it as a list of numbers using the list function:

(list 0 2 5 0 3 1 8 0 3 1 1 9 2 4 0 2 9 3 0 1 9 8)

Note that the numbers are separated by white space (spaces, tabs or even returns), not commas.

The mean function can be used to compute the average of a list of numbers. We can combine it with the list function to find the average number of purchases for our sample:

(mean '(0 2 5 0 3 1 8 0 3 1 1 9 2 4 0 2 9 3 0 1 9 8)) ; => 3.227273

The median of these numbers can be computed as

(median '(0 2 5 0 3 1 8 0 3 1 1 9 2 4 0 2 9 3 0 1 9 8)) ; => 2

It is of course a nuisance to have to type in the list of 22 numbers every time we want to compute a statistic for the sample. To avoid having to do this I will give this list a name using the def special form 2:

LS-USER> (def purchases (list 0 2 5 0 3 1 8 0 3 1 1 9 2 4 0 2 9 3 0 1 9 8))
PURCHASES
LS-USER>

Now the symbol purchases has a value associated with it: Its value is our list of 22 numbers. If you give the symbol purchases to the evaluator then it will find the value of this symbol and return that value:

LS-USER> purchases
(0 2 5 0 3 1 8 0 3 1 1 9 2 4 0 2 9 3 0 1 9 8)

We can now easily compute various numerical descriptive statistics for this data set:

LS-USER> (mean purchases)
3.227273
LS-USER> (median purchases)
2
LS-USER> (standard-deviation purchases)
3.279544
LS-USER> (interquartile-range purchases)
3.5

LISP-STAT also supports elementwise arithmetic operations on vectors of numbers. Technically, overriding, or ‘shadowing’ any of the Common Lisp functions is undefined. This is usually an euphuism for ‘something really bad will happen’, so the vector functions are located in the package elmt and prefixed by e to distinguish them from the Common Lisp variants, e.g. e+ for addition, e* for multiplication, etc. Presently these functions work only on vectors, so we’ll define a new purchases variable as a vector type:

(def purchases-2 #(0 2 5 0 3 1 8 0 3 1 1 9 2 4 0 2 9 3 0 1 9 8))

The # symbol tells the listener to interpret the list as a vector, much like the ' signals a list.

Now we can add 1 to each of the purchases:

LS-USER> (e+ 1 purchases-2)
(1 3 6 1 4 2 9 1 4 2 2 10 3 5 1 3 10 4 1 2 10 9)

and after adding 1 we can compute the natural logarithms of the results:

LS-USER> (elog (e+ 1 purchases-2))
(0 1.098612 1.791759 0 1.386294 0.6931472 2.197225 0 1.386294 0.6931472
0.6931472 2.302585 1.098612 1.609438 0 1.098612 2.302585 1.386294 0
0.6931472 2.302585 2.197225)

Exercises

For each of the following expressions try to predict what the evaluator will return. Then type them in, see what happens and try to explain any differences.

  1. (mean (list 1 2 3))

  2. (e+ #(1 2 3) 4)

  3. (e* #(1 2 3) #(4 5 6))

  4. (e+ #(1 2 3) #(4 5 7))

Summary Statistics

Devore and Peck [@DevorePeck page 54, Table 10] give precipitation levels recorded during the month of March in the Minneapolis - St. Paul area over a 30 year period. Let’s enter these data into LISP-STAT with the name precipitation:

(def precipitation
    #(.77 1.74 .81 1.20 1.95 1.20 .47 1.43 3.37 2.20 3.30
     3.09 1.51 2.10 .52 1.62 1.31 .32 .59 .81 2.81 1.87
     1.18 1.35 4.75 2.48 .96 1.89 .90 2.05))

In typing the expression above I have inserted return and tab a few times in order to make the typed expression easier to read. The tab key indents the next line to a reasonable point to make the expression more readable.

Here are some numerical summaries:

LS-USER> (mean precipitation)
1.685
LS-USER> (median precipitation)
1.47
LS-USER> (standard-deviation precipitation)
1.0157
LS-USER> (interquartile-range precipitation)
1.145

The distribution of this data set is somewhat skewed to the right. Notice the separation between the mean and the median. You might want to try a few simple transformations to see if you can symmetrize the data. Square root and log transformations can be computed using the expressions

(esqrt precipitation)

and

(elog precipitation)

You should look at plots of the data to see if these transformations do indeed lead to a more symmetric shape. The means and medians of the transformed data are:

    LS-USER> (mean (esqrt precipitation))
    1.243006
    LS-USER> (median (esqrt precipitation))
    1.212323
    LS-USER> (mean (elog precipitation))
    0.3405517
    LS-USER> (median (elog precipitation))
    0.384892

Plots

For this section we’ll be using the Vega-Lite plotting back-end. Load it like this:

(ql:quickload :plot/vglt)

The histogram and box-plot functions can be used to obtain graphical representations of this data set:

(vglt:plot
	(vglt:histogram
		(plist-df `(x ,precipitation)) "X" :title "Histogram of precipitation levels"))

Note how we converted the precipitation data into a data-frame before passing it to the histogram function. This is because plotting functions work on data frames. Also note the way the data frame was constructed using the plist-df function. When I first showed you an example of constructing a data frame:

(plist-df '(name #(1 2 3)))

the second value of the plist was a vector. In the histogram plot, the second value is a variable:

(plist-df `(x ,precipitation))

If you entered this into the evaluator (REPL) without the back-quote and comma:

(plist-df '(x precipitation))

you would get an error. This is because within a list, precipitation is a symbol, and plist-df expects the vector that precipitation stands for, in other words its value. To get the value, we use a sort of template mechanism, that starts with the back-quote character. Within a list that starts with this character, a comma signals to the evaluator to put the value of the symbol there, not the symbol itself. The easiest way to see this is to type both into the evaluator:

LS-USER> '(x precipitation)
(X PRECIPITATION)
LS-USER> `(x ,precipitation)
(X
 #(0.77 1.74 0.81 1.2 1.95 1.2 0.47 1.43 3.37 2.2 3.3 3.09 1.51 2.1 0.52 1.62
   1.31 0.32 0.59 0.81 2.81 1.87 1.18 1.35 4.75 2.48 0.96 1.89 0.9 2.05))

Note each graph is saved to an HTML file in your system cache directory. This location will vary depending on your operating system. On MS Windows, it will be in %APPDATALOCAL%/cache. You can view or edits the plots directly if you like.

Let’s try a box plot:

(vglt:plot
 (vglt::box-plot
  (plist-df `(x ,precipitation)) nil "X" :title "Boxplot of precipitation levels"))

The box-plot function can also be used to produce parallel box-plots of two or more samples.

It will do so if it is given a list of lists as its argument instead of a single list.

As an example, let’s use this function to compare the fuel consumption for various automobile types. The data comes from the R ggplot library and we load it like this:

(defparameter mpg
	(read-csv
		(rdata:rdata 'rdata:ggplot2 'rdata:mpg)))

The parallel box-plot is obtained by:

(vglt:plot
	  (vglt:box-plot mpg "CLASS" "HWY"
	                 :title "Boxplot of fuel consumption"))

Exercises

The following exercises involve examples and problems from Devore and Peck. The data sets are in files in the folder Datasets in the LISP-STAT distribution directory and can be read in using the load command. The short cut for the Datasets directory is LS:DATASETS, so to load car-prices, type:

(load #P"LS:DATASETS;car-prices")

at the REPL. The file will be loaded and some variables will be defined for you. Loading file car-prices.lisp will define the single variable car-prices. Loading file heating.lisp will define two variables, gas-heat and electric-heat.3

  1. Devore and Peck [@DevorePeck page 18, Example 2] give advertised prices for a sample of 50 used Japanese subcompact cars. Create a data-frame and obtain some plots and summary statistics for this data. Experiment with some transformations of the data as well. The data set is called car-prices in the file car-prices.lisp. The prices are given in units of $1000; thus the price 2.39 represents $2390. The data have been sorted by their leading digit.

  2. In Exercise 2.40 Devore and Peck [@DevorePeck] give heating costs for a sample of apartments heated by gas and a sample of apartments heated by electricity. Create a data-frame and obtain plots and summary statistics for these samples separately and look at a parallel box plot for the two samples. These data sets are called gas-heat and electric-heat in the file heating.lisp.

Generating and Modifying Data

This section briefly summarizes some techniques for generating random and systematic data.

Generating Random Data

The state of the internal random number generator can be “randomly” reseeded, and the current value of the generator state can be saved. The mechanism used is the standard Common Lisp mechanism. The current random state is held in the variable *random-state*. The function make-random-state can be used to set and save the state. It takes an optional argument. If the argument is NIL or omitted make-random-state returns a copy of the current value of *random-state*. If the argument is a state object, a copy of it is returned. If the argument is t a new, “randomly” initialized state object is produced and returned. 4

Forming Subsets and Deleting Cases

The select function allows you to select a single element or a group of elements from a list or vector. For example, if we define x by

(def x (list 3 7 5 9 12 3 14 2))

then (select x i) will return the ith element of x. Common Lisp, like the language C, but in contrast to FORTRAN, numbers elements of list and vectors starting at zero. Thus the indices for the elements of x are 0, 1, 2, 3, 4, 5, 6, 7 . So

LS-USER> (select x 0)
3
LS-USER> (select x 2)
5

To get a group of elements at once we can use a list of indices instead of a single index:

LS-USER> (select x (list 0 2))
(3 5)

If you want to select all elements of x except element 2 you can use the expression

(remove 2 (iota 8))

as the second argument to the function select:

LS-USER> (remove 2 (iota 8))
(0 1 3 4 5 6 7)
LS-USER> (select x (remove 2 (iota 8)))
(3 7 9 12 3 14 2)

Combining Lists & Vectors

At times you may want to combine several short lists or vectors into a single longer one. This can be done using the append function. For example, if you have three variables x, y and z constructed by the expressions

(def x (list 1 2 3))
(def y (list 4))
(def z (list 5 6 7 8))

then the expression

(append x y z)

will return the list

(1 2 3 4 5 6 7 8).

Modifying Data

So far when I have asked you to type in a list of numbers I have been assuming that you will type the list correctly. If you made an error you had to retype the entire def expression. Since you can use cut–and–paste this is really not too serious. However it would be nice to be able to replace the values in a list after you have typed it in. The setf special form is used for this. Suppose you would like to change the 12 in the list x used in the Section 4.3 to 11. The expression

(setf (select x 4) 11)

will make this replacement:

LS-USER> (setf (select x 4) 11)
11
LS-USER> x
(3 7 5 9 11 3 14 2)

The general form of setf is

(setf form value)

where form is the expression you would use to select a single element or a group of elements from x and value is the value you would like that element to have, or the list of the values for the elements in the group. Thus the expression

(setf (select x (list 0 2)) (list 15 16))

changes the values of elements 0 and 2 to 15 and 16:

LS-USER> (setf (select x (list 0 2)) (list 15 16))
(15 16)
LS-USER> x
(15 7 16 9 11 3 14 2)

As a result, if we change an element of (the item referred to by) x with setf then we are also changing the element of (the item referred to by) y, since both x and y refer to the same item. If you want to make a copy of x and store it in y before you make changes to x then you must do so explicitly using, say, the copy-list function. The expression

(def y (copy-list x))

will make a copy of x and set the value of y to that copy. Now x and y refer to different items and changes to x will not affect y.

Useful Shortcuts

This section describes some additional features of LISP-STAT that you may find useful.

Getting Help

On line help is available for many of the functions in LISP-STAT 5. As an example, here is how you would get help for the function iota:

LS-USER> (documentation 'iota 'function)
"Return a list of n numbers, starting from START (with numeric contagion
from STEP applied), each consecutive number being the sum of the previous one
and STEP. START defaults to 0 and STEP to 1.

Examples:

  (iota 4)                      => (0 1 2 3)
  (iota 3 :start 1 :step 1.0)   => (1.0 2.0 3.0)
  (iota 3 :start -1 :step -1/2) => (-1 -3/2 -2)
"

Note the quote in front of iota. documentation is itself a function, and its argument is the symbol representing the function iota. To make sure documentation receives the symbol, not the value of the symbol, you need to quote the symbol.

Another useful function is describe that, depending on the Lisp implementation, will return documentation and additional information about the object:

LS-USER> (describe 'iota)
ALEXANDRIA:IOTA
  [symbol]

IOTA names a compiled function:
  Lambda-list: (ALEXANDRIA::N &KEY (ALEXANDRIA::START 0) (STEP 1))
  Derived type: (FUNCTION
                 (UNSIGNED-BYTE &KEY (:START NUMBER) (:STEP NUMBER))
                 (VALUES T &OPTIONAL))
  Documentation:
    Return a list of n numbers, starting from START (with numeric contagion
    from STEP applied), each consecutive number being the sum of the previous one
    and STEP. START defaults to 0 and STEP to 1.

    Examples:

      (iota 4)                      => (0 1 2 3)
      (iota 3 :start 1 :step 1.0)   => (1.0 2.0 3.0)
      (iota 3 :start -1 :step -1/2) => (-1 -3/2 -2)

  Inline proclamation: INLINE (inline expansion available)
  Source file: s:/src/third-party/alexandria/alexandria-1/numbers.lisp

If you are not sure about the name of a function you may still be able to get some help. Suppose you want to find out about functions related to the normal distribution. Most such functions will have “norm” as part of their name. The expression

(apropos 'norm)

will print the help information for all symbols whose names contain the string “norm”:

ALEXANDRIA::NORMALIZE
ALEXANDRIA::NORMALIZE-AUXILARY
ALEXANDRIA::NORMALIZE-KEYWORD
ALEXANDRIA::NORMALIZE-OPTIONAL
ASDF/PARSE-DEFSYSTEM::NORMALIZE-VERSION (fbound)
ASDF/FORCING:NORMALIZE-FORCED-NOT-SYSTEMS (fbound)
ASDF/FORCING:NORMALIZE-FORCED-SYSTEMS (fbound)
ASDF/SESSION::NORMALIZED-NAMESTRING
ASDF/SESSION:NORMALIZE-NAMESTRING (fbound)
CL-INTERPOL::NORMAL-NAME-CHAR-P (fbound)
CL-PPCRE::NORMALIZE-VAR-LIST (fbound)
DISTRIBUTIONS::+NORMAL-LOG-PDF-CONSTANT+ (bound, DOUBLE-FLOAT)
DISTRIBUTIONS::CDF-NORMAL% (fbound)
DISTRIBUTIONS::COPY-LEFT-TRUNCATED-NORMAL (fbound)
DISTRIBUTIONS::COPY-R-LOG-NORMAL (fbound)
DISTRIBUTIONS::COPY-R-NORMAL (fbound)
DISTRIBUTIONS::DRAW-LEFT-TRUNCATED-STANDARD-NORMAL (fbound)
DISTRIBUTIONS::LEFT-TRUNCATED-NORMAL (fbound)
DISTRIBUTIONS::LEFT-TRUNCATED-NORMAL-ALPHA (fbound)
DISTRIBUTIONS::LEFT-TRUNCATED-NORMAL-LEFT (fbound)
DISTRIBUTIONS::LEFT-TRUNCATED-NORMAL-LEFT-STANDARDIZED (fbound)
DISTRIBUTIONS::LEFT-TRUNCATED-NORMAL-M0 (fbound)
DISTRIBUTIONS::LEFT-TRUNCATED-NORMAL-MU (fbound)
DISTRIBUTIONS::LEFT-TRUNCATED-NORMAL-P (fbound)
DISTRIBUTIONS::LEFT-TRUNCATED-NORMAL-SIGMA (fbound)
DISTRIBUTIONS::MAKE-LEFT-TRUNCATED-NORMAL (fbound)
DISTRIBUTIONS::MAKE-R-LOG-NORMAL (fbound)
DISTRIBUTIONS::MAKE-R-NORMAL (fbound)
DISTRIBUTIONS::QUANTILE-NORMAL% (fbound)
DISTRIBUTIONS::R-LOG-NORMAL-LOG-MEAN (fbound)
...

Let me briefly explain the notation used in the information printed by describe regarding the arguments a function expects 6. This is called the lambda-list. Most functions expect a fixed set of arguments, described in the help message by a line like Args: (x y z) or Lambda-list: (x y z)

Some functions can take one or more optional arguments. The arguments for such a function might be listed as

Args: (x &optional y (z t))

or

Lambda-list: (x &optional y (z t))

This means that x is required and y and z are optional. If the function is named f, it can be called as (f x-val), (f x-val y-val) or (f x-val y-val z-val). The list (z t) means that if z is not supplied its default value is T. No explicit default value is specified for y; its default value is therefore NIL. The arguments must be supplied in the order in which they are listed. Thus if you want to give the argument z you must also give a value for y.

Another form of optional argument is the keyword argument. The iota function for example takes arguments

Args: (N &key (START 0) (STEP 1))

The n argument is required, the START argument is an optional keyword argument. The default START is 0, and the default STEP is 1. If you want to create a sequence eight numbers, with a step of two) use the expression

(iota 8 :step 2)

Thus to give a value for a keyword argument you give the keyword 7 for the argument, a symbol consisting of a colon followed by the argument name, and then the value for the argument. If a function can take several keyword arguments then these may be specified in any order, following the required and optional arguments.

Finally, some functions can take an arbitrary number of arguments. This is denoted by a line like

Args: (x &rest args)

The argument x is required, and zero or more additional arguments can be supplied.

In addition to providing information about functions describe also gives information about data types and certain variables. For example,

LS-USER> (describe 'complex)
COMMON-LISP:COMPLEX
  [symbol]

COMPLEX names a compiled function:
  Lambda-list: (REALPART &OPTIONAL (IMAGPART 0))
  Declared type: (FUNCTION (REAL &OPTIONAL REAL)
                  (VALUES NUMBER &OPTIONAL))
  Derived type: (FUNCTION (T &OPTIONAL T)
                 (VALUES
                  (OR RATIONAL (COMPLEX SINGLE-FLOAT)
                      (COMPLEX DOUBLE-FLOAT) (COMPLEX RATIONAL))
                  &OPTIONAL))
  Documentation:
    Return a complex number with the specified real and imaginary components.
  Known attributes: foldable, flushable, unsafely-flushable, movable
  Source file: SYS:SRC;CODE;NUMBERS.LISP

COMPLEX names the built-in-class #<BUILT-IN-CLASS COMMON-LISP:COMPLEX>:
  Class precedence-list: COMPLEX, NUMBER, T
  Direct superclasses: NUMBER
  Direct subclasses: SB-KERNEL:COMPLEX-SINGLE-FLOAT,
                     SB-KERNEL:COMPLEX-DOUBLE-FLOAT
  Sealed.
  No direct slots.

COMPLEX names a primitive type-specifier:
  Lambda-list: (&OPTIONAL (SB-KERNEL::TYPESPEC '*))

shows the function, type and class documentation for complex, and

LS-USER> (documentation 'pi 'variable)
PI                                                              [variable-doc]
The floating-point number that is approximately equal to the ratio of the
circumference of a circle to its diameter.

shows the variable documentation for pi8.

Listing and Undefining Variables

After you have been working for a while you may want to find out what variables you have defined (using def). The function variables will produce a listing:

LS-USER> (variables)
CO
HC
RURAL
URBAN
PRECIPITATION
PURCHASES
NIL
LS-USER>

If you are working with very large variables you may occasionally want to free up some space by getting rid of some variables you no longer need. You can do this using the undef function:

LS-USER> (undef 'co)
CO
LS-USER> (variables)
HC
RURAL
URBAN
PRECIPITATION
PURCHASES
NIL
LS-USER>

More on the Listener

Common Lisp provides a simple command history mechanism. The symbols -, ``, *, **, +, ++, and +++ are used for this purpose. The top level reader binds these symbols as follows:


  `-` the current input expression
  `+` the last expression read
 `++` the previous value of `+`
`+++` the previous value of `++`
   `` the result of the last evaluation
  `*` the previous value of ``
 `**` the previous value of `*`

The variables ``, * and ** are probably most useful.

For example, if you read a data-frame but forget to assign the resulting object to a variable:

(read-csv (rdata 'rdata:datasets 'rdata:mtcars))
WARNING: Missing column name was filled in
#<DATA-FRAME (32 observations of 12 variables)>

you can recover it using one of the history variables:

(def mtcars *)
; MTCARS

The symbol MTCARS now has the data-frame object as its value.

Like most interactive systems, Common Lisp needs a system for dynamically managing memory. The system used depends on the implementation. The most common way (SBCL, CCL) is to grab memory out of a fixed bin until the bin is exhausted. At that point the system pauses to reclaim memory that is no longer being used. This process, called garbage collection, will occasionally cause the system to pause if you are using large amounts of memory.

Loading Files

The data for the examples and exercises in this tutorial, when not loaded from the network, have been stored on files with names ending in .lisp. In the LISP-STAT system directory they can be found in the folder Datasets. Any variables you save (see the next subsection for details) will also be saved in files of this form. The data in these files can be read into LISP-STAT with the load function. To load a file named randu.lisp type the expression

(load #P"LS:DATASETS;RANDU.LISP")

or just

(load "randu")

If you give load a name that does not end in .lisp then load will add this suffix.

Saving Your Work

If you want to record a session with LISP-STAT you can do so using the dribble function. The expression

(dribble "myfile")

starts a recording. All expressions typed by you and all results printed by LISP-STAT will be entered into the file named myfile. The expression

(dribble)

stops the recording. Note that (dribble "myfile") starts a new file by the name myfile. If you already have a file by that name its contents will be lost. Thus you can’t use dribble to toggle on and off recording to a single file.

dribble only records text that is typed, not plots. However, you can use the buttons displayed on a plot to save in SVG or PNG format. The original HTML plots are saved in your operating system’s cache directory and can be viewed again until the cache is cleared during a system reboot.

Variables you define in LISP-STAT only exist for the duration of the current session. If you quit from LISP-STAT your data will be lost. To preserve your data you can use the savevar function. This function allows you to save one or more variables into a file. Again a new file is created and any existing file by the same name is destroyed. To save the variable precipitation in a file called precipitation.lisp type

(savevar 'precipitation "precipitation")

Do not add the .lisp suffix yourself; savevar will supply it. To save the two variables precipitation and purchases in the file examples.lisp type 9.

(savevar '(purchases precipitation) "examples")

The files precipitation.lisp and examples.lisp now contain a set of expression that, when read in with the load command, will recreate the variables precipitation and purchases. You can look at these files with an editor like the Emacs editor and you can prepare files with your own data by following these examples.

To save a data frame, use the write-csv function.

Reading Data Files

The data files we have used so far in this tutorial have contained Common Lisp expressions. LISP-STAT also provides functions for reading raw data files. The most commonly used is read-csv.

(read-csv stream)

where stream is a common lisp stream with the data. Streams can be obtained from files, strings or a network and are in comma separated value (CSV) format. The parser supports delimiters other than comma.

The character delimited reader should be adequate for most purposes. If you have to read a file that is not in a character delimited format you can use the raw file handling functions of Common Lisp.

User Initialization File

Each Common Lisp implementation provides a way to execute initialization code upon start-up. You can use this file to load any data sets you would like to have available or to define functions of your own.

Defining Functions & Methods

This section gives a brief introduction to programming LISP-STAT. The most basic programming operation is to define a new function. Closely related is the idea of defining a new method for an object. 10

Defining Functions

You can use the Common Lisp language to define functions of your own. Many of the functions you have been using so far are written in this language. The special form used for defining functions is called defun. The simplest form of the defun syntax is

(defun fun args expression)

where fun is the symbol you want to use as the function name, args is the list of the symbols you want to use as arguments, and expression is the body of the function. Suppose for example that you want to define a function to delete a case from a list. This function should take as its arguments the list and the index of the case you want to delete. The body of the function can be based on either of the two approaches described in Section 4.3 above. Here is one approach:

(defun delete-case (x i)
  (select x (remove i (iota (- (length x) 1)))))

I have used the function length in this definition to determine the length of the argument x. Note that none of the arguments to defun are quoted: defun is a special form that does not evaluate its arguments.

Unless the functions you define are very simple you will probably want to define them in a file and load the file into LISP-STAT with the load command. You can put the functions in the implementation’s initialization file or include in the initialization file a load command that will load another file. The version of Common Lisp for the Macintosh, CCL, includes a simple editor that can be used from within LISP-STAT.

Matrices and Arrays

LISP-STAT includes support for multidimensional arrays. In addition to the standard Common Lisp array functions LISP-STAT also includes a system called array-operations.

An array is printed using the standard Common Lisp format. For example, a 2 by 3 matrix with rows (1 2 3) and (4 5 6) is printed as

#2A((1 2 3)(4 5 6))

The prefix #2A indicates that this is a two-dimensional array. This form is not particularly readable, but it has the advantage that it can be pasted into expressions and will be read as an array by the LISP reader.11 For matrices you can use the function print-matrix to get a slightly more readable representation:

LS-USER> (print-matrix '#2a((1 2 3)(4 5 6)))
#2a(
    (1 2 3)
    (4 5 6)
   )
NIL

The select function can be used to extract elements or sub-arrays from an array. If A is a two dimensional array then the expression

(select a 0 1)

will return element 1 of row 0 of A. The expression

(select a (list 0 1) (list 0 1))

returns the upper left hand corner of A.

References

Bates, D. M. and Watts, D. G., (1988), Nonlinear Regression Analysis and its Applications, New York: Wiley.

Becker, Richard A., and Chambers, John M., (1984), S: An Interactive Environment for Data Analysis and Graphics, Belmont, Ca: Wadsworth.

Becker, Richard A., Chambers, John M., and Wilks, Allan R., (1988), The New S Language: A Programming Environment for Data Analysis and Graphics, Pacific Grove, Ca: Wadsworth.

Becker, Richard A., and William S. Cleveland, (1987), “Brushing scatterplots,” Technometrics, vol. 29, pp. 127-142.

Betz, David, (1985) “An XLISP Tutorial,” BYTE, pp 221.

Betz, David, (1988), “XLISP: An experimental object-oriented programming language,” Reference manual for XLISP Version 2.0.

Chaloner, Kathryn, and Brant, Rollin, (1988) “A Bayesian approach to outlier detection and residual analysis,” Biometrika, vol. 75, pp. 651-660.

Cleveland, W. S. and McGill, M. E., (1988) Dynamic Graphics for Statistics, Belmont, Ca.: Wadsworth.

Cox, D. R. and Snell, E. J., (1981) Applied Statistics: Principles and Examples, London: Chapman and Hall.

Dennis, J. E. and Schnabel, R. B., (1983), Numerical Methods for Unconstrained Optimization and Nonlinear Equations, Englewood Cliffs, N.J.: Prentice-Hall.

Devore, J. and Peck, R., (1986), Statistics, the Exploration and Analysis of Data, St. Paul, Mn: West Publishing Co.

McDonald, J. A., (1982), “Interactive Graphics for Data Analysis,” unpublished Ph. D. thesis, Department of Statistics, Stanford University.

Oehlert, Gary W., (1987), “MacAnova User’s Guide,” Technical Report 493, School of Statistics, University of Minnesota.

Press, Flannery, Teukolsky and Vetterling, (1988), Numerical Recipes in C, Cambridge: Cambridge University Press.

Steele, Guy L., (1984), Common Lisp: The Language, Bedford, MA: Digital Press.

Stuetzle, W., (1987), “Plot windows,” J. Amer. Statist. Assoc., vol. 82, pp. 466 - 475.

Tierney, Luke, (1990) LISP-STAT: Statistical Computing and Dynamic Graphics in Lisp. Forthcoming.

Tierney, L. and J. B. Kadane, (1986), “Accurate approximations for posterior moments and marginal densities,” J. Amer. Statist. Assoc., vol. 81, pp. 82-86.

Tierney, Luke, Robert E. Kass, and Joseph B. Kadane, (1989), “Fully exponential Laplace approximations to expectations and variances of nonpositive functions,” J. Amer. Statist. Assoc., to appear.

Tierney, L., Kass, R. E., and Kadane, J. B., (1989), “Approximate marginal densities for nonlinear functions,” Biometrika, to appear.

Weisberg, Sanford, (1982), “MULTREG Users Manual,” Technical Report 298, School of Statistics, University of Minnesota.

Winston, Patrick H. and Berthold K. P. Horn, (1988), LISP, 3rd Ed., New York: Addison-Wesley.

Appendix A: LISP-STAT Interface to the Operating System

A.1 Running System Commands from LISP-STAT

The run-program function can be used to run UNIX commands from within LISP-STAT. This function takes a shell command string as its argument and returns the shell exit code for the command. For example, you can print the date using the UNIX date command:

LS-USER> (uiop:run-program "date" :output *standard-output*)
Wed Jul 19 11:06:53 CDT 1989
0

The return value is 0, indicating successful completion of the UNIX command.


  1. It is possible to make a finer distinction. The reader takes a string of characters from the listener and converts it into an expression. The evaluator evaluates the expression and the printer converts the result into another string of characters for the listener to print. For simplicity I will use evaluator to describe the combination of these functions. ↩︎

  2. def acts like a special form, rather than a function, since its first argument is not evaluated (otherwise you would have to quote the symbol). Technically def is a macro, not a special form, but I will not worry about the distinction in this tutorial. def is closely related to the standard Lisp special forms setf and setq. The advantage of using def is that it adds your variable name to a list of def‘ed variables that you can retrieve using the function variables. If you use setf or setq there is no easy way to find variables you have defined, as opposed to ones that are predefined. def always affects top level symbol bindings, not local bindings. It cannot be used in function definitions to change local bindings. ↩︎

  3. Use the function load. For example, evaluating the expression (load #P"LS:DATASETS;CAR-PRICES") should load the file car-prices.lisp. ↩︎

  4. The generator used is Marsaglia’s portable generator from the Core Math Libraries distributed by the National Bureau of Standards. A state object is a vector containing the state information of the generator. “Random” reseeding occurs off the system clock. ↩︎

  5. Help is available both in the REPL, and online at https://lisp-stat.dev/ ↩︎

  6. The notation used corresponds to the specification of the argument lists in Lisp function definitions. See Section 8{reference-type=“ref” reference=“Fundefs”} for more information on defining functions. ↩︎

  7. Note that the keyword :title has not been quoted. Keyword symbols, symbols starting with a colon, are somewhat special. When a keyword symbol is created its value is set to itself. Thus a keyword symbol effectively evaluates to itself and does not need to be quoted. ↩︎

  8. Actually pi represents a constant, produced with defconst. Its value cannot be changed by simple assignment. ↩︎

  9. I have used a quoted list ’(purchases precipitation) in this expression to pass the list of symbols to the savevar function. A longer alternative would be the expression (list ’purchases ’precipitation). ↩︎

  10. The discussion in this section only scratches the surface of what you can do with functions in the XLISP language. To see more examples you can look at the files that are loaded when XLISP-STAT starts up. For more information on options of function definition, macros, etc. see the XLISP documentation and the books on Lisp mentioned in the references. ↩︎

  11. You should quote an array if you type it in using this form, as the value of an array is not defined. ↩︎

6 - Reference

API documentation for Lisp-Stat systems

6.2 - Code Repository

Collection of XLisp and Common Lisp statistical routines

Below is a partial list of the consoidated XLispStat packages from UCLA and CMU repositories. There is a great deal more XLispStat code available that was not submitted to these archives, and a search for an algorithm or technique that includes the term “xlispstat” will often turn up interesting results.

Artificial Intelligence

Genetic Programming

Cerebrum
A Framework for the Genetic Programming of Neural Networks. Peter Dudey. No license specified.
[Docs]
GAL
Functions useful for experimentation in Genetic Algorithms. It is hopefully compatible with Lucid Common Lisp (also known as Sun Common Lisp). The implementation is a “standard” GA, similar to Grefenstette’s work. Baker’s SUS selection algorithm is employed, 2 point crossover is maintained at 60%, and mutation is very low. Selection is based on proportional fitness. This GA uses generations. It is also important to note that this GA maximizes. William M. Spears. “Permission is hereby granted to copy all or any part of this program for free distribution, however this header is required on all copies.”
mGA
A Common Lisp Implementation of a Messy Genetic Algorithm. No license specified.
[Docs, errata]

Machine Learning

Machine Learning
Common Lisp files for various standard inductive learning algorithms that all use the same basic data format and same interface. It also includes automatic testing software for running learning curves that compare multiple systems and utilities for plotting and statistically evaluating the results. Included are:
  • AQ: Early DNF learner.
  • Backprop: The standard multi-layer neural-net learning method.
  • Bayes Indp: Simple naive or “idiot’s” Bayesian classifier.
  • Cobweb: A probabilistic clustering system.
  • Foil: A first-order Horn-clause learner (Prolog and Lisp versions).
  • ID3: Decision tree learner with a number of features.
  • KNN: K nearest neighbor (instance-based) algorithm.
  • Perceptron: Early one-layer neural-net algorithm.
  • PFOIL: Propositional version of FOIL for learning DNF.
  • PFOIL-CNF: Propositional version of FOIL for learning CNF.

Raymond J. Mooney. “This program may be freely copied, used, or modified provided that this copyright notice is included in each copy of this code and parts thereof.”

Neural Networks

QuickProp
Common Lisp implementation of “Quickprop”, a variation on back-propagation. For a description of the Quickprop algorithm, see Faster-Learning Variations on Back-Propagation: An Empirical Study by Scott E. Fahlman in Proceedings of the 1988 Connectionist Models Summer School, Morgan-Kaufmann, 1988. Scott E. Fahlman. Public domain.
[README]

Fun & Games

Towers of Hanoi
Tower of Hanoi plus the Queens program explained in Winston and Horn. No license specified.

Mathematics

Combinatorial
Various combinatorial functions for XLispStat. There are other Common Lisp libraries for this, for example cl-permutation. It’s worth searching for something in Quicklisp too. No license specified.
functions
Bessel, beta, erf, gamma and horner implementations. Gerald Roylance. License restricted to non-commercial use only.
integrate
gauss-hermite.lsp is by Jan de Leeuw.

runge.lsp and integr.lsp are from Gerald Roylance 1982 CLMATH package. integr.lsp has Simpson’s rule and the trapezoid rule. runge.lsp integrates runge-kutta differential equations by various methods.

Roylance code is non-commercial use only. Jan de Leeuw’s code has no license specified.

lsqpack
This directory contains the code from the Lawson and Hanson book, Solving Least Squares Problems, translated with f2cl, tweaked for Xlisp-Stat by Jan de Leeuw. No license specified.
nswc
This is an f2cl translation, very incomplete, of the NSWC mathematics library. The fortran, plus a great manual, is available on github. The report is NSWCDD/TR-92/425, by Alfred H. Morris, Jr. dated January 1993. No license specified, but this code is commonly considered public domain.
Numerical Recipes
Code from Numerical Recipes in FORTRAN, first edition, translated with Waikato’s f2cl and tweaked for Xlisp-Stat by Jan de Leeuw. No license specified.
optimization
Code for annealing, simplex and other optimization problems. Various licenses. These days, better implementations are available, for example the linear-programming library.

Statistics

Algorithms

  • AS 190 Probabilities and Upper Quantiles for the Studentized Range.
  • AS 226 Computing Noncentral Beta Probabilities
  • AS 241 The Percentage Points of the Normal Distribution
  • AS 243 Cumulative Distribution Function of the Non-Central T Distribution
  • TOMS 744 A stochastic algorithm for global optimization with constraints

AS algorithms: B. Narasimhan (naras@euler.bd.psu.edu) “You can freely use and distribute this code provided you don’t remove this notice. NO WARRANTIES, EXPLICIT or IMPLIED”

TOMS: F. Michael Rabinowitz. No license specified.

Categorical

glim
Glim extension for log-linear models. Jan de Leeuw. No license specified.
IPF
Fits Goodman’s RC model to the array X. Also included is a set of functions for APL like array operations. The four basic APL operators (see, for example, Garry Helzel, An Encyclopedia of APL, 2e edition, 1989, I-APL, 6611 Linville Drive, Weed., CA) are inner-product, outer-product, reduce, and scan. They can be used to produce new binary and unary functions from existing ones. Unknown author. No license specified.
latent-class
One file with the function latent-class. Unknown author. No license specified.
max
Functions to do quantization and cluster analysis in the empirical case. Jan de Leeuw. No license specified.
write-profiles
A function. The argument is a list of lists of strings. Each element of the list corresponds with a variable, the elements of the list corresponding with a variable are the labels of that variable, which are either strings or characters or numbers or symbols. The program returns a matrix of strings coding all the profiles. Unknown author. License not specified.

Distributions

The distributions repository contains single file implementations of:

density demo
Demonstrations of plots of density and probability functions. Requires XLispStat graphics. Jan de Leeuw. No license specified.
noncentral t-distribution
noncentral-t distribution by Russ Lenth, based on Applied Statistics Algorithm AS 243. No license specified.
probability-functions
A compilation of probability densities, cumulative distribution functions, and their inverses (quantile functions), by Jan de Leeuw. No license specified.
power
This appears to test the powers of various distribution functions. Unknown author. No license specified.
weibull-mle
Maximum likelihood estimation of Weibull parameters. M. Ennis. No license specified.

Classroom Statistics

The systems in the introstat directory are meant to be used in teaching situations. For the most part they use XLispStat’s graphical system to introduce students to statistical concepts. They are generally simple in nature from a the perspective of a statistical practitioner.

ElToY
ElToY is a collection of three program written in XLISP-STAT. Dist-toy displays a univarate distribution dynamically linked to its parameters. CLT-toy provides an illustration of the central limit theorem for univariate distributions. ElToY provides a mechanism for displaying the prior and posterior distributions for a conjugate family dynamically linked so that changes to the prior affect the posterior and visa versa. Russell Almond almond@stat.washington.edu. GPL v2.

Multivariate

Dendro
Denrdo is for producing dendrograms for agglomerative cluster in XLISP-STAT.

Plotting

Boxplot Matrix
Graphical Display of Analysis of Variance with the Boxplot Matrix. Extension of the standard oneway boxplot to cross-classified data with multiple observations per cell. Richard M. Heiberger rmh@astro.ocis.temple.edu No license specified.
[Docs]
Dynamic Graphics and Regression Diagnostics
Contains methods for regression diagnostics using dynamic graphics, including all the methods discussed in Cook and Weisberg (1989) Technometrics, 277-312. Includes documentation written in LaTeX. sandy@umnstat.stat.umn.edu No license specified.
[Docs}
FEDF
Flipped Empirical Distribution Function. Parallel-FEDF, FEDF-ScatterPlot, FEDF-StarPlot written in XLISP-STAT. These plots are suggested for exploring multidimensional data suggested in “Journal of Computational and Graphical Statistics”, Vol. 4, No. 4, pp.335-343. 97/07/18. Lee, Kyungmi & Huh, Moon Yul myhuh@yurim.skku.ac.kr No license specified.
PDFPlot
PDF graphics output from XlispStat PDFPlot is a XlispStat class to generate PDF files from LispStat plot objects. Steven D. Majewski sdm7g@virginia.edu. No license specified.
RXridge
RXridge.LSP adds shrinkage regression calculation and graphical ridge “trace” display functionality to the XLisp-Stat, ver2.1 release 3+ implementation of LISP-STAT. Bob Obenchain. No license specified.

Regression

Bayes-Linear
BAYES-LIN is an extension of the XLISP-STAT object-oriented statistical computing environment, which adds to XLISP-STAT some object prototypes appropriate for carrying out local computation via message-passing between clique-tree nodes of Bayes linear belief networks. Darren J. Wilkinson. No license specified. [Docs]
Bayesian Poisson Regression
Bayesian Poisson Regression using the Gibbs Sampler Sensitivity Analysis through Dynamic Graphics. A set of programs that allow you to do Bayesian sensitivity analysis dynamically for a variety of models. B. Narasimhan (naras@stat.fsu.edu) License restricted to non-commercial use only.
[Docs]
Binary regression
Smooth and parametric binary regression code. Unknown author. License not specified.
Cost of Data Analysis
A regression analysis usually consists of several stages such as variable selection, transformation and residual diagnosis. Inference is often made from the selected model without regard to the model selection methods that proceeded it. This can result in overoptimistic and biased inferences. We first characterize data analytic actions as functions acting on regression models. We investigate the extent of the problem and test bootstrap, jacknife and sample splitting methods for ameliorating it. We also demonstrate an interactive XLISP-STAT system for assessing the cost of the data analysis while it is taking place. Julian J. Faraway. BSD license.
[Docs]
Gee
Lisp-Stat code for generalised estimating equation models. Thomas Lumley thomas@biostat.washington.edu. GPL v2.
[Docs]
GLIM
Functions and prototypes for fitting generalized linear models. Contributed by Luke Tierney luke@umnstat.stat.umn.edu. No license specified.
[Docs]
GLMER
A function to estimate coefficients and dispersions in a generalized linear model with random effects. Guanghan Liu gliu@math.ucla.edu. No license specified.
Hasse
Implements Taylor & Hilton’s rules for balanced ANOVA designs and draws the Hasse diagram of nesting relationships. Philip Iversen piversen@iastate.edu. License restricted to non-commercial use only.
monotone
Implementation of an algorithm to project on the intersection of r closed convex sets. Further details and references are in Mathar, Cyclic Projections in Data Analysis, Operations Research Proceedings 1988, Spinger, 1989. Jan de Leeuw. No license specified.
OIRS
Order and Influence in Regression Strategy. The methods (tactics) of regression data analysis such as variable selection, transformation and outlier detection are characterised as functions acting on regression models and returning regression models. The ordering of the tactics, that is the strategy, is studied. A method for the generation of acceptable models supported by the choice of regression data analysis methods is described with a view to determining if two capable statisticians may reasonably hold differing views on the same data. Optimal strategies are considered. The idea of influential points is extended from estimation to the model building process itself both quantitatively and qualitatively. The methods described are not intended for the entirely automatic analysis of data, rather to assist the statistician in examining regression data at a strategic level. Julian J. Faraway julian@stat.lsa.umich.edu. BSD license.
oneway
Additions to Tierney’s one way ANOVA. B. Narasimhan naras@euler.bd.psu.edu. No license specified.
Regstrat
A XLispStat tool to investigate order in Regression Strategy particularly for finding and examining the models found by changing the ordering of the actions in a regression analysis. Julian Faraway julian@stat.lsa.umich.edu. License restricted to non-commercial use only.
Simsel
XLISP-STAT software to perform Bayesian Predictive Simultaneous Variable and Transformation Selection for regression. A criterion-based model selection algorithm. Jennifer A. Hoeting jah@stat.colostate.edu. License restricted to non-commercial use only.

Robust

There are three robust systems in the robust directory:

robust regression
This is the Xlisp-Stat version of ROSEPACK, the robust regression package developed by Holland, Welsch, and Klema around 1975. See Holland and Welsch, Commun. Statist. A6, 1977, 813-827. See also the Xlisp-Stat book, pages 173-177, for an alternative approach. Jan de Leeuw. No license specified.

There is also robust statistical code for location and scale.

Simulation

The simulation directory contains bootstrapping methods, variable imputation, jackknife resampling, monte-carlo simulations and a general purpose simulator. There is also the discrete finite state markov chains in the temporal directory.

Smoothers

kernel density estimators
KDEs based on Wand, CFFI based KDEs by B. Narasimhan, and graphical univariate density estimation.
spline
Regularized bivariate splines with smoothing and tension according to Mitasova and Mitas. Cubic splines according to Green and Silverman. Jan de Leeuw. No license specified.
super-smoother
The super smoothing algorithm, originally implemented in FORTRAN by Jerome Friedman of Stanford University, is a method by which a smooth curve may be fitted to a two-dimensional array of points. Its implementation is presented here in the XLISP-STAT language. Jason Bond. No license specified.
[DOCS]
Variable Bandwidth
XLispStat code to facilitate interactive bandwidth choice for estimator (3.14), page 44 in Bagkavos (2003), “BIAS REDUCTION IN NONPARAMETRIC HAZARD RATE ESTIMATION”. No license specified.

Spatial

livemap
LiveMap is a tool for exploratory spatial data analysis. Dr. Chris Brunsdon. No license specified.
[DOCS]
variograms
Produces variograms using algorithms from C.V. Deutsch and A.G. Journel, “GSLIB: Geostatistical Software Library and User’s Guide, Oxford University Press, New York, 1992. Stanley S. Bentow. No license specified.
[DOCS]

Temporal

Exploratory survival analysis
A set of XLISP-STAT routines for the interactive, dynamic, exploratory analysis of survival data. E. Neely Atkinson (neely@odin.mda.uth.tmc.edu) “This software may be freely redistributed.”
[Docs]
Markov
Simulate some Markov chains in Xlisp-Stat. Complete documentation and examples are included. B. Narasimhan (naras@sci234e.mrs.umn.edu). GPL.
[Docs]
SAPA
Sapaclisp is a collection of Common Lisp functions that can be used to carry out many of the computations described in the SAPA book:

Donald B. Percival and Andrew T. Walden, “Spectral Analysis for Physical Applications: Multitaper and Conventional Univariate Techniques”, Cambridge University Press, Cambridge, England, 1993.

The SAPA book uses a number of time series as examples of various spectral analysis techniques.

From the description:

Sapaclisp features functions for converting to/from decibels, the Fortran sign function, log of the gamma function, manipulating polynomials, root finding, simple numerical integration, matrix functions, Cholesky and modified Gram-Schmidt (i.e., Q-R) matrix decompositions, sample means and variances, sample medians, computation of quantiles from various distributions, linear least squares, discrete Fourier transform, fast Fourier transform, chirp transform, low-pass filters, high-pass filters, band-pass filters, sample autocovariance sequence, autoregressive spectral estimates, least squares, forward/backward least squares, Burg’s algorithm, the Yule-Walker method, periodogram, direct spectral estimates, lag window spectral estimates, WOSA spectral estimates, sample cepstrum, time series bandwidth, cumulative periodogram test statistic for white noise, and Fisher’s g statistic.

License: “Use and copying of this software and preparation of derivative works based upon this software are permitted. Any distribution of this software or derivative works must comply with all applicable United States export control laws.”

Times
XLispstat functions for time series analysis, data editing, data selection, and other statistical operations. W. Hatch (bts!bill@uunet.uu.net). Public Domain.

Tests

The tests directory contains code to do one-sample and two-sample Kolmogorov-Smirnov test (with no estimated parameters) and code to do Mann-Whitney and Wilcoxon rank signed rank tests.

Training & Documentation

ENAR Short Course
This directory contains slides and examples used in a shortcourse on Lisp-Stat presented at the 1992 ENAR meetings in Cincinnati, 22 March 1992.
ASA Course
Material from an ASA course given in 1992.
Tech Report
A 106 page mini manual on XLispStat.

Utilities

The majority of the files in the utilities directory are specific to XLISP-STAT and unlikely to be useful. In most cases better alternatives now exist for Common Lisp. A few that may be worth investigating have been noted below.

Filters

XLisp-S
A series of routines to allow users of Xlisp or LispStat to interactively transfer data to and access functions in New S. Steve McKinney kilroy@biostat.washington.edu. License restricted to non-commercial use only.

I/O

formatted-input
A set of xlisp functions that can be used to read ascii files into lists of lists, using formatted input. The main function is read-file, which has as arguments a filename and a fortran type format string (with f, i, x, t, and a formats) Jan Deleeuw deleeuw@laplace.sscnet.ucla.edu “THIS SOFTWARE CAN BE FREELY DISTRIBUTED, USED, AND MODIFIED.”

Memoization

automatic memoization
As the name suggests. Marty Hall hall@aplcenmp.apl.jhu.edu. “Permission is granted for any use or modification of this code provided this notice is retained."
[OVERVIEW]

8 - Contribution Guidelines

How to contribute to Lisp-Stat

This section describes the mechanics of how to contribute code to Lisp-Stat at a high level. Legal stuff, community guidelines, code of conduct, etc. For details on how to contribute code and documentation, see links on nav sidebar to the left under Contributing.

For ideas about what you might contribute, please see open issues on github and the ideas page. The organisation repository contains the individual sub-projects. Contributions to documentation are especially welcome.

Contributor License Agreement

Contributor License Agreements (CLAs) are common and accepted in open source projects. We all wish for Lisp-Stat to be used and distributed as widely as possible, and for its users to be confident about the origins and continuing existence of the code. The CLA help us achieve that goal.

The Lisp-Stat project uses CLAs to accept regular contributions from individuals and corporations, and to accept larger grants of existing software products, for example if you wished to contribute a large XLISP-STAT library.

Contributions to this project must be accompanied by a Contributor License Agreement. You (or your employer) retain the copyright to your contribution; this simply gives us permission to use and redistribute your contributions as part of the project.

You generally only need to submit a CLA once, so if you have already submitted one (even if it was for a different project), you probably do not need to do it again. To get the process started, download and sign the CLA (A4, US-Letter), then open an issue with the title Contributor License Agreement on Github and upload the agreement as an attachment.

Code of Conduct

The following code of conduct is not meant as a means for punishment, action or censorship for the mailing list or project. Instead, it is meant to set the tone, expectations and comfort level for contributors and those wishing to participate in the community.

  • We ask everyone to be welcoming, friendly, and patient.
  • Flame wars and insults are unacceptable in any fashion, by any party.
  • Anything can be asked, and “RTFM” is not an acceptable answer.
  • Neither is “it’s in the archives, go read them”.
  • Statements made by core developers can be quoted outside of the list.
  • Statements made by others can not be quoted outside the list without explicit permission. - Anonymised paraphrased statements “someone asked about…” are OK - direct quotes with or without names are not appropriate.
  • The community administrators reserve the right to revoke the subscription of members (including mentors) that persistently fail to abide by this Code of Conduct.

8.1 - Contributing Code

How to contribute code to Lisp-Stat

First, ensure you have signed a contributor license agreement. Then follow these steps for contributing to Lisp-Stat:

You may also be interested in the additional information at the end of this document.

Get source code

First you need the Lisp-Stat source code. The core systems are found on the Lisp-Stat github page. For the individual systems, just check out the one you are interested in. For the entire Lisp-Stat system, at a minimum you will need:

Other dependencies will be pulled in by Quicklisp.

Development occurs on the “master” branch. To get all the repos, you can use the following command in the directory you want to be your top level dev space:

git clone https://github.com/Lisp-Stat/data-frame.git && \
git clone https://github.com/Lisp-Stat/dfio.git && \
git clone https://github.com/Lisp-Stat/special-functions.git && \
git clone https://github.com/Lisp-Stat/numerical-utilities.git && \
git clone https://github.com/Lisp-Stat/documentation.git && \
git clone https://github.com/Lisp-Stat/lisp-stat.git && \
git clone https://github.com/Lisp-Stat/plot.git && \
git clone https://github.com/Lisp-Stat/select.git && \
git clone https://github.com/Lisp-Stat/array-operations.git

Modify the source

Before you start, send a message to the Lisp-Stat mailing list or file an issue on Github describing your proposed changes. Doing this helps to verify that your changes will work with what others are doing and have planned for the project. Importantly, there may be some existing code or design work for you to leverage that is not yet published, and we’d hate to see work duplicated unnecessarily.

Be patient, it may take folks a while to understand your requirements. For large systems or design changes, a design document is preferred. For small changes, issues and the mailing list are fine.

Once your suggested changes are agreed, you can modify the source code and add some features using your favorite IDE.

The following sections provide tips for working on the project:

Coding Convention

Please consider the following before submitting a pull request:

  • Code should be formatted according to the Google Common Lisp Style Guide
  • All code should include unit tests. Currently we use fiveam as the test framework for new projects, but are looking at Parachute and Rove as more extensible alternatives.
  • Contributions should pass existing unit tests
  • New unit tests should be provided to demonstrate bugs and fixes
  • Indentation in Common Lisp is important for readability. Contributions should adhere to these guidelines. For the most part, a properly configured Emacs will do this automatically.

Code review

Github includes code review tools that can be used as part of a pull request. We recommend using a triangular workflow and feature/bug branches in your own repository to work from. Once you submit a pull request, one of the committers will review it and possibly request modifications.

As a contributor you should organise (squash) your git commits to make them understandable to reviewers:

  • Combine WIP and other small commits together.
  • Address multiple issues, for smaller bug fixes or enhancements, with a single commit.
  • Use separate commits to allow efficient review, separating out formatting changes or simple refactoring from core changes or additions.
  • Rebase this chain of commits on top of the current master
  • Write a good git commit message

Once all the comments in the review have been addressed, a Lisp-Stat committer completes the following steps to commit the patch:

  • If the master branch has moved forward since the review, rebase the branch from the pull request on the latest master and re-run tests.
  • If all tests pass, the committer amends the last commit message in the series to include “this closes #1234”. This can be done with interactive rebase. When on the branch issue: git rebase -i HEAD^
    • Change where it says “pick” on the line with the last commit, replacing it with “r” or “reword”. It replays the commit giving you the opportunity the change the commit message.
    • The committer pushes the commit(s) to the github repo
    • The committer resolves the issue with a message like "Fixed in <Git commit SHA>".

Additional Info

Where to start?

If you are new to statistics or Lisp, documentation updates are always a good place to start. You will become familiar with the workflow, learn how the code functions and generally become better acquainted with how Lisp-Stat operates. Besides, any contribution will require documentation updates, so it’s good to learn this system first.

If you are coming from an existing statistical environment, consider porting a XLispStat package that you find useful to Lisp-Stat. Use the XLS compatibility layer to help. If there is a function missing in XLS, raise an issue and we’ll create it. Some XLispStat code to browse:

Keep in mind that some of these rely on the XLispStat graphics system, which was native to the platform. LISP-STAT uses Vega for visualizations, so there isn’t a direct mapping. Non-graphical code should be a straight forward port.

You could also look at CRAN, which contains thousands of high-quality packages.

For specific ideas that would help, see the ideas page.

Issue Guidelines

Please comment on issues in github, making your concerns known. Please also vote for issues that are a high priority for you.

Please refrain from editing descriptions and comments if possible, as edits spam the mailing list and clutter the audit trails, which is otherwise very useful. Instead, preview descriptions and comments using the preview button (on the right) before posting them. Keep descriptions brief and save more elaborate proposals for comments, since descriptions are included in GitHub automatically sent messages. If you change your mind, note this in a new comment, rather than editing an older comment. The issue should preserve this history of the discussion.

8.2 - Contributing to Documentation

You can help make Lisp-Stat documentation better

Creating and updating documentation is a great way to learn. You will not only become more familiar with Common Lisp, you have a chance to investigate the internals of all parts of a statistical system.

We use Hugo to format and generate the website, the Docsy theme for styling and site structure, and Netlify to manage the deployment of the documentation site (what you are reading now). Hugo is an open-source static site generator that provides us with templates, content organisation in a standard directory structure, and a website generation engine. You write the pages in Markdown (or HTML if you want), and Hugo wraps them up into a website.

All submissions, including submissions by project members, require review. We use GitHub pull requests for this purpose. Consult GitHub Help for more information on using pull requests.

Repository Organisation

Declt generates documentation for individual systems in Markdown format. These are kept with the project, e.g. select/docs/select.md.

Conventions

Please follow the Microsoft Style Guide for technical documentation.

Quick Start

Here’s a quick guide to updating the docs. It assumes you are familiar with the GitHub workflow and you are happy to use the automated preview of your doc updates:

  1. Fork the Lisp-Stat documentation repo on GitHub.
  2. Make your changes and send a pull request (PR).
  3. If you are not yet ready for a review, add “WIP” to the PR name to indicate it’s a work in progress. (Don’t add the Hugo property “draft = true” to the page front matter, because that prevents the auto-deployment of the content preview described in the next point.)
  4. Wait for the automated PR workflow to do some checks. When it’s ready, you should see a comment like this: deploy/netlify — Deploy preview ready!
  5. Click Details to the right of “Deploy preview ready” to see a preview of your updates.
  6. Continue updating your doc and pushing your changes until you’re happy with the content.
  7. When you’re ready for a review, add a comment to the PR, and remove any “WIP” markers.

Updating a single page

If you’ve just spotted something you’d like to change while using the docs, Docsy has a shortcut for you (do not use this for reference docs):

  1. Click Edit this page in the top right hand corner of the page.
  2. If you don’t already have an up to date fork of the project repo, you are prompted to get one - click Fork this repository and propose changes or Update your Fork to get an up to date version of the project to edit. The appropriate page in your fork is displayed in edit mode.
  3. Follow the rest of the Quick Start process above to make, preview, and propose your changes.

Previewing locally

If you want to run your own local Hugo server to preview your changes as you work:

  1. Follow the instructions in Getting started to install Hugo and any other tools you need. You’ll need at least Hugo version 0.45 (we recommend using the most recent available version), and it must be the extended version, which supports SCSS.

  2. Fork the Lisp-Stat documentation repo into your own repository project, then create a local copy using git clone. Don’t forget to use --recurse-submodules or you won’t pull down some of the code you need to generate a working site.

    git clone --recurse-submodules --depth 1 https://github.com/lisp-stat/documentation.git
    
  3. Run hugo server in the site root directory. By default your site will be available at http://localhost:1313/. Now that you’re serving your site locally, Hugo will watch for changes to the content and automatically refresh your site.

  4. Continue with the usual GitHub workflow to edit files, commit them, push the changes up to your fork, and create a pull request.

Creating an issue

If you’ve found a problem in the docs, but are not sure how to fix it yourself, please create an issue in the Lisp-Stat documentation repo. You can also create an issue about a specific page by clicking the Create Issue button in the top right hand corner of the page.

Useful resources

8.3 - Contribution Ideas

Some ideas on how contribute to Lisp-Stat

SQLite

There isn’t a good, maintained wrapper for SQLite that doesn’t have a restricted license. Using CFFI and autowrap, create a lisp interface for SQLite. This will allow us to use sqldf with something other than PostgreSQL.

Special Functions

The functions underlying the statistical distributions require skills in numerical programming. If you like being ‘close to the metal’, this is a good area for contributions. Suitable for medium-advanced level programmers. In particular we need implementations of:

  • gamma
  • incomplete gamma (upper & lower)
  • inverse incomplete gamma

This work is partially complete and makes a good starting point for someone who wants to make a substantial contribution.

Documentation

Better and more documentation is always welcome, and a great way to learn. Suitable for beginners to Common Lisp or statistics.

Jupyter-Lab Integrations

Jupyter Lab has two nice integrations into Pandas, the Python version of Data-Frame, that would make great contributions: Qgrid, which allows editing a data frame in Jupyter Lab, and Jupyter DataTables. There are many more Pandas/Jupyter integrations, and any of them would be welcome additions to the Lisp-Stat ecosystem.

Plotting

LISP-STAT has a basic plotting system, but there is always room for improvement. An interactive REPL based plotting system should be possible with a medium amount of effort. Remote-js provides a working example of running JavaScript in a browser from a REPL, and could combined with something like Electron and a DSL for Vega-lite specifications. This may be a 4-6 week project for someone with JavaScript and HTML skills. There are other Plotly/Vega options, so if this interests you, open an issue and we can discuss. I have working examples of much of this, but all fragmented examples. Skills: good web/JavaScript, beginner lisp.

Regression

We have some code for ‘quick & dirty’ regressions and need a more robust DSL (Domain Specific Language). As a prototype, the -proto regression objects from XLISP-STAT would be both useful and be a good experiment to see what the final form should take. This is a relatively straightforward port, e.g. defproto -> defclass and defmeth -> defmethod. Skill level: medium in both Lisp and statistics, or willing to learn.

Vector Mathematics

We have code for vectorized versions of all Common Lisp functions, living in the elmt package. It now only works on vectors. Shadowing Common Lisp mathematical operators is possible, and more natural. This task is to make elmt vectorized math functions work on lists as well as vectors, and to implement shadowing of Common Lisp. This task requires at least medium-high level Lisp skills, since you will be working with both packages and shadowing. We also need to run the ANSI Common Lisp conformance tests on the results to ensure nothing gets broken in the process.

Continuous Integration

If you have experience with Github’s CI tools, a CI setup for Lisp-Stat would be a great help. This allows people making pull requests to immediately know if their patches break anything. Beginner level Lisp.