SQLDF

Selecting subsets of data using SQL

Overview

sqldf is a library for querying data in a data-frame using SQL, optimised for memory consumption. Any query that can be done in SQL can also be done in the API, but since SQL is widely known, many developers find it more convenient to use.

To use SQL to query a data frame, the developer uses the sqldf function, using the data frame name (converted to SQL identifier format) in place of the table name. sqldf will automatically create an in-memory SQLite database, copy the contents of the data frame to it, perform the query, return the results as a new data frame and delete the database. We have tested this with data frames of 350K rows and there is no noticeable difference in performance compared to API based queries.

See the cl-sqlite documentation for additional functionality provided by the SQLite library. You can create databases, employ multiple persistent connections, use prepared statements, etc. with the underlying library. sqldf is a thin layer for moving data to/from data-frames.

Basic Usage

sqldf requires the sqlite shared library from the SQLite project. It may also be available via your operating systems package manager.

To load sqldf:

(asdf:load-system :sqldf)
(use-package 'sqldf) ;access to the symbols

Examples

These examples use the R data sets that are loaded using the example ls-init file. If your init file doesn’t do this, go now and load the example datasets in the REPL. Mostly these examples are intended to demonstrate commonly used queries for users who are new to SQL. If you already know SQL, you can skip this section.

Ordering & Limiting

This example shows how to limit the number of rows output by the query. It also illustrates changing the column name to meet SQL identifier requirements. In particular, the R CSV file has sepal.length for a column name, which is converted to sepal-length for the data frame, and we query it with sepal_length for SQL because ‘-’ is not a valid character in SQL identifers.

First, let’s see how big the iris data set is:

LS-USER> iris
#<DATA-FRAME (150 observations of 6 variables)>

and look at the first few rows:

(head iris)
;;   X7 SEPAL-LENGTH SEPAL-WIDTH PETAL-LENGTH PETAL-WIDTH SPECIES
;; 0  1          5.1         3.5          1.4         0.2 setosa
;; 1  2          4.9         3.0          1.4         0.2 setosa
;; 2  3          4.7         3.2          1.3         0.2 setosa
;; 3  4          4.6         3.1          1.5         0.2 setosa
;; 4  5          5.0         3.6          1.4         0.2 setosa
;; 5  6          5.4         3.9          1.7         0.4 setosa

X7 is the row name/number from the data set. Since it was not assigned a column name in the data set, lisp-stat gives it a random name upon import (X1, X2, X3, …).

Now use sqldf for a query:

(pprint
  (sqldf "select * from iris order by sepal_length desc limit 3"))

;;    X7 SEPAL-LENGTH SEPAL-WIDTH PETAL-LENGTH PETAL-WIDTH SPECIES
;; 0 132          7.9         3.8          6.4         2.0 virginica
;; 1 118          7.7         3.8          6.7         2.2 virginica
;; 2 119          7.7         2.6          6.9         2.3 virginica

Averaging & Grouping

Grouping is often useful during the exploratory phase of data analysis. Here’s how to do it with sqldf:

(pprint
  (sqldf "select species, avg(sepal_length) from iris group by species"))

;;   SPECIES    AVG(SEPAL-LENGTH)
;; 0 setosa                5.0060
;; 1 versicolor            5.9360
;; 2 virginica             6.5880

Nested Select

For each species, show the two rows with the largest sepal lengths:

(pprint
  (sqldf "select * from iris i
	      where x7 in
		  (select x7 from iris where species = i.species order by sepal_length desc limit 2) order by i.species, i.sepal_length desc"))

;;    X7 SEPAL-LENGTH SEPAL-WIDTH PETAL-LENGTH PETAL-WIDTH SPECIES
;; 0  15          5.8         4.0          1.2         0.2 setosa
;; 1  16          5.7         4.4          1.5         0.4 setosa
;; 2  51          7.0         3.2          4.7         1.4 versicolor
;; 3  53          6.9         3.1          4.9         1.5 versicolor
;; 4 132          7.9         3.8          6.4         2.0 virginica
;; 5 118          7.7         3.8          6.7         2.2 virginica

Recall the note above about X7 being the row id. This may be different depending on how many other data frames with an unnamed column have been imported in your Lisp-Stat session.

SQLite access

sqldf needs to read and write data frames to the data base, and these functions are exported for general use.

Write a data frame

create-df-table and write-table can be used to write a data frame to a database. Each take a connection to a database, which may be file or memory based, a table name and a data frame. Multiple data frames, with different table names, may be written to a single SQLite file this way. For example, to write iris to disk:

LS-USER> (defparameter *conn* (sqlite:connect #P"c:/Users/lisp-stat/data/iris.db3")) ;filel to save to
*CONN*

LS-USER> (sqldf::create-df-table *conn* 'iris iris) ; create the table * schema
NIL
LS-USER> (sqldf:write-table *conn* 'iris iris) ; write the data
NIL

Read a data frame

read-table will read a database table into a data frame and update the column names to be lisp like by converting “.” and “_” to “-”. Note that the CSV reading tools of SQLite (for example, DB-Browser for SQLite are much faster than the lisp libraries, sometimes 15x faster. This means that often the quickest way to load a data-frame from CSV data is to first read it into a SQLite database, and then load the database table into a data frame. In practice, SQLite also turns out to be a convenient file format for storing data frames.

Roadmap

SQLDF is currently written using an apparently abandoned library, cl-sqlite. Pull requests from 2012 have been made with no response from the author, and the SQLite C API has improved considerably in the 12 years since the cl-sqlite FFI was last updated.

We choose CL-SQLite because, at the time of writing, it was the only SQLite library with a commercially acceptable license. Since then CLSQL has migrated to a BSD license and is a better option for new development. Not only does it support CommonSQL, the de-facto SQL query syntax for Common Lisp, it also supports several additional databases.

Version 2 of SQLDF will use CLSQL, possibly including some of the CSV and other extensions available in SQLite. Benchmarks show that SQLite’s CSV import is about 15x faster than cl-csv, and a FFI wrapper of SQLite’s CSV importer would be a good addition to Lisp-Stat.

Joins

Joins on tables are not implemented in SQLDF, though there is no technical reason they could not be. This will be done as part of the CLSQL conversion and involves more advanced SQL parsing. SXQL is worth investigating as a SQL parser.


Last modified 11 February 2023: Add LLA manual (2012036)