Select
Overview
Select provides:
- An API for taking slices (elements selected by the Cartesian
product of vectors of subscripts for each axis) of array-like
objects. The most important function is
select
. Unless you want to define additional methods forselect
, this is pretty much all you need from this library. See the API reference for additional details. - An extensible DSL for selecting a subset of valid subscripts. This is useful if, for example, you want to resolve column names in a data frame in your implementation of select.
- A set of utility functions for traversing selections in array-like objects.
It combines the functionality of dplyr’s slice, select and sample methods.
Basic Usage
The most frequently used form is:
(select object selection1 selection2 ...)
where each selection
specifies a set of subscripts along the
corresponding axis. The selection specifications are found below.
To select a column, pass in t
for the rows selection1
, and the
columns names (for a data frame) or column number (for an array) for
selection2
. For example, to select the first column of this array:
(select #2A((C0 C1 C2)
(v10 v11 v12)
(v20 v21 v22)
(v30 v31 v32))
t 1)
; #(C1 V11 V21 V31)
and to select a column from the mtcars
data frame:
(ql:quickload :data-frame)
(data :mtcars)
(select mtcars t 'mpg)
if you’re selecting from a data frame, you can also use the column
or columns
command:
(column mtcars 'mpg)
to select an entire row, pass t
for the column selector, and the
row(s) you want for selection1
. This example selects the first row
(second row in purely array terms, which are 0 based):
(select #2A((C0 C1 C2)
(v10 v11 v12)
(v20 v21 v22)
(v30 v31 v32))
1 t)
;#(V10 V11 V12)
Selection Specifiers
Selecting Single Values
A non-negative integer selects the corresponding index, while a negative integer selects an index counting backwards from the last index. For example:
(select #(0 1 2 3) 1) ; => 1
(select #(0 1 2 3) -2) ; => 2
These are called singleton slices. Each singleton slice drops the dimension: vectors become atoms, matrices become vectors, etc.
Selecting Ranges
(range start end)
selects subscripts i where start <= i < end.
When end is nil
, the last index is included (cf. subseq). Each
boundary is resolved according to the other rules, if applicable, so
you can use negative integers:
(select #(0 1 2 3) (range 1 3)) ; => #(1 2)
(select #(0 1 2 3) (range 1 -1)) ; => #(1 2)
Selecting All Subscripts
t selects all subscripts:
(select #2A((0 1 2)
(3 4 5))
t 1) ; => #(1 4)
Selecting w/ Sequences
Sequences can be used to make specific selections from the object. For example:
(select #(0 1 2 3 4 5 6 7 8 9)
(vector (range 1 3) 6 (range -2 -1))) ; => #(1 2 3 6 8 9)
(select #(0 1 2) '(2 2 1 0 0)) ; => #(2 2 1 0 0)
Masks
Bit Vectors
Bit vectors can be used to select elements of arrays and sequences as well:
(select #(0 1 2 3 4) #*00110) ; => #(2 3)
Which
which
returns an index of the positions in SEQUENCE which satisfy PREDICATE.
(defparameter data
#(12 127 28 42 39 113 42 18 44 118 44 37 113 124 37 48 127 36 29 31 125
139 131 115 105 132 104 123 35 113 122 42 117 119 58 109 23 105 63 27
44 105 99 41 128 121 116 125 32 61 37 127 29 113 121 58 114 126 53 114
96 25 109 7 31 141 46 13 27 43 117 116 27 7 68 40 31 115 124 42 128 146
52 71 118 117 38 27 106 33 117 116 111 40 119 47 105 57 122 109 124
115 43 120 43 27 27 18 28 48 125 107 114 34 133 45 120 30 127 31 116))
(which data :predicate #'evenp)
; #(0 2 3 6 7 8 9 10 13 15 17 25 26 30 31 34 40 44 46 48 55 56 57 59 60 66 71 74
; 75 78 79 80 81 82 84 86 88 91 93 98 100 103 107 108 109 112 113 116 117 120)
Sampling
You may sample sequences, arrays and data frames with the sample
generic function, and extend it for your own objects. The function signature is:
(defgeneric sample (data n &key
with-replacement
skip-unselected)
By default in common lisp, key
values that are not provide are nil
, so you need to turn them on if you want them.
:skip-unselected t
means to not return the values of the object that were not part of the sample. This is turned off by default because a common use case is splitting a data set into training and test groups, and the second value is ignored by default in Common Lisp. The let-plus
package, imported by default in select
, makes it easy to destructure into test and training. This example is from the tests included with select:
(let+ ((*random-state* state)
((&values train test) (sample arr35 2))
...
Note the setting of *random-state*
. You should use this pattern of setting *random-state*
to a saved seed anytime you need reproducible results (like in a testing scenerio).
The size of the sample is determined by the value of n
, which must be between 0 and the number of rows (for an array
) or length if a sequence
. If (< n 1)
, then n
indicates a proportion of the sample, e.g. 2/3 (values of n
less than one may be rational
or float
. For example, let’s take a training sample of 2/3 of the rows in the mtcars
dataset:
LS-USER> (sample mtcars 2/3)
#<DATA-FRAME (21 observations of 12 variables)>
#<DATA-FRAME (11 observations of 12 variables)>
LS-USER> (dims mtcars)
(32 12)
You can see that mtcars
has 32 rows, and has been divided into 2/3 and 1/3 proportional samples for training / test.
You can also take samples of sequences (lists and vectors), for example using the DATA
variable defined above:
LS-USER> (length data)
121
LS-USER> (sample data 10 :skip-unselected t)
#(43 117 42 29 41 105 116 27 133 58)
LS-USER> (sample data 1/10 :skip-unselected t)
#(119 116 7 53 27 114 31 23 121 109 42 125)
list
objects can also be sampled:
(sample '(a b c d e f g) 0.5)
(A E G B)
(F D C)
Note that n
is rounded up when the number of elements is odd and a proportional number is requested.
Extensions
The previous section describes the core functionality. The semantics can be extended. The extensions in this section are provided by the library and prove useful in practice. Their implementation provide good examples of extending the library.
including
is convenient if you want the selection to include the
end of the range:
(select #(0 1 2 3) (including 1 2))
; => #(1 2), cf. (select ... (range 1 3))
nodrop
is useful if you do not want to drop dimensions:
(select #(0 1 2 3) (nodrop 2))
; => #(2), cf. (select ... (range 2 3))
All of these are trivial to implement. If there is something you are
missing, you can easily extend select
. Pull request are
welcome.
(ref)
is a version of (select)
that always returns a single
element, so it can only be used with singleton slices.
Select Semantics
Arguments of select
, except the first one, are meant to be
resolved using canonical-representation
, in the select-dev
package. If you want to extend select
, you should define methods
for canonical-representation
. See the source code for the best
examples. Below is a simple example that extends the semantics with
ordinal numbers.
(defmacro define-ordinal-selection (number)
(check-type number (integer 0))
`(defmethod select-dev:canonical-representation
((axis integer) (select (eql ',(intern (format nil \"~:@@(~:r~)\" number)))))
(assert (< ,number axis))
(select-dev:canonical-singleton ,number)))
(define-ordinal-selection 1)
(define-ordinal-selection 2)
(define-ordinal-selection 3)
(select #(0 1 2 3 4 5) (range 'first 'third)) ; => #(1 2)
Note the following:
- The value returned by
canonical-representation
needs to be constructed usingcanonical-singleton
,canonical-range
, orcanonical-sequence
. You should not use the internal representation directly as it is subject to change. - You can assume that
axis
is an integer; this is the default. An object may define a more complex mapping (such as, for example, named rows & columns), but unless a method specialized to that is found,canonical-representation
will just query its dimension (withaxis-dimension
) and try to find a method that works on integers. - You need to make sure that the subscript is valid, hence the assertion.
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.