Dependencies

plugin tuple;
plugin refly;
plugin vec;
plugin cps;
plugin affine;
plugin buffer;
plugin matrix;

Types

%tensor.Ring

Represents an algebraic ring. E.g., let nat_ring = (Nat, 0, %core.nat.add, %core.nat.mul);.

let %tensor.Ring = [
    T: *,
    _0: T,
    add: [T, T] → T,
    mul: [T, T] → T,
];

Operations

%tensor.get

Extracts the element at index from arr: %tensor.get (arr, (a, b, c)) ≡ arr#a#b#c.

axm %tensor.get: {T: *, r: Nat, s: «r; Nat»} → [arr: «s; T», index: «i: r; Idx (s#i)»] → T, normalize_get;

%tensor.set

Inserts the element x in arr at index: %tensor.set (arr, (a, b, c), x) ≡ Insert arr a (Insert (arr#a) b (Insert (arr#a#b) c x)).

axm %tensor.set: {T: *, r: Nat, s: «r; Nat»} → [arr: «s; T», index: «i: r; Idx (s#i)», x: T] → «s; T», normalize_set;

%tensor.shape

Returns the shape of arr as a rank-r array of Nats: (s#0, …, s#(r-1)).

The rank r is given explicitly. A tensor's shape vector lives in its (nested array) type, which normalize_shape reads off by peeling r levels — it cannot be recovered by inference, since «s; T» with a vector s expands to a nested array and does not reverse-unify. (Size-1 axes collapse out of the type, so arr must have no literal-1 dims among the first r.)

axm %tensor.shape: {A: *} → [r: Nat] → [arr: A] → «r; Nat», normalize_shape;

%tensor.map_reduce

The workhorse behind every tensor operation: a reduction fold over an affinely-indexed loop nest.

The iteration domain and the result shape are stated explicitly:

Sr: the bounds of the full loop nest of length Ro + Rr. Its leading Ro dimensions are the parallel output loops; the trailing Rr dimensions are the reduction loops that are folded away.
So: the shape of the result tensor (rank Ro). It may differ from the leading Ro bounds of Sr, which is what lets map_out write to a re-shaped output (e.g. a transpose).

The full loop iteration vector (o…, r…) has length Ro + Rr.

f folds one element of each input into the accumulator (seeded per output cell with init).
map_out maps the full loop vector to the Ro write coordinates in the result «So». As the reduction is folded away before the write, it must depend only on the leading Ro output indices.
maps#i maps the full loop vector to the Ris#i read coordinates of input i. Transposes, slices and broadcasts are expressed here, by reading the inputs accordingly.

nis, Ro, Rr and the Ris must be literals before entering the tensor::Lower phase for the lowering to succeed; the bounds So and Sr may be symbolic.

axm %tensor.map_reduce: {nis: Nat}
                      → {To: *, Ro Rr: Nat}
                      → [So: «Ro; Nat», Sr: «%core.nat.add (Ro, Rr); Nat»]
                      → {Tis: «nis; *», Ris: «i: nis; Nat», Sis: «i: nis; «Ris#i; Nat»»}
                      → [f: Fn [To, «i: nis; Tis#i»] → To, init: To]
                      → [map_out: [«%core.nat.add (Ro, Rr); %affine.index»] → «Ro; %affine.index»]
                      → [maps: «i: nis; ([«%core.nat.add (Ro, Rr); %affine.index»] → «Ris#i; %affine.index»)»]
                      → [is: «i: nis; «Sis#i; Tis#i»»]
                      → «So; To», normalize_map_reduce;

subs ↦ λ o. ‹j; o#(subs#j)›: the access map that reads input axis j at loop variable subs#j, e.g. ((0, 2), (2, 1)) for the two inputs of a matrix product.

lam %tensor.proj_map {n r: Nat} (subs: «r; Nat») (o: «n; %affine.index»): «r; %affine.index» =

‹j: r; o#(%core.idx n 0 (subs#j))›;

%tensor.repeat

Tiles input to the larger shape s_out (each s_out#d must be a multiple of s_in#d): out[…d…] = input[… d mod s_in#d …].

axm %tensor.repeat: {T: *, r: Nat}
                  → [s_in: «r; Nat»]
                  → [s_out: «r; Nat»]
                  → [input: «s_in; T»]
                  → «s_out; T», normalize_repeat;
 
fun tensor_copy {T: *} (acc: T, y: «1; T»)@tt: T = return (y#0_1);

o ↦ (o#d mod s_in#d)_d: the wrap-around (tiling) read map.

lam repeat_map {r: Nat} (s_in: «r; Nat») (o: «r; %affine.index»): «r; %affine.index» =
    ‹d: r; %affine.semiop.mod (o#d, s_in#d)›;
 
lam %tensor.repeat_impl {T: *, r: Nat} (s_in: «r; Nat») (s_out: «r; Nat») (input: «s_in; T»): «s_out; T» =
    %tensor.map_reduce @1 @(T, r, 0) (s_out, s_out)
        @(T, r, s_in) (tensor_copy @T, ⊥: T)
        (%affine.id @r) (repeat_map @r s_in,) input;

%tensor.reshape

Row-major reshape: prod(s_in) must equal prod(s_out). Loops over the result and reads the input at the delinearized linear index, so it works for arbitrary rank changes (e.g. flatten/unflatten).

s_in is explicit because s_out has a different rank and so cannot pin the input's rank/element-type during inference.

axm %tensor.reshape: {T: *, r_in r_out: Nat}
                   → [s_in: «r_in; Nat»]
                   → [s_out: «r_out; Nat»]
                   → [input: «s_in; T»]
                   → «s_out; T», normalize_reshape;

o ↦ delinearize(linearize(o, s_out), s_in): the row-major output→input index map.

lam reshape_map {r_in r_out: Nat} (s_in: «r_in; Nat», s_out: «r_out; Nat»)
                (o: «r_out; %affine.index»): «r_in; %affine.index» =
    %affine.delinearize (%affine.linearize (o, s_out), s_in);
 
lam %tensor.reshape_impl {T: *, r_in r_out: Nat}
                         (s_in: «r_in; Nat») (s_out: «r_out; Nat») (input: «s_in; T»): «s_out; T» =
    %tensor.map_reduce @1 @(T, r_out, 0) (s_out, s_out)
        @(T, r_in, s_in) (tensor_copy @T, ⊥: T)
        (%affine.id @r_out) (reshape_map (s_in, s_out),) input;

%tensor.slice

Strided slice / narrow: out[…d…] = input[… start#d + step#d · d …]. Same rank in/out; s_out gives the sliced extents (the caller must ensure start#d + step#d · (s_out#d − 1) < s_in#d).

axm %tensor.slice: {T: *, r: Nat}
                 → [s_in: «r; Nat»]
                 → [start: «r; Nat», step: «r; Nat», s_out: «r; Nat»]
                 → [input: «s_in; T»]
                 → «s_out; T», normalize_slice;

o ↦ (start#d + step#d · o#d)_d: the strided read map.

lam slice_map {r: Nat} (start: «r; Nat», step: «r; Nat») (o: «r; %affine.index»): «r; %affine.index» =
    ‹d: r; %affine.op.add (%affine.semiop.mul (o#d, step#d), %affine.constant (start#d))›;
 
lam %tensor.slice_impl {T: *, r: Nat} (s_in: «r; Nat»)
                       (start: «r; Nat», step: «r; Nat», s_out: «r; Nat») (input: «s_in; T»): «s_out; T» =
    %tensor.map_reduce @1 @(T, r, 0) (s_out, s_out)
        @(T, r, s_in) (tensor_copy @T, ⊥: T)
        (%affine.id @r) (slice_map (start, step),) input;

%tensor.flip

Reverses every axis: out[…d…] = input[… (s_in#d − 1) − d …].

axm %tensor.flip: {T: *, r: Nat}
                → [s_in: «r; Nat»]
                → [input: «s_in; T»]
                → «s_in; T», normalize_flip;

o ↦ ((s_in#d − 1) − o#d)_d: the reverse read map.

lam flip_map {r: Nat} (s_in: «r; Nat») (o: «r; %affine.index»): «r; %affine.index» =
    ‹d: r; %affine.op.sub (%affine.constant (%core.nat.sub (s_in#d, 1)), o#d)›;
 
lam %tensor.flip_impl {T: *, r: Nat} (s_in: «r; Nat») (input: «s_in; T»): «s_in; T» =
    %tensor.map_reduce @1 @(T, r, 0) (s_in, s_in)
        @(T, r, s_in) (tensor_copy @T, ⊥: T)
        (%affine.id @r) (flip_map @r s_in,) input;

%tensor.pad

Pads each axis: s_out#d = lo#d + s_in#d + hi#d, rank preserved. out[…d…] = input[… o#d − lo#d …] whenever every axis lands inside the input region; otherwise the value depends on mode:

mode = 0 (constant): out-of-region cells take the scalar value.
mode = 1 (replicate): out-of-region cells take the nearest edge element (the read index is clamped to [0, s_in#d − 1] per axis); value is ignored.

The output shape is deduced from lo, s_in and hi (see pad_shape), so it is not passed explicitly.

// `(lo, s_in, hi) ↦ ‹d; lo#d + s_in#d + hi#d›`: the padded shape.
lam pad_shape {r: Nat} (lo: «r; Nat», s_in: «r; Nat», hi: «r; Nat»): «r; Nat» =
    ‹d: r; %core.nat.add (%core.nat.add (lo#d, s_in#d), hi#d)›;
 
axm %tensor.pad: {T: *, r: Nat}
               → [s_in: «r; Nat»]
               → [mode: Nat, lo: «r; Nat», hi: «r; Nat»]
               → [input: «s_in; T», value: T]
               → «pad_shape (lo, s_in, hi); T», normalize_pad;

%tensor.concat

Joins nis inputs along axis ax. All inputs share rank r and agree on every axis except ax; the output extent along ax is the sum of the inputs' extents there (s_out#ax = Σ_i Sis#i#ax). out[…o…] = is#k[… o with o#ax ↦ o#ax − off#k …], where k is the input whose range o#ax lands in (off are the prefix sums of the per-input extents along ax).

The output shape is deduced from ax and the per-input shapes Sis (see concat_shape), so it is not passed explicitly.

// Accumulator for the axis-sum: adds `row#ax` to the running total.
lam concat_acc {r: Nat} (ax: Idx r) (acc: Nat, row: «r; Nat»): Nat = %core.nat.add (acc, row#ax);
 
// `(ax, Sis) ↦ s_out`: sums the inputs' extents along `ax`, keeps the shared extents on every other axis.
lam concat_shape {nis r: Nat} (ax: Idx r, Sis: «nis; «r; Nat»»): «r; Nat» =
    let sum_ax = %vec.fold.l (concat_acc ax) (0, Sis);
    let row_0  = Sis#(%core.idx nis 0 0);
    ‹d: r; %core.select (%core.icmp.e @r (d, ax), sum_ax, row_0#d)›;
 
axm %tensor.concat: {T: *, nis r: Nat}
                  → [ax: Idx r]
                  → {Sis: «i: nis; «r; Nat»»}
                  → [is: «i: nis; «Sis#i; T»»]
                  → «concat_shape (ax, Sis); T», normalize_concat;

%tensor.conv

2-D convolution (cross-correlation), groups = 1, no bias. Layout: input: «N, Cin, H, W; R#T», weight: «Cout, Cin, KH, KW; R#T», output: «N, Cout, OH, OW; R#T». The multiply-accumulate runs in the ring R.

out[n,co,oh,ow] = Σ_{ci,kh,kw} padded[n, ci, stride#0·oh + dilation#0·kh, stride#1·ow + dilation#1·kw] · weight[co,ci,kh,kw]

padding zero-pads the spatial axes (composed via %tensor.pad) so the windowed read is a pure affine map. The output shape is deduced from the input/weight extents and the window parameters (see conv_shape), so it is not passed explicitly.

// `(n, cout, (H, W), (KH, KW), stride, dilation, padding) ↦ (n, cout, OH, OW)`, where
// `O#d = (s#d + 2·padding#d − (dilation#d·(k#d − 1) + 1)) / stride#d + 1`: the NCHW output shape.
lam conv_shape (n cout: Nat, s: «2; Nat», k: «2; Nat», stride: «2; Nat», dilation: «2; Nat», padding: «2; Nat»): «4; Nat» =
    let hw = ‹d: 2;
        %core.nat.add (
            %core.nat.div (
                %core.nat.sub (
                    %core.nat.add (s#d, %core.nat.mul (2, padding#d)),
                    %core.nat.add (%core.nat.mul (dilation#d, %core.nat.sub (k#d, 1)), 1)),
                stride#d),
            1)›;
    (n, cout, hw#0_2, hw#1_2);

Fused multiply-add over the ring R; ab#0 is the padded-input element, ab#1 the weight.

fun conv_fun [R: %tensor.Ring] (acc: R#T, ab: «2; R#T»)@tt: R#T = return (R#add (acc, R#mul (ab#0_2, ab#1_2)));

The full «7; affine.index» loop vector is o = (n, co, oh, ow, ci, kh, kw): indices into o: 0 n, 1 co, 2 oh, 3 ow, 4 ci, 5 kh, 6 kw. The weight read map (co, ci, kh, kw) and the output write map (n, co, oh, ow) are tensor.proj_map projections of it; only the padded-input read needs arithmetic.

Padded-input read map: (n, ci, stride#0·oh + dilation#0·kh, stride#1·ow + dilation#1·kw).

lam conv2d_in_map (stride: «2; Nat», dilation: «2; Nat») (o: «7; %affine.index»): «4; %affine.index» =
    (o#0_7, o#4_7,
     %affine.op.add (%affine.semiop.mul (o#2_7, stride#0_2), %affine.semiop.mul (o#5_7, dilation#0_2)),
     %affine.op.add (%affine.semiop.mul (o#3_7, stride#1_2), %affine.semiop.mul (o#6_7, dilation#1_2)));
 
axm %tensor.conv: [R: %tensor.Ring]
                → {n cin cout h w kh kw: Nat}
                → [stride: «2; Nat», dilation: «2; Nat», padding: «2; Nat»]
                → [input: «n, cin, h, w; R#T», weight: «cout, cin, kh, kw; R#T»]
                → «conv_shape (n, cout, (h, w), (kh, kw), stride, dilation, padding); R#T»;
 
lam %tensor.conv_impl [R: %tensor.Ring] {n cin cout h w kh kw: Nat}
                      (stride: «2; Nat», dilation: «2; Nat», padding: «2; Nat»)
                      (input: «n, cin, h, w; R#T», weight: «cout, cin, kh, kw; R#T»)
                    : «conv_shape (n, cout, (h, w), (kh, kw), stride, dilation, padding); R#T» =
    // Zero-pad only the spatial axes (lo = hi = (0, 0, pad_h, pad_w)); the pad value is the ring zero.
    let s_in     = (n, cin, h, w);
    let pad_lohi = (0, 0, padding#0_2, padding#1_2);
    let padded   = %tensor.pad s_in (0, pad_lohi, pad_lohi) (input, R#_0);
    let s_padded = %tensor.shape 4 padded;
    let s_out    = conv_shape (n, cout, (h, w), (kh, kw), stride, dilation, padding);
    let oh       = s_out#2_4;
    let ow       = s_out#3_4;
 
    %tensor.map_reduce @2 @(R#T, 4, 3)
        ((n, cout, oh, ow), (n, cout, oh, ow, cin, kh, kw))
        @((R#T, R#T), (4, 4), (s_padded, (cout, cin, kh, kw)))
        (conv_fun R, R#_0)
        (%tensor.proj_map @(7, 4) (0, 1, 2, 3))
        (conv2d_in_map (stride, dilation), %tensor.proj_map @(7, 4) (1, 4, 5, 6))
        (padded, weight);

%tensor.pool

2-D pooling. Reduces each (kernel#0, kernel#1) window of input: «N, C, H, W; T» with the fold g (seeded by init) to output: «N, C, OH, OW; T», pooling each channel independently. padding pads the spatial axes with init (composed via %tensor.pad), so padded cells are inert. So max-pooling is g = max, init = −∞, and sum-/average-pooling is g = +, init = 0 (divide by the window size afterwards for the average).

Unlike %tensor.conv there is no weight, so the window size kernel is an explicit operand. The output spatial size s_out = (OH, OW) is passed explicitly (no Nat division to compute it).

Wraps a binary fold g as a %tensor.map_reduce reduction Fn [T, «1; T»] → T.

fun pool_fun {T: *} (g: [T, T] → T) (acc: T, y: «1; T»)@tt: T = return (g (acc, y#0_1));

Windowed input read over the loop vector (n, c, oh, ow, kh, kw): (n, c, stride#0·oh + dilation#0·kh, stride#1·ow + dilation#1·kw).

lam pool2d_in_map (stride: «2; Nat», dilation: «2; Nat») (o: «6; %affine.index»): «4; %affine.index» =
    (o#0_6, o#1_6,
     %affine.op.add (%affine.semiop.mul (o#2_6, stride#0_2), %affine.semiop.mul (o#4_6, dilation#0_2)),
     %affine.op.add (%affine.semiop.mul (o#3_6, stride#1_2), %affine.semiop.mul (o#5_6, dilation#1_2)));
 
axm %tensor.pool: {T: *}
                → [g: [T, T] → T, init: T]
                → {n c h w: Nat}
                → [kernel: «2; Nat», stride: «2; Nat», dilation: «2; Nat», padding: «2; Nat», s_out: «2; Nat»]
                → [input: «n, c, h, w; T»]
                → «n, c, s_out#0_2, s_out#1_2; T»;
 
lam %tensor.pool_impl {T: *} (g: [T, T] → T, init: T) {n c h w: Nat}
                      (kernel: «2; Nat», stride: «2; Nat», dilation: «2; Nat», padding: «2; Nat», s_out: «2; Nat»)
                      (input: «n, c, h, w; T»)
                    : «n, c, s_out#0_2, s_out#1_2; T» =
    let s_in     = (n, c, h, w);
    let pad_lohi = (0, 0, padding#0_2, padding#1_2);
    let padded   = %tensor.pad s_in (0, pad_lohi, pad_lohi) (input, init);
    let s_padded = %tensor.shape 4 padded;
 
    %tensor.map_reduce @1 @(T, 4, 2)
        ((n, c, s_out#0_2, s_out#1_2), (n, c, s_out#0_2, s_out#1_2, kernel#0_2, kernel#1_2))
        @(T, 4, s_padded) (pool_fun g, init)
        (%tensor.proj_map @(6, 4) (0, 1, 2, 3))
        (pool2d_in_map (stride, dilation),)
        padded;

%tensor.dot_product

Returns the generalized dot product of a and b:

R: the ring in which the dot product is performed
r1/r2: the ranks of the two inputs
nc/nb: the number of contracting/batching dimensions
c1/c2: the contracting dimensions of the left/right input
b1/b2: the batching dimensions of the left/right input
s1/s2: the shape of the left/right input
a/b: the left/right input

fun dot_general_fun [R: %tensor.Ring] [x: R#T, [y: R#T, z: R#T]]@tt: R#T = return (R#add (x, (R#mul (y, z))));
lam dot_general_pick {n: Nat} {na nb nc: Nat} (a: «na; Idx n», b: «nb; Idx n», c: «nc; Idx n») (off_1 off_2: Nat) (i: Idx n): Nat =
  let (ab, ai) = %vec.first (%core.icmp.e @n) (a, i);
  let (bb, bi) = %vec.first (%core.icmp.e @n) (b, i);
  let (cb, ci) = %vec.first (%core.icmp.e @n) (c, i);
  let ai_nat = %core.bitcast Nat ai;
  let ai_out = %core.nat.add (off_1, ai_nat);
  let bi_nat = %core.bitcast Nat bi;
  let ci_nat = %core.bitcast Nat ci;
  let ci_out = %core.nat.add (off_2, ci_nat);
  ((ai_out, ci_out)#cb, bi_nat)#bb;
lam dot_general_shape {r1 r2: Nat} {nc nb: Nat}
  (c1: «nc; Idx r1», c2: «nc; Idx r2», b1: «nb; Idx r1», b2: «nb; Idx r2»)
  (s1: «r1; Nat», s2: «r2; Nat») =
      let bs_check = ‹i: nb; %refly.check (%core.ncmp.e ((s1#(b1#i), s2#(b2#i))), s1#(b1#i), "batching dims don't match")›;
      let cs_check = ‹i: nc; %refly.check (%core.ncmp.e ((s1#(c1#i), s2#(c2#i))), s1#(c1#i), "contracting dims don't match")›;
      let bs       = ‹i: nb; s1#(b1#i)›;
      let bc_1     = %vec.cat (b1, c1);
      let s1_res   = %vec.diff (s1, bc_1);
      let bc_2     = %vec.cat (b2, c2);
      let s2_res   = %vec.diff (s2, bc_2);
      let s12_res  = %vec.cat (s1_res, s2_res);
      let s_out    = %vec.cat (bs, s12_res);
      s_out;
axm %tensor.dot_product: [R: %tensor.Ring]
                       → {r1 r2: Nat}
                       → {nc nb: Nat}
                       → [c1: «nc; Idx r1», c2: «nc; Idx r2», b1: «nb; Idx r1», b2: «nb; Idx r2»]
                       → {s1: «r1; Nat», s2: «r2; Nat»}
                       → [a: «s1; R#T», b: «s2; R#T»]
                       → «dot_general_shape (c1, c2, b1, b2) (s1, s2); R#T»;
 
lam %tensor.dot_product_impl
  (R: %tensor.Ring) {r1 r2: Nat} {nc nb: Nat}
  (c1: «nc; Idx r1», c2: «nc; Idx r2», b1: «nb; Idx r1», b2: «nb; Idx r2»)
  {s1: «r1; Nat», s2: «r2; Nat»} (a: «s1; R#T», b: «s2; R#T»)
    : let s_out = dot_general_shape (c1, c2, b1, b2) (s1, s2);
      «s_out; R#T»
    = let s_out    = dot_general_shape (c1, c2, b1, b2) (s1, s2);
      let bc_1     = %vec.cat (b1, c1);
      let s1_res   = %vec.diff (s1, bc_1);
      let n_s1_res = %vec.len s1_res;
      let bc_2     = %vec.cat (b2, c2);
      let f        = dot_general_fun R;
      let r_out    = %vec.len s_out;
      let a1       = %vec.diff (‹i: r1; i›, bc_1);
      let a2       = %vec.diff (‹i: r2; i›, bc_2);
      let subs_1   = ‹i: r1; dot_general_pick (a1, b1, c1) (nb, r_out) i›;
      let off_2    = %core.nat.add (nb, n_s1_res);
      let subs_2   = ‹i: r2; dot_general_pick (a2, b2, c2) (off_2, r_out) i›;
      // The subscript rows use consecutive loop ids: 0 … r_out−1 are the output loops
      // and r_out … r_out+nc−1 the contractions, whose bounds are the contracting extents of `a`.
      let s_red    = ‹i: nc; s1#(c1#i)›;
      let s_full   = %vec.cat (s_out, s_red);
      let n_loops  = %core.nat.add (r_out, nc);
      %tensor.map_reduce @2 @(R#T, r_out, nc) (s_out, s_full)
          @((R#T, R#T), (r1, r2), (s1, s2)) (f, R#_0)
          (%tensor.proj_map @(n_loops, r_out) ‹i: r_out; %core.bitcast Nat i›)
          (%tensor.proj_map @(n_loops, r1) subs_1, %tensor.proj_map @(n_loops, r2) subs_2)
          (a, b);

%tensor.product_2d

Computes the matrix product of two 2-dimensional tensors - a special case of %tensor.dot_product.

axm %tensor.product_2d: [R: %tensor.Ring]
                      → {m k l: Nat}
                      → [t1: «m, k; R#T», t2: «k, l; R#T»]
                      → «m, l; R#T»;
 
lam %tensor.product_2d_impl (R: %tensor.Ring) {m k l: Nat} (t1: «m, k; R#T», t2: «k, l; R#T»): «m, l; R#T»
        = %tensor.dot_product_impl R @(2, 2) (1_2, 0_2, (), ()) @((m, k), (k, l)) (t1, t2);

%tensor.bmm

Computes the batch matrix multiplication (BMM) of two 3-dimensional tensors - a special case of %tensor.dot_product.

axm %tensor.bmm: [R: %tensor.Ring]
              → {B M K N: Nat}
              → [t1: «B, M, K; R#T», t2: «B, K, N; R#T»]
              → «B, M, N; R#T»;
 
lam %tensor.bmm_impl (R: %tensor.Ring) {B M K N: Nat} (t1: «B, M, K; R#T», t2: «B, K, N; R#T»): «B, M, N; R#T»
        = %tensor.dot_product_impl R @(3, 3) (2_3, 1_3, 0_3, 0_3) @((B, M, K), (B, K, N)) (t1, t2);

%tensor.transpose

Permutes the dimensions of input according to permutation.

lam transpose_shape {r: Nat} (s: «r; Nat», permutation: «r; Idx r»): «r; Nat» =
  let shape_permutation = ‹i: r; (%vec.first (%core.icmp.e @r) (permutation, i))#tt›;
  ‹i: r; s#(shape_permutation#i)›;
axm %tensor.transpose: {T: *, r: Nat, s: «r; Nat»}
                     → [input: «s; T», permutation: «r; Idx r»]
                     → «transpose_shape (s, permutation); T»;

o ↦ (o#(perm#0), …, o#(perm#(r−1))): input axis j reads output-loop var o#(perm#j).

lam transpose_map {r: Nat} (perm: «r; Idx r») (o: «r; %affine.index»): «r; %affine.index» = ‹j: r; o#(perm#j)›;
 
lam %tensor.transpose_impl {T: *, r: Nat, s: «r; Nat»}
                           (input: «s;T», permutation: «r; Idx r»)
                         : «transpose_shape (s, permutation); T»
                         = let out_s = transpose_shape (s, permutation);
                           %tensor.map_reduce @1 @(T, r, 0) (out_s, out_s)
                               @(T, r, s) (tensor_copy @T, ⊥: T)
                               (%affine.id @r) (transpose_map permutation,) input;

%tensor.transpose_2d

Permutes the dimensions of a 2-dimensional tensor.

axm %tensor.transpose_2d: {T: *}
                        → {s: «2; Nat»}
                        → [input: «s; T»]
                        → «s#tt, s#ff; T»;
 
lam %tensor.transpose_2d_impl {T: *} {s: «2; Nat»} (input: «s; T»): «s#tt, s#ff; T» =
  %tensor.transpose_impl @(T, 2, s) (input, (tt, ff));

%tensor.broadcast

Expands the dimensions of input to fit s_out: for all i, either s_in#i = s_out#i or s_in#i = 1, and the size-1 dimensions are expanded to fit s_out.

Lowered directly to packs by the LowerMapReduce phase (lower_broadcast): a size-1 axis broadcast to n becomes a ‹n; …› pack rather than a materialized/looped copy, which tensor.map_reduce cannot express.

axm %tensor.broadcast:  {T: *, r: Nat}
                      → [s_in: «r; Nat», s_out: «r; Nat», input: «s_in; T»]
                      → «s_out; T», normalize_broadcast;

%tensor.broadcast_in_dim

Transposes and expands the dimensions of input to fit s_out. Each input dimension is mapped to the output dimension given by index; the holes are filled by broadcasting.

Todo: We could probably implement this in terms of %tensor.broadcast and %tensor.transpose directly.

axm %tensor.broadcast_in_dim: {T: *, r_in r_out: Nat}
                            → [s_in: «r_in; Nat», s_out: «r_out; Nat», input: «s_in; T», index: «r_in; Idx r_out»]
                            → «s_out; T», normalize_broadcast_in_dim;

o ↦ (o#(index#i) · [s_in#i ≠ 1])_i: input axis i reads output-loop var o#(index#i) (the transpose), unless it is a broadcast axis (s_in#i = 1), which reads index 0.

lam bid_map {r_in r_out: Nat} (s_in: «r_in; Nat», index: «r_in; Idx r_out»)
            (o: «r_out; %affine.index»): «r_in; %affine.index» =
    ‹i: r_in; %affine.semiop.mul (o#(index#i), %core.select (%core.ncmp.e (s_in#i, 1), 0, 1))›;
 
lam %tensor.broadcast_in_dim_impl {T: *, r_in r_out: Nat}
    (s_in: «r_in; Nat», s_out: «r_out; Nat», input: «s_in; T», index: «r_in; Idx r_out»): «s_out; T» =
    %tensor.map_reduce @1 @(T, r_out, 0) (s_out, s_out)
        @(T, r_in, s_in) (tensor_copy @T, ⊥: T)
        (%affine.id @r_out) (bid_map (s_in, index),) input;

%tensor.map

Maps a function over a collection of tensors.

axm %tensor.map: {T: *, ni: Nat, Is: «ni; *»}
               → [app: «i: ni; Is#i» → T]
               → {r: Nat, s: «r; Nat»}
               → [is: «i: ni; «s; Is#i»»]
               → «s; T»;
 
lam %tensor.map_impl {T: *, ni: Nat, Is: «ni; *»}
                     (app: «i: ni; Is#i» → T)
                     {r: Nat, s: «r; Nat»}
                     (is: «i: ni; «s; Is#i» »)
                     : «s; T» =
  fun app_mr [x: T, y: «i: ni; Is#i»]@tt = return (app y);
  %tensor.map_reduce @ni @(T, r, 0) (s, s) @(Is, ‹ni; r›, ‹ni; s›) (app_mr, ⊥: T)
      (%affine.id @r) ‹ni; %affine.id @r› is;

%tensor.unary

Maps a unary function over a tensor.

axm %tensor.unary: {Ti To: *}
                 → [app: Ti → To]
                 → {r: Nat, s: «r; Nat»}
                 → [i: «s; Ti»]
                 → «s; To»;
 
lam %tensor.unary_impl {Ti To : *} [app: Ti → To] {r: Nat, s: «r; Nat»} (i: «s; Ti»): «s; To» =
  %tensor.map_impl @(To, 1, Ti) app @(r, s) i;

%tensor.binary

Maps a binary function over a pair of tensors.

axm %tensor.binary: {Ti1 Ti2 To: *}
                  → [app: [Ti1, Ti2] → To]
                  → {r: Nat, s: «r; Nat»}
                  → [is: [«s; Ti1», «s; Ti2»]]
                  → «s; To»;
 
lam %tensor.binary_impl {Ti1 Ti2 To: *} (app: [Ti1, Ti2] → To) {r: Nat, s: «r; Nat»} (is: [«s; Ti1», «s; Ti2»]): «s; To» =
  %tensor.map_impl @(To, 2, (Ti1, Ti2)) app @(r, s) is;

%tensor.select

Maps %core.select over tensors.

axm %tensor.select: {T: *}
                  → {r: Nat, s: «r; Nat»}
                  → [is: [«s; Bool», «s; T», «s; T»]]
                  → «s; T»;
 
lam %tensor.select_impl {T: *} {r: Nat, s: «r; Nat»} (is: [«s; Bool», «s; T», «s; T»]): «s; T» =
  %tensor.map_impl @(T, 3, (Bool, T, T)) (%core.select @T) @(r, s) is;

Phases

lower_tensor: lowers the high-level tensor axioms to the low-level ones (map_reduce, ...) by re-applying the matching *_impl annexes.
lower_map_reduce: lowers the low-level tensor axioms directly to their underlying primitives (loops, extract, insert, packs, ...).
lower_get_set: lowers %tensor.get / %tensor.set to their underlying primitives (extract, insert).
fuse_tensor: fuses producer map_reduces into their consumers' access maps.
lower_to_mem: Bufferizes the low-level tensor axioms (get, set, map_reduce) onto the buffer layer: tensor array values become %buffer.Buf handles and the operations become %buffer.read / %buffer.write / %buffer.alloc (threading %mem.M). Memory is threaded by %mem.add_mem (scheduled next in the pipeline); run %buffer.lower_ptr afterwards to reach %mem.Ptr.

axm %tensor.lower_tensor:     %compile.Phase;
axm %tensor.lower_map_reduce: %compile.Phase;
axm %tensor.lower_get_set:    %compile.Phase;
axm %tensor.fuse_tensor:      %compile.Phase;
axm %tensor.lower_to_mem:     %compile.Phase;

Table of Contents