Thoughts on readable data processing
June 25, 2020.TLDR: dynamic views as a composable zero-cost abstraction. Or: like SQL but easy to read. (gosh, I should get into writing paper titles)
the big idea
Late last night I thought of an idea for a data processing language. Consider the following problem:
"just a second" -> "JUST a second"
"bone apple tea" -> "BONE apple tea"
"capitalize the first word" -> "CAPITALIZE the first word"
In most languages (like Python or regex), this would require one or more of the following:
- find where to stop capitalizing at (via a loop or via string matching)
- working with indices up to that index
- use a loop to capitalize each letter
and the resulting performance is dependent on how you wrote that.
What if instead of doing any of that, we can naively operate on a specific view of the string, namely the first word:
# run toUpper() on each char of the first word
toUpper(input.words.first.chars);
and have it behave exactly as if we wrote a tight loop in C? like
for (int i = 0; i < input.length; i++) {
if (!isSpace(input[i])) {
input[i] = toUpper(input[i]);
} else break;
}
I doubt this is a new idea, since my thoughts here are very inspired by the idea of dynamic views and by the J language’s composability. It’s like SQL, except queries are performant AND readable.
fleshing out the concept
To simplify things, let’s work with flat, nonempty integer arrays. So the interesting problems include things like taking k-maximum, or finding the cumulative sum.
1 6 2 3 4.maximum(2) == 4 6
1 6 2 3 4.cumulativeSum == 1 7 9 12 16
How would we express these using the idea of ‘operating on an specified view’? I would imagine something like
input.maximum(k) := input.sorted.last(k)
input.cumulativeSum := input.prefixes.sum
Takeaways:
maximum
: applying something like.sorted.last(k)
should not actually perform a sort, but instead find k-maximum (like via min heap when k > 3).cumulativeSum
: clearly I’m cheating a bit, since I said we’re working with flat arrays, but takingprefixes
of a flat array sounds like a jagged 2D array. I’m thinking thatprefixes
is actually some special intermediate object that expects a folding operation (sum
) to be performed on the prefixes, and then uses that folding operation to produce the cumulative sum. It’s directly inspired by J’s\
operator. This reminds me of Clojure’s transducers…
zero-cost property
How do we best-effort ensure that this dynamic view abstraction is zero-cost? Meaning that everything we write using dynamic views should be equivalent (performance-wise) to writing a tight loop. Make indirection invisible after compilation.
For starters, every dynamic view should be managed at compile-time. Under the hood, the compiler should translate input.first.add(2)
to input[0] += 2
in C. How?
I’m thinking that each hardcoded view, like .first
, should (at compile-time) change the modified data to be the first element. This sounds obvious, but becomes non-trivial when we get things like .sorted
where the modified data is now the sorted version of the array. .sorted
requires some indirection under the hood. Taking another page from J, we could use keep around permutation vector as indices, so 10 30 40 20
would have the permutation vector 0 3 1 2
, meaning “smallest element is at index 0, second smallest is at index 3, …” The compiler handles the rest.
This ad-hoc method needs to be implemented for every dynamic view, and maybe there should be optimizations for common compositions like .sorted.last
.
composition property
This approach to dynamic views is cool, but it has one flaw: composition is non-trivial. There are actually two parts to this:
- Language-level composition: how do we make it easy to work with a view of a view of a view?
- Compiler-level composition: how to make that work perfectly under the hood?
Language-level composition
My favorite example here would be taking the integer mean of an integer array. This requires summing an array, finding its length, and integer dividing these two numbers.
How would you express that? In J, it would look like (+/ % #) input
meaning (sum divide length) input
. So J has composition down – it has a neat way to compose operators (the fork combinator).
How do we get dynamic views to be just as powerful? For reference, a tight loop in another language would look like:
int total = 0;
for (int i = 0; i < input.length; i++) {
total += input[i];
}
mean = total / input.length;
Would it just be input.mean := input.sum.divide(input.length)
? It sounds messy. This needs some thinking.
compiler-level composition
I mentioned earlier that common compositions of primitives, like .sorted.last
, should be identified and replaced with a more performant version. Idempotent functions like .sorted.sorted
should translate to .sorted
. Involutions like .reverse.reverse
should translate to no-op. These are hardcoded solutions corresponding to Special Combinations in J.
But how do we make things like this work in general? Take toUpper(input.words.first.chars)
from the first example, using strings. How the hell do we go from that to…?
for (int i = 0; i < input.length; i++) {
if (!isSpace(input[i])) {
input[i] = toUpper(input[i]);
} else break;
}
My current thought for this is to make every view apply a layer of indirection, which will get resolved at compile time. So .sorted
generates an array of permutation indices, which the compiler can traverse later to get the actual sorted list representation. (This approach is very amenable to extension, since all that’s needed to support new view operations is to define the indices.)
In our toUpper
example, .words
could be a list of indices that point to the start of each word. .first
gets the first index of this list. .chars
gets the list of indices for each letter. The compiler will have all this information available. The challenge is, how do we skip computing the unused information, like indices for things that are not the first word? I need to do more research.
(side note: In J we would turn notIsSpace "abc 123"
into a predicate array 1 1 1 0 1 1 1
, which can be used by other operations to split into words, for example. In our example, I’d imagine we’d throw out everything after seeing the first 0. How do we avoid computing the remaining 1s? Where does the early break
come in? This needs research.)
practical data representations
I really want this dynamic view idea for more than just integer arrays. I’m thinking strings obviously, but also associative arrays, custom ADTs, json, bit fields, dataframes, file handles.
Well, one can’t possibly write them all, but what if there was an API that allowed writing this stuff at a low level? Like having a low level way to specify views as lists of indices. The easiest solution would be specifying operations that the compiler can use in the compile step, though this makes compilation dangerous (might allow arbitrary execution?).
closing thoughts
This was just a thought experiment for a small data-processing language that compiles down to a fast language like C. If it turns out to be a solvable problem, then I wouldn’t feel comfortable about the implications. For one, if these operations are easy to specify while being zero cost, then there’s no point in learning algorithms for e.g. finding k-maximum, since you would just use an equivalent formulation like .sorted.last(k)
at zero cost!
If we know all the fast algorithms, and this syntax saves you from handwriting speed-saving constructions (like constructing min heaps by hand) and doesn’t require importing a library, then why ever learn algorithms outside of academia?
If most data-processing is expressible via manipulation of dynamic views, then why ever learn other forms of declarative programming with high level concepts like filters and folds?
Thoughts like this keep a practical computer science educator up at night.
addendum: approximate algorithms
One thing this abstraction does not allow is approximate algorithms, which may be much much faster than the best algorithm both time- and space-wise, at the cost of not guaranteeing correctness.
Maybe approximate algorithms could be a library? Adding things like approxSort
? Maybe we do need libraries after all.
addendum: time-efficiency vs space-efficiency
All this time I kind of ignored space as a problem. In the interest of optimizing for time, I silently assumed constructing auxiliary data structures would not be a big issue. Which is not true. Allocations are not free, and ignoring allocation might make the time-efficient algorithms slower.
This will definitely have to be a user setting. Maybe optionally annotate files or LOC or individual functions to run with certain efficiency constraints, so the compiler picks the more space-efficient translation over the time-efficient translation. Maybe have different levels of efficiency annotations. Will need to think about this too.
Thoughts on readable data processing (permalink) (tweet)How to not consent to cookies