Description
As you must know by now, Erk Meijer and team spend time thinking about and discovering the Essence in things. One year ago today, Bart De Smet blogged about the notion of a core set of LINQ operators, MinLINQ, the essence of LINQ.
"Hey Bart, what is MinLINQ, exactly?"
"MinLINQ is an implementation of the LINQ to Objects Standard Query Operators using a functionoriented layered approach. Based on three essential operators called Ana, Bind and Cata, others are implemented. While the current implementation focuses on IEnumerable<T> exclusively, the same layering can be used to a dual IObservable<T> implementation."
Sit back, relax and jump into the layered functional rabbit hole where we meet some interesting characters named Ana, Bind and Cata. It's all about the essence, Alice.
Happy New Year!
Download
Right click to download this episode
 High Quality WMV (901.7 MB)
 MP3 (36.6 MB)
 Mid Quality WMV (951.5 MB)
 High Quality MP4 (1.8 GB)
 Low Quality MP4 (453.6 MB)
The Discussion

Can't wait to see what Bart has for us next.

Ana, Bind, Cata
Shape, Wrap, Roll
CAR, CDR, CONS
balabing, balabang, balaboom

Great! @Bart you mention this as an academic exercise. Don't you see realworld significance for this kind of simplification or is it perhaps at odds with things like optimal translation of reified expressions into SQL? @Felix don't forget hylo, synchro, etc.

Mindblowing, as usual
Great work Bart!

Great way to start the year (currently half way through the video).
@Bart: I'm amazed by your simple IE>IO>Amb>IE roundtrip example. It is concise yet clear, demonstrating the importance of orthogonality.

@exoteric: Sure, there's realworld significance, ranging from setting a design philosophy for a query language or processor with proper layering of operators and their semantics, where compositionality rules the world, onto implementations itself. I won't yet comment on the latter category at this stage, rest assured we can keep ourselves more than busy on our team.
For now, it should be clear that the MinLINQ implementation is truly meant as an academic exercise, in particular when it comes to using the implementation for realworld scenarios, due to the various shortcuts that have been taken (no disposable nature, error handling omitted, etc.).
Exercise for the viewers: what'd be a good generalization of the set of GroupBy operators? How about other kinds of partitioning of an input sequence in chunks (e.g. Buffer* operators in Rx)?

@ShinNoNoir: Thanks for watching in 2011 too! The Amb roundtrip is indeed a beautiful one.
There are plenty of other ones, so here's one exercise for the viewers: assume your team has only the budget to use one try...catch block (assume each block costs as $1,000 and the licensing model of the C# compiler is to pay per try...catch used) in the entire codebase of your LINQ to Objects implementation of the Standard Query Operators (i.e. IEnumerable<T> based). How'd you spend it (*) such that you can easily implement errorhandling operators in the future? Examples include operators such as Catch :: IE<T> > (Exception > IE<T>) > IE<T>, but also Retry, etc.
Tip: you want to realize Catch, Retry, etc. using a roundtrip scheme of IE<T> > ??<T> > (wow, error handling got so much easier here thanks to the $1,000 wellspent) > ??<T> > IE<T>. Try to find what I've highlighted in bold and you may be onto something (nested tip: we got this in Rx and Ix).
(*) Constraint: use the try...catch inside a proper operator/combinator that deals with sequences. I.e. simply wrapping try...catch in a generic helper method such as the one shown below is not an option:
static void Catch<E>(Action action, Action<E> handler) where E : Exception { try { action(); } catch (E e) { handler(e); } }

Great video! Mindblowing and somehow back to basics.

Great interview and video to start the new year off with.
Max and Min in terms of using Aggregate could be written like so.
Function Max(Of T As IComparable)(ByVal c As IEnumerable(Of T)) As T Return Enumerable.Aggregate(c, c.First, Function(a, e) ( Function(_x As T, _y As T) {_y, _x, _x}(Math.Sign(_x.CompareTo(_y)) + 1) )(a, e) ) End Function Function Min(Of T As IComparable)(ByVal c As IEnumerable(Of T)) As T Return Enumerable.Aggregate(c, c.First, Function(a, e) ( Function(_x As T, _y As T) {_x, _x, _y}(Math.Sign(_x.CompareTo(_y)) + 1) )(a, e) ) End Function
I'm using
Math.Sign(_x.CompareTo(_y)) + 1
because the IComparable interface doesn't require the return value to limited to the values 1, 0 and +1, so I have to normalise it to this range.
Note that the above implementation need the extension method .First. So this requires a function to exist which returns the first value of the Enumerable. According to my understanding this function would be a CATA, so it is based off an Aggregate.
Now I think that be challenge. How do write First in terms of Aggregate. (Can it be done without the indexing overload of SelectMany?)
Thinking Headwear on.

Done First but it used the Indexing overload.
Function First(Of T)(ByVal src As IEnumerable(Of T)) As T Return Enumerable.Aggregate(Of T)( Enumerable.SelectMany(Of T, T)(src, Function(curr As T, index As Integer) As IEnumerable(Of T) Return If(index = 0, Enumerable.Repeat(curr, 1), Enumerable.Empty(Of T)) End Function), Function(accum, curr) (curr)) End Function

How would Sum work?
Using Aggregate it would simply be.
Function Sum(Of T )(ByVal src As IEnumerable(Of T)) As Integer Return Enumerable.Aggregate(src, Function(accum, curr) accum + curr) End Function
But the problem is the addition, it would work if the Generic Type could constrained to only type that implement an addition operator.
Would this suggest that .Sum() isn't generic and is an Extension method with multiple overloads, one for each arithmetical type. Eg Long, Integer, Double etc?

The problem is that operators are static in C# (Shared in VB), so you are pretty much forced to multiple overloads.
Here's something I found about generic operators written by Marc Gravell and Jon Skeet: http://www.yoda.arachsys.com/csharp/genericoperators.html

3 hours ago, bdesmet wrote
assume your team has only the budget to use one try...catch block in the entire codebase of your LINQ to Objects implementation of the Standard Query Operators (i.e. IEnumerable<T> based). How'd you spend it (*) such that you can easily implement errorhandling operators in the future?
That can be done using operator analogous to Materialize of Ix. It makes transition from implicit exceptionhandling mode to explicit Error monad.

So I've watched the video and been pondering about something. Bart, you use Ana instead of Unit (or Return). As you explained in the video, you can define Unit in terms of Ana. So far's everything fine.
The thing I've been pondering about, can you always define Ana for each monad M?

@AdamSpeight2008: To define First, think of an Aggregate whose accumulation object tracks the first element and whether it has seen it already. However, keep in mind this is not the best definition of First by any means, because Aggregate will drain the entire sequence before it can produce the result. This doesn't work for sideeffecting iterators (in LINQ to Objects, you won't see the effects for anything but enumeration to the first element), and infinite sequences. Notice how First is one of the only operators in Rx that leaves the monad; aggregates in the Ix world are a bit of a cata* (fill in the blanks yourself) from a monadic point of view. In Rx, aggregates return single values inside the monad, which allows for cancellation, and a Cata (in the sense used in the interview) has less value in this world.

@Gregory81: Perfect answer. Materialize  Where (or SelectMany of course...)  Dematerialize is a roundtrip scheme that can be used to implement exception handling for sequences. T is turned into Notifcation<T> and back.

Dear Niners,
You can actually find an implementation of FEnumerable and a bunch of operators on www.codeplex.com/linqsqo. To show it truly works, I've provided a port of Luke Hoban's LINQbased Ray Tracer as well, as part of the code. More information on conversion operators between classic IEnumerable and purist FEnumerable can be found in my blog post, mentioned in the description of the video above. I also describe how to get query expression syntax working on top of the FEnumerable implementation.
Cheers,
Bart 
Hey Bart,
Completely tangential and just out of curiosity ... you think in English? Not in French or Flemish? (You're from Belgium*, right?)
be well
h 
@corvuscorax: I'm Belgian, from the Flemish part, near Ghent, where we speak Flemish, a variant of Dutch. I mostly think in English when I'm in the US or outside my home country, yes (though I curse in Flemish occasionally).
Met vriendelijke groeten,
Bart 
Amazing work Bart! Looking forward to more lectures. Keep the functional speak coming.

Interesting. Just another idea to get early stopping, you could instead of passing in the aggregate so far, pass in a function that when called computes the rest. Then you can choose to call this function or not.
Instead of this:
let fold xs init f = match xs with  [] > init  x::r > f x (fold r init f)
You do this:
let fold xs init f = match xs with  [] > init  x::r > f x (fun () > (fold r init f))
If we assume that the && operator is short circuiting, then the following is a short circuiting All function:
let all p xs = fold (fun x r > p x && r()) true xs
i.e. as soon as the predicate returns false, the loop stops and returns false. 
I see this pseudosyntactic definition
() > (() > T?)
but raise you this haxesyntactic definition (If I recall syntax correctly)
alias Streamer = Void > Stream<T>
enum Stream<T>
{
none;
some(x: T, xs: Void > Stream<T>);
}This is a more purely functional style as a Streamer would normally always have the same result, every time you call it and you have to deconstruct the stream to get to successive elements, whereas with FEnumerable, somehow magically, the same enumerator function will return different results every time you call it.
It just happens there's no natural way to model this kind of structure in C#.

@exoteric: Don't stare too blind at the syntax here, it's merely a way to show the IE/IO duality in the simplest way possible by just reversing arrows. Other than that, we're simply working based on the Func<Func<T>> type.
We fully embrace sideeffects in this setting, as C# is good at. The arrows are not purist functional arrows, but rather fat arrows that can contain what you refer to as "magic". I believe I mentioned one way to turn it into a functional Haskellstyle setting is to decorate the arrows (or rather their return type) with the IO monad. All we're doing here is bringing laziness to the table in one suitable way, but we're completely honest about sideeffects induced by the cursorbased nature of enumerators.
A head/tail decomposition at the function level, as done by your Stream<T> sample, is another possibility, introducing recursion at a type level. Whether it's strictly better is subject to debate though. There's only one value of type Void, hence a purist Streamer can only return the same Stream<T> every time the function is called. In fact, () > T is a lazily computed constant; when a sideeffect is desired, one would simply write IO T in Haskell (even the () > is redundant due to the lazy nature of Haskell's evaluation mechanism).
This gets into a chickenoregg IO problem (somehow I like to associate IO with chickens and eggs) for realistic scenarios where you're querying data from some store such as a database, webservice, and whatnot. There you'll have to have an IO somewhere. After all, the contents of the store can change and you'll want a new enumerator that represents a cursor on the world at that point in time (time is another way to think about implicit sideeffects, where an impure function is implicitly parameterized on time). Related to the enumerator is some state that reflects serverside state too, moving over the result set as it was obtained at the point of the (initial?) request (providing a consistent  or not, depending on the guarantees provided by the store  view of the world at that time).
For inmemory lazily computed streams, such as prime numbers to use an academic sample, we can do better. There all the properties of a purist functional and lazily evaluated language such as Haskell shine.

Thanks so much for your explanations bart!
After some introduction to monads i once tried to understand linq exactly in that primitive building block manner but gave up in the middle. so this was really helping!
However i would like to conclude IEnumerable<T> is a monad. But isn't it more also  at least when you count in its extension methods? To understand if it is exactly a monad or can be viewed as a construct based upon a monad, it still would be great to have a list of all operators that are based upon the 3 primitives.
Or maybe even better: It would help to know all other paradigms (or their primitives) of those operators not covered by the monad aspect of IEnumerable<T> / IObservable<T>.
And please give me a hint: Is it possible to write an elegant short version of "zip" based on ana, bind and cata? I would like to try that... 
@Sebastian Gregor: IEnumerable<T> is a monad. It has a Return (new[] { value }) and a Bind (from x in xs from y in ys(x) select f(x, y) turns into xs.SelectMany(x => ys(x), (x, y) => f(x, y))) which is sufficient for our conclusion.
What's extra though is the sequence part of an IEnumerable<T>, or any other sequence type we're talking about (I know at least three more in the Rx/Ix puzzle, in fact I do know even more ). This is where the Ana abstraction can be seen as a sequencecentric Return, in that many values can be computed to be turned into an object that's inside the monad.
To see the layer map, check out the www.codeplex.com/LINQSQO project and use Visual Studio 2010 architecture explorer facilities to see the dependency graph of the methods in MinLinq. This will clearly show how all of the operators I've implemented there find their roots in Ana, Bind, and Cata.
For your requested tip: Zip is a nice one you bring up. I'm emphasizing nice a bit sarcastically. What Zip does is take a pair of sequences and turn it into a sequence of pairs (generalized using a projection function): Tuple<IEnumerable<A>, IEnumerable<B>> > IEnumerable<Tuple<A, B>>. To achieve this (lazily and whatnot), you got to obtain elements from both sequence at the "same pace". Now have a look whether you got such a primitive hanging in there somewhere . Notice though a Zip can be used to build the fancy Where and Select overloads that take a natural index number in their function parameter (Zip with the infinite Ana of natural numbers)...
Using Cata is rarely useful in operators itself, and in fact harmful as it breaks laziness. One exception is where you got to see the whole sequence anyway before you can produce any output. Examples include OrderBy and Reverse (exercise: write those using Ana and Cata).

@Bart yes, Streamer should be something like S > Stream<T> or it will be pretty boring, but that should do it, I think

@bdesmet: On difference between GroupBy and Buffer* methods. GroupBy adds new "monadic layer", so it's signature (not taking Keys into account) should be M<T> > T > TKey > M<M<T>>. It looks like specialized version of Ana applied to already existing monadic value as a seed.
Buffer is a different beast as it adds takes M<T> and produces M<List<T>>.
2 hours ago, bdesmet wrote
Using Cata is rarely useful in operators itself, and in fact harmful as it breaks laziness. One exception is where you got to see the whole sequence anyway before you can produce any output.
Didn't get that part. If Cata itself is lazy there should not be any problems. Or you mean that lazy invocations of Cata are usefull and direct invocations are harmfull?
Correct me if I'm wrong, but I thought that for every monad there exist AnaCata pair and every monadic operation may be written in terms of Ana and Cata. However in general there's no easy way to find Ana and Cata for a given monad.

@Gregory81: Let's be clear, I'm talking here about Cata as the folding (for LINQ to Objects people, think Aggregate) operator for sequences of data, which leaves the IEnumerable<T> world in favor of a value of type R, outside the monad. In this interpretation, it's like unsafePerformIO in Haskell:
 Ana (and return) gets you in. A > (A > T) > (A > A) > M<T>
 Bind (and >>=) keeps you in. M<T> > (T > M<R>) > M<R>
 Cata (and unsafePerformIO) get you out. M<T> > C > (C > T > C) > C
In our world of Rx and Ix, the monad's "aspect" is about lazy acquisition of data (which could go with lazy computation of the elements in the stream of course). The typing point of view is shown above too. If you make Cata lazy, it will be not the "streaming multiple values" kind of laziness that results; instead, you're deferring the catastrophy that's about to happen to the sequence being folded into a single value. (Exercise: There is a more mild form of aggregation in Rx though, but I wouldn't call the a Cata anymore. Go and find it.)
All the sideeffects that are due to iteration of the sequence (till the very end, provided there's even an end to it) will be surfaced at the point this kind of Cata is executed. It's like building LINQ to Objects on IEnumerable<T> by having operators turn their input sequence into a List<T> first (which is a Cata using Add calls), apply things like a filter or a map on the entire inmemory list, and then unfolding it again into a sequence (which is an Ana using the enumerator as the state, MoveNext as a terminating condition and Current as the selector). Well, if the user only wanted the projection of the first element, we definitely did too much work now (and we could get stuck if the sequence is infinite). Laziness was dropped inside the implementation, and the result of that leaks to the users (sideeffects, bottoming out on infinite sequences, etc.). In that sense, using a Cata is harmful when used in inappropriate places.
Have a look at Erik's paper titled "Functional Programming with Bananas, Lenses, Envelopes and Barbed Wire" for some interesting insights (warning: Greek letters). Tip: there's more than ana and cata...
A few practical things related to this discussion:
 LINQ to SQL, and IQueryable<T> in general, leaves the monad early for aggregation operators. For example, products.Sum(p => p.Price) returns a decimal on the spot. Laziness is gone. The IEnumerable<T> monad retains laziness in the face of composition of such sequences. Because of this, you can't do something like dividing Sum and Count and have the whole thing executed remotely. Instead, you'll have two roundtrips. This is like "forgetting" about the M in M<T>, hence the unsafePerformIO analogy.
 In Rx, we leave aggregates in the monad. Only First, Last, and Single are permitted to leave the monad (ignoring nonoperators such as Subscribe). This way you retain the composition over such sequences, which is where monadic computation shines. For example, you could apply the Timeout operator to computation of an aggregate and unsubscribe before it actually comes with the result.
For what GroupBy is concerned, you're absolutely right about it adding another layer. Keep thinking . Buffer is a different beast indeed  did you already look at the changes we made in the last Rx release? That may give some insightful clues... (Btw, a Channel 9 video on the latest Rx video is still coming up.)

@bdesmet:
First of all the easy part: not so catastrophic version of Aggregate is Scan.
On lazyness as the main monadic aspect of IEnumerable. The problem is that both IEnumerable and IObservable are quite complex creations. There are at least three (maybe more) aspects in IEnumerable (like returning multiple values, handling exceptions and deferring sideeffects) and at least four (or even five) in IObservable. Had we used haskell (not that I know it), instead of IEnumerable we'd written something like List<Error<IO<T>>> (or sometimes Set<Error<IO<T>>>). In Ix there are a lot of operators that leave one monadic aspect or the other. Like Catch leaves Error monad, or SumEnumerable leaves List monad (but not IO monad)  in some sense these are catamorphisms. Of course, if we're talking of IO as a main part of IEnumerable monad then indeed Cata would equal to unsafePerformIO.
What I was trying to say in previous post, given that the only way to deconstuct a monadic value is Cata, you'd had to use it somehow in the implementation of Bind.
By the way I think it can be good exercise to write approximate signatures of Ix and Rx operators in haskell, taking sideeffects and exceptions into account.
Erik's paper is my reading list (already quite some time). I yet had to get used to all hylomorphisms and co.

Like Catch leaves Error monad, or SumEnumerable leaves List monad (but not IO monad)  in some sense these are catamorphisms.

Have you considered "bottom" to express null equivalency in your syntax for your core implementation? As a singular binding bottom function application equal to null might do the trick.
http://en.wikipedia.org/wiki/Bottom_type
Im no LINQ or C# pro, do you think it would be a suitable solution for you Bart?

@Gregory81: Well, you're definitely welcome to do this stuff in Haskell. Let me say you won't be the first (hint ). Stay tuned.
You're right that IE<T> and IO<T> are a bit more complex than what we've discussed here. I conveniently swept errors and disposing under the carpet. The former can be deal with by treating it as a value that can be hidden behind the T (i.e. Materialize approach, T is Notification<T>). The latter (disposable aspect) pops up in the types for sure; disposables have a very imperative nature to it and can be represented as IO () in Haskell.

@bdesmet:About GroupBy and Buffer methods. That's the best what I can come up with:
The elements of IE<T> and IO<T> can be associated with additional values: index and time (for IO<T>). These values arise from specific properties of Bind operator of each monad.
For IE<T> Bind preserves internal ordering of elements within containers. It is possible to write nonpreserving version too (and Px has that version AFAIR). Even inside standart LINQ not all operators respect the order (for example, Union and other setrelated operators but that's because they should belong to different monad). However basic operations do preserve the order.
For IO<T> as events have global order (they can be ordered by time when an event happened) Bind preserves that global order.
Buffer operators do partitioning based on these additional values and preserve the ordering while GroupBy family does not.

Can we please get Bart back on soon! His stuff is always by turns enlightening and inspiring.

@dot_tom:I'm most interested in the "LINQ 2 MSIL" that Bart has alluded to previously, but expect that if anything comes out of that idea, we'll see it when it's ready.

thanks so much Bart, was an extremely en lightning video, I always like to go down the layers of abstraction sometime for fun, but sometimes the process can be painful, you did a great job simplifying everything and made it so exciting.
Finally have the courage to read up and follow Jon Skeet's Reimplementation of LINQ on his blog.

@bart: zip is special and can't be implemented by a combination of ana and bind, right? maybe already the term monad (mono) should make it obvious that combining two monads into one is not possible by its atomic operations. And ana / generate is just a more generative version of the atomic monad operation "return". since there is no atom that can handle two M<T> the only ways of building such a zip function would either force us to make use of cata and ana (breaking lazyness) or  better  make use of the special sequence centric additions to Enumerable<T> or Observable<T>. we are forced to manually iterate at same pace over the sequences and since i don't see any way to do that with one single foreach loop (it only takes one IEnumerable<T>), i assume i would need to implement IEnumerable by manually working with the Enumerators of the incoming enumerables. as far i can see yield return could help me with constructing the new enumerable so that i don't need to implement an own enumerator.
is that right? 
wouldn't it be nice to have a foreach loop in c# that can iterate over several enumerables at the same pace? i know that the term for"each" is misleading then because it doesn't express which permutations of the single sequence elements are included... but it would feel natural that this no cross operation like nested foreach loops, but more a single 1D loop cutting through all dimensions at the same pace. so the only semantic question would be: iterate until the smallest or the largest sequence is finished iterating. but i would guess that 95% would be happy with smallest sequence...

@dot_tom: There will be a lot of Bart on 9 in 2011. You can count on that.
C 
@Sebastian Gregor: Your analysis is right. Concerning language syntax, one question is whether you want one particular control flow primitive (foreach) to know about some kind of multivalue simultaneous treatment (zip), or you want to shift that into its own construct. I'm more in the latter camp where firstclass tuples (with pattern matching) allow one to combine all the needed building blocks. Here's what Zip looks like in hypothetical syntax where * is a suffix type constructor for sequence and (,) is tupling syntax both in the type world and the value world:
(T1, T2)* Zip<T1, T2>(this (T1*, T2*) source);
==
IEnumerable<Tuple<T1, T2>> Zip<T1, T2>(this Tuple<IEnumerable<T1>, IEnumerable<T2>> source);
Notice the beautiful symmetry of the lightweight signature shown above. Zip takes a tuple of stars to a star of a tuples. Assume the following use:
foreach (var (i, c) in (1..10, 'a'..'j').Zip())
f(i, c);== (ignore expansion of start..end range expressions)
foreach (var __t in Tuple.Create(new[] { 1, 2, ..., 10 }, new[] { 'a', 'b', ..., 'j' }).Zip())
{
var i = __t.Item1;
var c = __t.Item2;
f(i, c);
}Now the question is whether or not you want the GetEnumerator pattern to work on a "tuple of stars". If so, the Zip() call becomes redundant (if "foreach" were to resolve against extension methods, you could even do this right now). However, there are other ways of combining two sequences, such as Rx's CombineLatest. So this lightweight explicitness could be considered a good thing.