Skip to content

Adds Interval Tree Clocks and Vector Clocks with an algebraic approach #333

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

johnynek
Copy link
Collaborator

@johnynek johnynek commented Aug 8, 2014

I've been geeking out on this for the past few nights.

  1. implementing Interval Tree Clock was really fun.

  2. I noticed a new concept: a partial semigroup. It seems useful to build operations that are robust to duplication.

  3. I think with this, we can make storehaus work with Partial semigroups, which means if we store a value with type (C, T) where we have a clock for C and a semigroup for T we can get duplication tolerance. This is still a little vague to me, but I'd appreciate feedback.

@johnynek
Copy link
Collaborator Author

johnynek commented Aug 8, 2014

In particular, look at Clock[T]:

https://github.com/twitter/algebird/blob/oscar-interval-tree/algebird-core/src/main/scala/com/twitter/algebird/clock/Clock.scala

Is this a good structure? Is it missing some essence of what clocks are?

* where you update if possible).
* Note that this is not generally associative.
*/
def plusOrLeft(left: T, right: T): T =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have more general use outside of KV stores? Given it is functionality in terms of tryPlus, do we want it to be in the base trait? I only ask because it seems quite specialized...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I don't know yet. Maybe we should remove it from the base trait.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given you have the method in the companion object I think tryPlus is good enough.

@non
Copy link
Contributor

non commented Aug 8, 2014

My first question is, is tryPlus associative in general, or only when both sides are defined?

If not, I wonder how you can safely rely on associativity in any case, since it is possible that (... + c) + d is defined, but ... + (c + d) is undefined. In that case, this starts looking more like PartialMagma or something, right?

If it is associative, I think it might be a semigroupoid: http://en.wikipedia.org/wiki/Groupoid. But maybe I'm mistaken?

EDIT: The criteria I would like to have is something like this: for (a + b) + c and a + (b + c), either they are both defined and equal, or neither defined. Is this too strong?

@non
Copy link
Contributor

non commented Aug 8, 2014

(Also, I haven't finished reading it, but this MathOverflow article seems promising: http://mathoverflow.net/questions/123614/on-the-notion-of-partial-semigroup)

@avibryant
Copy link
Contributor

So, this may well be the wrong angle, but when I think about the essence of clocks, it's something like:

There is some type V which is the value we actually care about (for a counter, this might be a Long).

A clock is going to, somehow, store multiple V values (eg in a vector). We might combine these Vs in two ways:

  • element-wise (for some notion of element), when joining to another clock, using some semilattice on V
  • merging together two or more elements when reorganizing the internals of the clock, or when reading the current total out of the clock. This uses some monoid on V, which may or may not be the same as the semilattice above.

The internals of the clock (how they maintain separate Vs, how they decide which ones to merge and when, or which ones match when doing a join, etc) are implementation-specific.

This is a pretty different perspective from what you've taken here, I think. In particular, a few things that seem strange to me, given where I'm coming from:

  • The T in Clock[T] is not the V I actually care about. Instead it feels more like an implementation artifact (eg VectorClock.Stamp).
  • There doesn't seem to be any easy way to get the "final" answer out (what's the sum of the Vector[Long]), or anything in the Clock abstraction that even contemplates this being the use case.
  • Most importantly, the V seems to be hardcoded in the clock implementations (eg for VectorClock it's always Long). Why can't I define a VectorClock on HLL? There's a well defined semilattice for it, and a well defined merge operation to get the "total" (which unlike for Long, happens to be the same as the semilattice).

@johnynek
Copy link
Collaborator Author

johnynek commented Aug 8, 2014

Good feedback, sirs.

@non seems you want strongly associative, according to the math overflow term in the question, while I think I only promised properly associative. I think the cases I have here are strongly associative, so I could strengthen it. I don't think I want so many varieties. I'd rather the strongest notion that gives the duplicate-message-handling. Also, what do you think of using descriptive vs canonical names (Semigroupoid vs PartialSemigroup?)

@avibryant Yes, so you are interested in the cases where the vector clocks are not over integers. That does look interesting, but I wonder how you can increment the clock generally? HLL has a semilattice, but it does not have "successible" (or incrementable, or countable or whatever you wish to call it) in a natural way that I see. So, is this notion of being able to create a next largest time needed? Our normal human notion of time does not have that: it is a real number that is continuously integrating forward. I was interested in the classical application of vector clocks to deal with duplicated and out-of-order messages. But you are right: if there is a semilattice on T, then there is a semilattice on Vector[T]. In your picture, each node has a support (id for instance) and it can apply values. Is your notion of a clock stronger than a semilattice? I could improve my implementation I think such that IntervalTree and VectorClock stamps have types V which themselves must be clocks, and I think that is enough to implement them (lift and shrink in IntervalTree.Event will be a bit tricky...)

Avi, can you talk more about your vision for application? My main vision was in summingbird to have messaging give us at-least once semantics, but then use the clock to remove duplicates in order to get at-most-once. I recall you talking about distributed clocks where the values are not longs, but anything with a semilattice. What more did you have in mind there? (other than K-V stores of bloomfilters, HLLs, Sets, maximums, minimums or vectors of these).

I guess I was trying to get a partial band (idem but not commutative) from a general semigroup using the clock.

@non
Copy link
Contributor

non commented Aug 8, 2014

@johnynek If something like the strongly associative property works here, I think that would be a good property, since it is relatively easy to reason about.

I am inclined to prefer Semigroupoid only because I can imagine wanting to integrate with Groupoid and other related types, and I think the naming scheme there is a bit nicer. But I don't really have strong feelings about it, especially since it seems like there is not a single canonical definition here. I don't think PartialSemigroup will confuse anyone, especially if you are explicit about the kind of associativity it has.

@avibryant
Copy link
Contributor

Yes, I was wondering about exactly that recursive representation of having a V which is a Clock; that seems like the shortest path from where you are now to having clocks on non-integers, but I'm not convinced it's the globally optimal design.

In my vision, incrementing is not fundamental; rather, it falls out of the fact that you have a merge monoid for V. That lets you add(inc: V), and you can add(1L) if you want, but there's no equivalent increment op for HLL, and that's fine.

The motivation here comes from using vector clocks as CRDTs in Dynamo-like systems. The canonical example is having a distributed eventually-consistent (AP) counter. So (forgive me if this is obvious), you have a vector clock for the counter, each node has a corresponding element in the vector, asking a node to increment the counter will increment its element, and then whenever you have the opportunity to sync up the nodes, you join vector clocks as usual with element-wise max, and get the total counter value by summing across all the elements. But this generalizes nicely to HLLs etc.

@jnievelt
Copy link
Contributor

I don't think we want to design our clock around being used as a distributed counter. Couldn't one just use a latticed Monoid[Vector[T]] with a summing Monoid[T] instead? If we did want to build an intermediate structure, should it also meet CMS use case of a latticed Monoid[Vector[T]] with a latticed Monoid[T]?

Anyway, the issue with nested clocks is in forking/joining. For example, if you have a VectorClock of VectorClocks, how do you do an inner fork? Even if you don't fork, does it make more sense than simply having a flat identifier space?

In terms of my view on what these things really are, Stamps are a representation of a set of events which are guaranteed to have happened. The increment is a stand-in for adding a new event to the set, with the assumption that it's being added in order (within its identifier context) and without duplication. The unions and comparisons are completely in line with those of sets as well.

Clocks, then, are ways of storing those sets given that we can provide a reduced interface, though we don't want any inherent loss of accuracy from the clock itself. Thus we can use stamp structures that can easily "add the next event for a context", combine with another stamp, and check for ordering with another stamp. But we can't enumerate events or even test for their membership (unless we also have a full listing of each id's event sequences) within the structure.

Coming to duplication tolerance, it's difficult for me to imagine how clocks will be used here. Can you describe the logic that might be used and the scenario that it would solve?

* See the very readable paper:
* http://gsd.di.uminho.pt/members/cbm/ps/itc2008.pdf
*/
object IntervalTree {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no suuuuper strong opinions here but imho there's enough code here that this could be a package and you could split things out a bit. Might be easier to read if they're in separate files and stuff. This is a giant object. But they are all heavily related so I get the desire to keep them together.

@ianoc
Copy link
Collaborator

ianoc commented Aug 4, 2015

Sorry my bad, git foo on cmd line broke stuff and closed all of these

@ianoc ianoc reopened this Aug 4, 2015
@CLAassistant
Copy link

CLAassistant commented Nov 16, 2019

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants