-
Notifications
You must be signed in to change notification settings - Fork 347
Adds Interval Tree Clocks and Vector Clocks with an algebraic approach #333
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
In particular, look at Clock[T]: Is this a good structure? Is it missing some essence of what clocks are? |
* where you update if possible). | ||
* Note that this is not generally associative. | ||
*/ | ||
def plusOrLeft(left: T, right: T): T = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this have more general use outside of KV stores? Given it is functionality in terms of tryPlus, do we want it to be in the base trait? I only ask because it seems quite specialized...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I don't know yet. Maybe we should remove it from the base trait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given you have the method in the companion object I think tryPlus is good enough.
My first question is, is If not, I wonder how you can safely rely on associativity in any case, since it is possible that If it is associative, I think it might be a semigroupoid: http://en.wikipedia.org/wiki/Groupoid. But maybe I'm mistaken? EDIT: The criteria I would like to have is something like this: for |
(Also, I haven't finished reading it, but this MathOverflow article seems promising: http://mathoverflow.net/questions/123614/on-the-notion-of-partial-semigroup) |
So, this may well be the wrong angle, but when I think about the essence of clocks, it's something like: There is some type V which is the value we actually care about (for a counter, this might be a Long). A clock is going to, somehow, store multiple V values (eg in a vector). We might combine these Vs in two ways:
The internals of the clock (how they maintain separate Vs, how they decide which ones to merge and when, or which ones match when doing a join, etc) are implementation-specific. This is a pretty different perspective from what you've taken here, I think. In particular, a few things that seem strange to me, given where I'm coming from:
|
Good feedback, sirs. @non seems you want strongly associative, according to the math overflow term in the question, while I think I only promised properly associative. I think the cases I have here are strongly associative, so I could strengthen it. I don't think I want so many varieties. I'd rather the strongest notion that gives the duplicate-message-handling. Also, what do you think of using descriptive vs canonical names (Semigroupoid vs PartialSemigroup?) @avibryant Yes, so you are interested in the cases where the vector clocks are not over integers. That does look interesting, but I wonder how you can increment the clock generally? HLL has a semilattice, but it does not have "successible" (or incrementable, or countable or whatever you wish to call it) in a natural way that I see. So, is this notion of being able to create a next largest time needed? Our normal human notion of time does not have that: it is a real number that is continuously integrating forward. I was interested in the classical application of vector clocks to deal with duplicated and out-of-order messages. But you are right: if there is a semilattice on T, then there is a semilattice on Avi, can you talk more about your vision for application? My main vision was in summingbird to have messaging give us at-least once semantics, but then use the clock to remove duplicates in order to get at-most-once. I recall you talking about distributed clocks where the values are not longs, but anything with a semilattice. What more did you have in mind there? (other than K-V stores of bloomfilters, HLLs, Sets, maximums, minimums or vectors of these). I guess I was trying to get a partial band (idem but not commutative) from a general semigroup using the clock. |
@johnynek If something like the strongly associative property works here, I think that would be a good property, since it is relatively easy to reason about. I am inclined to prefer Semigroupoid only because I can imagine wanting to integrate with Groupoid and other related types, and I think the naming scheme there is a bit nicer. But I don't really have strong feelings about it, especially since it seems like there is not a single canonical definition here. I don't think PartialSemigroup will confuse anyone, especially if you are explicit about the kind of associativity it has. |
Yes, I was wondering about exactly that recursive representation of having a In my vision, incrementing is not fundamental; rather, it falls out of the fact that you have a merge monoid for The motivation here comes from using vector clocks as CRDTs in Dynamo-like systems. The canonical example is having a distributed eventually-consistent (AP) counter. So (forgive me if this is obvious), you have a vector clock for the counter, each node has a corresponding element in the vector, asking a node to increment the counter will increment its element, and then whenever you have the opportunity to sync up the nodes, you join vector clocks as usual with element-wise max, and get the total counter value by summing across all the elements. But this generalizes nicely to HLLs etc. |
I don't think we want to design our clock around being used as a distributed counter. Couldn't one just use a latticed Monoid[Vector[T]] with a summing Monoid[T] instead? If we did want to build an intermediate structure, should it also meet CMS use case of a latticed Monoid[Vector[T]] with a latticed Monoid[T]? Anyway, the issue with nested clocks is in forking/joining. For example, if you have a VectorClock of VectorClocks, how do you do an inner fork? Even if you don't fork, does it make more sense than simply having a flat identifier space? In terms of my view on what these things really are, Stamps are a representation of a set of events which are guaranteed to have happened. The increment is a stand-in for adding a new event to the set, with the assumption that it's being added in order (within its identifier context) and without duplication. The unions and comparisons are completely in line with those of sets as well. Clocks, then, are ways of storing those sets given that we can provide a reduced interface, though we don't want any inherent loss of accuracy from the clock itself. Thus we can use stamp structures that can easily "add the next event for a context", combine with another stamp, and check for ordering with another stamp. But we can't enumerate events or even test for their membership (unless we also have a full listing of each id's event sequences) within the structure. Coming to duplication tolerance, it's difficult for me to imagine how clocks will be used here. Can you describe the logic that might be used and the scenario that it would solve? |
* See the very readable paper: | ||
* http://gsd.di.uminho.pt/members/cbm/ps/itc2008.pdf | ||
*/ | ||
object IntervalTree { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no suuuuper strong opinions here but imho there's enough code here that this could be a package and you could split things out a bit. Might be easier to read if they're in separate files and stuff. This is a giant object. But they are all heavily related so I get the desire to keep them together.
Sorry my bad, git foo on cmd line broke stuff and closed all of these |
I've been geeking out on this for the past few nights.
implementing Interval Tree Clock was really fun.
I noticed a new concept: a partial semigroup. It seems useful to build operations that are robust to duplication.
I think with this, we can make storehaus work with Partial semigroups, which means if we store a value with type (C, T) where we have a clock for C and a semigroup for T we can get duplication tolerance. This is still a little vague to me, but I'd appreciate feedback.