Category Archives: Business

Brewer’s CAP Theorem

Brewer’s CAP theorem is an important concept in scalability discussions. The theorem states that only two of the three items, Consistency, Availability, and Partition tolerance are achievable. Below are three illustrations of how this works. For the purposes of these examples, we will imagine a cluster of three storage nodes used to store user profiles.

Scenario A: Sacrificing Partition Tolerance

On each of the three nodes, we will only store a subset of the user profiles. This is called sharding. Node one will have users A-H, node two I-S, and node three T-Z. As long as each node is up and running, we have achieved a three times higher throughput than with a single node as each node only server a third of the traffic (assuming of course that user profile querying and updating is uniformly distributed through the alphabet). Consistency is achieved because immediately after data is written, it is accessible. Availability is achieved because each server is accessible in real time. However, we have lost the concept of partition tolerance as the disabling of one server has rendered a certain section of users unreachable. This carries the notion that upon hardware failure, data could have permanently been lost. All in all, not a good sacrifice under most circumstances.

Scenario B: Sacrificing Availability

On each of the three nodes, we will store all the user profiles. And furthermore, to guarantee data consistency and data loss prevention, we will ensure that every write into the system happens on all three nodes before it is completed. So, if were to update a profile for Bob McBob, any subsequent queries or writes on Bob McBob’s profile would be blocked until the update has completed. Even worse is when one of the nodes is lost but the requirement of three writes is still required, our entire system is unavailable until it is restored. This means that while our data is consistent and protected, we have sacrificed the availability of the data. This is a reasonable sacrifice in some systems. However, our goal is scalability and this does not fit that requirement.

Scenario C: Sacrificing Consistency

On each of the three nodes, we will store all the user profiles. However (and different than scenario B), we will acknowledge a completed write immediately and not wait for the other two nodes. This means that if a read comes in on node two for data written on node one, it may or may not be up-to-date depending on the latency of replication. We are still highly available and still partition tolerant (with respect to the latency it takes to replicate to another second node).

A majority of the time, scenario C is the chosen path for a couple of reasons. First, most business use cases do not require up-to-the-second information. Take, for instance, a generated report on the sales in a given region. While the business user may request “live” data, monitoring the usage of such a report will likely look as follows: 1) User prints a report and waits for x time (perhaps some coffee is obtained). 2) User imports report into Excel and slices and dices the data for y time. 3) User acts upon information. Overall, the decision is delayed from “live” data by x + y, which is most likely in the order of hours and is definitely not based on “live” data.

Second, a business benefit can be garnered in some cases. Let’s take an ATM for instance. Upon a withdrawal of funds, the ATM looks at the data available for a decision on whether or not to allow the transaction to proceed. It is not aware of any pending transfers to or from the account and is definitely not aware of what occurred in the last x minutes. If you were to use a mobile phone, move some money from your ATM account, and then inquire the ATM for a balance, the account would look no different and allow an overdraft. Ultimately, the bank, by choosing “eventual” consistency, has appropriated a fee.

In my next post, I’d like to discuss how this eventually consistent model applies to Event Sourcing and how we can structure our applications to take advantage of this in the context of enforcing transactional consistency.

Advertisements

Event Sourcing as the Canonical Source of Truth

Event Sourcing (ES) is a concept that enables us peruse the history of our system and know its state at any point in time.  A few reasons this is important range from investigating a bug only occurring under certain conditions to understanding why something was changed (why was customer X’s address changed).  Another distinct advantage of event sourcing is that we could rebuild an entire data store (SQL tables, MongoDB collections, flat files, etc…) by replaying each event against a listener.  This would look something like the below code:

var listeners = GetAllListeners();

foreach(var event in GetAllEvents())
{
    listeners.Handle(Event);
}

It’s quite simple and elegant, but more important is that it becomes the canonical data store; the single source of truth.  The alone yields some interesting possibilities.  One of which I’m quite fond is upon discovery of a poorly conceived database schema.  It is quite simple to redesign and build up as if it was in place on day one utilizing a likeness to the above code snippet.  In the same vein, imagine two separate applications needing access to the same data but having very different business models.  Instead of each application consuming a schema that makes sense for only one (or neither due to compromise), each has its own model serving its own needs.  Since neither is the canonical store of the data, duplication of data isn’t something to be frightened.

One thing to note as the discussion moves into scalability is that employing a denormalized schema design enabled by ES already increases our ability to scale.  When intersection of sets (sql joins) is unnecessary, queries against relational data sources perform much faster.

At this point, I have posted three articles, an introduction on how I got to where I am now, a discussion of CQRS, and now a discussion of ES.  I’d like to come full circle and discuss how CQRS + ES can be used to achieve further scalability, but first I need to address Brewer’s CAP Theorem and how it forms the backbone of many design decisions related to scalability.

Managing Complexity with CQRS

CQRS stands for command-query responsibility segregation.  It literally means to separate your commands from your queries, your reads from your writes.

This can take on many forms, the simplest having command messages differ from query messages.  It might seem obvious when stated like this, but I guarantee you have violated this idea numerous times.  I know I have.  For instance, take the example below of a client utilizing a service for a customer.

var customer = _service.GetCustomer(10);

customer.Address.Street = "1234 Blah St.";

_service.UpdateCustomer(customer);

The example above has 2 interesting characteristics.  First, we are sending the entire customer object back to update a single field in the address.  This isn’t necessarily bad, but it brings me to the second observation.

When looking through the history of a customer, the ability to tell why the street was changed (if at all) is impossible to discern.  The business intent of the change is missing.  Did the customer move?  Was there a typo in the street?  These are very different intentions that mean very different things.  Take for instance a business which sends letters when a customer moves confirming receipt of the change. The above snippet of code only allows us to send a letter when the address has changed.

Perhaps this would look a little better.

var customer = _service.GetCustomer(10);

var address = customer.Address;

address.Street = “1234 Blah St.”;

_service.CustomerHasMovedTo(customer.Id, address);

//or for a typo

_service.CorrectAddress(customer.Id, address);

While the above code may be a little more verbose, the intent is clear and the business can act accordingly.  Perhaps the business could disallow the move of a customer from Arizona (AZ) to Arkansas (AR) while still allowing a typo correcting an address that was supposed to be in Arkansas but was input incorrectly as Arizona.

In addition to business intent being important in the present, it is also important in the past.  The ability to reflect over historical events can prove an invaluable asset to a business.  In my next post, I’d like to discuss the Event Sourcing pattern.

My Road to CQRS

I remember my first job after graduating from college was building an internal .NET application interfacing with a legacy system.  This legacy system ran on an AS400 and was written in RPG utilizing a DB2 database for data storage.  When I looked at the database schema, I was horrified.  Opposite of any form of normalization, the real world apparently didn’t build applications as my school had taught.

That wasn’t true, however. The remainder of that job and every job thereafter, and most of the articles I read on the interwebs espouse the same principles I was taught.  Normalize your data, maintain referential integrity, run write operations inside a transaction, etc…  This seemed the universally accepted way to build systems.  So I continued on this learned trajectory churning out quality software meeting business requirements as specified.  When the database had trouble handling my 17 join query, we cached the results at the application layer.  When we could deal with day old data, we ran ETL processes at night to pre-calculate the source of the burdensome queries.

In retrospect, I wish these situations triggered my memory of that first legacy system.  The application cache and the extracted tables mirrored those early schemas that kept me up at night. And worse still is that these objects were not just use to read the data, but they were also used to update the data.  While I preached the principles of SOLID, I ignorantly violated the first letter in the acronym.

So, what did those original developers know that I didn’t? The original system was built to run on a mainframe with distributed terminal clients.  The mainframe does all the work while clients would simply view screens and then issue commands or queries to change the view or update the data respectively.  That resembles very closely the architecture of the web; a webserver on a box and a number of browsers connect to it to view data or post forms.  These days, our web servers can handle a whole lot of load (especially when load balanced), and much more than the original mainframes.  So, the mainframe guys supporting distributed clients are like a website supporting gazillions of hits a day (an hour?).  How did they manage this complexity?

CQRS stands for command-query responsibility segregation.  It literally means separating your commands from your queries; your reads from your writes.  They are responsible for different things.  Reads don’t have any business logic in them (aside from authorization perhaps).  So why did I keep insisting on a single model to rule them all?

This may sound complex and it can be.  In my next post, I want to delve into how CQRS can help us manage this complexity.

Ideas in the Organization: Part 2

So, after sharing my last post with my collegues, two thoughts emerged.  First, how do you tell if a person is the type that will have ideas and, second, what do you do with people who don’t meet that criteria.

People with ideas are generally people who are able to apply a new concept to a given situation.  Whether it is the best application of a concept or not, the person was able to apply it.  But being able to apply a concept first requires two things. 

A person should be eager to learn.  Too many times, I find people who are very bright people, but simply show up to work in order to earn a paycheck.  While this is a noble idea (providing for one’s family), it doesn’t benefit the organization.  Also, a person needs to have the capability of learning.  Its great if someone wants to learn, but if they don’t have that aptitude, its a pointless endeavor.

Therefore, we can tell the “idea people” from the “non-idea people” by their application of acquired knowledge.  Having a college degree is a good starting point because it shows eagerness and capability.  Certifications are the same.  But ability to apply knowledge is difficult to judge from a resume.

So, ultimately, the question is what to do with “non-idea people”.  Well, while there are some uses for them, I would postulate that the organization is better off without them.  They will simply take instruction and execute it blindly, without thinking about the consequences or whether or not a question should be raised.

Ideas in the Organization

Scott has a great article at here where he talks about Toyota’s chief engineer.  I would like to focus on a small portion of one of the attributes of a chief engineer.

Innovative yet skeptical of unproven technology.

This is important because it communicates two things. The first is that he tests unproven technologies exhaustively before allowing them into the organization.  The second (and the one I want to focus on) is that he is innovative.  This implies that ideas are thought up and either implemented or discarded.  But where do these ideas come from?  I would say that these ideas come from two places, the bottom and the top.

The bottom has to be willing to bring new ideas or revisit old ideas with a shiny new exterior. And it cannot be just a few. Everyone needs to have ideas. Ideas are the things that make people better. And when the people are better, organizations are better. People without ideas cost the organization because they don’t ever progress to a level of higher understanding and subsequently aren’t able to understand something new. People with ideas tend to be more open to other ideas because they are able to see the bigger picture. Both of these problems stifle adoption and change and ultimately it will cost the organization financially by aggregating the "technical debt" that the company must climb out of.

Conversely, ideas have to come from the top. If no ideas come out of the top, then it precludes people at the bottom from announcing ideas. Even when ideas are "welcome," if nothing ever happens, people will stop having ideas. However, the top is where bad ideas get thrown out and good ideas get adopted.

In summary, the worst thing that can happen is an organization without ideas. The second worst thing that can happen is an organization with lots of ideas and very little adoption.