Archive for the ‘architecture’ Category

Does the CAP Theorem have a Second Order?

May 25, 2014

A couple of years ago, we decided at Central 1 that our services should fall on the Availability-Partition Tolerance (AP) side of the CAP Theorem. The assertion at the time was that, at a business level, it is reasonable to accept eventual consistency if we can be always available and partition tolerant. With our old systems, we made that tradeoff all the time, and sorted out the reconciliation issues the next day.

Recently, we were working on implementing Interac Online Payments, which has a fairly complex message flow that includes the POS switching network. The details aren’t important here, but the net result was that we needed to handle a scenario where the first part of a transaction might come to one data center, and the second part would come to the other. Conceptually, it was a bit like the propose and commit in 2-phase commit coming to different data centers.

The system is based on an Active-Active database server pair with two-way replication between them. Unfortunately, we were seeing the commit message come to the remote data center before the propose message was replicated there. Our solution is to try to route the commit message to the same data center as the original propose message. The result is that if the service is unavailable at the location that received the propose message, (even if the propose was replicated) we respond negatively to the commit: we answer inconsistently. Having said that, we can always receive a message, and our system continues to function if the network gets partitioned.

This leads me to wonder if the CAP Theorem has a second order. That is, if I have a data service that is AP, is it impossible for me to create a service on top of it that is Available-Consistent or Consistent-Partition Tolerant?

The Myth of Governance

December 5, 2011

In the previous post regarding requirements, it is tempting to think that you could avoid prescriptive or unnecessary requirements with a proper governance structure in place.  In fact, that is the fashionable reaction when any project artifact is found to have deviated from the path.  If only we had a proper review and sign-off procedure, everything would stay the course.

Now, anyone knows that review and signoff takes time.  If you want my time to review a 50 page document, you’ll be waiting 3 days to a week. If it’s 100 pages, I’ll get it back to you in at least a week.

The requirements document in the previous post was about 200 pages long.  Think about that.  200 pages is the length of a novel. Except if you picked up a novel that had the same character development and story arc as a typical work document, you’d put it down after reading the fist chapter. The quality of the attention you’re able to give the work drops off significantly after about 40 pages.

Even the author can’t pay attention past page 40. That’s why it’s common to find documents that contradict themselves.

This, along with a desire to parallelize writing and reviewing is why we often see these big documents released in chapters.  But then we gain opportunities to miss whole aspects of the subject. The document really needs to be considered as a whole.

So, governance in the form of review and sign-off is slow and error-prone.  You might be able to compensate for the errors and inattention by slowing down further.  Give me more time to review, and maybe I’ll be more careful and won’t miss things.

The real problem, however, is that review-based governance doesn’t scale. If the overall direction sits with one person, and they must review every decision, then the organization is limited to that reviewer’s capacity.

Well, obviously you scale by adding more reviewers.  But how do you ensure that the reviewers all agree on the same direction and vision?  Even if they all think they agree on the direction and vision, they will have to interpret it and apply it in specific circumstances.  Who watches the watchers?

In the end, we introduce documentation and review because we don’t know how else to ensure that our staff are producing what we expect.  However, if we think we’re going to actually ensure they produce what we expect through review, we’re dreaming.

What we really want is self-government, and I think a few organizations have done this well.  With self-government, the leadership clearly communicate a broader vision or path toward the future, and then motivate their staff to work toward the shared goal.  If you can sufficiently communicate the idea, and convince everyone to support it, then you should not need governance.

Technical Debt and Interest

August 9, 2011

Since installing Sonar over a year ago, we’ve been working to reduce our technical debt.  In some of our applications, which have been around for nigh on a decade, we have accumulated huge amounts of technical debt.  I don’t hold much faith in the numbers produced by Sonar in absolute terms, but it is encouraging to see the numbers go down little by little.

Our product management team seems to have grabbed onto the notion of technical debt.  Being from a financial institution they even get the notion that bad code isn’t so much a debt as an un-hedged call option, but they also recognize that it’s much easier to explain (and say) “technical debt” than “technical unhedged call option.”  They get this idea, and like it, but the natural question they should be asking is, “How much interest should they expect to pay should we take on some amount of technical debt?”

In the real world, debt upon which we pay no interest is like free money: you could take that loan and invest it in a sure-win investment, and repay your debt later, pocketing whatever growth you were able to get from the investment.  It’s the same with code: technical debt on which you pay no interest was probably incurred to get the code out faster, leaving budget and time for other money-making features.

How do we calculate interest, then?  The interest is a measure of how much longer it takes to maintain the code than it would if the code were idealized.  If the debt itself, the principal as it were, corresponds to the amount of time it would take to rectify the bad code, the interest is only slightly related to the principal.  And thus you see, product management’s question is difficult to answer.

Probably the easiest technical debt and interest to understand is that from duplicate code.  The principal for duplicate code is the time it would take to extract a method and replace both duplicates with a call to the method.  The interest is the time it takes to determine that duplicate code exists and replicate and test the fix in both places.  The tough part is determining that the duplicate code exists, and this may not happen until testing or even production.  Of course, if we never have to change the duplicate code, then there is no effort for fixing it, and so, in that case, the interest is zero.

So, I propose that the technical interest is something like

Technical Interest = Cost of Maintaining Bad Code * Probability that Maintenance is Required

You quickly realize then that it’s not enough to talk about the total debt in the system; indeed, it’s useless to talk about the total debt as some of it is a zero-interest, no down-payment type of loan.  What is much more interesting is to talk about the total interest payments being made on the system, and for that, you really need to decompose the source code into modules and analyze which modules incur the most change.

It’s also useful to look at the different types of debt and decide which of them are incurring the most interest.  Duplicate code in a quickly changing codebase, for example, is probably incurring more interest than even an empty catch block in the same codebase.  However, they both take about the same amount of time to fix.  Which should you fix first?  Because the interest on technical debt compounds, you should always pay off the high-interest loan first.

The Problem with Templates

March 17, 2010

As technical teams mature, one of the remedies for the many ills that come from growth is the addition of process.  These processes call for documentation, and someone generally kicks off a template to make these documents easier to produce.  As we learn more, we add sections to the templates to ensure we don’t repeat mistakes, or at least remember to consider the factors in subsequent initiatives.

So far so good.  The organization is learning and improving with every project.

Unfortunately, document templates often wind up looking a lot like forms.  That makes people want to fill in all the sections (often improperly), and that leads to bloated documents that don’t even fulfill their purpose.

Take, for example, a fairly typical waterfall model of software development.  There is a requirements document, followed by a design document.  Often the design document template will include a section called something like, “architecturally significant use cases.”  It is tempting to simply grab all the use cases from the requirements document and paste them into this section, especially when there are sections on logical, physical, deployment, data and code architecture yet to write.

Apart from the obvious problem with cut and paste, the inclusion of all the use cases fails at the most basic level to communicate the significant use cases.   The document fails.

I don’t have a good answer to this unless it is to provide only a high-level template for much of the document along with a description of how the document should work.

For example, that design document starts with architecturally significant use cases that drive the choice of logical components.  The logical components find places to live in executables and libraries, which are documented in the physical architecture section and those executables find homes in the deployment architecture.  In order to write a sensible design document, an author has to understand this flow; and seeing the headings in the template isn’t going to help.

In most cases, the document template is not the place to learn.  It should stay high-level, and force its authors to think through the process of writing the document.  We still need a place to ensure projects can impart their wisdom to subsequent projects, but the place to do this is in a checklist, not in a document template.

So, if you’re thinking of creating a template, think about creating a short (!) explanation of how a document of this type should be organized so that it communicates.  Add a checklist to the explanation, and do it all in a wiki so that those who come after you can help the organization learn.

Architecture or Design

January 23, 2010

We’re looking to hire an architect these days, an enterprise architect to be precise, and one of the questions I’ve been asking people is what is the difference between architecture and design.  Nobody has come up with my answer yet, and so, I’ll post it here and see if that improves the odds.

Much like the standard answer regarding the difference between a program and a script, the standard answer is deeply unsatisfying.  Most people, when asked about the difference between architecture and design respond that it’s about the size of the components being designed.  Generally, folks are pretty sure that design at the class or method level is definitely design, while figuring out how processes work together, especially over distributed systems, is definitely architecture.  Where is the line, though?  If I package my classes together into libraries, and think about them at that level, does it now become architecture?  What if my system happens to consist of two short batch programs that write and read a file, perhaps across a network, is that enough to be architecture?

The question of design or architecture is an important question to consider, not because we’re hiring an architect right now, but because one of the questions that developers rightly ask as they become more and more proficient is “what do I need to demonstrate in order to become an architect?”  The short answer is that minimally they need to demonstrate some architectural skill, but then it’s really hard to define exactly what that is, and how do you compare one person’s architectural skill with another’s.  It’s even harder to tell them how they can go and get some architectural skill!

Thinking merely about the size of components doesn’t help those developers because, well, what’s big enough to be considered architecture?  And anyway, what specific skill is it that you expect them to pick up as they become architects.

The specific skill that separates design from architecture is considering the business side of the technological proposal.  Specifically, a designer who is thinking in terms of initial and ongoing costs of their solution, and whether the value delivered by the solution is sufficient to warrant those costs is doing architecture.  The designer who draws components on a whiteboard, no matter how grandiose those components might be, is only doing design until they start thinking about the implied costs.

Thinking about costs leads to all the other aspects that usually fall into solutions architecture.  In order to understand the initial build cost, the architect has to make sourcing decisions.  Hardware costs come from the deployment view of the architecture, which means the designer needs to understand how scale and fault tolerance will be achieved.  The distribution model and the data architecture imply network costs.  A design that includes manual step in the business process will incur ongoing labour costs, which may or may not be justifiable relative to the initial build cost.  An architect who isn’t thinking about these costs in relation to the business is only doing part of the job, and their solution while technically superb will fail.

Note that the intuitive answer still works.  Design at the class level rarely has real cost implications.  Design at the component level, on the other hand, often does.  The design that specifies two short batch programs that access a file over the network might be architecture, but probably not a very good one.

So, when your developer is asking, “what do I need to do to become an architect?”  The answer is not, “draw bigger boxes on the whiteboard,” but something along the lines of “talk to me about the cost of your solution, what other approaches did you consider and what would their relative costs be.”  Get your developers to think in these lines, and they may grow up into architects who deliver successful solutions.

Availability: it’s expensive

March 12, 2008

I was jawing today with a developer who wanted to build something a little more robust than we needed it. He didn’t do it, but he thought it would be interesting and fun (the root of another blog post one day, perhaps).

The thing is, generally cost is proportional to something like the exponent of the uptime requirement. This is why uptime is so often expressed in terms of percentage: if 9% uptime costs 1 dollar (say), then 99% costs 10 dollars, 99.9 costs 100 dollars, 99.99 costs 1000 and 99.999 costs $10,000. Now think of what an underpaid junior developer can develop in under 6 minutes, because that’s what you can build robust enough to stay up for all but five minutes per year for $10,000.

Don’t believe me? I bet your junior developer could develop something that continually writes incremental integers to the screen in six minutes. Now, how would you make that survive power failures, screen failures, how about hot-replacement of parts? Yeah, now you’re talking about $10,000.

A couple of questions to consider in acquisition

November 20, 2007

When your CEO goes to buy another company, the chances are they’re going to ask you, their trusted technical advisor, to have a look and decide if their technology is worth acquiring. Now, I’ve never heard of an acquisition stopping at this point, but I suppose it could affect the offering price. Here are some questions that are relevant:

  1. How is the company protecting their intellectual property? What patents does the company hold? What are the relevant patents in the space, and why does this company not infringe on those patents?
  2. How much can the company and their systems scale? Can they readily accommodate the increased demands of an amalgamated company? What are their plans for scaling beyond the handful of customers they have today?  Has this scaling been proven?
  3. What is the state of the company’s equipment?  Is it leased or owned, how much life is left in it?
  4. What are the development processes that are currently in place?

Any company should be able to provide you with a number of architectural diagrams.  I would look for logical and deployment views to get you started.  There should also be a good entity relationship diagram for each database.

One Big Message?

August 16, 2007

This is the scenario: periodically, one component is going to look for objects that satisfy some criteria and tell the others. For example, perhaps it is the accounts whose billing anniversary is today. A bunch of other components are interested in these accounts, and so we’re going to send a message out that they can subscribe to and be notified of billing anniversaries. Clearly the initiating component is going to do a single query and get back a big list of accounts; should it send out a single big message, or many small ones?

Ordinarily, when one component is talking to another, I like to see as much go into each message as possible. This was a lesson we learned back in the days of CORBA (which I realize may not be completely over everywhere) when, based on some particularly optimistic advice from authors who had apparently never actually built more than sample code, everyone I worked with was composing interfaces for inter-process communication that exposed miriad small methods. The key to killing performance of distributed systems, it turns out, is to make communication synchronous and chatty so that the communication overhead dominates the payload. Maybe this is what killed CORBA for so many people; I have no doubt their early experiences with performance sewed some initial seeds of doubt.

However, in this case, we don’t have the same scenario. First of all, we’re talking about asynchronous communication. So, while we would add communication overhead by going with multiple small messages, it’s overhead that will only affect total processing resources, and not latency of individual requests. On the other hand, there is quite a bit to be gained from using multiple small messages —

  • we can route and filter the individual messages based on their contents.
  • we can scale the message processing by “simply” adding more sinks.
  • we can be much less clever about interrupted processing: the sink consumes each message in a transaction, and so, a failed transaction will be restarted automatically (perhaps by another recipient)
  • the constant-size messages mean that the sink does not have to scale. So, if we have a simple in-memory reader for the message that works for 20 accounts today, the same reader will work tomorrow when we have 20k accounts.

It’s a small question in the grand scheme of enterprise architecture, but all these small things add up.

Perhaps the next question is: how small is too small?

Scripts and Programs

August 10, 2007

What’s the difference between a script and a program?

Most people to whom I ask this question jump at the idea that a script is interpreted, and indeed, the Wikipedia page on scripting languages currently makes that distinction.  However, there are lots of interpreted programming languages — Basic, Prolog and Smalltalk all leap to mind — so that can’t be it.

So then, most people tend to say something to the effect that scripting isn’t serious, while programming is.  Well, that might be true, but it also has huge implications!  If scripting isn’t serious, then is there any need to apply proper software development practices to scripts?  Do you need to keep scripts under source control?  Do they need requirements?  Do you need to test them thoroughly?  What about documentation?

The assumption that scripts are not serious has serious implications to an organization’s stability.  Organizations that don’t value scripts wind up with a host of arcane code snippets sitting on servers, known only to a handful of sysadmins who have composed them and shared them with others.  When those sysadmins leave, their tools whither in lost directories and the organization breaks.

My own definition is this: a script is a sequence of instructions that is only intended to run once without modification.  Any script that you plan to use more than once is a program and deserves to be treated and cared for with the appropriate respect.

This has the interesting side effect that you can write programs in languages that are called “scripting languages,” and may, like JavaScript, even have “script” in their names.  Perhaps I am swimming upstream, but I would say that the standard definition is useless and indeed dangerous.

So don’t tell me, “there’s just a little script that runs nightly and downloads the billing information,” when what you mean is “there’s a critical csh program…”  Such scripts deserve all our respect!

Blogged with Flock

Purchase is just a special case of Exchange

July 13, 2007

When you forget something that you learned before, perhaps it’s time to write it down.  So here goes.

It turns out that what most people think of as e-Commerce, and in fact, any kind of commerce, is just the very tip of the iceberg.  What is below the surface, behind shopping carts and packing and shipping is a huge ugly monster – the exchange process.

An exchange is a process where a customer purchases some items, returns some of them and receives replacements for some of the returned items.  So, you see, both Return and Purchase are special cases of this flow. 

The first time I came to realize this, we were building a very nice e-store.  Unfortunately, no-one had thought about returns or exchanges at all.  Indeed, it was unclear whose responsibility they were!  So, the store kept getting more and more elaborate, and no one thought much about returns until we came to writing the terms and conditions (how this happened is a story for another day).  Needless to say, that was much much much too late to make a really snazzy experience for the customer or indeed for the poor service agent.

What we should have done is started thinking about exchanges up front.  How would inventory move from the warehouse to the customer and back again?  How would the money work, and what about the taxes? 

Now, I’m working on another offering, and realized after a two-hour session on the purchase, provisioning and billing of a service, we hadn’t thought yet about the exchange side.  Fortunately, it happened before we got too far down the road, but it would have been so much more efficient to have come up with it up front.

If you concentrate on the exchange process up front, you may be able to build hooks into the purchase process or even the product design to make the exchange easier (or indeed possible).  The simplest example is a shipping label that goes into the box to enable the customer to return items — this can be barcoded to identify the items and original shipment.

So next time, don’t have a meeting to discuss the purchase process, but to discuss the exchange process.  You can touch on the purchase if you want, but it’s the exchange where the real issues show up.

Blogged with Flock