Archive for the ‘programming’ Category

Does the CAP Theorem have a Second Order?

May 25, 2014

A couple of years ago, we decided at Central 1 that our services should fall on the Availability-Partition Tolerance (AP) side of the CAP Theorem. The assertion at the time was that, at a business level, it is reasonable to accept eventual consistency if we can be always available and partition tolerant. With our old systems, we made that tradeoff all the time, and sorted out the reconciliation issues the next day.

Recently, we were working on implementing Interac Online Payments, which has a fairly complex message flow that includes the POS switching network. The details aren’t important here, but the net result was that we needed to handle a scenario where the first part of a transaction might come to one data center, and the second part would come to the other. Conceptually, it was a bit like the propose and commit in 2-phase commit coming to different data centers.

The system is based on an Active-Active database server pair with two-way replication between them. Unfortunately, we were seeing the commit message come to the remote data center before the propose message was replicated there. Our solution is to try to route the commit message to the same data center as the original propose message. The result is that if the service is unavailable at the location that received the propose message, (even if the propose was replicated) we respond negatively to the commit: we answer inconsistently. Having said that, we can always receive a message, and our system continues to function if the network gets partitioned.

This leads me to wonder if the CAP Theorem has a second order. That is, if I have a data service that is AP, is it impossible for me to create a service on top of it that is Available-Consistent or Consistent-Partition Tolerant?


Inspired by Atlassian’s Fedex Day

December 21, 2011

My team has been after me for years to implement something like the Google 20.  Well, I’ve never felt I can afford to work only four days a week on delivering what the business asks for — 6 would be preferable.  So, we never did it.  However, Atlassian came up with the idea for Fedex days a few years ago, and this seemed a much more sellable idea, especially if we did it in December when things are starting to slow down a bit.  This year we tried it out.

We changed it a little from their format, but looking at their FAQ, there are some things we should adopt, like grabbing screenshots on the last morning in case the project breaks a few minutes before the deadline.  We also made it a two-day event, rather than just 24 hours.  Our environment is complex, and it could easily take a day just to get started.

Noon on Wednesday hit and the energy on the development floor went through the roof! Suddenly little teams of two or three formed all over the place, laptops emerged so people could work at each others’ desks, developers were huddling. Work continued into the wee hours of the morning both days. It was great!

Being the director, I decided to lead by example, and came up with my own project.  Part of my time was eaten by meetings that I couldn’t avoid, but for much of those two days I managed to roll up my sleeves and do some development.  True to form, I decided to start by learning a new language and development environment, and implemented my project in Grails.

By the end of Wednesday afternoon, I’d gone through all the tutorials I felt I needed and started on my actual project, which was to call the Innotas API to create a tool to simplify accepting timesheets.  That’s more or less when I found out that Grails is not all that much help for calling web services.  Oh well, I persevered, and thanks to Adrian Brennan, who was working on another integration with Innotas, I got my application to talk to Innotas by the time I went home, around 3 AM.

The Innotas API is a poster child for the worst API ever.  To do remarkably simple things, you need to cross the network tens of times.  It’s like traversing an XML document one node at a time over the Internet.  But I digress.

Thursday dawned earlier than expected and some of the teams were starting to struggle, including me.  I had more than half the day devoted to meetings that I couldn’t avoid.  Worse, there were no good blocks of time to get in the zone.  I was experiencing first-hand the difficulty with context-switching that my developers go through every day.  Indeed, I only got about two hours of productive time during the day, and came back in the evening.  When I left at 2 AM, I wasn’t the last to leave, and I suspect there were more working from home.

Friday morning flew by, and some of the organizational items that I’d left until the last minute became minor crises – mental note for next year!  However, I managed to get a partial demo working, which meant that at least I wouldn’t embarrass myself in the afternoon.

Suddenly it was noon, and a mountain of pizza was being delivered to our largest meeting room, which attracted the whole team very effectively.  Everyone grabbed some pizza and we called into the conference bridge for the handful of remote workers.  The afternoon would be long.

Atlassian limits their demos to three minutes.  We didn’t limit the demos this year, but next year we will.  A couple of people chose to show documents or presentations that they’d worked on, which I feel is counter to the spirit of the event.  We won’t accept those next year either.

One of the things I’d left until the last minute was figuring out exactly how we would finagle our way into the development VLAN from the conference room.  The challenges of seeing demos on various developer machines while simultaneously using or gotomeeting ate up too much time.  So next year we’ll do a little practice in the week before, and we’ll get two computers going so we don’t have to wait for each demo to set up.  Well, lessons learned.

I hoped for team engagement, skills development and demonstration, and we got those in spades.  I thought we might perhaps get a product idea or two, but I was completely blown away by the number of projects that resulted in something that is almost usable in our products. We got way more value out of this initiative than I expected, and I fully several projects to graduate into our products after a little refinement.

If you’ve thought about Fedex Days for your organization, I heartily recommend finding a quiet time of the year and going for it.

Technical Debt and Interest

August 9, 2011

Since installing Sonar over a year ago, we’ve been working to reduce our technical debt.  In some of our applications, which have been around for nigh on a decade, we have accumulated huge amounts of technical debt.  I don’t hold much faith in the numbers produced by Sonar in absolute terms, but it is encouraging to see the numbers go down little by little.

Our product management team seems to have grabbed onto the notion of technical debt.  Being from a financial institution they even get the notion that bad code isn’t so much a debt as an un-hedged call option, but they also recognize that it’s much easier to explain (and say) “technical debt” than “technical unhedged call option.”  They get this idea, and like it, but the natural question they should be asking is, “How much interest should they expect to pay should we take on some amount of technical debt?”

In the real world, debt upon which we pay no interest is like free money: you could take that loan and invest it in a sure-win investment, and repay your debt later, pocketing whatever growth you were able to get from the investment.  It’s the same with code: technical debt on which you pay no interest was probably incurred to get the code out faster, leaving budget and time for other money-making features.

How do we calculate interest, then?  The interest is a measure of how much longer it takes to maintain the code than it would if the code were idealized.  If the debt itself, the principal as it were, corresponds to the amount of time it would take to rectify the bad code, the interest is only slightly related to the principal.  And thus you see, product management’s question is difficult to answer.

Probably the easiest technical debt and interest to understand is that from duplicate code.  The principal for duplicate code is the time it would take to extract a method and replace both duplicates with a call to the method.  The interest is the time it takes to determine that duplicate code exists and replicate and test the fix in both places.  The tough part is determining that the duplicate code exists, and this may not happen until testing or even production.  Of course, if we never have to change the duplicate code, then there is no effort for fixing it, and so, in that case, the interest is zero.

So, I propose that the technical interest is something like

Technical Interest = Cost of Maintaining Bad Code * Probability that Maintenance is Required

You quickly realize then that it’s not enough to talk about the total debt in the system; indeed, it’s useless to talk about the total debt as some of it is a zero-interest, no down-payment type of loan.  What is much more interesting is to talk about the total interest payments being made on the system, and for that, you really need to decompose the source code into modules and analyze which modules incur the most change.

It’s also useful to look at the different types of debt and decide which of them are incurring the most interest.  Duplicate code in a quickly changing codebase, for example, is probably incurring more interest than even an empty catch block in the same codebase.  However, they both take about the same amount of time to fix.  Which should you fix first?  Because the interest on technical debt compounds, you should always pay off the high-interest loan first.

Availability: it’s expensive

March 12, 2008

I was jawing today with a developer who wanted to build something a little more robust than we needed it. He didn’t do it, but he thought it would be interesting and fun (the root of another blog post one day, perhaps).

The thing is, generally cost is proportional to something like the exponent of the uptime requirement. This is why uptime is so often expressed in terms of percentage: if 9% uptime costs 1 dollar (say), then 99% costs 10 dollars, 99.9 costs 100 dollars, 99.99 costs 1000 and 99.999 costs $10,000. Now think of what an underpaid junior developer can develop in under 6 minutes, because that’s what you can build robust enough to stay up for all but five minutes per year for $10,000.

Don’t believe me? I bet your junior developer could develop something that continually writes incremental integers to the screen in six minutes. Now, how would you make that survive power failures, screen failures, how about hot-replacement of parts? Yeah, now you’re talking about $10,000.

Scripts and Programs

August 10, 2007

What’s the difference between a script and a program?

Most people to whom I ask this question jump at the idea that a script is interpreted, and indeed, the Wikipedia page on scripting languages currently makes that distinction.  However, there are lots of interpreted programming languages — Basic, Prolog and Smalltalk all leap to mind — so that can’t be it.

So then, most people tend to say something to the effect that scripting isn’t serious, while programming is.  Well, that might be true, but it also has huge implications!  If scripting isn’t serious, then is there any need to apply proper software development practices to scripts?  Do you need to keep scripts under source control?  Do they need requirements?  Do you need to test them thoroughly?  What about documentation?

The assumption that scripts are not serious has serious implications to an organization’s stability.  Organizations that don’t value scripts wind up with a host of arcane code snippets sitting on servers, known only to a handful of sysadmins who have composed them and shared them with others.  When those sysadmins leave, their tools whither in lost directories and the organization breaks.

My own definition is this: a script is a sequence of instructions that is only intended to run once without modification.  Any script that you plan to use more than once is a program and deserves to be treated and cared for with the appropriate respect.

This has the interesting side effect that you can write programs in languages that are called “scripting languages,” and may, like JavaScript, even have “script” in their names.  Perhaps I am swimming upstream, but I would say that the standard definition is useless and indeed dangerous.

So don’t tell me, “there’s just a little script that runs nightly and downloads the billing information,” when what you mean is “there’s a critical csh program…”  Such scripts deserve all our respect!

Blogged with Flock

A common trap

May 9, 2007

Maybe I’ll make this into an interview question one day. Yesterday I was trolling through some code for a little script that ETLs from one database to another. It looked a bit like this:

void quit() {
db1.rollback(); db2.rollback();
db1.disconnect(); db2.disconnect();
void main() {
if (!db1.connect()) exit;
if (!db2.connect()) exit;
if !(db1.commit()) quit();
if !(db2.commit()) quit();
db1.disconnect(); db2.disconnect();

Now, I’ve simplified tremendously here, and there was actually some error handling in getStuff and writeStuff, which made the code look somewhat robust, but what about those two commits? Now, the interesting thing about this was not so much the code, which is wrong, but but rather that I was talking to a colleague who I regard as a highly talented and extremely productive developer, and he didn’t see anything wrong with it. It all makes me wonder how much of the world’s code is rife with similar errors.
So, if you’re ever in an interview with me, get ready for this question.

TortoiseSVN SSH and Certificates

April 30, 2007

Right, so, there is an excellent post here: that describes how to get Tortoise Subversion to authenticate with a certificate.

The problem I have with many how-to articles is that there is not enough room among all the steps to explain why the magic works. I’ve written how-to articles as well, and I know that this is an easy trap; it is, after all, time consuming to enumerate all those steps. However, especially with software systems, where there is often magic involved, it’s important to explain the trick so someone can trouble-shoot the inevitable problems.

In this case, on the client side, there is some magic around how Tortoise figures out where your certificate is. It turns out that Tortoise is looking for a registry entry /SimonTatham/PuTTY/Sessions/sessionname/PublicKeyFile, where sessionname is the host name in the repository url. Putty creates one of these entries for each session that you save, which for me, is for each server that I try to connect to, and so the distinction is a little bit subtle. Tortoise, is trying to find a repository at svn+ssh://username@sessionname/repository.

So now, if you are having trouble with TortoiseSVN asking for a password over and over again, make sure you’re using the name of the PuTTY session, and not, say, the name of the server.