Tuesday, 4 March 2014

The Importance of a Good Shard Key

In MongoDB (and many other database solutions) scalability is achieved by dividing your database into a number of smaller portions. Each portion is stored on a separate set of database nodes. The theory is that this approach allows your database to scale horizontally beyond the capacity of a single node and allows load to be spread more evenly across multiple clusters of nodes. In MongoDB this approach is known as sharding (many other databases call a similar concept partitioning).

When you deploy shards in MongoDB you have to select a field (or fields) that are present in each document as the shard key. The value of this field determines which set of database nodes holds a specific document. If you get this key selection wrong then the impacts on your system can be huge.

As an example of how shard key selection is vital, consider this real-world, non-IT example, which clearly highlights how a poorly selected shard key can massively impact performance of a system:

At the weekend I participated in a large Half Marathon event, with about 20,000 other runners. The number of bags to be stored while the race was on was too large for a single baggage tent, so the organisers had wisely decided to introduce a sharded approach by having two tents, each holding half the bags. Additionally, each tent was further sub-divided into separate evenly-sized sharded collections of bags, each managed by its own team of helpers.

Now, as a shard key the organisers had decided to use race number: a seemingly sensible choice given that this would be the single unique piece of information that every runner would have. The bag tents were therefore arranged so that tent 1 was for numbers 1 to 10,000 and tent 2 was for numbers 10,001 to 20,000. The sub divisions inside the tents were further broken down into 1000 number blocks (i.e. 1 to 1000, 1001 to 2000 and so on).

For pre-race bag drop off this sharding approach worked really well (the write scenario). Runners dropping off bags were nicely distributed across both time and the full range of values, so each tent and sub-division could work in parallel for the maximum write performance. This was clearly an excellent shard key selection for the write case.

The problem occurred at the end of the race when runners returned to the tents to collect their bags (the read scenario). Now, the organisers of the race had issued the numbers based on predicted finish time (1 for the fastest predicted finisher, 20,000 for the slowest): the result being that runners finished roughly in number order.

So, what happened then is that there was a initially massive read queue at tent 1 for the 1 to 2000 numbers, while the other shard nodes were almost completely idle. Then the queue moved to the 2001 to 4000 shards, and so on. Towards the end of the race, tent 1 was idle while tent 2 now had the read queues as the later numbered runners all finished.

We have a perfect case of a shard key that seems quite sensible by design but is actually fundamentally broken by the usage scenario of the system, in this case the (near) sequential arrival of numbers for retrieving the bags.

So, how can we solve this? Issuing the race numbers in a different order is one option, but a sequential numbering system based on predicted finish time is quite sensible for many other race organisation requirements. A better option is to change the shard key used by the baggage tents.

Basing the shard key around race number is still a good approach as this is the one piece of information guaranteed to be unique for each runner and the range of possible values is clearly defined and well distributed. Bag drop off (write performance) is never going to be a problem as runners will arrive fairly well distributed across both time and the race number range.

The challenge is coming up with a way of better distributing the bag retrieval (read performance). Fortunately with a sequential numbering system this is pretty easy. Rather than shard by the whole number, just shard by the last digit of the number: tent 1 – numbers ending 0 to 4; tent 2 – numbers ending 5 to 9. Then in each tent further divide into five separate shards, one for each ending digit. If finer granularity is required then within each shard it should be easy enough to sort and index by full race number.

Selecting a good shard key is hard. Considering both the write and read scenarios, and how the shard key will be utilised within these scenarios is critical. It can make a huge difference between a well-functioning system and one that fails to gain the benefits of the sharded approach.

Wednesday, 4 December 2013

Scala's Maturing Community

I've been involved in the Scala community and attending Scala conferences for about four years now. I've just come back from 2013 Scala eXchange, hosted by Skillsmatter in London. It was a great conference, one of the best I’ve been to, and the organisers and speakers should be thoroughly pleased with their efforts. One thing that I did think quite interesting was a significant change in emphasis from all of the previous Scala conferences that I’ve attended.

When I first started going to Scala conferences four years ago, the emphasis was definitely on an introduction to Scala. The talks focused on basic language features, how to use the collections libraries and core functional programming concepts. There were also some side talks about interesting libraries built in Scala, like Akka and Lift.

Last year the focus moved to a higher level, with more time spent on talking about where Scala was going and on more advanced functional programming concepts. There were talks focusing on things like Lenses and Monads and lots of detail about highly functional libraries developed using Scala. Presentaions about Akka and Lift were still present and people were starting to talk about Futures. In all of these talks, however, functional programming was the primary focus.

At Scala eXchange this year the emphasis was almost entirely flipped around. Most talks were about reactive programming using asynchronous approaches. Loads of stuff about Akka, actors, futures, events and the like. Some talks focused on Scala features such as macros and using type classes. However, there was very little direct talk of functional programming: it was just assumed that everyone present was using immutable data and following a functional programming approach. A huge shift in perspective from just a year ago.

I believe that this represents a massive shift in the maturity of the Scala community. We have moved from an immature group learning about how to use this exciting new language, how to work with immutable data and how to blend functional approaches into our code. We have instead become a group who are using this fantasic language and a set of fairly mature products and reactive techniques to build highly scalable systems. A huge leap in just a couple of years.

I remember a similar transition in the Java community when conferences went from talking about basic Java concepts like POJOs, collections and design patterns to talking about building complex solutions using advanced libraries like Spring and JMS. The Scala community seems to have made this leap much more quickly than the Java community did.

The one thing that worries me slightly about this change is that using immutable data and functional programming has almost become an unwritten assumption. Each presentation just seemed to assume that you would be programming in this way by default. While this is great for those of us in the community who have made this transition, I think we need to be careful not to skip this step for those people just transitioning into the Scala world.

The worst case would be developers transitioning from a language like Java straight into an asynchronous reactive programming model without first going through the transformation of thinking in terms of immutable data and functional programming. Bringing the concepts of mutable state and imperative code straight into an async world is a recipe for disastrous software projects. These in turn could tarnish Scala's representation and significantly complicate the life of those of us trying to bring Scala into traditionally conservative organisations and companies.

It's great the the Scala community has come so far and matured so fast. But, let's not forget the core concepts that underly this transformation. We must ensure that we continue to emphasise their importance as we move forward into the brave new world of highly scalable, asynchronous, reactive systems that Scala seems to the targeting so directly and successfully.

Monday, 15 April 2013

Coaching: How Not What

Yesterday evening I was playing a game of 5-a-side football (soccer to my American readers) with some friends. One of the players on my team was a particularly competent player who was also very vocal in offering advice to his team mates on what to do.

For example, when a player from the opposing team was approaching me with the ball he would offer sage advice such as "Don't let them get past you!". This was rather stating the obvious, as I was quite clearly not planning on letting the opponent go past me and take a clear shot on goal.

In this particular situation my challenge was not in knowing what I should do, but in having the skill and experience necessary to carry out the task. No amount of shouted commands were suddenly going to increase my skill to a level beyond that which I currently possess.

This got me thinking about how we coach people in programming and agile practices. How often do we say to someone something like "that needs to be refactored" or "you need to keep your functions small"? Or, we might say to an agile team "don't overcommit yourselves this sprint". All of these are instructions to do something, made with the assumption that the people receiving them have the skill and experience necessary to carry out the actions.

What if, like me on the football field, the recipients of these instructions knows what they should do, but not how to do it. Clearly we are not being successful coaches in these cases. We need to focus much more on the how rather than the what. Advice offered should be enabling, providing the recipient with a way of learning and gaining experience.

For example, "if you pull this bit of code out into a separate function then this loop becomes much simpler" is much more helpful than "that needs to be refactored".

As coaches we need to be mindful of how we communicate in order to improve the people under our guidance. It's very easy to let it slip and just become another player shouting advice to those who haven't yet gained the skill necessary to implement that advice.

Wednesday, 6 February 2013

Fast Track to Haskell Course

I've spent the last two days on a Fast Track to Haskell course. It was a very interesting two days and I learnt a huge amount. The course was taught by Andres Löh from Well-Typed and hosted by the excellent people at Skillsmatter.

I went into the course having read the superb Learn You A Haskell book and having built a couple of experimental Haskell programs. From that starting point there was not a huge amount of content in the course that I was not aware of. However, it was great to have my learning verified and my understanding in many areas significantly improved. There were also a few things that changed my thinking on certain topics. All in all, a well spent two days.

The first day started with a whistle stop tour of the Haskell language and how to build things in it. There were plenty of excellent exercises to help cement learning and understanding. Then we looked in more detail at types and how to reason about software using types. The day concluded with looking at some more advanced type concepts including higher-order functions and type classes. The second day followed on with some more about types and then went on to look at how Haskell deals with I/O (which is very different to most other languages I have encountered). The rest of the day was then spent looking at common patterns that can be found in Haskell code and how these can be generalised into concepts such as Functors and Monads. Plenty more exercises followed.

I will now be going away and working on some private projects in Haskell, with the hope of attending the advanced course some time later this year. In the mean time, this blog post looks at some of the key things that I came away from the course with and how they relate back to the Scala and Java that I tend to do for my day job.

Focus on the Types, they are the API

When using an OO language, such as Java (or even Scala), we tend to think about classes, their interfaces and what behaviour they have, and then use this as a starting point to develop from. In Haskell you tend to think first about the types that are involved and then work forward from there. This is a very interesting approach as it makes it possible to think about a problem in a very general and quite abstract way without getting too bogged down in detail too quickly. The types become an API to your system (rather than the interfaces, methods and domain model in an OO solution).

My experience from the course was that this tends to lead to many more types and type aliases in Haskell than I would typically define in Java or Scala. However, this soon proves to be a huge advantage as it then makes it much easier to reason about what each function does/should do and makes its behaviour very clear via the type signature. I will certainly be looking at following this practice in any Haskell programs I write and also trying to introduce more types into my Scala code.

It's Pattern Matching and Recursion All The Way Down

In a language like Java there is nothing like pattern matching and recursion has to be used sparingly if you want to avoid blowing up the stack and writing poorly performing code. In Scala there is much more scope for both, although writing tail recursive functions is still very important to ensure that you get the best optimisations from the compiler. In Haskell, pattern matching and recursion are the bread and butter of the language. Almost every single function tends to be a pattern match on arguments and the language supports a naturally recursive style across the board. Given the lazy semantics of Haskell and the optimisation in the runtime for recursive algorithms this is the best approach for building programs in Haskell.

One area for improvement that I can see in my Scala code is to make more use of pattern matching, especially partial functions.

Thinking Curried Helps a Lot

In the Scala world we typically only curry functions when there are very obvious places to partially apply them. More often than not we just partially apply a function using the underscore for one of its parameters. This is needed to ensure compatibility with Java and the object-oriented model. However, in Haskell every function is completely curried. There are no multi-argument functions. Everything is a single argument function that returns either a result or another function.

Initially I found that this made sense in the model that I'm used to: partially applying functions. However, once it came to defining new types as actually being functions as well I found that this started to fry my mind slightly. The conclusion that I came to is that it's best to just think of everything from a curried mindset and throughout the course things gradually became clearer. This thinking is essential to understand things like the State monad (which I'm still not 100% confident about to be totally honest).

Parametric Polymorphism is Very Useful

One of my biggest take-aways from the course was how useful and important parametric polymorphism is. As an (overly) simple example, what does a function from Int -> Int do? It could return the same number, add something to it, multiply it, factor it, modulus it, ignore the input and return a constant - the possibilities are nearly endless. However, what does a function from a -> a do? Given no other constraints it can only do one thing - return itself (this is called the identity function - id in Haskell).

The ability to define functions in this polymorphic way makes it much easier to reason about what a function might do and also to restrict what a function actually can do. This second point greatly reduces the possibility of introducing unexpected defects or creating functions that have multiple interweaved concerns. As another example, consider the function: Ord a => [a] -> [a]. What does this do? Well, it could just return the same input list, but because it takes the Ord type class as a constraint it's quite likely to be some form of sorting or filtering by order function. As the writer of this function I'm restricted to working with lists and just using equality and ordering functionality so the possibility of doing anything unexpected is greatly reduced.

Parametric Polymorphism is possible in Scala, but not quite as powerful as there aren't type classes for may common concepts (e.g. equality, ordering, numbers, functor, applicative, monoid, monad etc.). Introducing Scalaz fixes this for quite a lot of cases, but that's not to the taste of many projects and mixed Scala/Java teams may find this a step too far. However, there's certainly more scope for using parametric polymorphism in everyday Scala code and this is something I'm going to work on ove the coming weeks (expect some more blog posts).

Don't Try to Over Generalise

Something that I've seen in every language is that once developers discover the power of generalising concepts they tend to go overboard with this. Remember the time when every piece of code was a combination of seven billion design patterns? The power of Haskell's type classes and parametric polymorphism makes it very easy to see patterns that may not actually hold. I can see that it would be very easy to waste a lot of time and effort trying to make a data type fit into various different type classes or trying to spot type class patterns in your code and pull them out into generic concepts.

As with any kind of design pattern, generalisation or similar, the best approach (in any language) is to just build what you need to solve a problem and then when it becomes apparent that it either fits an existing pattern or represents a general pattern then this change can be made as part of a refactoring process.

Try to Keep Pure and I/O Code Separate

Haskell's I/O model at first seems very different, unusual and restrictive. Once you get your head around it, it makes so much sense. Separating pure code from code that talks about doing an I/O action from code that actually does the I/O makes for software that is much more easy to reason about, test and reuse. Very powerful and something I'm looking forward to exploring more during my journey into Haskell.

In the Scala world, I'm not sure about the need to go to such lengths as using an I/O Monad or similar. However, I can really see the advantage of separating pure code from code that performs I/O. Having pure functions that transform data and then wiring them together with the I/O performing functions at a higher level seems a natural way to create software that is better structured and easier to test. I'll certainly be playing with some approaches in Scala to experiment with this concept further.

Functors, Applicatives, Monoids and Monads are Nothing Special

A final observation from the course is that many people learning and talking about functional programming get a bit obsessed with concepts like Functors, Applicatives, Monoids and Monads (especially Monads). What I learnt is that most of these are just simple patterns that can be represented by one or two basic functions and a few simple rules. Nothing more to them really. Certainly nothing to get all worked up about!

Yes, there's a great deal of skill involved in knowing when to use these patterns, when to make your own code follow a pattern and so on, but that's true of any programming language or design pattern. It's called experience.


If you are a Scala developer who is enjoying the functional programming aspects of the language I would certainly recommend taking a look at Haskell. I found that it clarified lots of things that I thought I understood from Scala but still had some doubts about. Haskell is, in my opinion, a more elegant language than Scala and once you start using it you realise just how many hoops the Scala team had to jump through to maintain compatibility with the JVM, Java code and object-oriented features. Over the two days I learnt a lot about Haskell, a lot about functional programming and a lot about how to get even more value from Scala.

Thursday, 24 January 2013

Type Classes Make For Better Software Structure

When people usually talk about type classes in languages like Haskell and Scala they tend to be thinking of highly generic concepts like Ordering or Numeric operations. However, in some recent experiments with Haskell I have found that type classes can be very useful for more concrete things. When used appropriately they can lead to more loosely coupled software that is more flexible and easier to change.

I Just Met You, This Is Crazy, Here's My Number, Call Me Function?

As an example, let's consider a telephone number. Fairly simple right, it's a string of digits?

def sendMessage(toNumber: String, theMessage: Message)

Or perhaps we should create a type alias to give us some better info and scope for change:

type TelephoneNumber = String

def sendMessage(toNumber: TelephoneNumber, theMessage: Message)

But wait, I'm trying to build a library that sends messages to telephones and I need to know much more about a telephone number than this 'stringly' typed data is able to tell me. Things like:

  • Is it in international format?
  • What's the country prefix code?
  • Is it a mobile number or a landline number?
  • What is the area code?
  • Is it some premium rate number?
  • and so on…

The Naive Solution

It's clear that my 'stringly' typed representation is not sufficient for this task unless I want to encode a whole load of rules about the world's telephone numbering systems into my library (believe me when I say I don't!). So, a naive approach would be to introduce a new type that holds the information in a more usable format. Something like:

case class TelephoneNumber(countryCode: String, areaCode: String, number: String, isMobile: Boolean, isLandline: Boolean, …)

Unfortunately, this approach suffers from a whole host of problems, including:

  • If I build my library implementation against this class then I can't easily change implementation details without breaking the API
  • If I copy from an instance of this type into an internal representation then I have a lot of duplication and extra code to test
  • If a client uses this type in their domain model they are tightly coupled to my library (and also a specific version of it)
  • If a client copies from their own representation into an instance of this type there is more duplication and extra code to test

A More Object-Oriented Approach

So, how do we deal with this? Well what we need to do is separate the behaviour of a telephone number from its actual data representation. In an object-oriented language we would do this by introducing an interface (or in Scala, a trait):

trait TelephoneNumber {
  def countryCode: String
  def areaCode: String
  def number: String
  def isMobile: Boolean
  def isLandline: Boolean
  // and so on...

Internally in my library I write all my code against this interface without any care as to how the underlying telephone number data is structured. Client applications are free to implement telephone numbers in whatever way they wish, provided they also implement the interface that exposes the behaviour that my library needs. Yay, problem solved. Or is it? There's still one problem in this object-oriented approach: somewhere in the client code there must be a class that implements the TelephoneNumber interface. This is problematical in a number of ways:

  • The class hierarchy of the client domain model is still quite tightly coupled to my library
  • It might not be possible to modify an existing domain model to implement the TelephoneNumber interface (no access to source code, coupling restrictions etc.)
  • If the client domain represents telephone numbers as Strings then this class can be extended to implement the interface

Both of the last two points would result in needing to implement the same sort of mapping layer to copy between representations that we already said lead to duplication and extra code to test.

How Haskell Does It

So, in a purely functional language that doesn't have the concept of interfaces how do we solve this problem. Enter the Type Class. These clever little beasties provide a mechanism whereby the required behaviour is captured as a series of function definitions (with possible implementations if available). For example:

class TelephoneNumber a where
    countryCode :: a -> String
    areaCode :: a -> String
    number :: a -> String
    isMobile :: a -> Bool
    isLandline :: a -> Bool

So far this looks pretty much like our interface definition above except that each method takes an instance of type 'a' as a parameter and returns the result. However, the clever bit comes in that we can implement the type class for any type in Haskell without needing to modify that type in any way. For example, we could easily implement it for Strings:

instance TelephoneNumber String where
    countryCode s = ...
    areaCode s = ...

Or our client code could define its own telephone number data type and an implementation of the type class along side it:

data UKTelephoneNumber = ...

instance TelephoneNumber UKTelephoneNumber where
    countryCode t = ...

The final change is then that we have our library function require that something implementing the type class is provided rather than some concrete data type:

sendMessage :: (TelephoneNumber t) => t -> Message -> IO ()
sendMessage to msg = do
    let cc = countryCode t   -- call function on the type class passing the instance we have

In this Haskell solution we have neatly decoupled the behaviour that our library requires of telephone numbers from the data type used to represent them. Also, any client code of our library does not need to couple its data structures/types directly to the library. They can define and modify them in any way they like. All they need to do is keep the instance of the type class up-to-date.

And Finally, A Better Scala Solution

As I have hopefully shown, the Haskell solution to the problem is both elegant and leads to cleaner, more loosely coupled code. So can we do something similar in Scala? Well, yes we can because Scala also supports type classes. They are a bit less elegant than Haskell, but they work just fine. First, we need to change the TelephoneNumber trait into a type class:

trait TelephoneNumber[T] {
  def countryCode(t: T): String
  def areaCode(t: T): String
  def number(t: T): String
  def isMobile(t: T): Boolean
  def isLandline(t: T): Boolean
  // and so on...

Note that all we have done is added a type parameter and modified all the functions to take an instance of that type. Notice that it's now almost identical in structure to the Haskell equivalent. Next we need an implementation of this type class. Let's define one for String:

object TelephoneNumber {

  object StringIsATelephoneNumber extends TelephoneNumber[String] {
    def countryCode(t: String) = ...

  implicit def TelephoneNumber[String] = StringIsATelephoneNumber

I've put the implementation inside a TelephoneNumber object. If you are defining your own types, convention is usually that the type classes go in the companion object (e.g. for a case class UkTelephoneNumber the type class implementations would be in the UkTelephoneNumber companion object). This insures that the type class instances are always in scope when the type is used.

Finally, lets update our sendMessage function to work with type classes:

def sendMessage[T : TelephoneNumber](toNumber: T, theMessage: Message) = {
  val telNo = implicitly[TelephoneNumber[T]]
  val cc = telNo countryCode toNumber

Note the use of the context bound symbol ':' plus the implicitly function to define the need for and access to the correct instance of the type class.

So there we have it, an elegant Scala solution the overcome all of the coupling problems inherent with the object-oriented approach of implementing a required interface. When should you use this? My opinion is anywhere that two system components communicate, where that communication can be defined in terms of behavioural functions rather than actual data and where you want to minimise the coupling between them and the leaking of types from one component to another.