Monday 13 February 2012

Sometimes you just have to be lazy...

...or, solving problems with lazy evaluation.

I’ve recently been working on a simple template system, which also has the potential to grow into a full-blown document generation system. The basic principle is that a document is composed of a number of fragments. Each fragment is called in turn to generate its piece of the document and these pieces are concatenated together to product the final result.

My first implementation was a super simple one. We just fold over the list of fragments, accumulating the output from each into a string builder. Each fragment may alternatively generate an error, which we collect along the way as well. At the end of the fold we return either the string document or a list of errors.

  def render(fragments: List[Fragment]): Either[List[String], String] = {
    fragments.foldLeft(RenderResult()) { (acc, fragment) =>
      fragment.generate fold (
        { err => acc.copy(errors = err :: acc.errors) },
        { text => acc.doc.append(text); acc }     
      )
    }.toEither
  }

  case class RenderResult(doc: StringBuilder = new StringBuilder(), errors: List[String] = List()) {
    def toEither: Either[List[String], String] = ...
  }

Now, this works great for documents of only a few kilobytes in size. However, as the documents grow to multi-megabyte sizes this approach becomes infeasible. The amount of memory required to hold the document being generated becomes huge. Multiple concurrent generations become impossible. We need a better solution.

The typical suggestion at this point is to stream output from each fragment straight into an output writer of some form rather than collect and build a string.

  def render(fragments: List[Fragment], writer: Writer): Option[List[String]] = {
    fragments.foldLeft(List[String]()) { (acc, fragment) =>
      fragment.generate fold (
        { err => err :: acc.errors },
        { text => writer.write(text); acc }     
      )
    }.getErrors
  }

Unfortunately this doesn’t work here because of the possibility that fragments may generate errors. We don’t want to stream out partially complete documents. We could output to a temporary file and then copy this to the ultimate output on success, but this seems like a less than ideal solution.

A common approach in an imperative style would be to first run a check phase to see if there are any errors and then run a second render phase to produce the output.

  def check(fragments: List[Fragment]): List[String] = 
    fragmens.foldLeft(List[String]()) { (acc, fragment) =>
      fragment.checkForErrors map (_ :: acc) getOrElse acc
    }

  def render(fragments: List[Fragment], writer: Writer): Unit = 
    fragments foreach { fragment => writer.write(fragment.render) }

However, in this case the building of fragments might be quite costly and we don’t want to have to process them once to see if there is an error and then again to render.

However, as we are using Scala and programming in a functional style, we have an alternative approach. Instead of returning a string or writing direct to a writer we can return a lazily evaluated function. This function can encapsulate all the side-effect generating logic of doing the actual write - which is also nice from the point of view of being able to reason about our rendering code.

  def render(fragments: List[Fragment]): Either[List[String], (Writer) => Unit] = {
    fragments.foldLeft(RenderResult()) { (acc, element) =>
      fragment.generate fold (
        { err => acc.copy(errors = err :: acc.errors) },
        { func => acc.copy(fragmentFuncs = func :: fragmentFuncs }     
      )
    }.toEither
  }

  case class RenderResult(fragmentFuncs: List[() => String] = List(), errors: List[String] = List()) {
    def toEither: Either[List[String], List[(Writer) => Unit] = {
      ...
      (writer: Writer) => fragmentFuncs foreach { f => writer.write(f()) }
      ...
    }
}

The way this works is that the render method folds over each fragment, asking the fragment to do enough work to be certain that no errors will occur (e.g. validate data, take a temporary copy of any volatile values and so on). However, it doesn’t need to do the actual string generation. Instead, each fragment returns a function that will be called at some time in the future (or an error). The render code then accumulates all these small functions into one larger function and returns this.

The returned function, when called, takes a writer and iterates over all the small string generating functions, writing the output of each straight into the writer. Job done and in a mere fraction of the memory required by the original solution.

In cases where you need to generate large amounts of data or where you want to delay side effecting execution to a well-defined location in your code then lazy evaluation is a great approach to follow.

NOTE: All above code example are significantly simplified from the real code in order to demonstrate the approach.

2 comments:

  1. To my mind this is what I refer to a brilliant blog article! Do you run this site for your personal aims exclusively or you basically use it as a source of income?

    ReplyDelete
    Replies
    1. Hi. Thanks for your comment. Very glad you liked the post.

      My blog is all out of personal interest and sharing with the community. I don't make any income from it other than the fact that it raises my profile and helps me find good consultancy jobs.

      Most of the stuff I write is based on work that I'm doing, experiments or personal projects.

      Delete