In this post I set out a challenge to come up with the best solution to the following problem: how to find the most frequently multiple occurring item in a list.
I had created my own solution, but felt that a better one was possible. The criteria specified were simple code, elegance and performance.
I have had two submissions to the problem so far and this post offers a comparison of them along with my original solution. My comparisons were carried out using the following two test lists:
val random = new java.util.Random(2L) val list1 = (for ( i <- 1 to 100000 ) yield random.nextInt).toList.distinct val list2 = (for ( i <- 1 to 100000 ) yield random.nextInt(1000)).toList
The first list contains no duplicate elements, so the result should always be None. The second list will contain multiple duplicate elements.
My Solution
My solution to the problem was based around building a map of list items to frequency count and then finding the most frequent multiply occurring item in that map. My implementation was achieved with two small functions and two foldLeft operations:
def highestMultipleFrequency1[T](items: List[T]): Option[T] = { type Frequencies = Map[T, Int] type Frequency = Pair[T, Int] def freq(acc: Frequencies, item: T) = acc.contains(item) match { case true => acc + Pair(item, acc(item) + 1) case _ => acc + Pair(item, 1) } def mostFrequent(minOccurs: Int)(acc: Option[Frequency], item: Frequency) = acc match { case None if item._2 >= minOccurs => Some(item) case Some((value, count)) if item._2 > count => Some(item) case _ => acc } items.foldLeft(Map[T, Int]())(freq).foldLeft[Option[Frequency]](None)(mostFrequent(2)) match { case Some((value, count)) => Some(value) case _ => None } }
In the tests, my solution performs fairly consistently across both test lists, the second being slightly faster due to the smaller size of the map being used by the second fold function.
Solution from Nilanjan R
This solution send by Nilanjan R is significantly simpler and shorter in the amount of code it uses. It takes the approach of grouping items by themselves and then sorting the resulting lists by their size and taking the first item:
def highestMultipleFrequency2[T](items: List[T]): Option[T] = { items.groupBy(x => x).values.toList.sortWith((f, s) => f.size > s.size) match { case x :: xs if x.size >= 2 => Some(x.head) case _ => None } }
Under test, this solution performs significantly slower than my solution on the list without duplicates. However, when the list does contain significant multiple occurring elements then it performs slightly better than my solution. To me this tends to indicate that the sort function is the major performance factor in this solution. I do really like this due to its very clean, simple and easy to understand code. It's certainly my choice for the Noughts and Crosses application where simple code is more important than raw performance and where I am only dealing with very small lists.
Solution from Antonin Brettsnajdr
The solution offered by Antonin follows a similar approach to my solution in that it builds a map of item frequencies and then find the most frequently occurring element using this map. However, this solution implements the whole algorithm in a single tail-recursive function:
def highestMultipleFrequency3[T](items: List[T]): Option[T] = { def mff(l: List[T], m: Map[T, Int]): Option[T] = l match { case element :: tail => if (m contains element) mff(tail, m.updated(element, m(element) + 1)) else mff(tail, m + (element -> 1)) case Nil => val occurences = (m map { item => item._2 }) if (occurences forall (_ == 1)) None else { val maxOccur = occurences.max m find ( letter => letter._2 == maxOccur) match { case Some(x) => Option(x._1) case None => None } } } mff(items, Map()) }
This solution is by far the best performing. It is two to three times faster than my solution. There is a significant performance benefit when dealing with a list that contains multiple duplicated items. While the code is not as simple as the previous solution, it is very readable and is a nice example of tail recursion.
A big thanks to Nilanjan R and Antonin Brettsnajdr for their solutions. I found it very interesting looking at the different way people approach a problem such as this. Can you do better? Let me know your solution.
No comments:
Post a Comment