Thursday, October 8, 2009

Efficiently selecting random sub-collections.

Here's a handy algorithm for randomly choosing k elements from a collection of n elements (assume k < n)


public static <T> List<T> pickRandomSubset(Collection<T> source, int k, Random r) {
  List<T> toReturn = new ArrayList<T>(k);
  double remaining = source.size();
  for (T item : source) {
    double nextChance = (k - toReturn.size()) / remaining;
    if (r.nextDouble() < nextChance) {
      toReturn.add(item);
      if (toReturn.size() == k) {
        break;
      }
    }
    --remaining;
  }
  return toReturn;
}

The basic idea is to iterate through the source collection only once. For each element, we can compute the probability that it should be selected, which simply equals the number of items left to pick divided by the total number of items left.

Another nice thing about this algorithm is that it also works efficiently if the source is too large to fit in memory, provided you know (or can count) how many elements are in the source.

This isn't exactly anything groundbreaking, but it's far better than my first inclination to use library functions to randomly sort my list before taking a leading sublist.

No comments: