Sunday, August 31, 2008

Fast sets intersect detection in C#

While developing our latest version I encountered a scenario with i think general enough to be interesting for general public. I had two arrays of integers and was needed to find if they overlaps , meaning they intersect is not empty set :). Or in plain English what one array has members what also members of the other array. This operation is going to be performed a couple of millions of times every run of a program so I wanted it to be as fast as possible.

The "Naive" approach for this problem is to just run 2 loops where you take each of the elements of a first array and check it against element of a second array.
bool NaiveIntersect(List<int> setOne, List<int> setTwo)
{
for(int i1=0;i1<setOne.Count;i1++)
{
for(int i2=0;i2<setTwo.Count;i2++)
{
if (setTwo[i2]==setOne[i1])
{
return true;
}
}
}

return false;
}

Obviously this is a not optimal solution with complexity of O(n*m) .NET framework provide number of build in classes what give ability to test overlap of two arrays.

The list I came up with:

  • LINQ intersect extension method
  • Hashset collection
  • BitArray collection
LINQ don't have Overlap method but it do have Intersect method , so we need just to intersect the two sets and see if intersect is not empty set.

bool LinqOverlap(List<int> setOne, List<int> setTwo)
{
return setOne.Intersect(setTwo).Any();
}
HashSet it's a hashing data structure similar to HashTable or Dictionary collection, the main difference is what HashSet don't have key and values it has only values ,and hash is performed on them. In hash set you to find an element you just need to compute the hash of this element and get it from appropriate it's much faster that to iterate over a collection. HashSet collection provides a convenient Overlaps method to test this exact scenario so code is very short and obvious.

bool HashSetIntersect(HashSet<int> set1, HashSet<int> set2)
{
return set1.Overlaps(set2);
}
Another interesting collection is BitArray , underling data structure of BitArray is simple array of integers. The main strange of BitArray is in ability to perform bit operations across two arrays of number, so in our case overlap method will look like this. We perform bitwise AND operation and check if one of the bits in the intersect is set to true.

bool BitArrayOverlap(BitArray set1, BitArray set2)
{
var intersect=set1.And(set2);

for(int j=0;j<intersect.Length;j++)
{
if(intersect[j])
return true;
}

return false;
}

I took the code above and run it on 2 arrays of integers containing 125 elements each for 100,000 times. So lets se who is the winner

NaiveDoubleLoop - 7995
Linq- 1915
HashSet Overlap - 688
BitArray - 38

As you can see from the table the obvious winner is a BitArray. And it's true for a general case , but in some special case we can get much better performance , lets talk about one more method , it's an optimization of a BitArray method. It suited for cases where sets of integers we going to test for overlap can contain only values we will know in advance and the list of this values is small (couple of hundreds). If we have only 64 possible values we can just assign a bit mask for each of the possible integers values in our sets so a set will be represented by a single long value and testing for overlap is just a binary AND between 2 longs. If we have more that 64 possible values it's possible to extend this method by giving each value in addition to bit mask another parameter that will tell us for what group it belongs so our sets will be a short array of long and we will need to perform AND operation on each pair of values in this arrays.

Here is the code that shows the implementation

bool BinaryOverlap(long[] setOne, long[] setTwo)
{
return
((setOne[0] & setTwo[0])
(setOne[1] & setTwo[1])
(setOne[2] & setTwo[2])) != 0 ;
}
So what about performance you say ? For 192 values it's performs 100,000 overlap tests in 1 ms. This approach will not fit every scenario , we can't always control the integers that come to us but if we can it will give us blazingly fast performance.