Given an unsorted sequence, remove_duplicates would sort it using the
pointer value of attributes/nodes and then remove consecutive
duplicates.
This was problematic because it meant that the result of XPath queries
was dependent on the memory allocation pattern. While it's technically
incorrect to rely on the order, this results in easy to miss bugs.
This is particularly common when XPath queries use union operators -
although we also will call remove_duplicates in other cases.
This change reworks the code to use a hash set instead, using the same
hash function we use for compact storage. To make sure it performs well,
we allocate enough buckets for count * 1.5 (assuming all elements are
unique); since each bucket is a single pointer unlike xpath_node which
is two pointers, we need somewhere between size * 0.75 and size * 1.5
temporary storage.
The resulting filtering is stable - we remove elements that we have seen
before but we don't change the order - and is actually significantly
faster than sorting was.
With a large union operation, before this change it took ~56 ms per 100
query invocations to remove duplicates, and after this change it takes
~20ms.
Fixes#254.
gcov -b surfaced many lines with partial coverage, where branch is only
ever taken or not taken, or one of the expressions in a complex
conditional is always either true or false. This change adds a series of
tests (mostly focusing on XPath) to reduce the number of partially
covered lines.
Some compilers don't handle NaNs properly.
Some compilers don't implement fmod in a IEEE-compatible way.
Some compilers have exception handling codegen bugs (DMC...).