comparing two similar strings

I’ve advertised GetReuse once again in response to the following question:

How do you compare 2 strings, and determine how much they are “close” to each other?

Here is my answer:

If your strings are between 1 and 16 Kb, look at GetReuse SDK:

It has Perl and Python bindings. You might be interested in the latter as you posted to comp.lang.python.

I can’t disclosure the algorithm. Here is an excerpt from the home page: The formula for the calculation of the similarity is based on the scientific research. Any other “good” method of calculations should produce results that are equivalent in some terms to the GetReuse results.

I also commented the following message:

Here’s a really weird idea: Measure the size difference between the pair of strings compressed together and compressed separately.

My answer:

The idea isn’t weird. The only problem is that naive approach failed. compress(a,b) != compress(b,a). Having an assumption that compressing is a good approximation for the Kolmogorov complexity, the correct formula is a bit more complicated.

Leave a Reply