Xaprb

Stay curious!

Finding things within some distance in SQL

with 14 comments

One of the query optimization scenarios I’ve seen a lot over the years is finding something within some distance from a point. For example, finding people within some distance of yourself, apartments in a radius from a postal code, and so on.

These queries usually use the great-circle formula. That might be because Google finds lots of pages claiming that this is the right way to do a radius search. “The earth is not flat!”, they all say. That’s true, but it doesn’t mean that the great-circle formula is a good approach. It’s usually a really bad approach, in fact. It’s needlessly precise for most things, not precise enough for others, and it’s an expensive query to execute; all the trig functions tend to eat a bunch of CPU, and make it impossible to use ordinary indexes. This is true for all of the databases I’ve used — MySQL, Postgres, and SQL Server.

The great-circle formula is needlessly precise for a few reasons:

  1. Within the radiuses I’ve usually seen, the earth is flat, or close enough that it doesn’t make a difference. Looking for an apartment within 25 miles of downtown? The error introduced by pretending that the earth is flat on such a small scale doesn’t matter. The Pythagorean theorem would work just as well.
  2. “Downtown” is not a point, it’s an area. Nobody is going to argue if you return search results that vary by a few miles, or even more.
  3. Nobody drives in a straight line from downtown to their apartment. People usually search within a physical radius as a proxy for “find me something conveniently close.” They don’t really expect the miles as-the-crow-flies to be a good proxy — it’s just one they’re used to. In reality, that apartment just across the river might be too far away from work, because you’d have to drive a long way to get to a bridge. (Unless you want to swim to the office every day, that is.)

In cases where you really do need precision, there’s a reasonable chance that the great-circle formula still isn’t right for you, because not only is the earth not flat, the earth isn’t a sphere either.

What’s the optimization I usually suggest? It’s usually perfectly acceptable to just return results within a square centered on the point of interest. In most cases, the results will be just as satisfactory to the users. The remainder are usually very special cases.

Written by Xaprb

January 2nd, 2011 at 4:59 pm

14 Responses to 'Finding things within some distance in SQL'

Subscribe to comments with RSS

  1. I agree most don’t really need the precision of the Haversine formula for simple nearby queries. However, a latitude degree remains the same distance everywhere, but longitude degree gets much smaller as you get close to the north pole. Multiplying by 1/cos(latitude) for the width of the box will account for this change and still keep things pretty simple.

    Joshua

    2 Jan 11 at 7:36 pm

  2. Readers may also be interested that PostgreSQL just added support for KNN (K nearest neighbor) searches:

    http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=554506871bd3a73a5d9fa4ee3aa0b0fbf406f8ab

    Rather than searching within a radius, you can say “give me the 5 closest”, which is often closer to what you really want the query to do.

    Jeff Davis

    2 Jan 11 at 8:30 pm

  3. Thanks for the post. I find it a bit difficult to find user experience on GIS through Google.

    Have you used MySQL with MyISAM and index-optimized MBRContains()/etc. queries in production? Any caveats?

    Edmar

    2 Jan 11 at 10:07 pm

  4. I agree. Manhattan distance works just fine for such popular cases as “find me a restaurant in 10 miles radius”. Baron’s observation about driving not being in straight line is a good point.

    BTW, is easy enough to cut down the surrounding square into an octagon by trimming its points. This makes for a closer-to-circle shape, without too much overhead.

    @Joshua: again, this calls for approximation. You can store a dozen or so “longitude factors”, pre-calculated, and multiply by the one representing the nearest latitude line.

    Finally, just as PI is a known number (no one calculates the Taylor sequence to the 100,000 position to get PI), it’s a good idea to cache as constants other common values like sqrt(2), sqrt(3), golden ratio, etc.

    Shlomi Noach

    3 Jan 11 at 2:08 am

  5. Jeff, I agree — finding the N closest is often a good user experience. Combine that with some relevance ranking to search keywords, and users will likely be very happy.

    Edmar, I don’t have many good things to say about MySQL’s support for geospatial queries, starting with “it requires MyISAM.” Non-crash-proof tables are hard for me to justify in production.

    Xaprb

    3 Jan 11 at 11:44 am

  6. Hi Baron, MyISAM sucks, I agree. That said, I was hoping to hear about MySQL geospatial usage relying on MyISAM as a sort of “cache table only”, solely for the r-tree index. Setting up a whole new DB ecosystem just to get GIS indexing may be a solution, but I was hoping to avoid that.

    Edmar

    3 Jan 11 at 2:46 pm

  7. For a real user story, check out http://www.mysqlfanboy.com/2010/08/mysql-gis-part-1/ once the database error is solved :)

    Xaprb

    3 Jan 11 at 2:50 pm

  8. Great article Baron. I had a similar situation with a client about a year ago. They were having performance problems and some of the largest queries showing up were these radius calculations. I proposed this solution as a start, and then alternatively to precompute boxes within a city region. Their application was an iphone city and nightlife guide, so they only really needed city regions.

    All your points are so often lost on us when we’re trying to solve a technical problem, and lose site of the real business problem that is being solved!

    Regards,
    Sean

    Sean Hull

    3 Jan 11 at 3:43 pm

  9. For some of our apps, we need to return cities, stores, hotels, airports, etc., up to 50 miles away for the requested location. Our apps support locations all over the world, so variances in degrees longitude versus miles/kilometers can be very significant. Here’s what we do:

    * Use great circle formula to calculate min and max latitude and longitude
    * Search city, store, etc. table for locations within that lat-long bounding box. There is a composite index on lat and long.
    * Calculate distance to each location. We need to do this anyway to tell the user approximately how far away each location actually is.
    * Sort nearest to farthest and discard all entries farther than the limit

    All the calculations are done in application code. Only the second step above is done on the database server. So, maybe we avoided the issue by moving a lot of the work to the app, but it works very well for us.

    Of course, that’s just our scenario. As suggested, you might be able to simplify this.

    Robert

    3 Jan 11 at 6:54 pm

  10. Robert, thanks for the info! If you have the time, I’ve got a couple questions about it:

    1) Which database are you using? MySQL?

    2) How about the performance of the range query on latitude, longitude? I’m assuming the composite index is a regular B-tree (not spatial).

    3) What datatype/precision do you use to store latitude and longitude?

    Thanks!

    Edmar

    3 Jan 11 at 7:29 pm

  11. @Edmar

    1) MySQL

    2) Regular B-Tree. It’s hard to give a meaningful description of the performance other than to say the query almost always takes just a few milliseconds, assuming the server isn’t heavily loaded from other queries. Since the app (these are telephone-based speech recognition applications) is presenting a list to the user, it doesn’t make sense to use a radius that will return a huge list, since people are rarely interested in hearing about more than the first 10 or so locations. The majority of the time, they don’t go past the first window of 3 locations. The main exception would be one of our hotel locators when the user is searching in a large city. Even then, the query time is a relatively small fraction of the total time to generate the list.

    3) Float

    Robert

    4 Jan 11 at 1:25 pm

  12. Robert, thanks a lot! Agree that measuring query performance is hard, your answer with context is perfect for me. Cheers!

    Edmar

    4 Jan 11 at 7:02 pm

  13. @Edmar
    I’ve developed a method using MyISAM with R-trees as “cache”: http://www.xarg.org/2009/12/people-near-you-with-mysql/

    @Baron
    I think it’s not the best advice to completely avoid MyISAM tables in production as long as there are features that can not be reproduced by any other SE. You must know what data is business critical (for example user information) and treat it like your eyeball. Anything else which can be obtained within minutes (where I would see such a GEO-cache table) could also rest in a relativeley “insecure” table type.

    Robert

    10 Jan 11 at 7:43 pm

  14. Robert, yes, I agree. I do not completely avoid MyISAM in production. It’s just a nightmare to repair a big table after a crash, for example. But if scenarios like that aren’t a problem, then it’s okay.

    Xaprb

    10 Jan 11 at 8:51 pm

Leave a Reply