Archive for September, 2009
Observations on key-value databases
Key-value databases are catching fire these days. Memcached, Redis, Cassandra, Keyspace, Tokyo Tyrant, and a handful of others are surging in popularity, judging by the contents of my feed reader.
I find a number of things interesting about these tools.
- There are many more of them than open-source traditional relational databases. (edit: I mean that there are many options that all seem similar to each other, instead of 3 or 4 standing out as the giants.)
- It seems that a lot of people are simultaneously inventing solutions to their problems in private without being aware of each other, then open-sourcing the results. That points to a sudden sea change in architectures. Tipping points tend to be abrupt, which would explain isolated redundant development.
- Many of the products are feature-rich with things programmers need: diverse language bindings, APIs, embeddability, and the ability to speak familiar protocols such as memcached protocol.
- I think there are more solutions here than the ecosystem will support, and in five years a few will stand out as the most popular.
- This process of paring down the gene pool is win-win because they’re open-source, and nothing will be lost.
- Choosing which one to use is no easy task even for a highly skilled, technical, up-to-date person. Perhaps the decision-makers will choose on the availability of commercial support and consulting.
- Many of them offer built-in, dead-simple, distributed, synchronous replication. This is very difficult to achieve with traditional relational databases. What makes key-value databases different? They don’t have MVCC, for one thing; but I’m not sure of the complete answer to that question, to tell the truth.
We live in interesting times.
A fun use of SUBSTRING_INDEX and friends in MySQL
I used to develop with MySQL, and those were the golden days. These days I don’t write queries so much. But yesterday I needed to answer this question: are there any issues in our issue-tracking system that meet the following criteria?
- The last two or more emails are from the customer
- These emails were separated by at least two hours (i.e. it wasn’t a single train of thought)
I could do it with all kinds of correlated subqueries and so on — but maybe I could also just do it without them, no? Can this be done with plain old JOINS and GROUP BY? I’m sure you know the answer.
Here’s my approach: group emails by issue, and concatenate the dates they were sent in reverse order. If an email was sent from Percona to the customer, replace the date with a magical OUTBOUND constant. The result might look like this: “2009-09-11 13:17:34,OUTBOUND,…”. I’ll change this to create a good sample string, which I’ll use from now on:
set @email_dates := '2009-09-11 13:17:34,2009-09-11 12:26:17,2009-09-11 12:10:34,OUTBOUND,2009-09-11 12:02:47';
Now I’m sure you can see the task ahead: a bunch of position calculations, nasty substrings, and so on. But MySQL has some really handy functions that make it easier to work with these kinds of string manipulations, and that’s what I want to write about.
The first question to answer is how long the string of consecutive inbound emails was. We can use the FIND_IN_SET function for this:
SELECT FIND_IN_SET('OUTBOUND', @email_dates) -1 AS inbound;
+---------+
| inbound |
+---------+
| 3 |
+---------+
So the last 3 emails were inbound. But what if there were no emails, or if they were all inbound? The function will return -1 then. I could do some special-case logic to count the number of emails by counting the number of commas in the string, but that’s actually the hard way. I’m already using GROUP BY to concatenate the dates into a string, so it’s easier to just use COUNT(*). I’m working on a small part of the problem with a user variable, but if I put it back into the GROUP BY, the logic would be something like this:
LEAST(COUNT(*), GREATEST(0, FIND_IN_SET(.....)))
I’ll handle that later, and keep working with the contents of the variable for now. Next I’ll extract the first and last dates from that range. The first date is easy:
SELECT SUBSTRING_INDEX(@email_dates, ',', 1) AS most_recent_email;
+---------------------+
| most_recent_email |
+---------------------+
| 2009-09-11 13:17:34 |
+---------------------+
Now to get the Nth, which usually involves a bunch of SUBSTRING, REVERSE, etc. But SUBSTRING_INDEX makes it easier: I’ll get the substring up to the Nth index, and then get the last substring from that. I’ll keep the code simple here by substituting the FIND_IN_SET expression with @nth below:
SELECT SUBSTRING_INDEX(
SUBSTRING_INDEX(@email_dates, ',', @nth), ',', -1) AS oldest_email;
+---------------------+
| oldest_email |
+---------------------+
| 2009-09-11 12:10:34 |
+---------------------+
Now it’s a simple matter of using date and time math functions to compute the elapsed time between the first and last email in the consecutive incoming range. The full query looks like this (slightly altered, and using a subquery in the FROM clause, for this post):
SELECT
email_count,
CONCAT('Last ',
LEAST(email_count, GREATEST(0, FIND_IN_SET('OUTBOUND', email_dates) -1)),
' emails are inbound, with a ',
TIMESTAMPDIFF(HOUR,
SUBSTRING_INDEX(SUBSTRING_INDEX(email_dates, ',',
LEAST(email_count,
GREATEST(0, FIND_IN_SET('OUTBOUND', email_dates) -1))), ',', - 1),
SUBSTRING_INDEX(email_dates, ',', 1)),
' hour spread'
) AS emails_spread
FROM (
SELECT COUNT(*) AS email_count,
GROUP_CONCAT(IF(email_from LIKE '%@percona.com%', 'OUTBOUND', email_date)
ORDER BY email_date DESC) AS email_dates
FROM issue_emails
GROUP BY issue_id
) AS e
+-------------+---------------------------------------------------+
| email_count | emails_spread |
+-------------+---------------------------------------------------+
| 179 | Last 2 emails are inbound, with a 4 hour spread |
The golden days of SQL coding live again!
Stopping spam on Google Groups
Spammers seem to have gotten smarter recently, and able to post garbage to Google Groups. There was some spam on a couple of the groups I’ve created, and members helped me find a way to stop it. I’m glad they did, because since then it’s stopped an increasing flow of spam. I just deleted and banned a whole bunch of emails titled “answer group” across different groups, and I see the same crud got through to a group I’m on but don’t own.
Here’s the recipe:
- Moderate messages from new members.
That’s all. Then you can log in and delete emails from spammers, and ban them, so that at least only you, and not the whole group, gets spammed.





