The sorry state of SQLite full text search on Android
Just spent a few hours prototyping full text search using SQLite’s FTS functionality.
As far as I’m concerned, it’s a dead end for my app, for three reasons:
1. The lack of international-character tokenizer.
The first step when building a full text search index is to break down the textual content into words, aka tokens. Those tokens are then entered into a special index which lets SQLite perform very fast searches based on a token (or a set of tokens).
SQLite has two built-in tokenizers, and they both only consider tokens consisting of US ASCII characters. All other, non-US ASCII characters are considered whitespace.
There is an “icu” (Unicode) tokenizer, but it has to be enabled at compile time. A standard Android build does not have the “icu” tokenizer enabled.
I actually ran some tests, just to make sure, and the results are:
– Samsung Galaxy Ace, S5830, Android 2.3.6
– Sony Ericsson Arc, Android 2.3.4
– Samsung Galaxy Nexus, Android 4.0.2
The above devices do not have the “icu” tokenizer.
– HTC Incredible S, Android 2.3.5
The Incredible S has the “icu” tokenizer.
I hardly think HTC wanted to do a favor to third-party app developers – the “icu” tokenizer is probably used by their Sense UI system.
2. Crashes in the “icu” tokenizer.
This is moot, but I’ve stumbled upon this post on stack overflow. A crash in native code when using the “icu” tokenizer is not something I’m willing to risk.
3. The large size of FTS indexes.
In my tests, ab FTS index table takes about as much space as the original, structured, non-FTS table it was built for.
This is probably to be expected, because it’s possible to query an FTS table for the original text that was used to build the index.
This may be a useful feature (although one could always do a join between FTS and structured data tables), but I’d rather do without it.
Android users are very sensitive to the amount of internal storage used by applications, and putting a SQLite database in external storage on a memory card is something I’m very hesitant to do.
Bottom line – it’s going to be the good old “LIKE ‘%blah%'” queries for me, which have more or less acceptable performance even considering the full table scan they perform, and do not use any extra storage space.