The MIME Glob Match implementation should follow the suggestions from the Shared MIME Database specification as closely as possible (unlike other projects) to make sure the user will be presented with usable results (although the user won’t know where these results come from, since the MIME System will be hidden from the UI).
The implementation should be very fast, since the MIME type detection is done on every single file, that is touched by the file manager (and can thereby be considered a common operation).
An analysis of the globs file
in the Shared MIME Database was done to better understand the requirements
for the MIME globbing implementation.
Another important point to consider while designing/optimizing the Pattern Matching algorithm is the average filename length. For example, in case of my home folder (on my workstation at home), the average length of filenames is 12-13 characters. I may not be very representative for the average user, so you may want to repeat this simple test on your home desktop and send your results to benny@xfce.org to help us. Just run the simple shell code below (be aware that this can take some time, dependending on the number of files in your home directory).
find $HOME -exec basename '{}' ';' | \ awk '{total += length($0); count += 1} END {print total / count}'
Below is a table of results:
| Name | Average filename length |
|---|---|
| Benedikt Meurer | 12.491 |
| Biju Chacko | 12.777 |
| Brian Tarricone | 18.8394 |
| Danny Milosavljevic | 15.4941 |
| Jasper Huijsmans | 15.1931 |
| Jens Luedicke (Home folder) | 10.9706 |
| Jens Luedicke (MP3 folder) | 37.0887 |
| Mantas (Home folder) | 17.6998 |
| Mantas (MP3 folder) | 31.5892 |
| Matthew Young | 24.0075 |
| Yves-Alexis Perez (Home folder) | 16.0649 |
| Yves-Alexis Perez (MP3 folder) | 34.6277 |
The Shared MIME Database specification mentions three different types of patterns (Literal Patterns, Simple Patterns and Complex Patterns) into which the allowed patterns can be divided. The Shared MIME Database specification uses glob-style Pattern Matching; supported wildcards are:
* matches zero or more occurences of arbitrary characters? matches exactly one occurence of an arbitrary character[..] defines a character class
In general, the pattern data (as loaded from the glob database)
should be stored in a cache-friendly way (read the
article about AlphaSort for
an explanation); this is especially important for the Simple
Patterns. Thunar should therefore store all patterns in one huge
memory chunk and the items within that chunk should be properly
aligned in memory. This approach offers several advantages, where
the two most important ones are:
A Literal Pattern is a pattern that does not include any of the possible
wildcards, but only printable characters in the UTF-8 encoding, for
example README is such a pattern, or .DirIcon.
The Shared MIME Database contains only a few literal patterns (10 literal
patterns in shared-mime-info-0.15) and it is unlikely that vendors will
add many literal patterns. Therefore we should go with a simple string
comparison here. The important fact is that we need to sort the list of
literal patterns by their length to make sure we check the longest pattern
first.
The literal patterns are arranged in a list, like shown on the picture above. A literal pattern is represented by a C structure, like this:
struct LiteralPattern { LiteralPattern *next; MimeType *mime_type; // this could also point to the mime type name char pattern[1]; // use variable-length array if available };
The literal patterns should be stored in one memory chunk (for the reasons
stated above); on ia32, for each literal pattern, there’ll be two 4 byte
pointers and the pattern string (zero-terminated), where the string should
be padded with zero to the next multiple of 4. The memory chunk should have
a base alignment of 16 byte (mmap() does this and malloc() on BSD and
Solaris, dunno about glibc). For example, if you have two literal patterns
- Makefile and core - the memory layout will look like this:
This approach consumes about 168 bytes of memory for shared-mime-info-0.15
on ia32.
Other platforms may benefit from a different memory-layout. The best layout should be determined at configure time (this would probably break cross-compiling).
With such a sorted list, a possible implementation of the pattern matching function will look like this:
for (LiteralPattern *lp = literal_patterns; lp != NULL; lp = lp->next) { const gchar *f; const gchar *p; for (f = filename, p = lp->pattern; *f != '\0'; ++f, ++p) { if (*p == '\0') goto no_match; else if (*p != *f) break; } if (*f == '\0' && *p == '\0') goto match; }
Since the average filename is longer than 10 characters, but
the longest literal pattern in the Shared MIME Database (as of
version 0.15) has only 8 characters, this implementation
will stop with no_match after the first comparison in many
cases, and we won’t waste CPU cycles while comparing patterns
that cannot match.
This is the most common case of a pattern in the Shared MIME Database
(version 0.15 has about 400 Simple Patterns), and it is basicly
a pattern in the form *.ext, where ext does not contain any
of the wildcards mentioned above (unlikely with earlier attempts, a
Simple Pattern starts with *. instead of just *). Due to the
huge number of Simple Patterns we’ll first concentrate on reducing
the number of required comparisons - for both the case where a pattern
matches and the case where no pattern matches - and afterwards
concentrate on improving the comparison speed (if necessary).
Since we reduced the Simple Patterns to patterns in the form *.ext,
our implementation will first search for dots in the input file name.
If the input file name does not contain any dots, none of the Simple
Patterns can match and we’re done with the Simple Patterns. Else, if
the file name contains one or more dots, we’ll perform a tree lookup
for every substring that follows a dot, until we find a match or reach
the end of the file name. The following simple pseudo-code illustrates this
procedure:
for (f = filename; *f != '\0';) { if (*f++ != '.') continue; /* perform the tree lookup on the * substring f ... */ }
Now, let’s see to this obscure tree lookup. Most probably the best way to match an extension-based file name pattern to a file name is to use a n-ary tree and perform a lookup character by character.
The following figure illustrates the search tree built from the
7 Simple Patterns *.C, *.htm, *.html, *.xm, *.xml,
*.xsl and *.xslt.
: Put everything online concerning this tree idea.
While reducing the number of comparisons, it’s important to keep an eye on data locality; else, if we stall the CPU 50 times waiting for main memory, we’ll surely experience a huge performance problem with large directories in the end.
A simple worst case scenario here would be a MP3 directory with all
extensions in upper case (*.MP3) and say about 3000 such mp3
files. Since the extension is upper-cased, we’d need the following
steps:
(plus the magic rules > 80 checks first). Performing these actions on 3000 files will take time; it’s therefore important to optimize the Glob Pattern Matching as much as possible.
More research required!
This is the least common pattern (only 5 complex patterns in the
shared-mime-info-0.15 package). It needs to be checked using
fnmatch() (the glob-style pattern matching in GLib does not
support […] character ranges). Due to the nature of the
patterns there’s not much that can be done here to optimize the
pattern matching, and since its the least common case, it may
not be worth to spent much effort on this.
For filenames that would match only in steps 3-6 (e.g.
IMAGE.GIF, which will normally match the simple pattern
*.gif in step 4), all the complex patterns will need to
be tested first, so it may actually be a good idea to waste
some more thoughts about this.
The required steps for the Glob Match detection are as follows
The ideas presented below are mostly obsoleted by the additional analysis performed on the Shared MIME Database.
The Literal patterns, like Makefile or
README can easily be checked using string comparisons. To further
speed up the Literal patterns-steps, the patterns should be
sorted by length, where the longest patterns should appear first, and
in addition to the string value, the lookup database should also store
the length of the string.
When comparing the filename to a literal pattern entry, the code should first check if the lengths match, then if both are of the same length, it should perform a string comparison. If the code detects that the filename length is less than the length of the current literal pattern, then it should stop the lookup and continue with the next step. This optimization is based on the proposition that the literal patterns are sorted by their string length (longest first), and so no other literal can match the given filename, because all other literals are shorter than the filename.
The following pseudo code demonstrate this step:
for (all literal patterns) { if (length (literal) < length (filename)) break; else if (length (literal) == length (filename)) { if (literal == filename) return mime type of literal; } }
As mentioned earlier the storage should be an array of Pattern
structures. The PatternLiteral structure should look like
this:
struct PatternLiteral { union { gchar *ptr; gchar buf[8]; } value; gsize length; const gchar *type; };
For the ia32 architecture (and most other 32bit architectures), this will most probably result in a 16 byte struct (if not, your compiler is on crack!), which matches the size of a cacheline of mostly every (all?) ia32 based CPU. In addition, care must taken that the allocated array of structs is also aligned on a 16 byte boundary (IIRC, this is default on 32bit BSDs already; Glibc?). To further speed up things, the implementation can use prefetching for CPUs that support it. The size of buf should probably be detected at configure time somehow, so other architectures can benefit from this optimization as well. For the ia32 case, we can optimize this further using SSE/SSE2 (if available) to load and compare the stuff very fast.
As an explanation for the struct: length includes the length
of the literal in bytes, and type is a pointer to the MIME-type
which is located in the hashtable somewhere in the ThunarMimeDatabase.
The interesting optimization here is the storage of the literal
string itself; if the literal string is no more than 8 bytes in
size, it is embedded into the PatternLiteral memory, and
the string is stored outside only if the literal is 9 or more byte
in size. A quick grep on the existing glob files show
that all currently existing literal patterns are less than
9 byte in size and our optimization works; excellent data
locality!
If some literal patterns are 9 or more byte, they should be collected and stored together in a single string chunk to avoid polluting the cache with random memory accesses.
Since the literal pattern and the input filename size is known
and compared first, the string comparison could be done using
a plain memcmp() instead of using a real string comparison. But
this will only work if both strings are normalized first (remember,
this is UTF-8!).
The above optimizations should do the job, especially, since - as mentioned already - the literal pattern is not the common case.
Open issues:
A possible implementation of the MIME-type detection Glob match step.
This implementation is not well tested, and it’s basicly just a testcase.
It includes the integer comparison-optimization for the common case
where the pattern is something like *.EXT and EXT has 1 to 3
characters. For the final implementation, this should be optimized
even further.
struct _FilerMimeGlob { PatternLiteral *literals; PatternString *simple_long; PatternFast4 *simple_short; PatternGlob *full; }; /* for literal patterns */ struct _PatternLiteral { gchar *value; gsize length; gchar *type; PatternLiteral *next; }; /* for '*.xxx' patterns, where strlen(xxx) > 3 */ struct _PatternString { gchar *value; gsize length; gchar *type; PatternString *next; }; /* for '*.xxx' patterns, where strlen(xxx) <= 3 */ struct _PatternFast4 { guint32 value_cs; guint32 value_ci; guint32 mask; gchar *type; PatternFast4 *next; }; /* for complex glob patterns */ struct _PatternGlob { GPatternSpec *pspec; gchar *type; PatternGlob *next; }; static const gchar* pattern_literal_match (PatternLiteral *pattern, const gchar *name, const gchar *name_reversed, gsize name_length, gboolean case_sensitive) { for (; pattern != NULL; pattern = pattern->next) { if (pattern->length != name_length) continue; if ((case_sensitive && strcmp (name, pattern->value) == 0) || (!case_sensitive && strcasecmp (name, pattern->value) == 0)) return pattern->type; } return NULL; } #define _pattern_append(Type, head, pnew) \ ({ \ Type *__lp; \ if (G_UNLIKELY (head == NULL)) \ { \ head = pnew; \ } \ else \ { \ for (__lp = head; __lp->next != NULL; __lp = __lp->next); \ __lp->next = pnew; \ } \ (head); \ }) static PatternLiteral* pattern_literal_append (PatternLiteral *head, const gchar *type, const gchar *pattern) { PatternLiteral *pnew; pnew = g_new (PatternLiteral, 1); pnew->value = g_strdup (pattern); pnew->type = g_strdup (type); pnew->next = NULL; if (G_UNLIKELY (head == NULL)) { head = pnew; } else { for (lp = head; lp->next != NULL; lp = lp->next); lp->next = pnew; } return head; } static const gchar* pattern_string_match (PatternString *pattern, const gchar *name, const gchar *name_reversed, gsize name_length, gboolean case_sensitive) { for (; pattern != NULL; pattern = pattern->next) { if (pattern->length > name_length) continue; if ((case_sensitive && strcmp (name, pattern->value, pattern->length) == 0) || (!case_sensitive && strncasecmp (name, pattern->value, pattern->length) == 0)) return pattern->type; } return NULL; } static const gchar* pattern_fast4_match (PatternFast4 *pattern, const gchar *name, const gchar *name_reversed, gsize name_length, gboolean case_sensitive) { guint32 value; gchar x[4]; if (G_UNLIKELY (pattern == NULL)) return NULL; if (G_LIKELY (name_length > 3)) { x[0] = name_reversed[0]; x[1] = name_reversed[1]; x[2] = name_reversed[2]; x[3] = name_reversed[3]; } else if (name_length == 3) { x[0] = name_reversed[0]; x[1] = name_reversed[1]; x[2] = name_reversed[2]; x[3] = '\0'; } else if (name_length == 2) { x[0] = name_reversed[0]; x[1] = name_reversed[1]; x[2] = '\0'; x[3] = '\0'; } else if (name_length == 1) { x[0] = name_reversed[0]; x[1] = '\0'; x[2] = '\0'; x[3] = '\0'; } else { return NULL; } if (!case_sensitive) { if (x[0] >= 'a' && x[0] <= 'z') x[0] += ('A' - 'a'); if (x[1] >= 'a' && x[1] <= 'z') x[1] += ('A' - 'a'); if (x[2] >= 'a' && x[2] <= 'z') x[2] += ('A' - 'a'); if (x[3] >= 'a' && x[3] <= 'z') x[3] += ('A' - 'a'); } value = *((const guint32 *) x); if (case_sensitive) { for (; pattern != NULL; pattern = pattern->next) if (pattern->value_cs == (value & pattern->mask)) return pattern->type; } else { for (; pattern != NULL; pattern = pattern->next) if (pattern->value_ci == (value & pattern->mask)) return pattern->type; } return NULL; } static const gchar* pattern_glob_match (PatternGlob *pattern, const gchar *name, const gchar *name_reversed, gsize name_length, gboolean case_sensitive) { for (; pattern != NULL; pattern = pattern->next) if (g_pattern_match (pattern->pspec, name_length, name, name_reversed)) return pattern->type; return NULL; } static void filer_mime_glob_append (FilerMimeGlob *glob, const gchar *type, const gchar *pattern) { if (strpbrk (pattern, "*?[") == NULL) { pattern_literal_append (glob->literals, type, pattern); } else if (pattern[0] == '*' && pattern[1] == '.' && strpbrk (pattern + 2, "*?[") == NULL) { if (strlen (pattern + 2) <= 3) { pattern_fast4_append (glob->simple_short, type, pattern); } else { pattern_string_append (glob->simple_long, type, pattern); } } else { pattern_glob_append (glob->full, type, pattern); } } /** * @glob * @name : The file's name, not its path. * * Return value : **/ const gchar* filer_mime_glob_lookup (FilerMimeGlob *glob, const gchar *name) { const gchar *type; gboolean case_sensitive; gchar *rname; gsize length; gint n; length = strlen (filename); rname = g_utf8_strreverse (name, -1); for (case_sensitive = TRUE, n = 0; n < 2; case_sensitive = !case_sensitive, ++n) { /* literals */ type = pattern_literal_match (glob->literals, name, rname, length, case_sensitive); if (type != NULL) goto done; /* long simple patterns */ type = pattern_string_match (glob->simple_long, name, rname, length, case_sensitive); if (type != NULL) goto done; /* short simple patterns */ type = pattern_fast4_match (glob->simple_short, name, rname, length, case_sensitive); if (type != NULL) goto done; /* complex glob patterns */ type = pattern_glob_match (glob->full, name, rname, length, case_sensitive); if (type != NULL) goto done; } done: g_free (rname); return type; } /** **/ void filer_mime_glob_read_from_file (FilerMimeGlob *glob, const gchar *path) { gchar line[2048]; gchar *pattern; gchar *p; FILE *fp; fp = fopen (path, "r"); if (G_UNLIKELY (fp == NULL)) return; while (fgets (line, 2048, fp) != NULL) { if (G_UNLIKELY (line[0] == '#')) continue; for (p = line; *p != ':' && *p != '\0'; ++p); if (G_UNLIKELY (*p == '\0')) continue; for (*p++ = '\0', pattern = p; *p != '\0' && *p != '\n'; ++p); if (G_UNLIKELY (pattern == p)) continue; *p = '\0'; filer_mime_glob_append (glob, line, pattern); } fclose (fp); }