Implementation - MIME Glob Match

The MIME Glob Match implementation should follow the suggestions from the Shared MIME Database specification as closely as possible (unlike other projects) to make sure the user will be presented with usable results (although the user won’t know where these results come from, since the MIME System will be hidden from the UI).

Ideas for the new Implementation

The implementation should be very fast, since the MIME type detection is done on every single file, that is touched by the file manager (and can thereby be considered a common operation).

An analysis of the globs file in the Shared MIME Database was done to better understand the requirements for the MIME globbing implementation.

Another important point to consider while designing/optimizing the Pattern Matching algorithm is the average filename length. For example, in case of my home folder (on my workstation at home), the average length of filenames is 12-13 characters. I may not be very representative for the average user, so you may want to repeat this simple test on your home desktop and send your results to benny@xfce.org to help us. Just run the simple shell code below (be aware that this can take some time, dependending on the number of files in your home directory).

find $HOME -exec basename '{}' ';' | \
 awk '{total += length($0); count += 1} END {print total / count}'

Below is a table of results:

Name Average filename length
Benedikt Meurer 12.491
Biju Chacko 12.777
Brian Tarricone 18.8394
Danny Milosavljevic 15.4941
Jasper Huijsmans 15.1931
Jens Luedicke (Home folder) 10.9706
Jens Luedicke (MP3 folder) 37.0887
Mantas (Home folder) 17.6998
Mantas (MP3 folder) 31.5892
Matthew Young 24.0075
Yves-Alexis Perez (Home folder) 16.0649
Yves-Alexis Perez (MP3 folder) 34.6277

Types of Patterns

The Shared MIME Database specification mentions three different types of patterns (Literal Patterns, Simple Patterns and Complex Patterns) into which the allowed patterns can be divided. The Shared MIME Database specification uses glob-style Pattern Matching; supported wildcards are:

  • * matches zero or more occurences of arbitrary characters
  • ? matches exactly one occurence of an arbitrary character
  • [..] defines a character class

In general, the pattern data (as loaded from the glob database) should be stored in a cache-friendly way (read the article about AlphaSort for an explanation); this is especially important for the Simple Patterns. Thunar should therefore store all patterns in one huge memory chunk and the items within that chunk should be properly aligned in memory. This approach offers several advantages, where the two most important ones are:

  1. Improved data locality
  2. Reduced heap fragmentation

Literal Patterns

A Literal Pattern is a pattern that does not include any of the possible wildcards, but only printable characters in the UTF-8 encoding, for example README is such a pattern, or .DirIcon.

The Shared MIME Database contains only a few literal patterns (10 literal patterns in shared-mime-info-0.15) and it is unlikely that vendors will add many literal patterns. Therefore we should go with a simple string comparison here. The important fact is that we need to sort the list of literal patterns by their length to make sure we check the longest pattern first.

The literal patterns are arranged in a list, like shown on the picture above. A literal pattern is represented by a C structure, like this:

struct LiteralPattern
{
  LiteralPattern *next;
  MimeType       *mime_type; // this could also point to the mime type name
  char            pattern[1]; // use variable-length array if available
};

The literal patterns should be stored in one memory chunk (for the reasons stated above); on ia32, for each literal pattern, there’ll be two 4 byte pointers and the pattern string (zero-terminated), where the string should be padded with zero to the next multiple of 4. The memory chunk should have a base alignment of 16 byte (mmap() does this and malloc() on BSD and Solaris, dunno about glibc). For example, if you have two literal patterns - Makefile and core - the memory layout will look like this:

This approach consumes about 168 bytes of memory for shared-mime-info-0.15 on ia32.

Other platforms may benefit from a different memory-layout. The best layout should be determined at configure time (this would probably break cross-compiling).

With such a sorted list, a possible implementation of the pattern matching function will look like this:

for (LiteralPattern *lp = literal_patterns; lp != NULL; lp = lp->next)
  {
    const gchar *f;
    const gchar *p;
 
    for (f = filename, p = lp->pattern; *f != '\0'; ++f, ++p)
      {
        if (*p == '\0')
          goto no_match;
        else if (*p != *f)
          break;
      }
 
    if (*f == '\0' && *p == '\0')
      goto match;
  }

Since the average filename is longer than 10 characters, but the longest literal pattern in the Shared MIME Database (as of version 0.15) has only 8 characters, this implementation will stop with no_match after the first comparison in many cases, and we won’t waste CPU cycles while comparing patterns that cannot match.

Simple Patterns

This is the most common case of a pattern in the Shared MIME Database (version 0.15 has about 400 Simple Patterns), and it is basicly a pattern in the form *.ext, where ext does not contain any of the wildcards mentioned above (unlikely with earlier attempts, a Simple Pattern starts with *. instead of just *). Due to the huge number of Simple Patterns we’ll first concentrate on reducing the number of required comparisons - for both the case where a pattern matches and the case where no pattern matches - and afterwards concentrate on improving the comparison speed (if necessary).

Since we reduced the Simple Patterns to patterns in the form *.ext, our implementation will first search for dots in the input file name. If the input file name does not contain any dots, none of the Simple Patterns can match and we’re done with the Simple Patterns. Else, if the file name contains one or more dots, we’ll perform a tree lookup for every substring that follows a dot, until we find a match or reach the end of the file name. The following simple pseudo-code illustrates this procedure:

for (f = filename; *f != '\0';)
  {
    if (*f++ != '.')
      continue;
 
    /* perform the tree lookup on the
     * substring f ...
     */
  }

Now, let’s see to this obscure tree lookup. Most probably the best way to match an extension-based file name pattern to a file name is to use a n-ary tree and perform a lookup character by character.

The following figure illustrates the search tree built from the 7 Simple Patterns *.C, *.htm, *.html, *.xm, *.xml, *.xsl and *.xslt.

FIXME: Put everything online concerning this tree idea.

While reducing the number of comparisons, it’s important to keep an eye on data locality; else, if we stall the CPU 50 times waiting for main memory, we’ll surely experience a huge performance problem with large directories in the end.

A simple worst case scenario here would be a MP3 directory with all extensions in upper case (*.MP3) and say about 3000 such mp3 files. Since the extension is upper-cased, we’d need the following steps:

  1. Case-sensitive Literal pattern matching
  2. Case-sensitive Simple pattern matching
  3. Case-sensitive Complex pattern matching
  4. Case-insensitive Literal pattern matching
  5. Case-insensitive Simple pattern matching

(plus the magic rules > 80 checks first). Performing these actions on 3000 files will take time; it’s therefore important to optimize the Glob Pattern Matching as much as possible.

:!: More research required!

Complex Patterns

This is the least common pattern (only 5 complex patterns in the shared-mime-info-0.15 package). It needs to be checked using fnmatch() (the glob-style pattern matching in GLib does not support […] character ranges). Due to the nature of the patterns there’s not much that can be done here to optimize the pattern matching, and since its the least common case, it may not be worth to spent much effort on this.

:!: For filenames that would match only in steps 3-6 (e.g. IMAGE.GIF, which will normally match the simple pattern *.gif in step 4), all the complex patterns will need to be tested first, so it may actually be a good idea to waste some more thoughts about this.

Steps

The required steps for the Glob Match detection are as follows

  1. Check Literal patterns (case sensitive)
  2. Check Simple patterns (case sensitive)
  3. Check Complex patterns (case sensitive)
  4. Check Literal patterns (case insensitive)
  5. Check Simple patterns (case insensitive)
  6. Check Complex patterns (case insensitive)

Recommended reading

Obsolete Ideas

The ideas presented below are mostly obsoleted by the additional analysis performed on the Shared MIME Database.

Literal patterns

The Literal patterns, like Makefile or README can easily be checked using string comparisons. To further speed up the Literal patterns-steps, the patterns should be sorted by length, where the longest patterns should appear first, and in addition to the string value, the lookup database should also store the length of the string.

When comparing the filename to a literal pattern entry, the code should first check if the lengths match, then if both are of the same length, it should perform a string comparison. If the code detects that the filename length is less than the length of the current literal pattern, then it should stop the lookup and continue with the next step. This optimization is based on the proposition that the literal patterns are sorted by their string length (longest first), and so no other literal can match the given filename, because all other literals are shorter than the filename.

The following pseudo code demonstrate this step:

for (all literal patterns)
  {
    if (length (literal) < length (filename))
      break;
    else if (length (literal) == length (filename))
      {
        if (literal == filename)
          return mime type of literal;
      }
  }

As mentioned earlier the storage should be an array of Pattern structures. The PatternLiteral structure should look like this:

struct PatternLiteral
{
  union
  {
    gchar *ptr;
    gchar  buf[8];
  } value;
 
  gsize length;
 
  const gchar *type;
};

For the ia32 architecture (and most other 32bit architectures), this will most probably result in a 16 byte struct (if not, your compiler is on crack!), which matches the size of a cacheline of mostly every (all?) ia32 based CPU. In addition, care must taken that the allocated array of structs is also aligned on a 16 byte boundary (IIRC, this is default on 32bit BSDs already; Glibc?). To further speed up things, the implementation can use prefetching for CPUs that support it. The size of buf should probably be detected at configure time somehow, so other architectures can benefit from this optimization as well. For the ia32 case, we can optimize this further using SSE/SSE2 (if available) to load and compare the stuff very fast.

As an explanation for the struct: length includes the length of the literal in bytes, and type is a pointer to the MIME-type which is located in the hashtable somewhere in the ThunarMimeDatabase. The interesting optimization here is the storage of the literal string itself; if the literal string is no more than 8 bytes in size, it is embedded into the PatternLiteral memory, and the string is stored outside only if the literal is 9 or more byte in size. A quick grep on the existing glob files show that all currently existing literal patterns are less than 9 byte in size and our optimization works; excellent data locality!

If some literal patterns are 9 or more byte, they should be collected and stored together in a single string chunk to avoid polluting the cache with random memory accesses.

Since the literal pattern and the input filename size is known and compared first, the string comparison could be done using a plain memcmp() instead of using a real string comparison. But this will only work if both strings are normalized first (remember, this is UTF-8!).

The above optimizations should do the job, especially, since - as mentioned already - the literal pattern is not the common case.

Open issues:

  • What about case-insensitive comparisons!? (We could use a 2nd list with the same basic structure as described above here, where all items are lower-cased properly - again, remember UTF-8 - we’d two lists, but only one function!)

Old Implementation (from Filer)

A possible implementation of the MIME-type detection Glob match step. This implementation is not well tested, and it’s basicly just a testcase. It includes the integer comparison-optimization for the common case where the pattern is something like *.EXT and EXT has 1 to 3 characters. For the final implementation, this should be optimized even further.

struct _FilerMimeGlob
{
  PatternLiteral  *literals;
  PatternString   *simple_long;
  PatternFast4    *simple_short;
  PatternGlob     *full;
};
 
/* for literal patterns */
struct _PatternLiteral
{
  gchar           *value;
  gsize            length;
  gchar           *type;
  PatternLiteral  *next;
};
 
/* for '*.xxx' patterns, where strlen(xxx) > 3 */
struct _PatternString
{
  gchar         *value;
  gsize          length;
  gchar         *type;
  PatternString *next;
};
 
/* for '*.xxx' patterns, where strlen(xxx) <= 3 */
struct _PatternFast4
{
  guint32        value_cs;
  guint32        value_ci;
  guint32        mask;
  gchar         *type;
  PatternFast4  *next;
};
 
/* for complex glob patterns */
struct _PatternGlob
{
  GPatternSpec  *pspec;
  gchar         *type;
  PatternGlob   *next;
};
 
 
 
static const gchar*
pattern_literal_match (PatternLiteral *pattern,
                       const gchar    *name,
                       const gchar    *name_reversed,
                       gsize           name_length,
                       gboolean        case_sensitive)
{
  for (; pattern != NULL; pattern = pattern->next)
    {
      if (pattern->length != name_length)
        continue;
 
      if ((case_sensitive && strcmp (name, pattern->value) == 0)
          || (!case_sensitive && strcasecmp (name, pattern->value) == 0))
        return pattern->type;
    }
 
  return NULL;
}
 
 
 
#define _pattern_append(Type, head, pnew) \
({ \
  Type *__lp; \
  if (G_UNLIKELY (head == NULL)) \
    { \
      head = pnew; \
    } \
  else \
    { \
      for (__lp = head; __lp->next != NULL; __lp = __lp->next); \
      __lp->next = pnew; \
    } \
 (head); \
 })
 
 
 
static PatternLiteral*
pattern_literal_append (PatternLiteral *head,
                        const gchar    *type,
                        const gchar    *pattern)
{
  PatternLiteral *pnew;
 
  pnew = g_new (PatternLiteral, 1);
  pnew->value = g_strdup (pattern);
  pnew->type = g_strdup (type);
  pnew->next = NULL;
 
  if (G_UNLIKELY (head == NULL))
    {
      head = pnew;
    }
  else
    {
      for (lp = head; lp->next != NULL; lp = lp->next);
      lp->next = pnew;
    }
 
  return head;
}
 
 
 
static const gchar*
pattern_string_match (PatternString *pattern,
                      const gchar   *name,
                      const gchar   *name_reversed,
                      gsize          name_length,
                      gboolean       case_sensitive)
{
  for (; pattern != NULL; pattern = pattern->next)
    {
      if (pattern->length > name_length)
        continue;
 
      if ((case_sensitive && strcmp (name, pattern->value, pattern->length) == 0)
          || (!case_sensitive && strncasecmp (name, pattern->value, pattern->length) == 0))
        return pattern->type;
    }
 
  return NULL;
}
 
 
 
static const gchar*
pattern_fast4_match (PatternFast4 *pattern,
                     const gchar  *name,
                     const gchar  *name_reversed,
                     gsize         name_length,
                     gboolean      case_sensitive)
{
  guint32 value;
  gchar   x[4];
 
  if (G_UNLIKELY (pattern == NULL))
    return NULL;
 
  if (G_LIKELY (name_length > 3))
    {
      x[0] = name_reversed[0];
      x[1] = name_reversed[1];
      x[2] = name_reversed[2];
      x[3] = name_reversed[3];
    }
  else if (name_length == 3)
    {
      x[0] = name_reversed[0];
      x[1] = name_reversed[1];
      x[2] = name_reversed[2];
      x[3] = '\0';
    }
  else if (name_length == 2)
    {
      x[0] = name_reversed[0];
      x[1] = name_reversed[1];
      x[2] = '\0';
      x[3] = '\0';
    }
  else if (name_length == 1)
    {
      x[0] = name_reversed[0];
      x[1] = '\0';
      x[2] = '\0';
      x[3] = '\0';
    }
  else
    {
      return NULL;
    }
 
  if (!case_sensitive)
    {
      if (x[0] >= 'a' && x[0] <= 'z')
        x[0] += ('A' - 'a');
      if (x[1] >= 'a' && x[1] <= 'z')
        x[1] += ('A' - 'a');
      if (x[2] >= 'a' && x[2] <= 'z')
        x[2] += ('A' - 'a');
      if (x[3] >= 'a' && x[3] <= 'z')
        x[3] += ('A' - 'a');
    }
 
  value = *((const guint32 *) x);
 
  if (case_sensitive)
    {
      for (; pattern != NULL; pattern = pattern->next)
        if (pattern->value_cs == (value & pattern->mask))
          return pattern->type;
    }
  else
    {
      for (; pattern != NULL; pattern = pattern->next)
        if (pattern->value_ci == (value & pattern->mask))
          return pattern->type;
    }
 
  return NULL;
}
 
 
 
static const gchar*
pattern_glob_match (PatternGlob *pattern,
                    const gchar *name,
                    const gchar *name_reversed,
                    gsize        name_length,
                    gboolean     case_sensitive)
{
  for (; pattern != NULL; pattern = pattern->next)
    if (g_pattern_match (pattern->pspec, name_length, name, name_reversed))
 
      return pattern->type;
 
  return NULL;
}
 
 
 
static void
filer_mime_glob_append (FilerMimeGlob *glob,
                        const gchar   *type,
                        const gchar   *pattern)
{
  if (strpbrk (pattern, "*?[") == NULL)
    {
      pattern_literal_append (glob->literals, type, pattern);
    }
  else if (pattern[0] == '*' && pattern[1] == '.'
        && strpbrk (pattern + 2, "*?[") == NULL)
    {
      if (strlen (pattern + 2) <= 3)
        {
          pattern_fast4_append (glob->simple_short, type, pattern);
        }
      else
        {
          pattern_string_append (glob->simple_long, type, pattern);
        }
    }
  else
    {
      pattern_glob_append (glob->full, type, pattern);
    }
}
 
 
 
/**
 * @glob
 * @name        : The file's name, not its path.
 *
 * Return value :
 **/
const gchar*
filer_mime_glob_lookup (FilerMimeGlob *glob,
                        const gchar   *name)
{
  const gchar *type;
  gboolean     case_sensitive;
  gchar       *rname;
  gsize        length;
  gint         n;
 
 
  length = strlen (filename);
  rname = g_utf8_strreverse (name, -1);
 
  for (case_sensitive = TRUE, n = 0; n < 2; case_sensitive = !case_sensitive, ++n)
    {
      /* literals */
      type = pattern_literal_match (glob->literals, name, rname, length, case_sensitive);
      if (type != NULL)
        goto done;
 
      /* long simple patterns */
      type = pattern_string_match (glob->simple_long, name, rname, length, case_sensitive);
      if (type != NULL)
        goto done;
 
      /* short simple patterns */
      type = pattern_fast4_match (glob->simple_short, name, rname, length, case_sensitive);
      if (type != NULL)
        goto done;
 
      /* complex glob patterns */
      type = pattern_glob_match (glob->full, name, rname, length, case_sensitive);
      if (type != NULL)
        goto done;
    }
 
done:
  g_free (rname);
  return type;
}
 
 
 
/**
 **/
void
filer_mime_glob_read_from_file (FilerMimeGlob *glob,
                                const gchar   *path)
{
  gchar  line[2048];
  gchar *pattern;
  gchar *p;
  FILE  *fp;
 
  fp = fopen (path, "r");
  if (G_UNLIKELY (fp == NULL))
    return;
 
  while (fgets (line, 2048, fp) != NULL)
    {
      if (G_UNLIKELY (line[0] == '#'))
        continue;
 
      for (p = line; *p != ':' && *p != '\0'; ++p);
      if (G_UNLIKELY (*p == '\0'))
        continue;
 
      for (*p++ = '\0', pattern = p; *p != '\0' && *p != '\n'; ++p);
      if (G_UNLIKELY (pattern == p))
        continue;
      *p = '\0';
 
      filer_mime_glob_append (glob, line, pattern);
    }
 
  fclose (fp);
}
 
  implementation/mime-glob-match.txt · Last modified: 2005/03/27 20:26 by 217.85.190.105 (benny)
 
Recent changes RSS feed Creative Commons License Driven by DokuWiki