Perl code with globstar

Globstar For Perl

Using the double star in file name patterns like assets/**/*.css is second nature to many web developers in the Node.js ecosystem. It is also used by gitignore when evaluating exclusion patterns. The new Perl library File-Globstar brings the same functionality to Perl.

The module provides five entry points for related functionality, the functions globstar(), translatestar(), fnmatchstar(), and quotestar() as well as the high-level class File::Globstar::ListMatch for implementing exclusion or inclusion patterns in the style of gitignore.

Functions and Classes

globstar(PATTERN)

This function behaves like the regular Perl function glob() but has support for the double asterisk feature. You can use it for something like this:

my @files = globstar 'lib/**/*.p[lm]';

The variable @files now contains all Perl source files in the directory lib and all of its descendants.

Internally, the function just expands all occurences of ** and leaves the rest of the heavy lifting to the regular glob() function.

More detailed information is available in the manual page of File::Globstar.

fnmatchstar(PATTERN, STRING[, OPTIONS])

While globstar() examines the file system for expanding one pattern into a list of files, the function fnmatchstar() does that in an abstract manner, matching one string against a pattern. It returns true or false depending on whether the string matched or not. It is on one hand an extension of the standard C function fnmatch(3), on the other hand a subset of it because fnmatchstar() supports less options.

Read the manual page of File::Globstar for more information.

translatestar(PATTERN[, OPTIONS])

The function translatestar() is internally used by fnmatchstar(). It takes a globstar pattern as input and transpiles it into a Perl regular expression.

Again, all the gory details are described in the manual page File::Globstar.

quotestar(STRING)

The function quotestar() escapes all characters special to globbing patterns in a string with a backslash. You can pass an optional argument “negatable” if you also want to backslash-escape a leading exclamation mark.

This is for example needed, when you want to add a path to a list of globbing patterns and that path is meant to be a literal path, not subject to globbing.

File::Globstar::ListMatch

File::Globstar::ListMatch implements the algorithm used by gitignore in Perl. It takes a list of globstar patterns, possibly negated for overriding previous patterns, and converts that into a list of regular expressions. Its match() method takes a string and subsequently matches it against all patterns reporting the overall result as true or false.

File::Globstar::ListMatch can operate in inclusion and exclusion mode. Exclusion mode is compatible to gitignore and differs from inclusion mode by exactly one feature that is meant to improve performance but that does not make a lot of sense at first glance: In exclusion mode, you cannot include a file or directory, when one of its parent directories has previously been excluded.

Wait a second! Didn’t we say that just strings get compared without the file system being consulted? Yes. But, of course, the normal application is for matching file names, and so we use names of files and directories as a metapher for the string to be matched.

Now why this strange rule with the parent directories? In fact, it makes the matching slower. How can it possibly improve performance?

The motivation for writing File::Globstar::ListMatch was the static site generator Qgoda. Qgoda works similar to Jekyll. It recursively scans all files and subdirectories in its source directory and copies them over into the output directory _site, possibly processing them from Markdown into HTML and possibly renaming them.

By default, it ignores all top-level files and directories the names of which start with an underscore and all hidden files. Hidden files are files with names starting with a dot (.). The default exclusion pattern therefore looks like this:

/_*
.*

The leading slash anchors a pattern to the top-level directory.

You can override previous patterns with a leading exclamation mark, for example like this:

/_*
.*
!.htaccess

That would still exclude all files starting with a dot but not .htaccess files.

What you cannot do (in exclusion mode!) is this:

/_*
.*
!_assets/fonts

Supporting this would have a very negative impact on performance. You understand this, when you look at the way that Qgoda (or probably git as well) collects the files to process. It recurses the source directory with File::Find and checks each file or directory against the exclusion pattern list. If the name matches, the file gets ignored. And in the case of a directory, it also prevents descending into that directory. Files inside that directory will never even be looked at.

In the example above, _assets is ignored because of the first rule /_* that is then later partly overridden by !_assets/fonts. But _assets is already discarded, and therefore _assets/fonts is never visited. The rule is useless.

This could. of course, be prevented by always reading the entire directory tree and taking the decision for all files and subdirectories found individually. But for example the directory /node_modules that is present in many web projects typically has tens of thousands of files (if not hundreds of thousands). It makes a whole lot of sense to not descend into such directories at all.

But isn’t a useless rule just a cosmetic problem? Not quite. Qgoda usually runs in watch mode. It monitors the source directory for changes and rebuilds the site as necessary on the fly.

As necessary means that Qgoda has to check whether the file that was reported to have changed would have to be processed. It just checks that single file against the exclusion list. If that file is for example _assets/fonts/funnyface.ttf it would normally not be included because !_assets/fonts overrides the general pattern _*. On the other hand, we have seen that the initial run with File::Find would never pick up that file because the recursion would already stop at _assets.

And that is more than a cosmetic problem. If the file was ignored in the initial run that examines the file system, it should also be ignored during the abstract check that only takes the file name into account.

It should now also be clear, why File::Globstar::ListMatch or fnmatchstar() never check the existence of a file. It simply wouldn’t make sense because the file in question could have been deleted. A deleted file could also trigger a rebuild in Qgoda (or make a git checkout dirty), but all file operations will, of course, fail for deleted files. So why try them in the first place?

By the way, there is a possibility to re-include _assets/fonts in Qgoda. It is a little bit esoteric though:

/_*
.*
!/_assets
/_assets/*
!/_assets/fonts

The pattern !/_assets re-includes the entire subdirectory _assets. The next pattern /_assets/* puts then all subdirectories of /_assets again on the ignore list, before !/_assets/fonts ultimately re-includes the desired directory /_assets/fonts.

Note that the above example can be expressed with less leading slashes:

/_*
.*
!/_assets
_assets/*
!_assets/fonts

The leading slash in the last two patterns was removed without changing anything. Read on if you don’t know why …

Slashes

One of the nice things about .gitignore files is that most people intuitively understand their syntax and semantics by simply looking at them, without ever bothering to read documentation. One thing that many people do not fully understand though, is the exact meaning of slashes in patterns.

When File::Globstar::ListMatch (or git) examines a pattern, it first checks whether it ends in a slash. If it does, it strips off the slash and marks it as a directory pattern. That particular pattern can only match a directory.

If the remainder of the pattern contains at least one more slash it is a “full path pattern” (see below).

In the last step, a possible leading slash is thrown away. It may have turned the pattern into a full path pattern but the leading slash will be ignored and not take part in the match.

That is one of the things that most people intuitively understand but let’s look at an example:

js/vendor
*.bak

When the file js/vendor/awesomelib/index.js is checked against this list, it matches js/vendor and is therefore ignored.

Now take src/js/app/components/menu.js.bak. You probably understand by intuition that it is ignored as well because of the pattern *.bak. But why exactly? The first pattern contains a slash but the first one does not. It is therefore a base name pattern. Only the base name of the file — in this case menu.js.bak — is taken into account and that base name matches *.bak. The leading directory part is completely ignored.

Another consequence of this is that every slash (except for a trailing slash) forces a full path match and automatically anchors the pattern to the base directory. js/vendor/awesomelib/index.js matches js/vendor but src/js/vendor/awesome/index.js does not match js/vendor. Remember, the slash causes the full path name to be taken into account!

As a rule of thumb, in patterns you either need a leading slash, a trailing slash, or slashes inside a pattern, but combinations of one of these three options are almost always wrong (although the error never hurts).

One of the rare exceptions would be for example /_*/. This would match all top-level directories that have names beginning with a slash. Compare that to /_* which would also match non-directories. On the other hand, /_assets and /_assets/ are almost equivalent. The subtle difference is that the second pattern would only match /_assets if it happens to be a directory. A non-directory /_assets would go through but in general you will know whether /_assets is a directory or not. It’s your data after all, isn’t it?

Conclusion

Globstar ** patterns are standard for many applications today. File-Globstar makes them available for software written in Perl. Star File-Globstar on github or rate it on CPAN if you think that this fills a gap in the Perl ecosystem.


blog comments powered by Disqus