Switch rules
1. Concepts
1.1 The switch plugin
With the mfdata module, there is a plugin called switch installed by
default.
This default system switch plugin has a special dynamically generated
configuration. This configuration is generated from switch rules read
in installed plugins config.ini files.
So, there is no editable configuration file for this plugin. Its configuration is generated from other plugins configuration files.
The goal of this switch plugin is to provide "loose coupling" and "dynamic business routing rules" between plugins.
It feeds other plugins depending of rules evaluated for each incoming file.
Let's say for the example that we have 3 rules:
- a rule#1 (from the installed plugin
plugin1) which isTrueif the incoming filename starts withA - a rule#2 (from the installed plugin
plugin2) which isTrueif the incoming filename starts withB - a rule#3 (from the installed plugin
plugin3) which isTrueif the incoming filename starts withAB
For an incoming filename: Afoo, only rule#1 is True, so the switch plugin is routing
the incoming file to the plugin: plugin1:
For an incoming filename: Bbar, only rule#2 is True, so the switch plugin is routing
the incoming file to the plugin: plugin2:
For an incoming filename: Cbar, all rules are False, so the switch plugin is just deleting the incoming file:
For an incoming filename: ABfoo, we have two rules evaluated to True, so the switch plugin is routing the file to both plugins (with a copy):
With a copy?
In this particular case, for performances reasons and to avoid too many copies, some optimizations are done (hardlinking, copy and moving the last one...) but you can consider that each plugin receives a copy of the incoming file (at the first order of complexity)
This is "loose coupling" because switch rules are described in each plugin (and
not in the switch plugin itself).
So if you remove the plugin: plugin2 in this example (without changing anything else),
a new configuration for the switch plugin is automatically generated with the two remaining rules and if you inject another ABfoo filename, the routing will automatically change to:
1.2 The guess_file_type plugin
This plugin is also installed by default. In the default configuration, this
plugin listens to some directories and feeds the switch system plugin.
As its name suggests, it also tries to guess the file type with the file/libmagic unix tools.
So before giving the file to the switch plugin, it will add some interesting "tags" to the file, tags which will be usable in switch rules to route (for example) some files of a given type to a specific plugin.
Default configuration after a mfdata clean installation:
${MFMODULE_RUNTIME_HOME}?
In most cases, ${MFMODULE_RUNTIME_HOME}=/home/mfdata
After installing some plugins and configuring another "incoming" directory (see mfdata configuration [internal_plugins]/watched_directories, you can get something like that:
1.3 Tags
Tags are a kind of context for each file in a mfdata workflow in the form of several
keys/values. "Good" plugins
keep the context from the beginning of the workflow to its end. So, you can trace
the complete life of a file in the mfdata workflow.
Here is a little example:
0.guess_file_type.main.ascii_header = mer. juil. 8 12:11:06 CEST 2020
0.guess_file_type.main.enter_step = 2020-07-08T10:11:06:290229
0.guess_file_type.main.size = 33
0.guess_file_type.main.system_magic = ASCII text
1.switch.main.enter_step = 2020-07-08T10:11:06:304117
2.archive.main.enter_step = 2020-07-08T10:11:06:309646
first.core.original_basename = foobar
first.core.original_dirname = incoming
first.core.original_uid = 5479ab3ec3054a499ea81dfda5e7a2bd
latest.core.step_counter = 2
latest.guess_file_type.main.ascii_header = mer. juil. 8 12:11:06 CEST 2020
latest.guess_file_type.main.size = 33
latest.guess_file_type.main.system_magic = ASCII text
Tags follow the form: {step_number_in_the_workflow}.{plugin}.{step}.{tag_name}.
Notable exceptions are:
{plugin}.{step}replaced bycorefor "always available" tags{step_number_in_the_workflow}can be replaced or duplicated with:first(forcoretags which values won't change during the workflow)latest(for values corresponding to the latest passage in the given step)
Most switch rules are a logical expression on values of some of these tags.
Particularly useful tags in switch rules are:
latest.guess_file_type.system_magic(available only if the file passed through theguess_file_typeplugin): the output of thefilecommand on the file (for example:PNG image data, 48 x 47, 8-bit gray+alpha, non-interlaced)latest.guess_file_type.size(available only if the file passed through theguess_file_typeplugin): the file size (in bytes)first.core.original_basename: the basename of the incoming file (at the very beginning of the workflow)first.core.original_dirname: the dirname of the incoming file (at the very beginning of the workflow)latest.guess_file_type.ascii_header(available only if the file passed through theguess_file_typeplugin): the 60 first ascii characters of the file (ascii codes<32or>126are filtered)
Important
As your own plugins can add some custom tags on the file context, you can also use these custom tags on switch rules to get workflows like this (for example):
2. How to set your switch rules?
2.1 Introduction
In the config.ini file of your plugin, you can add several rules blocks:
[switch_rules:{rule_type}:{rule_type_param}]
or (depending on the rule type):
[switch_rules:{rule_type}]
Each rules block defined a rule type and some rule parameters. Under a "rules block" you can have one or several switch rules.
A switch rule is a line like:
{pattern} = {step_name1}, {step_name2}*, {step_name3}, ...
What about this * sign after step_name2?
If a step name ends with a *, it means that the switch plugin can use
hardlinking instead of copying (when there are multiple recipients for the file).
It's better for performances but target step must not alter the incoming file in any way.
So please DO NOT ADD THE STAR SIGN IF YOU ARE NOT SURE!
So, for a given pattern, you can have one or several copy actions (separated by coma). Each action means copy (or hardlink) the incoming file to the given step.
What about if there is only one recipient step?
If there is only one recipient step, the switch plugin will use a move
instead of copy for performances reasons.
Evaluation principles
- all switch rule are evaluated in the context of their rule block. If the pattern match (in this context), collected actions are APPENDED
- there is no way to remove a step from the (already collected) recipient list
- all switch rules are systematically evaluated
- there is no special orders for rules evaluation
- if a given step appears several times in the final recipient list, duplicates are automatically removed (on the technical side, the "recipient list" is a set and not a list)
2.2 Example
Let's say you add these lines in your plugin config.ini:
[switch_rules:fnmatch:first.core.original_basename]
A* = step1*, step2
B* = step1*
[switch_rules:regex:first.core.original_dirname]
.foo$ = step3
We have two rules blocks:
- one of
fnmatchtype (with parameter:first.core.original_basename) - one of
regextype (with parameter:first.core.original_dirname)
2.2.1 First file
Let's say we have an incoming file with:
first.core.original_basename(original basename of the file) =Barfirst.core.original_dirname(original dirname of the file) =foo
For the first rules blocks, we try the pattern A* with fnmatch on first.core.original_basename value (Bar).
It is evaluated to False.
fnmatch?
fnmatch rule type is a basic rule type which use fnmatch patterns
(a kind of "Unix shell-style wildcards"). These patterns are basic but very easy to learn and read.
Then we try the pattern B* with fnmatch on first.core.original_basename value (Bar). It evaluated to True.
So step1 is added to the list of recipients for this file (hardlinking allowed because of the *). But we continue to resolve
other rules.
For the second rules blocks, we try the pattern .foo$ with regex on first.core.original_dirname value (foo).
It is evaluated to False.
regex?
regex rule type is a rule type which use re patterns
(regular expressions). These patterns are more powerful but less easy to learn and read than fnmatch ones.
So for this first file, we have only one recipient: step1 and the following routing:
2.2.2 Second file
Let's say we have a different incoming file with:
first.core.original_basename(original basename of the file) =Abarfirst.core.original_dirname(original dirname of the file) =incoming.foo
In the first rule block, the first line matches so step1 and step2 are added to the recipient list.
In the second rule block, the rule also matches, so we add step3 to the recipient list.
So for this second file, we have three recipients: step1, step2, step3 and the following routing:
optimization
The previous diagram is not totally accurate as the switch plugin will
probably replace the latest copy operation by a move operation
(for performances reasons).
3. Available rule block types
3.1 equal
[switch_rules:equal:first.core.original_basename]
foo = main
It will return True only if the given tag value (first.core.original_basename
in this example) is equal to foo (python == operator).
bytes/utf8?
In MetWork < 1.0, tag values were compared as bytes (so strings was often
prefixed by b""). In MetWork >= 1.0, tag values are automatically decoded
as utf8 strings. So use standard plain strings here.
3.2 fnmatch
[switch_rules:fnmatch:first.core.original_basename]
A* = step1*, step2
fnmatch rule type is a basic rule type which use fnmatch patterns (a kind of "Unix shell-style wildcards"). These patterns are basic but very easy to learn and read.
The example will return True if the given tag value (first.core.original_basename
in this example) starts with A.
3.3 regex
[switch_rules:regex:first.core.original_dirname]
.foo$ = step3
regex rule type is a rule type which use re patterns (regular expressions). These patterns are more powerful but less easy to learn and read than fnmatch ones.
The example will return True if the given tag value (first.core.original_dirname
in this example) ends with .foo.
3.4 notequal
[switch_rules:notequal:first.core.original_basename]
foo = main
This is the inverse rule than equal.
Files which orginal_basename are different from foo will be routed to the main step.
3.5 notfnmatch
[switch_rules:notfnmatch:first.core.original_basename]
A* = step1*, step2
This is the inverse rule than fnmatch.
Files which orginal_basename doesn't start with A will be routed to the step1 and step2 steps.
3.6 notregex
[switch_rules:notregex:first.core.original_dirname]
.foo$ = step3
This is the inverse rule than regex.
This is for completeness only as you can express negative regex in the regex pattern itself.
3.7 alwaystrue
[switch_rules:alwaystrue]
whatever = main
This rule is always True. So the main step will be in the recipient list
for all files.
3.8 python
[switch_rules:python]
foo.myrule = step1
/special/directory:bar.myrule2 = step2
This is probably the most important rule as this is the most flexible and the most powerful.
With the line: foo.myrule = step1, you say to call the function myrule() in
the file foo.py at the root of your plugin directory. If the function returns True
for the given file, step1 will be added to the recipient list.
With the line /special/directory:bar.myrule2, you say to call the function myrule2()
in the file bar.py located in special/directory/ directory.
The prototype of the function myrule() (or myrule2()) is very basic. Here is an example:
def myrule(xaf):
# xaf is the incoming file with its context (XattrFile object)
# see https://github.com/metwork-framework/xattrfile library
original_basename = xaf.tags.get('original_basename', None)
original_dirname = xaf.tags.get('original_dirname', None)
if original_basename and original_dirname:
# WARNING: original_basename and original_dirname are bytes
if original_basename.startswith(b'A'):
if original_dirname.startswith(b'B'):
return True
return False
Warning
This function should be fast and don't rely on external services as its call
can block the whole switch plugin.
4. Advanced usage
4.1 Without any switch plugin
In some use cases (for example, you have only one plugin which takes care of
all incoming files and you are sure it will never change), you can work without
any switch or guess_file_type plugin.
To do that, set in the mfdata configuration:
install_switch=0under[internal_plugins]sectioninstall_guess_file_type=0under[internal_plugins]section
(if you don't do that, these two plugins will be reinstalled after services restart)
Then (only if they were already installed):
plugins.uninstall switch(asmfdatauser)plugins.uninstall guess_file_type(asmfdatauser)
To remove them.
Last, change your plugin config.ini file to add a new watched directory. For
example, set watched_directories={MFDATA_CURRENT_STEP_DIR},incoming
Warning
Please don't remove {MFDATA_CURRENT_STEP_DIR} from watched_directories key
unless you know exactly what you are doing.
You will get the following workflow:
4.2 With multiple switch plugins
Starting with MetWork 1.0, for very complex workflows, you can use several switch plugins.
To install another switch plugin, first you have to find the released .plugin
corresponding to the switch plugin.
You will find the full path with: ls ${MFMODULE_HOME}/share/plugins/switch-*.plugin.
${MFMODULE_HOME}?
In most cases, ${MFMODULE_HOME=/opt/metwork-mfdata.
Then install the switch plugin a second time (but with another name). For example:
plugins.install --new-name=myswitch /full/path/of/the/switch.plugin
To configure this new switch plugin called myswitch, you can use the same
principles described above but rules block must be changed to something like that:
[switch_rules@myswitch:{rule_type}:{rule_type_param}]
or (depending on the rule type):
[switch_rules@myswitch:{rule_type}]
(we add @myswitch to target the additional switch plugin called myswitch)
what about @switch?
If you add @switch suffix, you target the default system switch plugin. For this,
you can omit @switch suffix at all.
With this kind of configuration, you can build very complex workflows with several level of routing like: