Switch rules
1. Concepts
1.1 The switch plugin
With the mfdata
module, there is a plugin called switch
installed by
default.
This default system switch plugin has a special dynamically generated
configuration. This configuration is generated from switch rules read
in installed plugins config.ini
files.
So, there is no editable configuration file for this plugin. Its configuration is generated from other plugins configuration files.
The goal of this switch plugin is to provide "loose coupling" and "dynamic business routing rules" between plugins.
It feeds other plugins depending of rules evaluated for each incoming file.
Let's say for the example that we have 3 rules:
- a rule#1 (from the installed plugin
plugin1
) which isTrue
if the incoming filename starts withA
- a rule#2 (from the installed plugin
plugin2
) which isTrue
if the incoming filename starts withB
- a rule#3 (from the installed plugin
plugin3
) which isTrue
if the incoming filename starts withAB
For an incoming filename: Afoo
, only rule#1 is True
, so the switch
plugin is routing
the incoming file to the plugin: plugin1
:
For an incoming filename: Bbar
, only rule#2 is True
, so the switch
plugin is routing
the incoming file to the plugin: plugin2
:
For an incoming filename: Cbar
, all rules are False
, so the switch
plugin is just deleting the incoming file:
For an incoming filename: ABfoo
, we have two rules evaluated to True
, so the switch
plugin is routing the file to both plugins (with a copy):
With a copy?
In this particular case, for performances reasons and to avoid too many copies, some optimizations are done (hardlinking, copy and moving the last one...) but you can consider that each plugin receives a copy of the incoming file (at the first order of complexity)
This is "loose coupling" because switch rules are described in each plugin (and
not in the switch
plugin itself).
So if you remove the plugin: plugin2
in this example (without changing anything else),
a new configuration for the switch
plugin is automatically generated with the two remaining rules and if you inject another ABfoo
filename, the routing will automatically change to:
1.2 The guess_file_type plugin
This plugin is also installed by default. In the default configuration, this
plugin listens to some directories and feeds the switch
system plugin.
As its name suggests, it also tries to guess the file type with the file/libmagic unix tools.
So before giving the file to the switch
plugin, it will add some interesting "tags" to the file, tags which will be usable in switch rules to route (for example) some files of a given type to a specific plugin.
Default configuration after a mfdata
clean installation:
${MFMODULE_RUNTIME_HOME}?
In most cases, ${MFMODULE_RUNTIME_HOME}=/home/mfdata
After installing some plugins and configuring another "incoming" directory (see mfdata
configuration [internal_plugins]/watched_directories
, you can get something like that:
1.3 Tags
Tags are a kind of context for each file in a mfdata
workflow in the form of several
keys/values. "Good" plugins
keep the context from the beginning of the workflow to its end. So, you can trace
the complete life of a file in the mfdata
workflow.
Here is a little example:
0.guess_file_type.main.ascii_header = mer. juil. 8 12:11:06 CEST 2020
0.guess_file_type.main.enter_step = 2020-07-08T10:11:06:290229
0.guess_file_type.main.size = 33
0.guess_file_type.main.system_magic = ASCII text
1.switch.main.enter_step = 2020-07-08T10:11:06:304117
2.archive.main.enter_step = 2020-07-08T10:11:06:309646
first.core.original_basename = foobar
first.core.original_dirname = incoming
first.core.original_uid = 5479ab3ec3054a499ea81dfda5e7a2bd
latest.core.step_counter = 2
latest.guess_file_type.main.ascii_header = mer. juil. 8 12:11:06 CEST 2020
latest.guess_file_type.main.size = 33
latest.guess_file_type.main.system_magic = ASCII text
Tags follow the form: {step_number_in_the_workflow}.{plugin}.{step}.{tag_name}
.
Notable exceptions are:
{plugin}.{step}
replaced bycore
for "always available" tags{step_number_in_the_workflow}
can be replaced or duplicated with:first
(forcore
tags which values won't change during the workflow)latest
(for values corresponding to the latest passage in the given step)
Most switch rules are a logical expression on values of some of these tags.
Particularly useful tags in switch rules are:
latest.guess_file_type.system_magic
(available only if the file passed through theguess_file_type
plugin): the output of thefile
command on the file (for example:PNG image data, 48 x 47, 8-bit gray+alpha, non-interlaced
)latest.guess_file_type.size
(available only if the file passed through theguess_file_type
plugin): the file size (in bytes)first.core.original_basename
: the basename of the incoming file (at the very beginning of the workflow)first.core.original_dirname
: the dirname of the incoming file (at the very beginning of the workflow)latest.guess_file_type.ascii_header
(available only if the file passed through theguess_file_type
plugin): the 60 first ascii characters of the file (ascii codes<32
or>126
are filtered)
Important
As your own plugins can add some custom tags on the file context, you can also use these custom tags on switch rules to get workflows like this (for example):
2. How to set your switch rules?
2.1 Introduction
In the config.ini
file of your plugin, you can add several rules blocks:
[switch_rules:{rule_type}:{rule_type_param}]
or (depending on the rule type):
[switch_rules:{rule_type}]
Each rules block defined a rule type and some rule parameters. Under a "rules block" you can have one or several switch rules.
A switch rule is a line like:
{pattern} = {step_name1}, {step_name2}*, {step_name3}, ...
What about this *
sign after step_name2
?
If a step name ends with a *
, it means that the switch
plugin can use
hardlinking instead of copying (when there are multiple recipients for the file).
It's better for performances but target step must not alter the incoming file in any way.
So please DO NOT ADD THE STAR SIGN IF YOU ARE NOT SURE!
So, for a given pattern, you can have one or several copy actions (separated by coma). Each action means copy (or hardlink) the incoming file to the given step.
What about if there is only one recipient step?
If there is only one recipient step, the switch
plugin will use a move
instead of copy
for performances reasons.
Evaluation principles
- all switch rule are evaluated in the context of their rule block. If the pattern match (in this context), collected actions are APPENDED
- there is no way to remove a step from the (already collected) recipient list
- all switch rules are systematically evaluated
- there is no special orders for rules evaluation
- if a given step appears several times in the final recipient list, duplicates are automatically removed (on the technical side, the "recipient list" is a set and not a list)
2.2 Example
Let's say you add these lines in your plugin config.ini
:
[switch_rules:fnmatch:first.core.original_basename]
A* = step1*, step2
B* = step1*
[switch_rules:regex:first.core.original_dirname]
.foo$ = step3
We have two rules blocks:
- one of
fnmatch
type (with parameter:first.core.original_basename
) - one of
regex
type (with parameter:first.core.original_dirname
)
2.2.1 First file
Let's say we have an incoming file with:
first.core.original_basename
(original basename of the file) =Bar
first.core.original_dirname
(original dirname of the file) =foo
For the first rules blocks, we try the pattern A*
with fnmatch
on first.core.original_basename
value (Bar
).
It is evaluated to False
.
fnmatch?
fnmatch
rule type is a basic rule type which use fnmatch patterns
(a kind of "Unix shell-style wildcards"). These patterns are basic but very easy to learn and read.
Then we try the pattern B*
with fnmatch
on first.core.original_basename
value (Bar
). It evaluated to True
.
So step1
is added to the list of recipients for this file (hardlinking allowed because of the *
). But we continue to resolve
other rules.
For the second rules blocks, we try the pattern .foo$
with regex
on first.core.original_dirname
value (foo
).
It is evaluated to False
.
regex?
regex
rule type is a rule type which use re patterns
(regular expressions). These patterns are more powerful but less easy to learn and read than fnmatch
ones.
So for this first file, we have only one recipient: step1
and the following routing:
2.2.2 Second file
Let's say we have a different incoming file with:
first.core.original_basename
(original basename of the file) =Abar
first.core.original_dirname
(original dirname of the file) =incoming.foo
In the first rule block, the first line matches so step1
and step2
are added to the recipient list.
In the second rule block, the rule also matches, so we add step3
to the recipient list.
So for this second file, we have three recipients: step1
, step2
, step3
and the following routing:
optimization
The previous diagram is not totally accurate as the switch
plugin will
probably replace the latest copy
operation by a move
operation
(for performances reasons).
3. Available rule block types
3.1 equal
[switch_rules:equal:first.core.original_basename]
foo = main
It will return True
only if the given tag value (first.core.original_basename
in this example) is equal to foo
(python ==
operator).
bytes/utf8?
In MetWork < 1.0, tag values were compared as bytes (so strings was often
prefixed by b""
). In MetWork >= 1.0, tag values are automatically decoded
as utf8
strings. So use standard plain strings here.
3.2 fnmatch
[switch_rules:fnmatch:first.core.original_basename]
A* = step1*, step2
fnmatch
rule type is a basic rule type which use fnmatch patterns (a kind of "Unix shell-style wildcards"). These patterns are basic but very easy to learn and read.
The example will return True
if the given tag value (first.core.original_basename
in this example) starts with A
.
3.3 regex
[switch_rules:regex:first.core.original_dirname]
.foo$ = step3
regex
rule type is a rule type which use re patterns (regular expressions). These patterns are more powerful but less easy to learn and read than fnmatch
ones.
The example will return True
if the given tag value (first.core.original_dirname
in this example) ends with .foo
.
3.4 notequal
[switch_rules:notequal:first.core.original_basename]
foo = main
This is the inverse rule than equal
.
Files which orginal_basename
are different from foo
will be routed to the main
step.
3.5 notfnmatch
[switch_rules:notfnmatch:first.core.original_basename]
A* = step1*, step2
This is the inverse rule than fnmatch
.
Files which orginal_basename
doesn't start with A
will be routed to the step1
and step2
steps.
3.6 notregex
[switch_rules:notregex:first.core.original_dirname]
.foo$ = step3
This is the inverse rule than regex
.
This is for completeness only as you can express negative regex in the regex pattern itself.
3.7 alwaystrue
[switch_rules:alwaystrue]
whatever = main
This rule is always True
. So the main
step will be in the recipient list
for all files.
3.8 python
[switch_rules:python]
foo.myrule = step1
/special/directory:bar.myrule2 = step2
This is probably the most important rule as this is the most flexible and the most powerful.
With the line: foo.myrule = step1
, you say to call the function myrule()
in
the file foo.py
at the root of your plugin directory. If the function returns True
for the given file, step1
will be added to the recipient list.
With the line /special/directory:bar.myrule2
, you say to call the function myrule2()
in the file bar.py
located in special/directory/
directory.
The prototype of the function myrule()
(or myrule2()
) is very basic. Here is an example:
def myrule(xaf):
# xaf is the incoming file with its context (XattrFile object)
# see https://github.com/metwork-framework/xattrfile library
original_basename = xaf.tags.get('original_basename', None)
original_dirname = xaf.tags.get('original_dirname', None)
if original_basename and original_dirname:
# WARNING: original_basename and original_dirname are bytes
if original_basename.startswith(b'A'):
if original_dirname.startswith(b'B'):
return True
return False
Warning
This function should be fast and don't rely on external services as its call
can block the whole switch
plugin.
4. Advanced usage
4.1 Without any switch plugin
In some use cases (for example, you have only one plugin which takes care of
all incoming files and you are sure it will never change), you can work without
any switch
or guess_file_type
plugin.
To do that, set in the mfdata
configuration:
install_switch=0
under[internal_plugins]
sectioninstall_guess_file_type=0
under[internal_plugins]
section
(if you don't do that, these two plugins will be reinstalled after services restart)
Then (only if they were already installed):
plugins.uninstall switch
(asmfdata
user)plugins.uninstall guess_file_type
(asmfdata
user)
To remove them.
Last, change your plugin config.ini
file to add a new watched directory. For
example, set watched_directories={MFDATA_CURRENT_STEP_DIR},incoming
Warning
Please don't remove {MFDATA_CURRENT_STEP_DIR}
from watched_directories
key
unless you know exactly what you are doing.
You will get the following workflow:
4.2 With multiple switch plugins
Starting with MetWork 1.0, for very complex workflows, you can use several switch
plugins.
To install another switch
plugin, first you have to find the released .plugin
corresponding to the switch
plugin.
You will find the full path with: ls ${MFMODULE_HOME}/share/plugins/switch-*.plugin
.
${MFMODULE_HOME}?
In most cases, ${MFMODULE_HOME=/opt/metwork-mfdata
.
Then install the switch plugin a second time (but with another name). For example:
plugins.install --new-name=myswitch /full/path/of/the/switch.plugin
To configure this new switch
plugin called myswitch
, you can use the same
principles described above but rules block must be changed to something like that:
[switch_rules@myswitch:{rule_type}:{rule_type_param}]
or (depending on the rule type):
[switch_rules@myswitch:{rule_type}]
(we add @myswitch
to target the additional switch
plugin called myswitch
)
what about @switch
?
If you add @switch
suffix, you target the default system switch
plugin. For this,
you can omit @switch
suffix at all.
With this kind of configuration, you can build very complex workflows with several level of routing like: