External Tags
Added in version 2.10.
The goal of this feature is to allow users to provide additional metadata when
crawling files by providing a .meta.yml file in the directories being crawled or
a global tag file which you can set by using the tags.staticMetaFilename setting.
Note
Only json and yaml files are supported.
Here is a list of Tags settings (under tags. prefix):
Name |
Environment Variable |
Default value |
Documentation |
|---|---|---|---|
|
|
|
|
|
|
|
Tip
Use static metadata for configuration-level metadata that applies to all documents,
and use per-directory .meta.yml files for metadata specific to certain directories
or files.
Meta Filename
Whenever a directory is crawled, FSCrawler checks if a file named
.meta.yml is present in the directory. If it is, the content of this file is
used to enrich the document.
For example, if you have a file named .meta.yml in the directory
/path/to/data/dir:
external:
myTitle: "My document title"
Then the document indexed will have a new field named external.myTitle with the value
My document title.
Only supported fields can be added to the document. If you try to add a field which is not supported, it will be ignored.
For example, if you have the .meta.yml file contains:
foo: "bar"
external:
myTitle: "My document title"
The document indexed will have a new field named external.myTitle with the value
My document title. The field foo will be ignored.
If you really want to add a field named foo, you need to add it first as an external tag:
external:
foo: "bar"
myTitle: "My document title"
and then use an ingest pipeline to rename the external.foo field to foo. See Using Ingest Node Pipeline.
The .meta.yml file can also overwrite existing fields. For example, if you have the following
.meta.yml file:
content: "HIDDEN"
Then the content field will be replaced by HIDDEN even though something else is extracted.
Note
The .meta.yml file is not indexed. It is only used to enrich the document.
You can use another filename for the external tags file. For example, if you want to use
meta_tags.json instead of .meta.yml, you can set:
fs:
url: "/path/to/docs"
tags:
metaFilename: "meta_tags.json"
Static Metadata
Added in version 2.10.
You can define static metadata that will be applied to all documents indexed by FSCrawler.
This is useful when you want to add the same metadata to every document without needing
to create a .meta.yml file in every directory.
For example, if you want to add a hostname and environment field to all documents. Create a file
named /path/to/static_metadata.yml with the following content:
external:
hostname: "server001"
environment: "production"
Then, configure FSCrawler to use this static metadata file using the tags.staticMetaFilename setting:
fs:
url: "/path/to/docs"
tags:
staticMetaFilename: "/path/to/static_metadata.yml"
All documents indexed will have the fields external.hostname and external.environment
with the values server001 and production respectively.
Note
Static metadata is merged first and then the content within a .meta.yml is applied.
If you are overwriting the tags within the .meta.yml file, then that
takes precedence.
Example: If the static metadata file contains:
And the .meta.yml file contains:
The resulting document will have: