Local FS settings ¶

Contents

Here is a list of Local FS settings (under fs. prefix)`:

Name	Default value	Documentation
`fs.url`	`"/tmp/es"`	Root directory
`fs.update_rate`	`"15m"`	Update Rate
`fs.includes`	`null`	Includes and excludes
`fs.excludes`	`["/~"]`	Includes and excludes
`fs.filters`	`null`	Filter content
`fs.json_support`	`false`	Indexing JSon docs
`fs.xml_support`	`false`	Indexing XML docs
`fs.add_as_inner_object`	`false`	Add as Inner Object
`fs.index_folders`	`true`	Index folders
`fs.attributes_support`	`false`	Adding file attributes
`fs.raw_metadata`	`false`	Disabling raw metadata
`fs.filename_as_id`	`false`	Using filename as elasticsearch _id
`fs.add_filesize`	`true`	Disabling file size field
`fs.remove_deleted`	`true`	Ignore deleted files
`fs.store_source`	`false`	Storing binary source document
`fs.index_content`	`true`	Ignore content
`fs.lang_detect`	`false`	Language detection
`fs.continue_on_error`	`false`	Continue on Error
`fs.ocr.pdf_strategy`	`ocr_and_text`	OCR integration
`fs.indexed_chars`	`100000.0`	Extracted characters
`fs.ignore_above`	`null`	Ignore above
`fs.checksum`	`false`	File Checksum
`fs.follow_symlinks`	`false`	Follow Symlinks

Root directory ¶

Define fs.url property in your ~/.fscrawler/test/_settings.yaml file:

name: "test"
fs:
  url: "/path/to/data/dir"

For Windows users, use a form like c:/tmp or c:\\tmp.

Update rate ¶

By default, update_rate is set to 15m. You can modify this value using any compatible time unit.

For example, here is a 15 minutes update rate:

name: "test"
fs:
  update_rate: "15m"

Or a 3 hours update rate:

name: "test"
fs:
  update_rate: "3h"

update_rate is the pause duration between the last time we read the file system and another run. Which means that if you set it to 15m, the next scan will happen on 15 minutes after the end of the current scan, whatever its duration.

Includes and excludes ¶

Let’s say you want to index only docs like *.doc and *.pdf but resume*. So resume_david.pdf won’t be indexed.

Define fs.includes and fs.excludes properties in your ~/.fscrawler/test/_settings.yaml file:

name: "test"
fs:
  includes:
  - "*/*.doc"
  - "*/*.pdf"
  excludes:
  - "*/resume*"

By default, FSCrawler will exclude files starting with ~.

New in version 2.5.

It also applies to directory names. So if you want to ignore .ignore dir, just add .ignore as an excluded name. Note that includes and excludes apply to directory names as well.

Let’s take the following example with the root dir as /tmp:

/tmp
├── folderA
│   ├── subfolderA
│   ├── subfolderB
│   └── subfolderC
├── folderB
│   ├── subfolderA
│   ├── subfolderB
│   └── subfolderC
└── folderC
    ├── subfolderA
    ├── subfolderB
    └── subfolderC

If you define the following fs.excludes property in your ~/.fscrawler/test/_settings.yaml file:

name: "test"
fs:
  excludes:
  - "/folderB/subfolder*"

Then all files but the ones in /folderB/subfolderA, /folderB/subfolderB and /folderB/subfolderC will be indexed.

Since the includes and excludes work on the entire path of the file you must consider that when using wildcards. Below are some includes and excludes pattern to help convey the idea better.

Pattern	Includes	Excludes
`*.jpg`	Include all jpg files	exclude all jpg files
`/images/*.jpg`	Include all jpg files in the images directory	Exclude all jpg files in the images directory
`/old-.jpg`	Include all jpg files that start with `old-`	Exclude all jpg files that start with `old-`

New in version 2.6.

If a folder contains a file named .fscrawlerignore, this folder and its subfolders will be entirely skipped.

Filter content ¶

New in version 2.5.

You can filter out documents you would like to index by adding one or more regular expression that match the extracted content. Documents which are not matching will be simply ignored and not indexed.

If you define the following fs.filters property in your ~/.fscrawler/test/_settings.yaml file:

name: "test"
fs:
  filters:
  - ".*foo.*"
  - "^4\\d{3}([\\ \\-]?)\\d{4}\\1\\d{4}\\1\\d{4}$"

With this example, only documents which contains the word foo and a VISA credit card number with the form like 4012888888881881, 4012 8888 8888 1881 or 4012-8888-8888-1881 will be indexed.

Indexing JSon docs ¶

If you want to index JSon files directly without parsing with Tika, you can set json_support to true. JSon contents will be stored directly under _source. If you need to keep JSon documents synchronized to the index, set option Add as Inner Object which stores additional metadata and the JSon contents under field object.

name: "test"
fs:
  json_support: true

Of course, if you did not define a mapping before launching the crawler, Elasticsearch will auto guess the mapping.

Indexing XML docs ¶

New in version 2.2.

If you want to index XML files and convert them to JSON, you can set xml_support to true. The content of XML files will be added directly under _source. If you need to keep XML documents synchronized to the index, set option Add as Inner Object which stores additional metadata and the XML contents under field object.

name: "test"
fs:
  xml_support: true

Of course, if you did not define a mapping before launching the crawler, Elasticsearch will auto guess the mapping.

Add as Inner Object ¶

The default settings store the contents of json and xml documents directly onto the _source element of elasticsearch documents. Thereby, there is no metadata about file and path settings, which are necessary to determine if a document is deleted or updated. New files will however be added to the index, (determined by the file timestamp).

If you need to keep json or xml documents synchronized to elasticsearch, you should set this option.

name: "test"
fs:
  add_as_inner_object: true

Index folders ¶

New in version 2.2.

By default FSCrawler will index folder names in the folder index. If you don’t want to index those folders, you can set index_folders to false.

Note that in that case, FSCrawler won’t be able to detect removed folders so any document has been indexed in elasticsearch, it won’t be removed when you remove or move the folder away.

See elasticsearch.index_folder below for the name of the index to be used to store the folder data (if es.index_folders is set to true).

name: "test"
fs:
  index_folders: false

Dealing with multiple types and multiple dirs ¶

If you have more than one type, create as many crawlers as types and/or folders:

~/.fscrawler/test_type1/_settings.yaml:

name: "test_type1"
fs:
  url: "/tmp/type1"
  json_support: true
elasticsearch:
  index: "mydocs1"
  index_folder: "myfolders1"

~/.fscrawler/test_type2/_settings.yaml:

name: "test_type2"
fs:
  url: "/tmp/type2"
  json_support: true
elasticsearch:
  index: "mydocs2"
  index_folder: "myfolders2"

~/.fscrawler/test_type3/_settings.yaml:

name: "test_type3"
fs:
  url: "/tmp/type3"
  xml_support: true
elasticsearch:
  index: "mydocs3"
  index_folder: "myfolders3"

Dealing with multiple types within the same dir ¶

You can also index many types from one single dir using two crawlers scanning the same dir and by setting includes parameter:

~/.fscrawler/test_type1.yaml:

name: "test_type1"
fs:
  url: "/tmp"
  includes:
  - "type1*.json"
  json_support: true
elasticsearch:
  index: "mydocs1"
  index_folder: "myfolders1"

~/.fscrawler/test_type2.yaml:

name: "test_type2"
fs:
  url: "/tmp"
  includes:
  - "type2*.json"
  json_support: true
elasticsearch:
  index: "mydocs2"
  index_folder: "myfolders2"

~/.fscrawler/test_type3.yaml:

name: "test_type3"
fs:
  url: "/tmp"
  includes:
  - "*.xml"
  xml_support: true
elasticsearch:
  index: "mydocs3"
  index_folder: "myfolders3"

Using filename as elasticsearch `_id`¶

Please note that the document _id is generated as a hash value from the filename to avoid issues with special characters in filename. You can force to use the _id to be the filename using filename_as_id attribute:

name: "test"
fs:
  filename_as_id: true

Adding file attributes ¶

If you want to add file attributes such as attributes.owner, attributes.group and attributes.permissions, you can set attributes_support to true.

name: "test"
fs:
  attributes_support: true

Note

On Windows systems, attributes.group and attributes.permissions are not generated.

Disabling raw metadata ¶

FSCrawler can extract all found metadata within a meta.raw object in addition to the standard metadata fields. If you want to enable this feature, you can set raw_metadata to true.

name: "test"
fs:
  raw_metadata: true

Generated raw metadata depends on the file format itself.

For example, a PDF document could generate:

{
   "date" : "2016-07-07T08:37:42Z",
   "pdf:PDFVersion" : "1.5",
   "xmp:CreatorTool" : "Microsoft Word",
   "Keywords" : "keyword1, keyword2",
   "access_permission:modify_annotations" : "true",
   "access_permission:can_print_degraded" : "true",
   "subject" : "Test Tika Object",
   "dc:creator" : "David Pilato",
   "dcterms:created" : "2016-07-07T08:37:42Z",
   "Last-Modified" : "2016-07-07T08:37:42Z",
   "dcterms:modified" : "2016-07-07T08:37:42Z",
   "dc:format" : "application/pdf; version=1.5",
   "title" : "Test Tika title",
   "Last-Save-Date" : "2016-07-07T08:37:42Z",
   "access_permission:fill_in_form" : "true",
   "meta:save-date" : "2016-07-07T08:37:42Z",
   "pdf:encrypted" : "false",
   "dc:title" : "Test Tika title",
   "modified" : "2016-07-07T08:37:42Z",
   "cp:subject" : "Test Tika Object",
   "Content-Type" : "application/pdf",
   "X-Parsed-By" : "org.apache.tika.parser.DefaultParser",
   "creator" : "David Pilato",
   "meta:author" : "David Pilato",
   "dc:subject" : "keyword1, keyword2",
   "meta:creation-date" : "2016-07-07T08:37:42Z",
   "created" : "Thu Jul 07 10:37:42 CEST 2016",
   "access_permission:extract_for_accessibility" : "true",
   "access_permission:assemble_document" : "true",
   "xmpTPg:NPages" : "2",
   "Creation-Date" : "2016-07-07T08:37:42Z",
   "access_permission:extract_content" : "true",
   "access_permission:can_print" : "true",
   "meta:keyword" : "keyword1, keyword2",
   "Author" : "David Pilato",
   "access_permission:can_modify" : "true"
}

Where a MP3 file would generate:

{
   "xmpDM:genre" : "Vocal",
   "X-Parsed-By" : "org.apache.tika.parser.DefaultParser",
   "creator" : "David Pilato",
   "xmpDM:album" : "FS Crawler",
   "xmpDM:trackNumber" : "1",
   "xmpDM:releaseDate" : "2016",
   "meta:author" : "David Pilato",
   "xmpDM:artist" : "David Pilato",
   "dc:creator" : "David Pilato",
   "xmpDM:audioCompressor" : "MP3",
   "title" : "Test Tika",
   "xmpDM:audioChannelType" : "Stereo",
   "version" : "MPEG 3 Layer III Version 1",
   "xmpDM:logComment" : "Hello but reverted",
   "xmpDM:audioSampleRate" : "44100",
   "channels" : "2",
   "dc:title" : "Test Tika",
   "Author" : "David Pilato",
   "xmpDM:duration" : "1018.775146484375",
   "Content-Type" : "audio/mpeg",
   "samplerate" : "44100"
}

Note

All fields are generated as text even though they can be valid booleans or numbers.

The meta.raw.* fields have a default mapping applied:

{
  "type": "text",
  "fields": {
    "keyword": {
      "type": "keyword",
      "ignore_above": 256
    }
  }
}

If you want specifically tell elasticsearch to use a date type or a numeric type for some fields, you need to modify the default template provided by FSCrawler.

Note

Note that dots in metadata names will be replaced by a :. For example PTEX.Fullbanner will be indexed as PTEX:Fullbanner.

Note

Note that if you have a lot of different type of files, that can generate a lot of raw metadata which can make you hit the total number of field limit in elasticsearch mappings. In which case you will need to change the index settings foo.

See elasticsearch documentation

Disabling file size field ¶

By default, FSCrawler will create a field to store the original file size in octets. You can disable it using `add_filesize’ option:

name: "test"
fs:
  add_filesize: false

Ignore deleted files ¶

If you don’t want to remove indexed documents when you remove a file or a directory, you can set remove_deleted to false (default to true):

name: "test"
fs:
  remove_deleted: false

Ignore content ¶

If you don’t want to extract file content but only index filesystem metadata such as filename, date, size and path, you can set index_content to false (default to true):

name: "test"
fs:
  index_content: false

Continue on Error ¶

New in version 2.3.

By default FSCrawler will immediately stop indexing if he hits a Permission denied exception. If you want to just skip this File and continue with the rest of the directory tree you can set continue_on_error to true (default to false):

name: "test"
fs:
  continue_on_error: true

Language detection ¶

New in version 2.2.

You can ask for language detection using lang_detect option:

name: "test"
fs:
  lang_detect: true

In that case, a new field named meta.language is added to the generated JSon document.

If you are using elasticsearch 5.0 or superior, you can use this value to send your document to a specific index using a Node Ingest pipeline.

For example, you can define a pipeline named langdetect with:

PUT _ingest/pipeline/langdetect
{
  "description" : "langdetect pipeline",
  "processors" : [
    {
      "set": {
        "field": "_index",
        "value": "myindex-{{meta.language}}"
      }
    }
  ]
}

In FSCrawler settings, set both fs.lang_detect and elasticsearch.pipeline options:

name: "test"
fs:
  lang_detect: true
elasticsearch:
  pipeline: "langdetect"

And then, a document containing french text will be sent to myindex-fr. A document containing english text will be sent to myindex-en.

You can also imagine changing the field name from content to content-fr or content-en. That will help you to define the correct analyzer to use.

Language detection might detect more than one language in a given text but only the most accurate will be set. Which means that if you have a document containing 80% of french and 20% of english, the document will be marked as fr.

Note that language detection is CPU and time consuming.

Storing binary source document ¶

You can store in elasticsearch itself the binary document (BASE64 encoded) using store_source option:

name: "test"
fs:
  store_source: true

In that case, a new field named attachment is added to the generated JSon document. This field is not indexed. Default mapping for attachment field is:

{
  "_doc" : {
    "properties" : {
      "attachment" : {
        "type" : "binary",
        "doc_values" : false
      }
      // ... Other properties here
    }
  }
}

Extracted characters ¶

By default FSCrawler will extract only the first 100 000 characters. But, you can set indexed_chars to 5000 in FSCrawler settings in order to overwrite this default settings.

name: "test"
fs:
  indexed_chars: "5000"

This number can be either a fixed size, number of characters that is, or a percent using % sign. The percentage value will be applied to the filesize to determine the number of character the crawler needs to extract.

If you want to index only 80% of filesize, define indexed_chars to "80%". Of course, if you want to index the full document, you can set this property to "100%". Double values are also supported so "0.01%" is also a correct value.

Compressed files: If your file is compressed, you might need to increase indexed_chars to more than "100%". For example, "150%".

If you want to extract the full content, define indexed_chars to "-1".

Note

Tika requires to allocate in memory a data structure to extract text. Setting indexed_chars to a high number will require more memory!

Ignore Above ¶

New in version 2.5.

By default (if index_content set to true) FSCrawler will send every single file to Tika, whatever its size. But some files on your file system might be a way too big to be parsed.

Set ignore_above to the desired value of the limit.

name: "test"
fs:
  ignore_above: "512mb"

File checksum ¶

If you want FSCrawler to generate a checksum for each file, set checksum to the algorithm you wish to use to compute the checksum, such as MD5 or SHA-1.

Note

You MUST set index_content to true to allow this feature to work. Nevertheless you MAY set indexed_chars to 0 if you do not need any content in the index.

You MUST NOT set json_support or xml_support to allow this feature to work also.

name: "test"
fs:
   # required
  index_content: true
  #indexed_chars: 0
  checksum: "MD5"

Follow Symlinks ¶

New in version 2.7.

If you want FSCrawler to follow the symbolic links, you need to be explicit about it and set follow_symlink to true. Starting from version 2.7, symbolic links are not followed anymore.

name: "test"
fs:
  follow_symlink: true