Local FS settings

Here is a list of Local FS settings (under fs. prefix)`:

Name Default value Documentation
fs.url "/tmp/es" Root directory
fs.update_rate "15m" Update Rate
fs.includes null Includes and excludes
fs.excludes ["*/~*"] Includes and excludes
fs.filters null Filter content
fs.json_support false Indexing JSon docs
fs.xml_support false Indexing XML docs
fs.add_as_inner_object false Add as Inner Object
fs.index_folders true Index folders
fs.attributes_support false Adding file attributes
fs.raw_metadata false Disabling raw metadata
fs.filename_as_id false Using filename as elasticsearch _id
fs.add_filesize true Disabling file size field
fs.remove_deleted true Ignore deleted files
fs.store_source false Storing binary source document
fs.index_content true Ignore content
fs.lang_detect false Language detection
fs.continue_on_error false Continue on Error
fs.ocr.pdf_strategy ocr_and_text OCR integration
fs.indexed_chars 100000.0 Extracted characters
fs.ignore_above null Ignore above
fs.checksum false File Checksum
fs.follow_symlinks false Follow Symlinks

Root directory

Define fs.url property in your ~/.fscrawler/test/_settings.yaml file:

name: "test"
fs:
  url: "/path/to/data/dir"

For Windows users, use a form like c:/tmp or c:\\tmp.

Update rate

By default, update_rate is set to 15m. You can modify this value using any compatible time unit.

For example, here is a 15 minutes update rate:

name: "test"
fs:
  update_rate: "15m"

Or a 3 hours update rate:

name: "test"
fs:
  update_rate: "3h"

update_rate is the pause duration between the last time we read the file system and another run. Which means that if you set it to 15m, the next scan will happen on 15 minutes after the end of the current scan, whatever its duration.

Includes and excludes

Let’s say you want to index only docs like *.doc and *.pdf but resume*. So resume_david.pdf won’t be indexed.

Define fs.includes and fs.excludes properties in your ~/.fscrawler/test/_settings.yaml file:

name: "test"
fs:
  includes:
  - "*/*.doc"
  - "*/*.pdf"
  excludes:
  - "*/resume*"

By default, FSCrawler will exclude files starting with ~.

New in version 2.5.

It also applies to directory names. So if you want to ignore .ignore dir, just add .ignore as an excluded name. Note that includes and excludes apply to directory names as well.

Let’s take the following example with the root dir as /tmp:

/tmp
├── folderA
│   ├── subfolderA
│   ├── subfolderB
│   └── subfolderC
├── folderB
│   ├── subfolderA
│   ├── subfolderB
│   └── subfolderC
└── folderC
    ├── subfolderA
    ├── subfolderB
    └── subfolderC

If you define the following fs.excludes property in your ~/.fscrawler/test/_settings.yaml file:

name: "test"
fs:
  excludes:
  - "/folderB/subfolder*"

Then all files but the ones in /folderB/subfolderA, /folderB/subfolderB and /folderB/subfolderC will be indexed.

Since the includes and excludes work on the entire path of the file you must consider that when using wildcards. Below are some includes and excludes pattern to help convey the idea better.

Pattern Includes Excludes
*.jpg Include all jpg files exclude all jpg files
/images/*.jpg Include all jpg files in the images directory Exclude all jpg files in the images directory
*/old-*.jpg Include all jpg files that start with old- Exclude all jpg files that start with old-

New in version 2.6.

If a folder contains a file named .fscrawlerignore, this folder and its subfolders will be entirely skipped.

Filter content

New in version 2.5.

You can filter out documents you would like to index by adding one or more regular expression that match the extracted content. Documents which are not matching will be simply ignored and not indexed.

If you define the following fs.filters property in your ~/.fscrawler/test/_settings.yaml file:

name: "test"
fs:
  filters:
  - ".*foo.*"
  - "^4\\d{3}([\\ \\-]?)\\d{4}\\1\\d{4}\\1\\d{4}$"

With this example, only documents which contains the word foo and a VISA credit card number with the form like 4012888888881881, 4012 8888 8888 1881 or 4012-8888-8888-1881 will be indexed.

Indexing JSon docs

If you want to index JSon files directly without parsing with Tika, you can set json_support to true. JSon contents will be stored directly under _source. If you need to keep JSon documents synchronized to the index, set option Add as Inner Object which stores additional metadata and the JSon contents under field object.

name: "test"
fs:
  json_support: true

Of course, if you did not define a mapping before launching the crawler, Elasticsearch will auto guess the mapping.

Indexing XML docs

New in version 2.2.

If you want to index XML files and convert them to JSON, you can set xml_support to true. The content of XML files will be added directly under _source. If you need to keep XML documents synchronized to the index, set option Add as Inner Object which stores additional metadata and the XML contents under field object.

name: "test"
fs:
  xml_support: true

Of course, if you did not define a mapping before launching the crawler, Elasticsearch will auto guess the mapping.

Add as Inner Object

The default settings store the contents of json and xml documents directly onto the _source element of elasticsearch documents. Thereby, there is no metadata about file and path settings, which are necessary to determine if a document is deleted or updated. New files will however be added to the index, (determined by the file timestamp).

If you need to keep json or xml documents synchronized to elasticsearch, you should set this option.

name: "test"
fs:
  add_as_inner_object: true

Index folders

New in version 2.2.

By default FSCrawler will index folder names in the folder index. If you don’t want to index those folders, you can set index_folders to false.

Note that in that case, FSCrawler won’t be able to detect removed folders so any document has been indexed in elasticsearch, it won’t be removed when you remove or move the folder away.

See elasticsearch.index_folder below for the name of the index to be used to store the folder data (if es.index_folders is set to true).

name: "test"
fs:
  index_folders: false

Dealing with multiple types and multiple dirs

If you have more than one type, create as many crawlers as types and/or folders:

~/.fscrawler/test_type1/_settings.yaml:

name: "test_type1"
fs:
  url: "/tmp/type1"
  json_support: true
elasticsearch:
  index: "mydocs1"
  index_folder: "myfolders1"

~/.fscrawler/test_type2/_settings.yaml:

name: "test_type2"
fs:
  url: "/tmp/type2"
  json_support: true
elasticsearch:
  index: "mydocs2"
  index_folder: "myfolders2"

~/.fscrawler/test_type3/_settings.yaml:

name: "test_type3"
fs:
  url: "/tmp/type3"
  xml_support: true
elasticsearch:
  index: "mydocs3"
  index_folder: "myfolders3"

Dealing with multiple types within the same dir

You can also index many types from one single dir using two crawlers scanning the same dir and by setting includes parameter:

~/.fscrawler/test_type1.yaml:

name: "test_type1"
fs:
  url: "/tmp"
  includes:
  - "type1*.json"
  json_support: true
elasticsearch:
  index: "mydocs1"
  index_folder: "myfolders1"

~/.fscrawler/test_type2.yaml:

name: "test_type2"
fs:
  url: "/tmp"
  includes:
  - "type2*.json"
  json_support: true
elasticsearch:
  index: "mydocs2"
  index_folder: "myfolders2"

~/.fscrawler/test_type3.yaml:

name: "test_type3"
fs:
  url: "/tmp"
  includes:
  - "*.xml"
  xml_support: true
elasticsearch:
  index: "mydocs3"
  index_folder: "myfolders3"

Using filename as elasticsearch _id

Please note that the document _id is generated as a hash value from the filename to avoid issues with special characters in filename. You can force to use the _id to be the filename using filename_as_id attribute:

name: "test"
fs:
  filename_as_id: true

Adding file attributes

If you want to add file attributes such as attributes.owner, attributes.group and attributes.permissions, you can set attributes_support to true.

name: "test"
fs:
  attributes_support: true

Note

On Windows systems, attributes.group and attributes.permissions are not generated.

Disabling raw metadata

FSCrawler can extract all found metadata within a meta.raw object in addition to the standard metadata fields. If you want to enable this feature, you can set raw_metadata to true.

name: "test"
fs:
  raw_metadata: true

Generated raw metadata depends on the file format itself.

For example, a PDF document could generate:

{
   "date" : "2016-07-07T08:37:42Z",
   "pdf:PDFVersion" : "1.5",
   "xmp:CreatorTool" : "Microsoft Word",
   "Keywords" : "keyword1, keyword2",
   "access_permission:modify_annotations" : "true",
   "access_permission:can_print_degraded" : "true",
   "subject" : "Test Tika Object",
   "dc:creator" : "David Pilato",
   "dcterms:created" : "2016-07-07T08:37:42Z",
   "Last-Modified" : "2016-07-07T08:37:42Z",
   "dcterms:modified" : "2016-07-07T08:37:42Z",
   "dc:format" : "application/pdf; version=1.5",
   "title" : "Test Tika title",
   "Last-Save-Date" : "2016-07-07T08:37:42Z",
   "access_permission:fill_in_form" : "true",
   "meta:save-date" : "2016-07-07T08:37:42Z",
   "pdf:encrypted" : "false",
   "dc:title" : "Test Tika title",
   "modified" : "2016-07-07T08:37:42Z",
   "cp:subject" : "Test Tika Object",
   "Content-Type" : "application/pdf",
   "X-Parsed-By" : "org.apache.tika.parser.DefaultParser",
   "creator" : "David Pilato",
   "meta:author" : "David Pilato",
   "dc:subject" : "keyword1, keyword2",
   "meta:creation-date" : "2016-07-07T08:37:42Z",
   "created" : "Thu Jul 07 10:37:42 CEST 2016",
   "access_permission:extract_for_accessibility" : "true",
   "access_permission:assemble_document" : "true",
   "xmpTPg:NPages" : "2",
   "Creation-Date" : "2016-07-07T08:37:42Z",
   "access_permission:extract_content" : "true",
   "access_permission:can_print" : "true",
   "meta:keyword" : "keyword1, keyword2",
   "Author" : "David Pilato",
   "access_permission:can_modify" : "true"
}

Where a MP3 file would generate:

{
   "xmpDM:genre" : "Vocal",
   "X-Parsed-By" : "org.apache.tika.parser.DefaultParser",
   "creator" : "David Pilato",
   "xmpDM:album" : "FS Crawler",
   "xmpDM:trackNumber" : "1",
   "xmpDM:releaseDate" : "2016",
   "meta:author" : "David Pilato",
   "xmpDM:artist" : "David Pilato",
   "dc:creator" : "David Pilato",
   "xmpDM:audioCompressor" : "MP3",
   "title" : "Test Tika",
   "xmpDM:audioChannelType" : "Stereo",
   "version" : "MPEG 3 Layer III Version 1",
   "xmpDM:logComment" : "Hello but reverted",
   "xmpDM:audioSampleRate" : "44100",
   "channels" : "2",
   "dc:title" : "Test Tika",
   "Author" : "David Pilato",
   "xmpDM:duration" : "1018.775146484375",
   "Content-Type" : "audio/mpeg",
   "samplerate" : "44100"
}

Note

All fields are generated as text even though they can be valid booleans or numbers.

The meta.raw.* fields have a default mapping applied:

{
  "type": "text",
  "fields": {
    "keyword": {
      "type": "keyword",
      "ignore_above": 256
    }
  }
}

If you want specifically tell elasticsearch to use a date type or a numeric type for some fields, you need to modify the default template provided by FSCrawler.

Note

Note that dots in metadata names will be replaced by a :. For example PTEX.Fullbanner will be indexed as PTEX:Fullbanner.

Note

Note that if you have a lot of different type of files, that can generate a lot of raw metadata which can make you hit the total number of field limit in elasticsearch mappings. In which case you will need to change the index settings foo.

See elasticsearch documentation

Disabling file size field

By default, FSCrawler will create a field to store the original file size in octets. You can disable it using `add_filesize’ option:

name: "test"
fs:
  add_filesize: false

Ignore deleted files

If you don’t want to remove indexed documents when you remove a file or a directory, you can set remove_deleted to false (default to true):

name: "test"
fs:
  remove_deleted: false

Ignore content

If you don’t want to extract file content but only index filesystem metadata such as filename, date, size and path, you can set index_content to false (default to true):

name: "test"
fs:
  index_content: false

Continue on Error

New in version 2.3.

By default FSCrawler will immediately stop indexing if he hits a Permission denied exception. If you want to just skip this File and continue with the rest of the directory tree you can set continue_on_error to true (default to false):

name: "test"
fs:
  continue_on_error: true

Language detection

New in version 2.2.

You can ask for language detection using lang_detect option:

name: "test"
fs:
  lang_detect: true

In that case, a new field named meta.language is added to the generated JSon document.

If you are using elasticsearch 5.0 or superior, you can use this value to send your document to a specific index using a Node Ingest pipeline.

For example, you can define a pipeline named langdetect with:

PUT _ingest/pipeline/langdetect
{
  "description" : "langdetect pipeline",
  "processors" : [
    {
      "set": {
        "field": "_index",
        "value": "myindex-{{meta.language}}"
      }
    }
  ]
}

In FSCrawler settings, set both fs.lang_detect and elasticsearch.pipeline options:

name: "test"
fs:
  lang_detect: true
elasticsearch:
  pipeline: "langdetect"

And then, a document containing french text will be sent to myindex-fr. A document containing english text will be sent to myindex-en.

You can also imagine changing the field name from content to content-fr or content-en. That will help you to define the correct analyzer to use.

Language detection might detect more than one language in a given text but only the most accurate will be set. Which means that if you have a document containing 80% of french and 20% of english, the document will be marked as fr.

Note that language detection is CPU and time consuming.

Storing binary source document

You can store in elasticsearch itself the binary document (BASE64 encoded) using store_source option:

name: "test"
fs:
  store_source: true

In that case, a new field named attachment is added to the generated JSon document. This field is not indexed. Default mapping for attachment field is:

{
  "_doc" : {
    "properties" : {
      "attachment" : {
        "type" : "binary",
        "doc_values" : false
      }
      // ... Other properties here
    }
  }
}

Extracted characters

By default FSCrawler will extract only the first 100 000 characters. But, you can set indexed_chars to 5000 in FSCrawler settings in order to overwrite this default settings.

name: "test"
fs:
  indexed_chars: "5000"

This number can be either a fixed size, number of characters that is, or a percent using % sign. The percentage value will be applied to the filesize to determine the number of character the crawler needs to extract.

If you want to index only 80% of filesize, define indexed_chars to "80%". Of course, if you want to index the full document, you can set this property to "100%". Double values are also supported so "0.01%" is also a correct value.

Compressed files: If your file is compressed, you might need to increase indexed_chars to more than "100%". For example, "150%".

If you want to extract the full content, define indexed_chars to "-1".

Note

Tika requires to allocate in memory a data structure to extract text. Setting indexed_chars to a high number will require more memory!

Ignore Above

New in version 2.5.

By default (if index_content set to true) FSCrawler will send every single file to Tika, whatever its size. But some files on your file system might be a way too big to be parsed.

Set ignore_above to the desired value of the limit.

name: "test"
fs:
  ignore_above: "512mb"

File checksum

If you want FSCrawler to generate a checksum for each file, set checksum to the algorithm you wish to use to compute the checksum, such as MD5 or SHA-1.

Note

You MUST set index_content to true to allow this feature to work. Nevertheless you MAY set indexed_chars to 0 if you do not need any content in the index.

You MUST NOT set json_support or xml_support to allow this feature to work also.

name: "test"
fs:
   # required
  index_content: true
  #indexed_chars: 0
  checksum: "MD5"