.. _local-fs-settings: Local FS settings ----------------- .. contents:: :backlinks: entry Here is a list of Local FS settings (under ``fs.`` prefix): +----------------------------+--------------------------------------+---------------+------------------------------+ | Name | Environment Variable | Default value | Documentation | +============================+======================================+===============+==============================+ | ``fs.provider`` | ``FSCRAWLER_FS_PROVIDER`` | ``"local"`` | `Crawler Provider`_ | +----------------------------+--------------------------------------+---------------+------------------------------+ | ``fs.url`` | ``FSCRAWLER_FS_URL`` | ``"/tmp/es"`` | `Root directory`_ | +----------------------------+--------------------------------------+---------------+------------------------------+ | ``fs.update_rate`` | ``FSCRAWLER_FS_UPDATE_RATE`` | ``"15m"`` | `Update Rate`_ | +----------------------------+--------------------------------------+---------------+------------------------------+ | ``fs.includes`` | ``FSCRAWLER_FS_INCLUDES`` | ``null`` | `Includes and excludes`_ | +----------------------------+--------------------------------------+---------------+------------------------------+ | ``fs.excludes`` | ``FSCRAWLER_FS_EXCLUDES`` | ``["*/~*"]`` | `Includes and excludes`_ | +----------------------------+--------------------------------------+---------------+------------------------------+ | ``fs.filters`` | ``FSCRAWLER_FS_FILTERS`` | ``null`` | `Filter content`_ | +----------------------------+--------------------------------------+---------------+------------------------------+ | ``fs.json_support`` | ``FSCRAWLER_FS_JSON_SUPPORT`` | ``false`` | `Indexing JSon docs`_ | +----------------------------+--------------------------------------+---------------+------------------------------+ | ``fs.xml_support`` | ``FSCRAWLER_FS_XML_SUPPORT`` | ``false`` | `Indexing XML docs`_ | +----------------------------+--------------------------------------+---------------+------------------------------+ | ``fs.add_as_inner_object`` | ``FSCRAWLER_FS_ADD_AS_INNER_OBJECT`` | ``false`` | `Add as Inner Object`_ | +----------------------------+--------------------------------------+---------------+------------------------------+ | ``fs.index_folders`` | ``FSCRAWLER_FS_INDEX_FOLDERS`` | ``true`` | `Index folders`_ | +----------------------------+--------------------------------------+---------------+------------------------------+ | ``fs.attributes_support`` | ``FSCRAWLER_FS_ATTRIBUTES_SUPPORT`` | ``false`` | `Adding file attributes`_ | +----------------------------+--------------------------------------+---------------+------------------------------+ | ``fs.acl_support`` | ``FSCRAWLER_FS_ACL_SUPPORT`` | ``false`` | `Collecting ACL metadata`_ | +----------------------------+--------------------------------------+---------------+------------------------------+ | ``fs.raw_metadata`` | ``FSCRAWLER_FS_RAW_METADATA`` | ``false`` | `Enabling raw metadata`_ | +----------------------------+--------------------------------------+---------------+------------------------------+ | ``fs.filename_as_id`` | ``FSCRAWLER_FS_FILENAME_AS_ID`` | ``false`` | :ref:`filename-as-id` | +----------------------------+--------------------------------------+---------------+------------------------------+ | ``fs.add_filesize`` | ``FSCRAWLER_FS_ADD_FILESIZE`` | ``true`` | `Disabling file size field`_ | +----------------------------+--------------------------------------+---------------+------------------------------+ | ``fs.remove_deleted`` | ``FSCRAWLER_FS_REMOVE_DELETED`` | ``true`` | `Ignore deleted files`_ | +----------------------------+--------------------------------------+---------------+------------------------------+ | ``fs.store_source`` | ``FSCRAWLER_FS_STORE_SOURCE`` | ``false`` | :ref:`store_binary` | +----------------------------+--------------------------------------+---------------+------------------------------+ | ``fs.index_content`` | ``FSCRAWLER_FS_INDEX_CONTENT`` | ``true`` | `Ignore content`_ | +----------------------------+--------------------------------------+---------------+------------------------------+ | ``fs.lang_detect`` | ``FSCRAWLER_FS_LANG_DETECT`` | ``false`` | `Language detection`_ | +----------------------------+--------------------------------------+---------------+------------------------------+ | ``fs.continue_on_error`` | ``FSCRAWLER_FS_CONTINUE_ON_ERROR`` | ``false`` | :ref:`continue_on_error` | +----------------------------+--------------------------------------+---------------+------------------------------+ | ``fs.indexed_chars`` | ``FSCRAWLER_FS_INDEXED_CHARS`` | ``100000.0`` | `Extracted characters`_ | +----------------------------+--------------------------------------+---------------+------------------------------+ | ``fs.ignore_above`` | ``FSCRAWLER_FS_IGNORE_ABOVE`` | ``null`` | `Ignore above`_ | +----------------------------+--------------------------------------+---------------+------------------------------+ | ``fs.checksum`` | ``FSCRAWLER_FS_CHECKSUM`` | ``null`` | `File Checksum`_ | +----------------------------+--------------------------------------+---------------+------------------------------+ | ``fs.temp_dir`` | ``FSCRAWLER_FS_TEMP_DIR`` | ``null`` | `Temporary Directory`_ | +----------------------------+--------------------------------------+---------------+------------------------------+ | ``fs.follow_symlinks`` | ``FSCRAWLER_FS_FOLLOW_SYMLINKS`` | ``false`` | `Follow Symlinks`_ | +----------------------------+--------------------------------------+---------------+------------------------------+ | ``fs.tika_config_path`` | ``FSCRAWLER_FS_TIKA_CONFIG_PATH`` | ``null`` | `Tika Config Path`_ | +----------------------------+--------------------------------------+---------------+------------------------------+ | ``fs.ocr.enabled`` | ``FSCRAWLER_FS_OCR_ENABLED`` | ``true`` | :ref:`ocr_integration` | +----------------------------+--------------------------------------+---------------+------------------------------+ .. _crawler-provider: Crawler Provider ^^^^^^^^^^^^^^^^ .. versionadded:: 2.10 The ``fs.provider`` setting specifies which crawler plugin to use for scanning files. Available providers are: * ``local`` (default): Crawl files from the local filesystem * ``ftp``: Crawl files from a remote FTP server (see :ref:`ftp-settings`) * ``ssh``: Crawl files from a remote server via SSH/SFTP (see :ref:`ssh-settings`) .. code:: yaml name: "test" fs: provider: "local" url: "/path/to/data/dir" .. note:: The ``fs.provider`` setting replaces the deprecated ``server.protocol`` setting. If you are currently using ``server.protocol``, you should migrate to ``fs.provider``. Old configuration (deprecated): .. code:: yaml name: "test" fs: url: "/path/to/data/dir" server: hostname: "mynode.mydomain.com" protocol: "ftp" New configuration (recommended): .. code:: yaml name: "test" fs: provider: "ftp" url: "/path/to/data/dir" server: hostname: "mynode.mydomain.com" .. _root-directory: Root directory ^^^^^^^^^^^^^^ Define ``fs.url`` property in your ``~/.fscrawler/test/_settings.yaml`` file: .. code:: yaml name: "test" fs: url: "/path/to/data/dir" For Windows users, use a form like ``c:/tmp`` or ``c:\\tmp``. .. _local-fs-update_rate: Update rate ^^^^^^^^^^^ By default, ``update_rate`` is set to ``15m``. You can modify this value using any compatible `time unit `__. For example, here is a 15 minutes update rate: .. code:: yaml name: "test" fs: update_rate: "15m" Or a 3 hours update rate: .. code:: yaml name: "test" fs: update_rate: "3h" ``update_rate`` is the pause duration between the last time we read the file system and another run. Which means that if you set it to ``15m``, the next scan will happen on 15 minutes after the end of the current scan, whatever its duration. The supported units for duration are: * ``d`` for days * ``h`` for hours * ``m`` for minutes * ``s`` for seconds * ``ms`` for milliseconds .. note:: If you don't want to wait for the next scan, you can manually edit the ``~/.fscrawler/{job_name}/_checkpoint.json`` file and set ``next_check`` to the current time or to ``null``. FSCrawler will then start a new scan at most after 5 seconds. See :ref:`status-files` for more information. .. _includes_excludes: Includes and excludes ^^^^^^^^^^^^^^^^^^^^^ Let’s say you want to index only docs like ``*.doc`` and ``*.pdf`` but ``resume*``. So ``resume_david.pdf`` won’t be indexed. Define ``fs.includes`` and ``fs.excludes`` properties in your ``~/.fscrawler/test/_settings.yaml`` file: .. code:: yaml name: "test" fs: includes: - "*/*.doc" - "*/*.pdf" excludes: - "*/resume*" By default, FSCrawler will exclude files starting with ``~``. It also applies to directory names. So if you want to ignore ``.ignore`` dir, just add ``.ignore`` as an excluded name. Note that ``includes`` and ``excludes`` apply to directory names as well. Let's take the following example with the ``root`` dir as ``/tmp``: .. code:: /tmp ├── folderA │ ├── subfolderA │ ├── subfolderB │ └── subfolderC ├── folderB │ ├── subfolderA │ ├── subfolderB │ └── subfolderC └── folderC ├── subfolderA ├── subfolderB └── subfolderC If you define the following ``fs.excludes`` property in your ``~/.fscrawler/test/_settings.yaml`` file: .. code:: yaml name: "test" fs: excludes: - "/folderB/subfolder*" Then all files but the ones in ``/folderB/subfolderA``, ``/folderB/subfolderB`` and ``/folderB/subfolderC`` will be indexed. If you want to exclude a specific folder, you need to use a wildcard character at the end of the folder name, like: .. code:: yaml name: "test" fs: excludes: - "/folderB/subfolderB/*" Since the includes and excludes work on the entire *path of the file* you must consider that when using wildcards. Below are some includes and excludes pattern to help convey the idea better. +--------------------+------------------------------------------------+------------------------------------------------+ | Pattern | Includes | Excludes | +====================+================================================+================================================+ | ``*.jpg`` | Include all jpg files | exclude all jpg files | +--------------------+------------------------------------------------+------------------------------------------------+ | ``/images/*.jpg`` | Include all jpg files in the images directory | Exclude all jpg files in the images directory | +--------------------+------------------------------------------------+------------------------------------------------+ | ``*/old-*.jpg`` | Include all jpg files that start with ``old-`` | Exclude all jpg files that start with ``old-`` | +--------------------+------------------------------------------------+------------------------------------------------+ If a folder contains a file named ``.fscrawlerignore``, this folder and its subfolders will be entirely skipped. Filter content ^^^^^^^^^^^^^^ You can filter out documents you would like to index by adding one or more regular expression that match the extracted content. Documents which are not matching will be simply ignored and not indexed. If you define the following ``fs.filters`` property in your ``~/.fscrawler/test/_settings.yaml`` file: .. code:: yaml name: "test" fs: filters: - ".*foo.*" - "^4\\d{3}([\\ \\-]?)\\d{4}\\1\\d{4}\\1\\d{4}$" With this example, only documents which contains the word ``foo`` and a VISA credit card number with the form like ``4012888888881881``, ``4012 8888 8888 1881`` or ``4012-8888-8888-1881`` will be indexed. Indexing JSon docs ^^^^^^^^^^^^^^^^^^ If you want to index JSon files directly without parsing with Tika, you can set ``json_support`` to ``true``. JSon contents will be stored directly under \_source. If you need to keep JSon documents synchronized to the index, set option `Add as Inner Object`_ which stores additional metadata and the JSon contents under field ``object``. .. code:: yaml name: "test" fs: json_support: true Of course, if you did not define a mapping before launching the crawler, Elasticsearch will auto guess the mapping. Indexing XML docs ^^^^^^^^^^^^^^^^^ If you want to index XML files and convert them to JSON, you can set ``xml_support`` to ``true``. The content of XML files will be added directly under \_source. If you need to keep XML documents synchronized to the index, set option `Add as Inner Object`_ which stores additional metadata and the XML contents under field ``object``. .. code:: yaml name: "test" fs: xml_support: true Of course, if you did not define a mapping before launching the crawler, Elasticsearch will auto guess the mapping. Add as Inner Object ^^^^^^^^^^^^^^^^^^^ The default settings store the contents of json and xml documents directly onto the \_source element of elasticsearch documents. Thereby, there is no metadata about file and path settings, which are necessary to determine if a document is deleted or updated. New files will however be added to the index, (determined by the file timestamp). If you need to keep json or xml documents synchronized to elasticsearch, you should set this option. .. code:: yaml name: "test" fs: add_as_inner_object: true Index folders ^^^^^^^^^^^^^ By default FSCrawler will index folder names in the folder index. If you don’t want to index those folders, you can set ``index_folders`` to ``false``. Note that in that case, FSCrawler won’t be able to detect removed folders so any document has been indexed in elasticsearch, it won’t be removed when you remove or move the folder away. See ``elasticsearch.index_folder`` below for the name of the index to be used to store the folder data (if ``es.index_folders`` is set to ``true``). .. code:: yaml name: "test" fs: index_folders: false Dealing with multiple types and multiple dirs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If you have more than one type, create as many crawlers as types and/or folders: ``~/.fscrawler/test_type1/_settings.yaml``: .. code:: yaml name: "test_type1" fs: url: "/tmp/type1" json_support: true elasticsearch: index: "mydocs1" index_folder: "myfolders1" ``~/.fscrawler/test_type2/_settings.yaml``: .. code:: yaml name: "test_type2" fs: url: "/tmp/type2" json_support: true elasticsearch: index: "mydocs2" index_folder: "myfolders2" ``~/.fscrawler/test_type3/_settings.yaml``: .. code:: yaml name: "test_type3" fs: url: "/tmp/type3" xml_support: true elasticsearch: index: "mydocs3" index_folder: "myfolders3" Dealing with multiple types within the same dir ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You can also index many types from one single dir using two crawlers scanning the same dir and by setting ``includes`` parameter: ``~/.fscrawler/test_type1.yaml``: .. code:: yaml name: "test_type1" fs: url: "/tmp" includes: - "type1*.json" json_support: true elasticsearch: index: "mydocs1" index_folder: "myfolders1" ``~/.fscrawler/test_type2.yaml``: .. code:: yaml name: "test_type2" fs: url: "/tmp" includes: - "type2*.json" json_support: true elasticsearch: index: "mydocs2" index_folder: "myfolders2" ``~/.fscrawler/test_type3.yaml``: .. code:: yaml name: "test_type3" fs: url: "/tmp" includes: - "*.xml" xml_support: true elasticsearch: index: "mydocs3" index_folder: "myfolders3" .. _filename-as-id: Using filename as elasticsearch ``_id`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Please note that the document ``_id`` is generated as a hash value from the filename to avoid issues with special characters in filename. You can force to use the ``_id`` to be the filename using ``filename_as_id`` attribute: .. code:: yaml name: "test" fs: filename_as_id: true Adding file attributes ^^^^^^^^^^^^^^^^^^^^^^ If you want to add file attributes such as ``attributes.owner``, ``attributes.group`` and ``attributes.permissions``, you can set ``attributes_support`` to ``true``. .. code:: yaml name: "test" fs: attributes_support: true .. note:: On Windows systems, ``attributes.group`` and ``attributes.permissions`` are not generated. Collecting ACL metadata ^^^^^^^^^^^^^^^^^^^^^^^ To extract NTFS access control entries (principal, type, permissions and flags), enable both ``attributes_support`` and ``acl_support``: .. code:: yaml name: "test" fs: attributes_support: true acl_support: true When ``acl_support`` is disabled, FSCrawler skips resolving ACLs even if ``attributes_support`` is active. Enabling raw metadata ^^^^^^^^^^^^^^^^^^^^^ FSCrawler can extract all found metadata within a ``meta.raw`` object in addition to the standard metadata fields. If you want to enable this feature, you can set ``raw_metadata`` to ``true``. .. code:: yaml name: "test" fs: raw_metadata: true Generated raw metadata depends on the file format itself. For example, a PDF document could generate: .. code:: json { "date" : "2016-07-07T08:37:42Z", "pdf:PDFVersion" : "1.5", "xmp:CreatorTool" : "Microsoft Word", "Keywords" : "keyword1, keyword2", "access_permission:modify_annotations" : "true", "access_permission:can_print_degraded" : "true", "subject" : "Test Tika Object", "dc:creator" : "David Pilato", "dcterms:created" : "2016-07-07T08:37:42Z", "Last-Modified" : "2016-07-07T08:37:42Z", "dcterms:modified" : "2016-07-07T08:37:42Z", "dc:format" : "application/pdf; version=1.5", "title" : "Test Tika title", "Last-Save-Date" : "2016-07-07T08:37:42Z", "access_permission:fill_in_form" : "true", "meta:save-date" : "2016-07-07T08:37:42Z", "pdf:encrypted" : "false", "dc:title" : "Test Tika title", "modified" : "2016-07-07T08:37:42Z", "cp:subject" : "Test Tika Object", "Content-Type" : "application/pdf", "X-Parsed-By" : "org.apache.tika.parser.DefaultParser", "creator" : "David Pilato", "meta:author" : "David Pilato", "dc:subject" : "keyword1, keyword2", "meta:creation-date" : "2016-07-07T08:37:42Z", "created" : "Thu Jul 07 10:37:42 CEST 2016", "access_permission:extract_for_accessibility" : "true", "access_permission:assemble_document" : "true", "xmpTPg:NPages" : "2", "Creation-Date" : "2016-07-07T08:37:42Z", "access_permission:extract_content" : "true", "access_permission:can_print" : "true", "meta:keyword" : "keyword1, keyword2", "Author" : "David Pilato", "access_permission:can_modify" : "true" } Where a MP3 file would generate: .. code:: json { "xmpDM:genre" : "Vocal", "X-Parsed-By" : "org.apache.tika.parser.DefaultParser", "creator" : "David Pilato", "xmpDM:album" : "FS Crawler", "xmpDM:trackNumber" : "1", "xmpDM:releaseDate" : "2016", "meta:author" : "David Pilato", "xmpDM:artist" : "David Pilato", "dc:creator" : "David Pilato", "xmpDM:audioCompressor" : "MP3", "title" : "Test Tika", "xmpDM:audioChannelType" : "Stereo", "version" : "MPEG 3 Layer III Version 1", "xmpDM:logComment" : "Hello but reverted", "xmpDM:audioSampleRate" : "44100", "channels" : "2", "dc:title" : "Test Tika", "Author" : "David Pilato", "xmpDM:duration" : "1018.775146484375", "Content-Type" : "audio/mpeg", "samplerate" : "44100" } .. note:: All fields are generated as text even though they can be valid booleans or numbers. The ``meta.raw.*`` fields have a default mapping applied: .. code:: json { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } } If you want specifically tell elasticsearch to use a date type or a numeric type for some fields, you need to modify the default template provided by FSCrawler. .. note:: Note that dots in metadata names will be replaced by a ``:``. For example ``PTEX.Fullbanner`` will be indexed as ``PTEX:Fullbanner``. .. note:: Note that if you have a lot of different type of files, that can generate a lot of raw metadata which can make you hit the total number of field limit in elasticsearch mappings. In which case you will need to change the index settings ``foo``. See `elasticsearch documentation `__ Disabling file size field ^^^^^^^^^^^^^^^^^^^^^^^^^ By default, FSCrawler will create a field to store the original file size in octets. You can disable it using \`add_filesize’ option: .. code:: yaml name: "test" fs: add_filesize: false Ignore deleted files ^^^^^^^^^^^^^^^^^^^^ If you don’t want to remove indexed documents when you remove a file or a directory, you can set ``remove_deleted`` to ``false`` (default to ``true``): .. code:: yaml name: "test" fs: remove_deleted: false .. note:: Setting ``remove_deleted`` is forced to ``false`` when using the Workplace Search output (:ref:`wpsearch-settings`). Ignore content ^^^^^^^^^^^^^^ If you don’t want to extract file content but only index filesystem metadata such as filename, date, size and path, you can set ``index_content`` to ``false`` (default to ``true``): .. code:: yaml name: "test" fs: index_content: false .. _continue_on_error: Continue on Error ^^^^^^^^^^^^^^^^^ By default FSCrawler will immediately stop indexing if he hits a Permission denied exception. If you want to just skip this File and continue with the rest of the directory tree you can set ``continue_on_error`` to ``true`` (default to ``false``): .. code:: yaml name: "test" fs: continue_on_error: true Language detection ^^^^^^^^^^^^^^^^^^ You can ask for language detection using ``lang_detect`` option: .. code:: yaml name: "test" fs: lang_detect: true In that case, a new field named ``meta.language`` is added to the generated JSon document. If you are using elasticsearch 5.0 or superior, you can use this value to send your document to a specific index using a `Node Ingest pipeline <#using-ingest-node-pipeline>`__. For example, you can define a pipeline named ``langdetect`` with: .. code:: sh PUT _ingest/pipeline/langdetect { "description" : "langdetect pipeline", "processors" : [ { "set": { "field": "_index", "value": "myindex-{{meta.language}}" } } ] } In FSCrawler settings, set both ``fs.lang_detect`` and ``elasticsearch.pipeline`` options: .. code:: yaml name: "test" fs: lang_detect: true elasticsearch: pipeline: "langdetect" And then, a document containing french text will be sent to ``myindex-fr``. A document containing english text will be sent to ``myindex-en``. You can also imagine changing the field name from ``content`` to ``content-fr`` or ``content-en``. That will help you to define the correct analyzer to use. Language detection might detect more than one language in a given text but only the most accurate will be set. Which means that if you have a document containing 80% of french and 20% of english, the document will be marked as ``fr``. Note that language detection is CPU and time consuming. .. _store_binary: Storing binary source document ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You can store in elasticsearch itself the binary document (BASE64 encoded) using ``store_source`` option: .. code:: yaml name: "test" fs: store_source: true In that case, a new field named ``attachment`` is added to the generated JSon document. This field is not indexed. Default mapping for ``attachment`` field is: .. code:: json { "_doc" : { "properties" : { "attachment" : { "type" : "binary", "doc_values" : false } // ... Other properties here } } } Extracted characters ^^^^^^^^^^^^^^^^^^^^ By default FSCrawler will extract only the first 100 000 characters. But, you can set ``indexed_chars`` to ``5000`` in FSCrawler settings in order to overwrite this default settings. .. code:: yaml name: "test" fs: indexed_chars: "5000" This number can be either a fixed size, number of characters that is, or a percent using ``%`` sign. The percentage value will be applied to the filesize to determine the number of character the crawler needs to extract. If you want to index only ``80%`` of filesize, define ``indexed_chars`` to ``"80%"``. Of course, if you want to index the full document, you can set this property to ``"100%"``. Double values are also supported so ``"0.01%"`` is also a correct value. **Compressed files**: If your file is compressed, you might need to increase ``indexed_chars`` to more than ``"100%"``. For example, ``"150%"``. If you want to extract the full content, define ``indexed_chars`` to ``"-1"``. .. note:: Tika requires to allocate in memory a data structure to extract text. Setting ``indexed_chars`` to a high number will require more memory! Ignore Above ^^^^^^^^^^^^ By default (if ``index_content`` set to ``true``) FSCrawler will send every single file to Tika, whatever its size. But some files on your file system might be a way too big to be parsed. Set ``ignore_above`` to the desired value of the limit. .. code:: yaml name: "test" fs: ignore_above: "512mb" File checksum ^^^^^^^^^^^^^ If you want FSCrawler to generate a checksum for each file, set ``checksum`` to the algorithm you wish to use to compute the checksum, such as ``MD5`` or ``SHA-1``. .. note:: You MUST set ``index_content`` to true to allow this feature to work. Nevertheless you MAY set ``indexed_chars`` to 0 if you do not need any content in the index. You MUST NOT set ``json_support`` or ``xml_support`` to allow this feature to work also. .. code:: yaml name: "test" fs: # required index_content: true #indexed_chars: 0 checksum: "MD5" Temporary Directory ^^^^^^^^^^^^^^^^^^^ .. versionadded:: 2.10 When ``checksum`` or ``store_source`` is enabled, FSCrawler may need to create temporary files to process large documents without loading them entirely into memory. By default, temporary files are created in ``~/.fscrawler//tmp/``. You can override this location using the ``temp_dir`` option: .. code:: yaml name: "test" fs: checksum: "MD5" temp_dir: "/path/to/custom/temp" .. note:: For small files (64KB or less), FSCrawler uses an in-memory buffer instead of temporary files for better performance. Temporary files are only created for larger files to avoid ``OutOfMemoryError``. .. note:: Temporary files are automatically deleted after processing each document. Follow Symlinks ^^^^^^^^^^^^^^^ If you want FSCrawler to follow the symbolic links, you need to be explicit about it and set ``follow_symlinks`` to ``true``. Starting from version 2.7, symbolic links are not followed anymore. .. code:: yaml name: "test" fs: follow_symlinks: true Tika Config Path ^^^^^^^^^^^^^^^^ .. versionadded:: 2.10 If you want to override the default tika parser configuration, you can set the path to a custom tika configuration file, which will be used instead. .. code:: yaml name: "test" fs: tika_config_path: '/path/to/tikaConfig.xml' An example tika config file is shown below. See |Tika_configuring|_ for more information. .. code:: xml application/xhtml+xml