Job file specifications ======================= .. contents:: :backlinks: entry Expected files -------------- FSCrawler expects to find a job directory in the ``~/.fscrawler`` directory or in the directory you defined with the ``-config_dir`` CLI option (see :ref:`cli-options`). The job file could be either: * a ``yaml`` file named ``_settings.yaml`` * a ``json`` file named ``_settings.json`` * a list of files within a directory named ``_settings`` When using a directory, FSCrawler will merge all files found in the directory. Meaning that you can split your settings in multiple files, like: * ``my_job_fs.yaml`` which contains the file system settings * ``my_job_elasticsearch.yaml`` which contains the elasticsearch settings Using placeholders ------------------ .. versionadded:: 2.10 FSCrawler supports placeholders in the job file. This is useful when you want to use environment variables in your job file. For example, you can define the following job file: .. code:: yaml fs: url: "${HOME}/docs" elasticsearch: nodes: - url: "${ES_NODE1:=https://127.0.0.1:9200}" api_key: "${ES_API_KEY}" When running FSCrawler, it will replace ``${HOME}``, ``${ES_NODE1}`` and ``${ES_API_KEY}`` by their respective values which will be read from environment variables and java system properties if not found. If no value is found, it will use the default value after the ``:=`` if any, or it will fail starting if no default value. In the previous example, both ``${HOME}`` and ``${ES_API_KEY}`` are mandatory but ``${ES_NODE1}`` is optional and will be set to ``https://127.0.0.1:9200`` if not set. FSCrawler is using the gestalt-config project to handle placeholders. You can read more about String substitution in the `gestalt-config documentation `_. Default placeholders -------------------- FSCrawler supports a set of default placeholders that you can define using environment variables. The form of those placeholders is the prefix ``FSCRAWLER_`` and the setting name. For example, ``fs.url`` can be set using the environment variable ``FSCRAWLER_FS_URL`` or the system property ``-Dfs.url``. As an example, you can run: .. code:: sh FSCRAWLER_NAME=foo \ FSCRAWLER_FS_URL=/tmp/test \ FSCRAWLER_ELASTICSEARCH_API-KEY=VnVhQ2ZHY0JDZGJrUW0tZTVhT3g6dWkybHAyYXhUTm1zeWFrdzl0dk5udw== \ bin/fscrawler test or: .. code:: sh FS_JAVA_OPTS="-Dname=foo -Dfs.url=/tmp/test -Delasticsearch.api-key=VnVhQ2ZHY0JDZGJrUW0tZTVhT3g6dWkybHAyYXhUTm1zeWFrdzl0dk5udw==" \ bin/fscrawler test .. note:: If you define as well some settings in the job file, the settings in the job file will override the environment variables and system properties. Example job file specification ------------------------------ The job file (``~/.fscrawler/test/_settings.yaml``) for the job name ``test`` must comply to the following ``yaml`` specifications: .. code:: yaml # optional: the name of the crawler. Defaults to the job directory name. name: "test" # required fs: # define a "local" file path crawler, if running inside a docker container this must be the path INSIDE the container (/tmp/es) url: "/path/to/docs" follow_symlinks: false remove_deleted: true continue_on_error: false # scan every 5 minutes for changes in url defined above update_rate: "5m" # opional: define includes and excludes, "~" files are excluded by default if not defined below includes: - "*.doc" - "*.xls" excludes: - "resume.doc" # optional: do not send big files to TIKA ignore_above: "512mb" # special handling of JSON files, should only be used if ALL files are JSON json_support: false add_as_inner_object: false # special handling of XML files, should only be used if ALL files are XML xml_support: false # use MD5 from filename (instead of filename) if set to false filename_as_id: true # include size ot file in index add_filesize: true # inlcude user/group of file only if needed attributes_support: false # collect ACL metadata when available acl_support: false # do you REALLY want to store every file as a copy in the index ? Then set this to true store_source: false # you may want to store (partial) content of the file (see indexed_chars) index_content: true # how much data from the content of the file should be indexed (and stored inside the index), set to 0 if you need checksum, but no content at all to be indexed #indexed_chars: "0" indexed_chars: "10000.0" # usually file metadata will be stored in separate fields, if you want to keep the original set, set this to true raw_metadata: false # optional: add checksum meta (requires index_content to be set to true) checksum: "MD5" # recommmended, but will create another index index_folders: true lang_detect: false ocr.pdf_strategy: noocr #ocr: # language: "eng" # path: "/path/to/tesseract/if/not/available/in/PATH" # data_path: "/path/to/tesseract/tessdata/if/needed" # optional: add static metadata tags to documents tags: metaFilename: "meta_tags.json" # default is ".meta.yml" # optional: add static metadata to all indexed documents staticMetaFilename: "/path/to/static_metadata.json" # optional: specify a crawler provider (default is "local") # available providers: "local", "ftp", "ssh" # provider: "ssh" # optional: only required if you want to SSH/FTP to another server to index documents from there server: hostname: "localhost" port: 22 username: "dadoonet" password: "password" pem_path: "/path/to/pemfile" # required elasticsearch: urls: - "https://127.0.0.1:9200" bulk_size: 1000 flush_interval: "5s" byte_size: "10mb" # choose one of the 2 following options: # 1 - Using Api Key api_key: "VnVhQ2ZHY0JDZGJrUW0tZTVhT3g6dWkybHAyYXhUTm1zeWFrdzl0dk5udw==" # 2 - Using username/password (not recommended / deprecated) username: "elastic" password: "password" # optional, defaults to ``name``-property index: "test_docs" # optional, defaults to "test_folders", used when es.index_folders is set to true index_folder: "test_fold" # optional, defaults to "true" push_templates: "true" # optional, defaults to "true", used with Elasticsearch 8.17+ with a trial or enterprise license semantic_search: "true" # only used when started with --rest option rest: url: "http://127.0.0.1:8080/fscrawler" Here is a list of existing top level settings: +-----------------------------------+-------------------------------+ | Name | Documentation | +===================================+===============================+ | ``name`` (mandatory field) | :ref:`simple_crawler` | +-----------------------------------+-------------------------------+ | ``fs`` | :ref:`local-fs-settings` | +-----------------------------------+-------------------------------+ | ``tags`` | :ref:`tags` | +-----------------------------------+-------------------------------+ | ``elasticsearch`` | :ref:`elasticsearch-settings` | +-----------------------------------+-------------------------------+ | ``server`` | :ref:`ssh-settings` | +-----------------------------------+-------------------------------+ | ``rest`` | :ref:`rest-service` | +-----------------------------------+-------------------------------+ You can define your job settings either in ``_settings.yaml`` (using ``.yaml`` extension) or in ``_settings.json`` (using ``.json`` extension).