.. _installation: .. _docker: Using docker ------------ Pull the Docker image from `Docker Hub `__: .. code:: sh docker pull dadoonet/fscrawler .. note:: This image is very big (500+mb) as it contains `Tesseract `__ and all the `trained language data `__. If you don't want to use OCR at all, you can use a smaller image (around 230mb) by pulling instead ``dadoonet/fscrawler:noocr`` .. code:: sh docker pull dadoonet/fscrawler:noocr Let say your documents are located in ``~/tmp`` dir and you want to store your fscrawler jobs in ``~/.fscrawler``. You can run FSCrawler with: .. code:: sh docker run -it --rm \ -v ~/.fscrawler:/root/.fscrawler \ -v ~/tmp:/tmp/es:ro \ dadoonet/fscrawler .. note:: The configuration file is expected to be stored on your machine in ``~/.fscrawler/fscrawler/_settings.yaml``. Remember to change the URL of your elasticsearch instance as the container won't be able to see it running under the default ``127.0.0.1``. You will need to use the actual IP address of the host. Or use the ``FSCRAWLER_ELASTICSEARCH_URLS`` environment variable to set the elasticsearch URL. See :ref:`docker-options` for more information. If you need to add a 3rd party library (jar) or your Tika custom jar, you can put it in a ``external`` directory and mount it as well: .. code:: sh docker run -it --rm \ -v ~/.fscrawler:/root/.fscrawler \ -v ~/tmp:/tmp/es:ro \ -v "$PWD/external:/usr/share/fscrawler/external" \ dadoonet/fscrawler If you want to use the :ref:`rest-service`, don't forget to also expose the port: .. code:: sh docker run -it --rm \ -v ~/.fscrawler:/root/.fscrawler \ -v ~/tmp:/tmp/es:ro \ -p 8080:8080 \ dadoonet/fscrawler If you want to change the log level for FSCrawler, you can run: .. code:: sh docker run -it --rm \ -v ~/.fscrawler:/root/.fscrawler \ -v ~/tmp:/tmp/es:ro \ -v ~/logs:/root/logs \ -e FS_JAVA_OPTS="-DLOG_LEVEL=debug -DDOC_LEVEL=debug" \ dadoonet/fscrawler And you can read the logs from the ``~/logs`` directory: .. code:: sh tail -f ~/logs/documents.log You can pass all the CLI options to the docker container as well: .. code:: sh docker run -it --rm \ -v ~/.fscrawler:/root/.fscrawler \ -v ~/tmp:/tmp/es:ro \ dadoonet/fscrawler job_name --restart --loop 1 See :ref:`cli-options` for more information. .. _docker-compose: Using docker compose -------------------- In this section, the following directory layout is assumed: .. code-block:: none . ├── .env ├── docs │ └── └── docker-compose.yml The ``.env`` file looks like this: .. code-block:: sh :substitutions: # Password for the 'elastic' user (at least 6 characters) ES_LOCAL_PASSWORD=changeme # Version of Elastic products ES_LOCAL_VERSION=|ES_stack_version| # Set the ES container name ES_LOCAL_CONTAINER_NAME=es-fscrawler # Set to 'basic' or 'trial' to automatically start the 30-day trial ES_LOCAL_LICENSE=basic #ES_LOCAL_LICENSE=trial # Port to expose Elasticsearch HTTP API to the host ES_LOCAL_PORT=9200 ES_LOCAL_DISK_SPACE_REQUIRED=1gb ES_LOCAL_JAVA_OPTS="-XX:UseSVE=0 -Xms128m -Xmx2g" # Project namespace (defaults to the current folder name if not set) COMPOSE_PROJECT_NAME=fscrawler # FSCrawler Settings FSCRAWLER_VERSION=|FSCrawler_version| FSCRAWLER_PORT=8080 # Optionally, you can change the log level settings FS_JAVA_OPTS="-DLOG_LEVEL=debug -DDOC_LEVEL=debug" And, the ``docker-compose.yml`` file looks like this: .. code:: yaml --- services: elasticsearch: image: docker.elastic.co/elasticsearch/elasticsearch:${ES_LOCAL_VERSION} container_name: ${ES_LOCAL_CONTAINER_NAME} volumes: - dev-elasticsearch:/usr/share/elasticsearch/data ports: - 127.0.0.1:${ES_LOCAL_PORT}:9200 environment: - discovery.type=single-node - ELASTIC_PASSWORD=${ES_LOCAL_PASSWORD} - xpack.security.enabled=true - xpack.security.http.ssl.enabled=false - xpack.license.self_generated.type=${ES_LOCAL_LICENSE} - xpack.ml.use_auto_machine_memory_percent=true - ES_JAVA_OPTS=${ES_LOCAL_JAVA_OPTS} - cluster.routing.allocation.disk.watermark.low=${ES_LOCAL_DISK_SPACE_REQUIRED} - cluster.routing.allocation.disk.watermark.high=${ES_LOCAL_DISK_SPACE_REQUIRED} - cluster.routing.allocation.disk.watermark.flood_stage=${ES_LOCAL_DISK_SPACE_REQUIRED} ulimits: memlock: soft: -1 hard: -1 healthcheck: test: [ "CMD-SHELL", "curl --output /dev/null --silent --head --fail -u elastic:${ES_LOCAL_PASSWORD} http://elasticsearch:9200", ] interval: 10s timeout: 10s retries: 30 # FSCrawler fscrawler: image: dadoonet/fscrawler:${FSCRAWLER_VERSION} container_name: fscrawler restart: always environment: - FS_JAVA_OPTS=${FS_JAVA_OPTS} - FSCRAWLER_ELASTICSEARCH_URLS=http://${ES_LOCAL_CONTAINER_NAME}:9200 - FSCRAWLER_ELASTICSEARCH_USERNAME=elastic - FSCRAWLER_ELASTICSEARCH_PASSWORD=${ES_LOCAL_PASSWORD} - FSCRAWLER_REST_URL=http://fscrawler:${FSCRAWLER_PORT} volumes: - ${PWD}/docs:/tmp/es:ro depends_on: elasticsearch: condition: service_healthy ports: - ${FSCRAWLER_PORT}:8080 command: --rest volumes: dev-elasticsearch: Copy your pdf/doc files into the ``docs`` directory and run the full stack, including FSCrawler with: .. code:: sh docker-compose up When the job has finished indexing, you can check your documents in Elasticsearch with: .. code:: sh curl -u elastic:changeme "http://localhost:9200/fscrawler/_search" .. note:: You will find this example in the ``contrib/docker-compose-example-elasticsearch`` project directory. Running as a Service on Windows ------------------------------- Create a ``fscrawlerRunner.bat`` as: .. code:: sh set JAVA_HOME=c:\Program Files\Java\jdk15.0.1 set FS_JAVA_OPTS=-Xmx2g -Xms2g /Elastic/fscrawler/bin/fscrawler.bat --config_dir /Elastic/fscrawler data >> /Elastic/logs/fscrawler.log 2>&1 Then use ``fscrawlerRunner.bat`` to create your windows service. .. _local-installation: Local installation ------------------ If you prefer to run FSCrawler from a ZIP distribution on your machine instead of Docker: .. ifconfig:: release.endswith('-SNAPSHOT') You can download FSCrawler |version| from |Download_URL|_. The filename ends with ``.zip``. .. warning:: This is a **SNAPSHOT** version. You can also download a **stable** version from |Maven_Central|_. .. note:: There's an issue with the download links for SNAPSHOT versions. .. hint:: Due to a bug with the underlying service we rely on to provide SNAPSHOT hosting, we've had to temporarily remove browse access for SNAPSHOT releases. You should still be able to publish and consume SNAPSHOT releases as usual, but you cannot browse them via the UI. So you must now download the `maven-metadata.xml `__ file. Check the ```` tag to find the latest SNAPSHOT version of the ``zip`` file. .. code:: xml zip 2.10-20250801.161301-75 20250801161301 Note the ``value`` tag which contains the version you need to download. And use that value in the following URL: https://central.sonatype.com/repository/maven-snapshots/fr/pilato/elasticsearch/crawler/fscrawler-distribution/2.10-SNAPSHOT/fscrawler-distribution-2.10-20250801.161301-75.zip .. ifconfig:: release == version You can download FSCrawler |version| from |Download_URL|_. .. tip:: This is a **stable** version. You can choose another version than |version| from |Maven_Central|_. You can also download a **SNAPSHOT** version from |Sonatype|_. After extracting the ZIP, you get a directory with ``bin/`` (run scripts), ``config/`` (logging), ``lib/`` (core and dependencies), ``external/`` (optional JARs), and ``logs/``. See :ref:`layout` for the full directory layout. Optional libraries (external) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You may need to add JARs to the ``external`` directory for some formats. For example, to support JPEG2000 (JPX/JP2) images in PDFs, add the ``jai-imageio-jpeg2000`` library: download it from `Maven Central `_ and put ``jai-imageio-jpeg2000-1.4.0.jar`` in the ``external`` directory.