Tips and tricks¶
Moving files to a “watched” directory¶
When moving an existing file to the directory FSCrawler is watching, you
need to explicitly touch
all the files as when moved, the files are
keeping their original date intact:
# single file
touch file_you_moved
# all files
find -type f -exec touch {} +
# all .txt files
find -type f -name "*.txt" -exec touch {} +
Or you need to restart from the
beginning with the --restart
option which will reindex everything.
Workaround for huge temporary files¶
fscrawler uses a media library that currently does not clean up their temporary files. Parsing MP4 files may create very large temporary files in /tmp. The following commands could be useful e.g. as a cronjob to automatically delete those files once they are old and no longer in use. Adapt the commands as needed.
# Check all files in /tmp
find /tmp \( -name 'apache-tika-*.tmp-*' -o -name 'MediaDataBox*' \) -type f -mmin +15 ! -exec fuser -s {} \; -delete
# When using a systemd service with PrivateTMP enabled
find $(find /tmp -maxdepth 1 -type d -name 'systemd-private-*-fscrawler.service-*') \( -name 'apache-tika-*.tmp-*' -o -name 'MediaDataBox*' \) -type f -mmin +15 ! -exec fuser -s {} \; -delete
Indexing from HDFS drive¶
There is no specific support for HDFS in FSCrawler. But you can mount your HDFS on your machine and run FS crawler on this mount point. You can also read details about HDFS NFS Gateway.
Using docker with FSCrawler REST¶
To use the REST service available from 2.2 you can add the --rest
flag to the FSCrawler docker container command:
. Note that you must expose the same ports that the REST service opens on in the docker container. For example, if your REST service starts on 127.0.0.1:8080
then expose the same ports in your FSCrawler docker-compose image:
Then expose the docker container you’ve created by changing the IP of the REST URL in your settings.yaml
to the docker-compose container name:
Pull the Docker image:
docker pull dadoonet/fscrawler
Run it:
docker run dadoonet/fscrawler job