Version 2.5
A bug was causing a lot of data going over the wire each time FSCrawler was running. To fix this issue, we changed the default mapping and we set
store: true
on fieldfile.filename
. If this field is not stored andremove_deleted
istrue
(default), FSCrawler will fail while crawling your documents. You need to create the new mapping accordingly and reindex your existing data either by deleting the old index and running again FSCrawler or by using the reindex API as follows:# Backup old index data POST _reindex { "source": { "index": "job_name" }, "dest": { "index": "job_name_backup" } } # Remove job_name index DELETE job_name
Restart FSCrawler with the following command. It will just create the right mapping again.
$ bin/fscrawler job_name --loop 0
Then restore old data:
POST _reindex { "source": { "index": "job_name_backup" }, "dest": { "index": "job_name" } } # Remove backup index DELETE job_name_backup
The default mapping changed for FSCrawler for
meta.raw.*
fields. Might be better to reindex your data.The
excludes
parameter is also used for directory names. But this new implementation also brings a breaking change if you were usingexcludes
previously. In the previous implementation, the regular expression was only applied to the filename. It’s now applied to the full virtual path name.For example if you have a
/tmp
dir as follows:/tmp └── folder ├── foo.txt └── bar.txt
Previously excluding
foo.txt
was excluding the virtual file/folder/foo.txt
. If you still want to exclude any file namedfoo.txt
whatever its directory you now need to specify*/foo.txt
:{ "name" : "test", "fs": { "excludes": [ "*/foo.txt" ] } }
For more information, read Includes and excludes.
For new indices, FSCrawler now uses
_doc
as the default type name for clusters running elasticsearch 6.x or superior.