Version 2.5

  • A bug was causing a lot of data going over the wire each time FSCrawler was running. To fix this issue, we changed the default mapping and we set store: true on field file.filename. If this field is not stored and remove_deleted is true (default), FSCrawler will fail while crawling your documents. You need to create the new mapping accordingly and reindex your existing data either by deleting the old index and running again FSCrawler or by using the reindex API as follows:

    # Backup old index data
    POST _reindex
    {
      "source": {
        "index": "job_name"
      },
      "dest": {
        "index": "job_name_backup"
      }
    }
    # Remove job_name index
    DELETE job_name
    

    Restart FSCrawler with the following command. It will just create the right mapping again.

    $ bin/fscrawler job_name --loop 0
    

    Then restore old data:

    POST _reindex
    {
      "source": {
        "index": "job_name_backup"
      },
      "dest": {
        "index": "job_name"
      }
    }
    # Remove backup index
    DELETE job_name_backup
    

    The default mapping changed for FSCrawler for meta.raw.* fields. Might be better to reindex your data.

  • The excludes parameter is also used for directory names. But this new implementation also brings a breaking change if you were using excludes previously. In the previous implementation, the regular expression was only applied to the filename. It’s now applied to the full virtual path name.

    For example if you have a /tmp dir as follows:

    /tmp
    └── folder
        ├── foo.txt
        └── bar.txt
    

    Previously excluding foo.txt was excluding the virtual file /folder/foo.txt. If you still want to exclude any file named foo.txt whatever its directory you now need to specify */foo.txt:

    {
      "name" : "test",
      "fs": {
        "excludes": [
          "*/foo.txt"
        ]
      }
    }
    

    For more information, read Includes and excludes.

  • For new indices, FSCrawler now uses _doc as the default type name for clusters running elasticsearch 6.x or superior.