-
Notifications
You must be signed in to change notification settings - Fork 268
Open
Description
See NUTCH-2930
In order to avoid information leakage to a public search index or web archive, it should be possible to configure Nutch in a way that no content is fetched from localhost, loop-back addresses, private address spaces.
NUTCH-2527 adds the configuration snippets to exclude URLs pointing to private addresses.
However, filtering URLs isn't enough because a DNS entry of an arbitrary host name may point to a private IP address. Blocking must happen on the protocol level because the IP address is only know in the protocol implementation. I'll add an implementation for protocol-okhttp.