Applications using HDFS will be able to read data up to 59x faster thanks to this new feature.
A centralized cache management is added to the NameNode. NameNode instructs specific datanodes to cache specific data and datanode sends cache report periodically.
Caching is user drive. User can specify which file/directory to cache and NameNode will ask DataNodes to cache the data by piggy-backing a cache command on the DataNode heartbeat reply.
While scheduling job, DataNode where data is available in cache will be given priority (amongst other replicas) by application schedulers to take advantage of cache-locality.