In a Hadoop setup (rather any Big Data setup), memory issues are not unexpected!
An update on couple of issues we have seen off late –
1. NameNode process gets stuck:
In this case, typically you will see following symptoms –
a. DataNode gives following timeout error -
WARN ipc.Client: Exception encountered while connecting to the server : java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/<datanode-ip>:<datanode-port> remote=/<namenode-ip>:<namenode-port>]
ls: Failed on local exception: java.io.IOException: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/<datanode-ip>:<datanode-port> remote=/<namenode-ip>:<namenode-port>]; Host Details : local host is: "<datanode-ip>"; destination host is: "<namenode-ip>":<namenode-port>;
ð What this essentially means is that DataNode process timed-out while trying to connect to the NameNode. So obviously the next step is to check why NameNode didn’t respond.
b. On checking NameNode logs, we observed following warning –
WARN org.apache.hadoop.util.JvmPauseMonitor (org.apache.hadoop.util.JvmPauseMonitor$Monitor@18df3f69): Detected pause in JVM or host machine (eg GC): pause of approximately 74135ms No GCs detected
This indicates that the NameNode paused for longer than expected time of 60000ms. This also explains why DataNode did not get response from NameNode in designated 60000ms.
The warning also indicates that the pause was not due to GC. Typically GC can cause such ‘stop the world’ pauses and if that’s the case, it calls for a memory profiling and GC Tuning.
However, in this case, it turned out CPU activity was very high on the master node due to another cronjob. We sorted out the cronjob issue and the issue was resolved.
2. DataNode process OOM:
Depending on the size of data and amount of data activity, you may observe OOM issue in DataNode process once in a while.
A quick fix would be to allocate more memory to DataNode process. Typically following configuration change will be helpful –
Update value of HADOOP_DATANODE_HEAPSIZE in <HADOOP HOME>/conf/hadoop-env.sh
Also, it is advisable to configure data-node to generate heap-dump on OOM error. That will help you with further analysis of heap if you get same error again.
(This is applicable to other processes as well – NN/RM/NM etc.)