2012年9月30日 星期日

Reduce Task assign 時機

在Hadoop中, Reduce task 並非等到所有 Map task 做完才被assign下去做,
預設:

JobInProgress.java
...
public synchronized Task obtainNewReduceTask(TaskTrackerStatus tts, int clusterSize,
int numUniqueHosts) throws IOException {
...
if (!scheduleReduces()) {
      return null;
    }
...
}
...
public synchronized boolean scheduleReduces() {
    return finishedMapTasks >= completedMapsForReduceSlowstart;
  }

completedMapsForReduceSlowstart
=(預設DEFAULT_COMPLETED_MAPS_PERCENT_FOR_REDUCE_SLOWSTART)*numMapTasks
= 0.05*這個job的map task 數量

目的:Ensure we have sufficient map outputs ready to shuffle before scheduling reduces

...............................................................................................................
舉例來說, 假設此Job有 40 map tasks, 1 reduce task
則此Job的completedMapsForReduceSlowstart = 0.05*40=2

及表示這個Job至少必須完成 2 個map tasks, 才可以assign reduce task