在Hadoop中, Reduce task 並非等到所有 Map task 做完才被assign下去做,
預設:
JobInProgress.java
...
public synchronized Task obtainNewReduceTask(TaskTrackerStatus tts, int clusterSize,
int numUniqueHosts) throws IOException {
...
if (!scheduleReduces()) {
return null;
}
...
}
...
public synchronized boolean scheduleReduces() {
return finishedMapTasks >= completedMapsForReduceSlowstart;
}
而completedMapsForReduceSlowstart
=(預設DEFAULT_COMPLETED_MAPS_PERCENT_FOR_REDUCE_SLOWSTART)*numMapTasks
= 0.05*這個job的map task 數量
目的:Ensure we have sufficient map outputs ready to shuffle before scheduling reduces
...............................................................................................................
舉例來說, 假設此Job有 40 map tasks, 1 reduce task
則此Job的completedMapsForReduceSlowstart = 0.05*40=2
及表示這個Job至少必須完成 2 個map tasks, 才可以assign reduce task