理解Hadoop HDFS写文件原理

这里做一个测试HDFS写文件的测试

NN : 192.168.1.1
DN1 : 192.168.1.2
DN2 : 192.168.1.3
DN3 : 192.168.1.4
Client : 192.168.1.1

$ll read.txt 
-rw-rw-r-- 1 hadoop hadoop 12 Apr  3 11:48 read.txt

NameNode分析


看看hadoop的namenode的日志
2014-04-03 14:24:50,338 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /user/hadoop/read.txt._COPYING_. BP-398901529-192.168.1.1-1393416650594 blk_3945775701777059462_15982{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[192.168.1.2:50010|RBW], ReplicaUnderConstruction[192.168.1.3:50010|RBW], ReplicaUnderConstruction[192.168.1.4:50010|RBW]]}
2014-04-03 14:24:50,473 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 192.168.1.4:50010 is added to blk_3945775701777059462_15982{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[192.168.1.2:50010|RBW], ReplicaUnderConstruction[192.168.1.3:50010|RBW], ReplicaUnderConstruction[192.168.1.4:50010|RBW]]} size 0
2014-04-03 14:24:50,474 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 192.168.1.3:50010 is added to blk_3945775701777059462_15982{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[192.168.1.2:50010|RBW], ReplicaUnderConstruction[192.168.1.3:50010|RBW], ReplicaUnderConstruction[192.168.1.4:50010|RBW]]} size 0
2014-04-03 14:24:50,476 INFO BlockStateChange: BLOCK* addStoredBlock: blockMap updated: 192.168.1.2:50010 is added to blk_3945775701777059462_15982{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, replicas=[ReplicaUnderConstruction[192.168.1.2:50010|RBW], ReplicaUnderConstruction[192.168.1.3:50010|RBW], ReplicaUnderConstruction[192.168.1.4:50010|RBW]]} size 0
2014-04-03 14:24:50,477 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /user/hadoop/read.txt._COPYING_ is closed by DFSClient_NONMAPREDUCE_1320389024_1

我在hadoop集群里只上线了3台几点,replication的也是等于3,就是一个机器一个块。

DataNode分析


分别在192.168.1.2(DN1),192.168.1.3(DN2),192.168.1.4(DN3)上抓包

DN1 : 192.168.1.2

三次握手(DN1和Client建立连接)

14:24:50.367036 IP 192.168.1.1.53561 > 192.168.1.2.50010: S 1235675786:1235675786(0) win 14600 
14:24:50.367142 IP 192.168.1.2.50010 > 192.168.1.1.53561: S 3430371344:3430371344(0) ack 1235675787 win 14480 
14:24:50.367183 IP 192.168.1.1.53561 > 192.168.1.2.50010: . ack 1 win 29 

DN1和client通信(Client开始发第一个包,seq=439,ack=440)
14:24:50.448286 IP 192.168.1.1.53561 > 192.168.1.2.50010: P 1:440(439) ack 1 win 29 
14:24:50.448336 IP 192.168.1.2.50010 > 192.168.1.1.53561: . ack 440 win 31 

DN1和DN2通信(DN1和DN2建立连接,三次握手)
14:24:50.449765 IP 192.168.1.2.60024 > 192.168.1.3.50010: S 753790100:753790100(0) win 14600 
14:24:50.449978 IP 192.168.1.3.50010 > 192.168.1.2.60024: S 839637351:839637351(0) ack 753790101 win 14480 
14:24:50.450051 IP 192.168.1.2.60024 > 192.168.1.3.50010: . ack 1 win 29 

DN1和DN2通信(DN1把第一个包发给DN2,收到确认)
14:24:50.450304 IP 192.168.1.2.60024 > 192.168.1.3.50010: P 1:319(318) ack 1 win 29 
14:24:50.450437 IP 192.168.1.3.50010 > 192.168.1.2.60024: . ack 319 win 31 
14:24:50.455004 IP 192.168.1.3.50010 > 192.168.1.2.60024: P 1:6(5) ack 319 win 31 
14:24:50.455020 IP 192.168.1.2.60024 > 192.168.1.3.50010: . ack 6 win 29 

DN1和Client通信(仔细看下面的ack是440,到这里才是对第一个包的确认,代表三个DN都完成第一个包处理)
14:24:50.455225 IP 192.168.1.2.50010 > 192.168.1.1.53561: P 1:6(5) ack 440 win 31 
14:24:50.455384 IP 192.168.1.1.53561 > 192.168.1.2.50010: . ack 6 win 29 

Client开始发第二个包,seq=440,ack=487
14:24:50.464315 IP 192.168.1.1.53561 > 192.168.1.2.50010: P 440:487(47) ack 6 win 29 

DN1和DN2通信(DN1把第二个包转发给DN2)
14:24:50.464508 IP 192.168.1.2.60024 > 192.168.1.3.50010: P 319:366(47) ack 6 win 29 
14:24:50.467019 IP 192.168.1.3.50010 > 192.168.1.2.60024: P 6:17(11) ack 366 win 31 

DN1和Client通信(确认第二个包,代表3个DN完成第二个包处理)
14:24:50.467885 IP 192.168.1.2.50010 > 192.168.1.1.53561: P 6:20(14) ack 487 win 31 

Client开始发第三个包
14:24:50.471012 IP 192.168.1.1.53561 > 192.168.1.2.50010: P 487:518(31) ack 20 win 29 

DN1和DN2通信(DN1把第三个包发送给DN2)
14:24:50.471167 IP 192.168.1.2.60024 > 192.168.1.3.50010: P 366:397(31) ack 17 win 29 
14:24:50.474400 IP 192.168.1.3.50010 > 192.168.1.2.60024: P 17:29(12) ack 397 win 31 
14:24:50.474786 IP 192.168.1.3.50010 > 192.168.1.2.60024: F 29:29(0) ack 397 win 31 

DN1和Client通信(DN1告诉Client已经写完)
14:24:50.475349 IP 192.168.1.2.50010 > 192.168.1.1.53561: P 20:34(14) ack 518 win 31 
....

DN1和DN2,Client断开连接,看时间是14:24:50.476223
14:24:50.475771 IP 192.168.1.1.53561 > 192.168.1.2.50010: F 518:518(0) ack 34 win 29 
14:24:50.476047 IP 192.168.1.2.60024 > 192.168.1.3.50010: F 397:397(0) ack 30 win 29 
14:24:50.476081 IP 192.168.1.2.50010 > 192.168.1.1.53561: F 34:34(0) ack 519 win 31 
14:24:50.476186 IP 192.168.1.1.53561 > 192.168.1.2.50010: . ack 35 win 29 
14:24:50.476223 IP 192.168.1.3.50010 > 192.168.1.2.60024: . ack 398 win 31 

在看看DN1上的日志
2014-04-03 14:24:50,448 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-398901529-192.168.1.1-1393416650594:blk_3945775701777059462_15982 src: /192.168.1.1:53561 dest: /192.168.1.2:50010
2014-04-03 14:24:50,475 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.1.1:53561, dest: /192.168.1.2:50010, bytes: 12, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_1320389024_1, offset: 0, srvID: DS-1250979778-192.168.1.2-50010-1393417978787, blockid: BP-398901529-192.168.1.1-1393416650594:blk_3945775701777059462_15982, duration: 19058333
2014-04-03 14:24:50,475 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-398901529-192.168.1.1-1393416650594:blk_3945775701777059462_15982, type=HAS_DOWNSTREAM_IN_PIPELINE terminating

DN2 : 192.168.1.3

三次握手 起始时间是14:24:50.449860

14:24:50.449860 IP 192.168.1.2.60024 > 192.168.1.3.50010: S 753790100:753790100(0) win 14600 
14:24:50.449953 IP 192.168.1.3.50010 > 192.168.1.2.60024: S 839637351:839637351(0) ack 753790101 win 14480 
14:24:50.449981 IP 192.168.1.2.60024 > 192.168.1.3.50010: . ack 1 win 29                        

开始发包,192.168.1.2 和 192.168.1.3之间TCP的传输
14:24:50.450292 IP 192.168.1.2.60024 > 192.168.1.3.50010: P 1:319(318) ack 1 win 29 
14:24:50.450308 IP 192.168.1.3.50010 > 192.168.1.2.60024: . ack 319 win 31 
14:24:50.451944 IP 192.168.1.3.36534 > 192.168.1.4.50010: S 2811947842:2811947842(0) win 14600 
...
...

断开连接,看时间是14:24:50.476039
14:24:50.474584 IP 192.168.1.3.36534 > 192.168.1.4.50010: F 243:243(0) ack 21 win 29 
14:24:50.474631 IP 192.168.1.3.50010 > 192.168.1.2.60024: F 29:29(0) ack 397 win 31 
14:24:50.474798 IP 192.168.1.4.50010 > 192.168.1.3.36534: . ack 244 win 31 
14:24:50.476009 IP 192.168.1.2.60024 > 192.168.1.3.50010: F 397:397(0) ack 30 win 29 
14:24:50.476039 IP 192.168.1.3.50010 > 192.168.1.2.60024: . ack 398 win 31 

下面的DN2的日志
2014-04-03 14:24:50,451 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-398901529-192.168.1.1-1393416650594:blk_3945775701777059462_15982 src: /192.168.1.2:60024 dest: /192.168.1.3:50010
2014-04-03 14:24:50,474 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.1.2:60024, dest: /192.168.1.3:50010, bytes: 12, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_1320389024_1, offset: 0, srvID: DS-136573777-192.168.1.3-50010-1393417978720, blockid: BP-398901529-192.168.1.1-1393416650594:blk_3945775701777059462_15982, duration: 18129080
2014-04-03 14:24:50,474 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-398901529-192.168.1.1-1393416650594:blk_3945775701777059462_15982, type=HAS_DOWNSTREAM_IN_PIPELINE terminating

DN3 : 192.168.1.4

三次握手 起始时间是14:24:50.452130

14:24:50.452130 IP 192.168.1.3.36534 > 192.168.1.4.50010: S 2811947842:2811947842(0) win 14600 
14:24:50.452426 IP 192.168.1.4.50010 > 192.168.1.3.36534: S 2051537423:2051537423(0) ack 2811947843 win 14480 
14:24:50.452224 IP 192.168.1.3.36534 > 192.168.1.4.50010: . ack 1 win 29 

开始发包,192.168.1.3 和 192.168.1.4之间TCP的传输
14:24:50.452552 IP 192.168.1.3.36534 > 192.168.1.4.50010: P 1:165(164) ack 1 win 29 
14:24:50.452575 IP 192.168.1.4.50010 > 192.168.1.3.36534: . ack 165 win 31 

14:24:50.454682 IP 192.168.1.4.50010 > 192.168.1.3.36534: P 1:6(5) ack 165 win 31 
14:24:50.454864 IP 192.168.1.3.36534 > 192.168.1.4.50010: . ack 6 win 29 

14:24:50.464797 IP 192.168.1.3.36534 > 192.168.1.4.50010: P 165:212(47) ack 6 win 29 
14:24:50.466173 IP 192.168.1.4.50010 > 192.168.1.3.36534: P 6:13(7) ack 212 win 31 

14:24:50.471383 IP 192.168.1.3.36534 > 192.168.1.4.50010: P 212:243(31) ack 13 win 29 
14:24:50.473232 IP 192.168.1.4.50010 > 192.168.1.3.36534: P 13:20(7) ack 243 win 31 

断开连接,看时间是14:24:50.474757
14:24:50.473901 IP 192.168.1.4.50010 > 192.168.1.3.36534: F 20:20(0) ack 243 win 31                               ....
14:24:50.474728 IP 192.168.1.3.36534 > 192.168.1.4.50010: F 243:243(0) ack 21 win 29                              ..>.
14:24:50.474757 IP 192.168.1.4.50010 > 192.168.1.3.36534: . ack 244 win 31 

下面是DN3的日志
2014-04-03 14:24:50,453 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-398901529-192.168.1.1-1393416650594:blk_3945775701777059462_15982 src: /192.168.1.3:36534 dest: /192.168.1.4:50010
2014-04-03 14:24:50,472 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.1.3:36534, dest: /192.168.1.4:50010, bytes: 12, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_1320389024_1, offset: 0, srvID: DS-2002629359-192.168.1.4-50010-1393417979543, blockid: BP-398901529-192.168.1.1-1393416650594:blk_3945775701777059462_15982, duration: 16898199
2014-04-03 14:24:50,473 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-398901529-192.168.1.1-1393416650594:blk_3945775701777059462_15982, type=LAST_IN_PIPELINE, downstreams=0:[] terminating

总结


hadoop_write
Client发起写的文件的请求,先把本地要写的文件分割成一个一个块,每写一个块之前向NN申请,告诉NN“我有一个Block1要写”,NN就会返回给Client可以写的DN列表,这里是3个(具体的决策由NN调度),Client收到DN列表后,就开始向第一个DN建立连接,然后发送一个package(Block1的切片),里面的实现有一个发送队列的,每个packages在这个队列里面。现在Client发送了package1的同时,DN1会和DN2建立TCP连接,DN2会和DN3建立连接,可以看上面的时间。DN1收到第一个package1后,会把package1转发到DN2,DN2转发到DN3。DN3接收完成给DN2回复,DN2给DN1回复,然后DN1回复给Client。Client端接收到ack,把packages1从发送队列里移除,然后开始发送packages2。直到一个Blocks发送完,然后发送第二个Block,按照上面的步骤走。当所有的Block都发送完成,Client就会告诉NN,全部发送成功,然后NN就把这个文件的信息正式写入到NameSpace里面,可以上面NN日志的最后一条。

标签:Hadoop