In this post we are going to learn about Distcp in Hadoop and various aspects of Distcp.
- What id Distcp
Distcp(Distributed Copy) is a tool used for copying large set of data for Inter/Intra cluster copying. It uses Map Reduce for distribution, error handling , recovery, and reporting. It expands the list of the files into a map task which is used to copy a partition file from source to destination cluster. - How DistCp Works
The DistCp command will compare the file size at source and destination. If the destination file size is different from source file size, it will copy the file else if file size is same, it will skip copying the file. - Why to prefer Distcp over cp, get and put commands
DistCp is the fastest way to copy the data as compare to get/put command as it uses map for the processing. Also, we cannot use put/get command to copy files for Intra cluster. - Where We Can Run DistCp Command
We can run distcp command on both secure(kerberos enable) and insure cluster. To run Distcp on secure cluster we need to run kinit command. kinit will load the kerberize tickets for the user.
- How To Run DistCp command
There are different ways to run DistCp for inter and Intra cluster.
Intra Cluster :-
1) Without Any Option
hadoop distcp hdfs://:8020// hdfs://:8020/
2) Update Option
hadoop distcp -update hdfs://:8020// hdfs://:8020//
3) Overwrite Option
hadoop -overwrite distcp hdfs://:8020// hdfs://:8020//
Source_Name_Node :- Address of the cluster name node from where data needs to be copy.
Destination_Name_Node :- Address of the clsuter name node to which data needs to be copy.
File_Path :- Path where file resides at source cluster/File need to copy at destination cluster.
File_Name :- Name of the file which needs to be copy.
We need to pass the full address of the name node. In case of HA clusters where Nameservice is activated, if we use Nameservice in place of name node, distcp will throw an error.
In the method one, we do not need to pass the file name at destination path also, else it will create a sub folder with the file name and copy all the portioned files inside that folder.
- What is Update and Overwrite Option Do
Update :- If update option is used in DistCp command, it will compare the file name, file size and contents of the file at source and destination. It will copy the difference to the destination path and will skip remaining portion.
Overwrite :- If overwrite option is used in the DistCp command, it will compare the file name, file size and contents of the file at source and destination. If the files size are not same, it will overwrite the data at destination else it will skip the copy.
