DistCp

In this post, we are going to learn about the Distcp command in Hadoop and various aspects of Distcp.

  • What is Distcp
    Distcp(Distributed Copy) is a tool used for copying the large set of data for Inter/Intra-cluster copying. It uses Map Reduce for distribution, error handling, recovery, and reporting. It expands the list of the files into a map task which is used to copy a partition file from source to destination cluster.
  • How DistCp Works
    The DistCp command will compare the file size at source and destination. If the destination file size is different from source file size, it will copy the file else if the file size is same, it will skip copying the file.
  • Why prefer discp over cp, get and put commands
    DistCp is the fastest way to copy the data as compare to get/put command as it uses map for the processing. Also, we cannot use put/get command to copy files for Intracluster.
  • Where We Can Run DistCp Command
    We can run the distcp command on both secure(Kerberos enable) and insecure cluster. To run Distcp on the secure cluster we need to run kinit command. kinit will load the kerberize tickets for the user.
  • How To Run DistCp Command
    There are different ways to run DistCp for inter and Intra-cluster.
    Intra-Cluster :-
    1) Without Any Option

             hadoop distcp hdfs://<Source_Name_Node>:8020///<File_Path>/hdfs://<Destination_Name_Node>:8020//<File_Path>

          2) Update Option

                    hadoop distcp -update hdfs://<Source_Name_Node>:8020///<File_Path>/hdfs://<Destination_Name_Node>:8020//<File_Path>/

3) Overwrite Option

                    hadoop -overwrite distcp hdfs://<Source_Name_Node>:8020///<File_Path>/hdfs://<Destination_Name_Node>:8020//<File_Path>/

Source_Name_Node: – Address of the cluster name node from where data needs to be copied.
Destination_Name_Node: – Address of the cluster name node to which data needs to be copied.
File_Path : – Path, where file resides at source cluster/File, need to copy at destination cluster.
File_Name : – Name of the file which needs to be copied.

We need to pass the full address of the name node. In the case of HA clusters where Name service is activated, if we use Nameservice in place of name node, distcp will throw an error.
In the method one, we do not need to pass the file name at destination path also, else it will create a subfolder with the file name and copy all the portioned files inside that folder.

  • What is Update and Overwrite Option Do

Update: – If update option is used in DistCp command, it will compare the file name, file size, and contents of the file at source and destination. It will copy the difference to the destination path and will skip remaining portion.

Overwrite: – If overwrite option is used in the DistCp command, it will compare the file name, file size, and contents of the file at source and destination. If the files size is not same, it will overwrite the data at destination else it will skip the copy.

  • How To Run DistCp on Intra-Cluster

Intra-cluster means the version of the cluster is different. Consider Cloudera, if data needs to be moved from one CDH version (e.g. CDH4) to another (e.g. CDH5). We need to use HTFP protocol.

The command must be used in below manner.

                hadoop distcp hftp://cdh4-<namenode>:50070/ hdfs://CDH5-<namenode>/

hadoop distcp hftp://cdh4-<namenode>:50070/<File_Path> hdfs://CDH5-<namenode>/<File_Path>

The HFTP protocol allows using FTP resources in an HTTP request. When copying the data using distcp across different versions of CDH, use hftp:// for the source file system and hdfs:// for the destination file system, and run distcp from the destination cluster. The default port for HFTP is 50070 and the default port for HDFS is 8020.

HFTP is a read-only protocol, which could be used for source cluster not for destination cluster. HFTP cannot be used for to copy the data from insecure to secure cluster.

Distcp has one disadvantage of not having the option to merge the data. The three ways come with the option of either copying the part that is missing or to overwrites the whole data.

Updated version of distcp command with -append option which can be used by the update, but even it is working pursuing the update data operation.
To skip the file size check skip check operation can be used with Hadoop Distcp.

       Hadoop Distcp -Ddfs.checksum.type=CRC32 -skipcrccheck -update

  • There are a few limitations with Hadoop distcp command, these are as below.

When copying the data from multiple sources, the Distcp command with fail with an error in case of two sources collides, but we can avoid this scenario at destination level by using certain options. By default, the files at destination level are skipped to copy.

  • Side Effects of DistCp are as follows, in case a map fails:-

Unless -i is specified, the logs generated by that task attempt will be replaced by the previous attempt.
Unless -overwrite is specified, files successfully copied by a previous map on a re-execution will be marked as “skipped”.
If a map fails mapred.map.max.attempts times, the remaining map tasks will be killed (unless -i is set).
If mapred.speculative.execution is set final and true, the result of the copy is undefined.

  • The different options that can be used for Distcp commands are as follows:

 

Options Description Comments
-p[rbugp] Preserve

r:  replication number

b: block size

u: user

g: group

p: permission

Modification time will not be Preserved.

When using update options, synchronize will not work until file size will differ.

-i Ignore Failure This option is for keeping the logs in case of failure, which would help in debugging in case of failure. A failure map will not cause the failure of the job until all the splits get completed.
log <logpath> Write logs to log path It keeps the log of each it attempt to copy as a map. If the map fails, job output will not be retained in case of re-execution of the job.
-m<num_maps> Maximum number of simultaneous copies Specify the number of maps to copy data. Is it not necessary that more maps improve throughput.
– overwrite Overwrite Destination If -i is not specified and a map fails, all the files in the split, not only those that failed, will be recopied. It also changes the semantics for generating destination paths, so users should use this carefully. Pass the file name in case of overwriting, else the split files will be created outside the file you desire.
-update Overwrite id source and destination size are different If the file size of the source is greater than the destination, it will copy the files which are missing at the destination location. Pass the file name in case of overwriting, else the split files will be created outside the file you desire
-f <urilist> Use list at <urilist> as source list This is equivalent to listing each source on the command line.
-filelimit <n> Number of files should be <= n
-sizelimit <n> Total size of the file should be <= n bytes
-delete Delete the file at the destination. This will delete the file at destination level and will the source file. We can use trash to recover the file.

 

 

  • Summary:-

Covering up the summary of Hadoop Distcp command, it is a powerful tool the data present at one Hadoop hdfs location to another or within the same location. Distcp can be used to copy data from one version of Cloudera CDH (e.g. CDH-4 to CDH-5 etc.). Three types distcp.can be performed: direct, update or overwrite. The direct method is used to.copy the data if data is not available at the destination location. Update and overwrite can be used if data is already present at the destination location. The update will check the file size at source and destination, if the file size of the source is greater than the destination, it will copy the files which are missing at the destination location. Overwrite will check the file size at source and destination, if the file size of the source is greater than the destination, it will overwrite the file of destination with the source.

Distcp command must be run from destination cluster.

Leave a comment