Tarballs - Compress and De-compress

Tar file archiver and compression utility

In computing, tar (derived from tape archive) is both a file format (in the form of a type of archive bitstream) and the name of the program used to handle such files. Initially developed as a raw format, used for tape backup and other sequential access devices for backup purposes, it is now commonly used to collate collections of files into one larger file, for distribution or archiving, while preserving file system information such as user and group permissions, dates, and directory structures.

Whilst Ubuntu and other distro's have there own means to install suitable software from time to time you will need to install application software from the developer themselves, as a result such software is often tarred. This tar is nothing short of yet another packaging and delivery system all be it a very old one, this particular archiver or one very much like it was used within Unix system to perform backups to tape systems for file archive predominately of user data. Such archives where always initiated from the command line or from within scripts and cron jobs. An example of using tar with a backup device attached is displayed below:

Creating a file archive to tape

#> /home/user/andrew tar cvf > /dev/st0 --label="backup-tape1" *

tar's basic syntax is this

tar option(s) archive_name file_name(s)

Looking at this command tar the regular application name and in this case three options that create, verbose and files the * instructs the tar application to recursively select all files and folders within the current directory and archive them via redirect to the device Scsi tape streamer 0 (/dev/st0). This particular method offers no compression for the files but tar is capable, with no compression the archive is faster as no processing power is used. Additionally the three options after the word tar are not proceeded with a - this was common practice on Unix systems but seems to be not required on the later Linux systems. Such options or switches as they are more commonly known, operate as flags to the application to denote and turn on special functions that could be used whilst in this case creating the file, there are many such switches that could be used and we will explore some of the more common uses later.

#> tar cpf /dev/st0 --label=" full-backup created on `date '+%d-%B-%Y'`." --directory /home/andrew

Commonly used Switches and options used above

* c = create (Create archive)
* p = permissions (Retain permissions)
* f = file (Archive files)
* v = verbose (Display progress to Standard Out (Screen))
* --label="Tape Name"
* --directory="path" (Full path of directory to be backed up)
* 'date '+%d-%B-%Y'' (This is a special command that will execute as tar is run this prints into label the current date in specified format but you do need to use the single quotes usually found near the Return or Enter key)

It is now days most unlikely you will use an external Scsi Tape Drive but if you do at least you have the command to make really slow backups directly. More common is the need to expand data or applications from a developers tarball and there are several ways to expand the data contained within.

We will first discuss how these tarballs are distributed and like most programs the tarballs themselves can be downloaded directly from the internet usually in source code format unless otherwise stated. This article we stress will only cover the various methods of the extraction of this source code or data from downloadable archives we do not cover the compiling of any such source code.

Tarballs are often found across the internet with a variety of extensions it is also common to include the compression extension if used.

You should be aware of these things:

Compressed and encoded files are usually identified by their extensions, the character code following the file name. Depending on how they were compressed, some files won't decompress or decode if the extension is removed. Here are some common extensions for compressed and encoded files:

Compressed files

* filename.tar.Z (denotes a file both tarred and compressed)
* filename.Z (denotes a compressed file)
* filename.tar (denotes a tared file)
* filename.tar.gz (denotes a GNU compressed file)
* filename.gz (denotes a GNU compressed file)
* filename.tgz (abbreviated form of tarred and GNU compressed)
* filename.tar.bz (denotes a file compressed with bzip)
* filename.tar.bz2 (denotes a file compressed with bzip2)
* filename.bz (denotes a file compressed with bzip)
* filename.bz2 (denotes a file compressed with bzip2)
* filename.z (denotes a GNU compressed file -- note lowercase 'z')
* filename.zip (denotes a zipped archive)

Some files can also be obtained encoded files

* filename.uu (a uuencoded file)
* filename.hqx (a binhexed file)

As you can see there is an enormous amount of file formats available and we have but included some of the most common. So we will only concentrate on a small number of them.

We will concentrate for now with files with the following file format extensions .tar, .gz, bz, tar.bz, bz2, tar.bz2, with these we can compress and decompress files so lets look at decompression.

#> tar xvf filename.tar

The above command as is, will extract all files and folders to the current directory sometimes this is written in the following manner.

#> tar xvf filename.tar .

Ok it's not very obvious we know, but by using a space and a full stop at the end of the tar command this effectively does the same as the command previously this in essence is the correct way to place files to the current directory via "tar", but this is seldom used and short hand is used instead. What is required are user permissions to the directory been written to, this is where many novices go wrong trying to expand files in directories to which you the user have no write access. Once logged in, you only have the areas in your own home directory that you have permissions to write to.

Generally this is not enough as quite often we need to place applications in parts of the system directory which as a normal user we have no write access to. We can over come this provided we can assume the role of super user sudo mode. If you have this access you can type the following.

#> sudo tar xvf filename.tar

This provided we have changed the current directory to where the file or files need to be located. However using this method we must first copy filename.tar to the directory we want to expand in.

#> sudo cp filename.tar /path/new_directory

There is a better way to place files and maintain a copy of the tarfile in your own home directory. First we assume you know how to locate and place yourself in the directory the archive resides.

#> sudo tar xvf filename.tar -C /path/new_directory

You will need to use '''"sudo"''' at anytime you wish to write files outside of your own home directory their is no other acceptable solution, writing files should never be attempted elsewhere on the file system as user root. In the above command you will notice a single -C this bit is rather clever as it has interesting quality of writing directories stored in the tar file that was previously created with any subdirectories it may hold to the path you specify. It is clever because you do not need to move the original tarball to the directory you wish to expand the files to, keeping them all in an organised folder.

Compressed Archives

Now we need to look at how to extract compressed files from an archive as the likelihood of finding tarballs that contain no compressed files is quite small. These tarball files will have extensions such as bz, bz2, gz, tar.gz and tgz, to extract and decompress files from an archive tarball we need to add some new switches to our command line format. Fortunately tar has evolved over the years to the extent that tar is an invaluable utility tool within the toolbox arsenal.

Perhaps we should look at a real world requirement that we can install that will cause little harm to your desktop computer but as examples go should prove my point.

Download the Content Management System archive file below from [http://drupal.org Drupal] themselves, the tarball itself contains for the most part "php" and "config" files, none of which will cause an harm to your system or that of your desktop, however these files will be written into the system area.

Drupal-6.8

This file I assume you will download to your Desktop if you do not you will need to replace the path we have specified with that of your own path where this file has been downloaded.

Create a directory in /var called www and in this localhost and use the following commands to do so, this is the recommended area in which drupal gets installed. It does however require other applications but for the purpose of this demonstration these are not required.

Open a new console and type the following commands

#> sudo mkdir /var/www
#> sudo mkdir /var/www/localhost

We need to use sudo here as the directory we wish to write to are parts of the system we normally do not have write access to. Drupal is a Content Management System and this archive is available compressed using Gunzip (.gz) you could just use the Gunzip gunzip filename.tar.gz to decompress the archive or you can run the command below.

#> cd ~/Desktop#> tar xvf drupal-6.8.tar.gz

The commands alter the current directory to that of the users own Desktop area, the tilda which proceeds it, is a shorthand method that finds the logged in users own home directory where ever they may be located within the file system. The ~ is equal to home when used with cd, mv or cp and other file manipulator applications. The above second command expands the drupal archive into your Desktop folder which must be moved to the correct folder position. Remember we are still in the Desktop folder so the command bellow will only operate from the folder specified.

#> sudo mv drupal-6.8 /var/www/localhost

This command moves the folder called drupal-6.8 to the localhost directory obviously we are using sudo again to ensure these files and folders contained in it are written to the directory specified.

Alternate Method

Then there is another way and in my view a better way, suppose instead of the above method drupal.tar.gz still sits on the descktop and that /var/www/localhost is empty we might like to issue the command like that displayed below.

#> sudo tar xvf drupal-6.8.tar.gz -C /var/www/localhost

Try this command above for yourself and when finished use either one of the commands below depending how complete you wish to erase the files and folders created.

#> sudo rm -r /var/www

This command will remove recursively all the files and folders thus far created. The command bellow only deletes the files and folders created from drupal-6.8.

#> sudo rm -r /var/www/localhost/drupal-6.8

Other Compression Archives

Bzip and Bzip2 are other popular compressions used with tar so how do we cope with archives with the (.bz & bz2) extensions. This should only require a slight modification with the switches when compressing or extracting. If the downloaded drupal archive had a tar.bz we would need to insert a z for a option switch.

#> sudo tar xzvf drupal-6.8.tar.bz -C /var/www/localhost

Example of a Bzip2, download this file to your home directory

wxCommunicator

Bzip2 files will need a different switch which is denoted using the j option thus:

#> tar xvjf wxCommunicator-1.0.5-src.tar.bz2 -C .

This application does not usually get installed in the system directory as this is used for personal use therefore sudo is not required. Please read the documentation for archives such as this before installation into the system area even when you know what you are doing. We have specified -C essentially but strictly speaking this is not necessary when extracting to the current directory.

You can at anytime use file to determine what or how your file is compressed this makes it a lot easier to use the correct switches when decompressing an archive, especially from an unfamiliar source.

Creating Compressed Archives

I firmly believe the only way to be more confident with Linux is to delve into it's secrets of the command line tools, this is where the true power of Linux lies, not it's various GUI's (Graphical User Interface) located on the desktop. With rare exception most everything you can achieve via a GUI you can direct from the command line. However we are transgressing somewhat, tar not only opens a variety of archives it can also create them. We took a very cursory peek at creating a tar file at the begining of this article. Lets look at some real data examples that we are more likely to use to transport files from one machine to another.

First of all we need some data and here you can choose your data which can be mp3's jpg's or anything else that takes you fancy. Create a new directory on your Desktop and put your data into it, you can get this data from a variety of sources Google comes prominently to mind but you may have other ideas. The point here is to obtain a large number of files, look at how much space they use up, then we use tar and a variety of options, well three any ways, to see if tar or one of it's options will help us achieve a smaller compressed file.

The best file to compress is a file with lots of space in it, files such as mp3's, mpg's and jpg's have very little space in them and these are in fact files that are already heavily compressed. However this is something you can experiment with at your leisure, with a wide variety of files, when you know how. An example of a file with lots of space in it is a text or a file with a .txt extension.

#> cd Desktop#> mkdir new_archive

We aim to archive these files in the newly created folder but before we do that it would be good to obtain a value of just how much space these files occupy. This gives us a base line on which to measure how poor or how well our files have compressed by with tar. We have a folder sitting on our Desktop that is of an unknown quantity called "enterfest", its contents are for the most part irrelevant as we can accomplish this packaging with any files. However it would help your assessment of this compression to know the files it contains are a mixture of documents and images. We now need to know how much space these files occupy ls will list the contents of the folder but does not give any indication as to individual or collective file size. Modifying the list command with a minus "al" for all, ie ls -al will list all files and folders and presents us with some quite meaningless numbers that relate to size.

drwx------ 4 delboy users 4096 2008-12-30 17:39 .
drwxr-xr-x 4 delboy users 4096 2009-01-16 06:58 ..
-rwx------ 1 delboy users 2967 2007-03-29 14:44 annettesemail.txt
-rwx------ 1 delboy users 32256 2007-04-24 15:47 BookingForm-Thompson.doc
-rwx------ 1 delboy users 229617 2007-04-11 17:55 Dome-print-1.pdf
drwx------ 2 delboy users 4096 2008-12-30 17:39 dome scans
-rwx------ 1 delboy users 790604 2007-04-03 22:27 DSCF0221.JPG.htm
-rwx------ 1 delboy users 390718 2007-04-20 11:59 Enter_FestivalMap.jpg
-rwx------ 1 delboy users 97792 2007-04-20 20:09 enter_invite1.doc
-rwx------ 1 delboy users 57856 2007-04-23 10:38 enter_invite2.doc
-rwx------ 1 delboy users 34304 2007-03-05 14:32 ENTER_UT_BLURB_050407.doc
-rwx------ 1 delboy users 35328 2007-03-05 11:32 Enter_UTConfTimeli_1D9980-1.doc
-rwx------ 1 delboy users 100333 2007-04-20 12:00 Enter_UTProgramme25April2007.jpg
-rwx------ 1 delboy users 211623 2007-04-20 12:00 Enter_UTProgramme26April2007.jpg
-rwx------ 1 delboy users 217953 2007-04-20 12:01 Enter_UTProgramme27April2007.jpg
-rwx------ 1 delboy users 228348 2007-04-20 12:01 Enter_UTProgramme28April2007.jpg
-rwx------ 1 delboy users 200121 2007-04-20 12:01 Enter_UTProgramme29April2007.jpg
-rwx------ 1 delboy users 56832 2007-04-02 11:45 Enter_UTprogrammeoverviewperday250307.xls
-rwx------ 1 delboy users 52736 2007-04-03 22:35 Gordons_RiskAssessment,MediaShed,CambridgeFestival.doc

We can address this by adding to the existing options a "h" for 'human readable' and presents us with an output that looks like this

drwx------ 4 delboy users 4.0K 2008-12-30 17:39 .
drwxr-xr-x 4 delboy users 4.0K 2009-01-16 06:58 ..
-rwx------ 1 delboy users 2.9K 2007-03-29 14:44 annettesemail.txt
-rwx------ 1 delboy users 32K 2007-04-24 15:47 BookingForm-Thompson.doc
-rwx------ 1 delboy users 225K 2007-04-11 17:55 Dome-print-1.pdf
drwx------ 2 delboy users 4.0K 2008-12-30 17:39 dome scans
-rwx------ 1 delboy users 773K 2007-04-03 22:27 DSCF0221.JPG.htm
-rwx------ 1 delboy users 382K 2007-04-20 11:59 Enter_FestivalMap.jpg
-rwx------ 1 delboy users 96K 2007-04-20 20:09 enter_invite1.doc
-rwx------ 1 delboy users 57K 2007-04-23 10:38 enter_invite2.doc
-rwx------ 1 delboy users 34K 2007-03-05 14:32 ENTER_UT_BLURB_050407.doc
-rwx------ 1 delboy users 35K 2007-03-05 11:32 Enter_UTConfTimeli_1D9980-1.doc
-rwx------ 1 delboy users 98K 2007-04-20 12:00 Enter_UTProgramme25April2007.jpg
-rwx------ 1 delboy users 207K 2007-04-20 12:00 Enter_UTProgramme26April2007.jpg
-rwx------ 1 delboy users 213K 2007-04-20 12:01 Enter_UTProgramme27April2007.jpg
-rwx------ 1 delboy users 223K 2007-04-20 12:01 Enter_UTProgramme28April2007.jpg
-rwx------ 1 delboy users 196K 2007-04-20 12:01 Enter_UTProgramme29April2007.jpg
-rwx------ 1 delboy users 56K 2007-04-02 11:45 Enter_UTprogrammeoverviewperday250307.xls
-rwx------ 1 delboy users 52K 2007-04-03 22:35 Gordons_RiskAssessment,MediaShed,CambridgeFestival.doc

Better, but although useful this still would still require some maths on our part to add them all together and this would take time. A better way would be to use a command that tells us directly how much space these files occupy. Well Linux has such a command called du

delboy@guest-laptop ~/Desktop/enterfest $ du
192 ./internet specs
9764 ./dome scans
13056 .
delboy@guest-laptop ~/Desktop/enterfest $

Again we are presented with meaningless numbers so lets see if we can use the minus 'h' with this command as it seems this will give us a report of how much space is used by these files.

delboy@guest-laptop ~/Desktop/enterfest $ du -h
192K ./internet specs
9.6M ./dome scans
13M .
delboy@guest-laptop ~/Desktop/enterfest $

So what does this tell us, well the current directory which is 'enterfest' holds two other directories but in total the contents of the current directory including the two directories equal thirteen Mega bytes of data. Any tar file now created using these contents needs to be better than 13M.

When creating the new 'tar' file we may as well include the directory the files are contained in we do this using 'cd' thus, remembering we are currently in the 'enterfest' directory.

#> cd ..

To move backward to the 'Desktop' from 'enterfest', although we could specify the actual path we need only use two dots or two full stops with the command 'cd' there is no similar command to move forward one directory. We can now use 'tar' with three different options and measure the resultant file size.

#> tar cvf enterfest.tar enterfest

ls -alh enterfest.tar
-rw-r--r-- 1 delboy users 13M 2009-01-16 07:52 enterfest.tar
delboy@guest-laptop ~/Desktop $

As we can see this first 'tar' has packaged the files nicely in one file but it is still 13M so no improvement.

#> tar czvf enterfest.tar.gz enterfest

delboy@guest-laptop ~/Desktop $ ls -alh enterfest.tar.gz
-rw-r--r-- 1 delboy users 3.7M 2009-01-16 07:56 enterfest.tar.gz
delboy@guest-laptop ~/Desktop $

Here we can see a vast improvement in file size using the z option has resulted in an extremely good compression down to only 3.7M for the same data selecton.

#> tar cjvf enterfest.tar.bz2 enterfest

delboy@guest-laptop ~/Desktop $ ls -alh enterfest.tar.bz2
-rw-r--r-- 1 delboy users 3.4M 2009-01-16 07:59 enterfest.tar.bz2
delboy@guest-laptop ~/Desktop $

Again we have triumphed with compression of the same data we have succeeded with a file size of 3.4M as opposed with 3.7M using the 'z' option.

.tar (just to pack files)

 Pack   $ tar - cvf folder.tar   folder/
 Unpack $ tar -xvf  file.tar
 See content (Without extract) $ tar -tvf file.tar

  .tar.gz - .tar.z - .tgz (tar with gzip)

 Pack and compress  $  tar- czvf  files.tar.gz  folder/
 Extract  $ tar -xzvf file.tar.gz
  See content (Without extract)  $ tar -tzvf file.tar.gz

   .gz (gzip)*

Compress $ gzip -q file
 Extract $ gzip -d file.gz

 .bz2 (bzip2)**

 Compress

$ bzip2 file

$ bunzip2 file

 Extract

$ bzip2 -d file.bz2

$ bunzip2 file.bz2

* gzip faster but less compresion

** bzip2 Slower but more compresion

 .tar.bz2 (tar with bzip2)

 Compress  $tar -jcvf file.tar.bz2 folder/
 Extract  $ tar -xjvf file.tar.bz2
See content (Without extract)  $ bzip2 -dc file.tar.bz2 | tar -tv

   .zip (zip)

 Compress $  zip file.zip /files
 Extract $ unzip file.zip
 See content (Without extract) $ unzip -v file.zip

   .rar (rar)

#  apt-get install rar (You must install first)
 Compress $ rar a file.rar folder/
 Extract $ rar e file.rar
See content (Without extract)

$ rar v file.rar

$ rar l file.rar

We certainly hope you experiment with 'tar' and it's various options and get to know and understand the tools available to you.