jorvis AT tigr.org

beginning perl for bioinformatics

overview

The best way to learn a language is with hands-on practice. I lead a perl class at The Institute for Genomic Research (TIGR) designed for absolute beginners to programming. It covers the basics of the language with practical examples relevant to the field of bioinformatics. This class is offered to all TIGR employees and will be held 3 or 4 times per year. Each offering consists of four 3-hour sessions. Students follow along on computers provided for the course, writing code along with the instructor during each class.

I've written a guide book to steer students through the language. It is written such that those who cannot attend the class could use it completely independently. You can find a PDF form of this guide along with links to the example scripts in the "downloads" section on the left. The following sections are taken from the guide's introduction.

You do not have to be a TIGR employee to download or use any of these materials. I've made both the guide book and source materials freely available to anyone who wishes to use them (for non-profit purposes). If you want a quick primer for perl within the scope of bioinformatics, I welcome you to try this out. If you find you enjoy programming in perl you can continue your studies with any of the texts listed on the left.

what is perl?

Perl is a computer programming language developed by a linguist and computer geek, Larry Wall. You should look for external resources for full descriptions of its use and history but, in short, it is used for a remarkably diverse set of applications. Among these are:

How perl got its name depends on who you ask. Even the author seems to have two explanations for it. On serious occasions it stands for practical extraction and reporting language; other times it stands for pathologically eclectic rubbish lister.

Is perl on my computer?

Perl runs natively (it's already installed) on unix/linux operating systems and Mac OS X. It can be installed on other operating systems as well, such as Windows and Mac OS, by downloading the freely available ActivePerl from www.activestate.com

How is perl different from other programming languages?

It's easy to learn. Anyone who has programmed in languages like C/C++ will be surprised at how quickly one can get started writing perl. Languages such as C/C++ must be compiled into binary before they can be executed. Java must be compiled into bytecode. This means that each time you change your program, you must compile it, then run it to see how it works, and redo this if there is still something wrong. Perl is one of many interpreted languages, which means that an external program, the perl interpreter, reads your text file of code and executes it directly; there is no need for compiling. While this makes the development process a fair bit faster, it also means that perl code executes more slowly than languages like C/C++. For many applications, such as web programming, the speed difference isn't usually noticeable, but for other scientific analysis methods, such the core of the BLAST algorithm, perl would not be the ideal choice.

Perl stands apart from other programming languages most notably in its ability to parse and manipulate text files. Its methods and powerful regular expression abilities make it an ideal choice for bioinformatics, since most biological data is represented as text - such as DNA and protein sequences.

Do I know enough to start programming in perl?

I believe that perl is the best true programming language for beginners because of its extremely flexible syntax and lack of typed variables. To get started with it one only needs to know how to type in a plain-text editor (such as notepad for windows or emacs for *nix) and be comfortable enough with an operating system to understand how to manipulate files and folders. If I ask you to create a folder called "blah" and within it create a file called "script.pl" and you know what to do, you're in good shape.

Because this is a class geared towards perl's use in bioinformatics, I'm going to assume that you know some basic biological concepts. You should, for instance, know the difference between a DNA molecule and a protein, and how they are represented as text (or be able to look it up on your own time.)

see the download links on the left for more ...