Data miners scout for more powerful tools

Saturday, January 9th, 2010
Bioinformatics3

Artwork: Shovamayee Maharana

Starting January 18, 2010, NCBS and IISc will co-host the annual Asia Pacific Bioinformatics Conference. The Asia Pacific Bioinformatics Conference (APBC) series, founded in 2003, is an annual international forum for exploring research, development and applications of Bioinformatics and Computational Biology.

Bioinformatics is the application of data mining techniques to detect patterns in molecular biological information. The basic information can be of many types, e.g. DNA sequence data, the amino acid composition of proteins, and the levels of gene and protein expression in the cell. Since this data is now being collected in vast quantities, often using automated techniques, one of the main challenges of bioinformatics is to develop ever-more powerful computational tools.

And data mining methodology will be the focus of the 2010 meeting. As Ramanathan Sowdhamini, a professor at NCBS, and co-chair of the conference, says: "Biological data are often fuzzy and it is hard to recognise the patterns that exist within groups. Informaticians are devising ever-more computationally intense algorithms like Hidden Markov Models, Support Vector Machines and, most recently, Random Forest techniques to organize and analyse biological data. Such analyses have high predictive power and can in fact be the prelude to many biological experiments."

Random Forest: small trees speak the truth

Computer chess programs are not big on chess theory: they just look at every possible sequence of moves (a "decision tree"), and pick the one that leads to some very obvious advantage. Likewise, in molecular biology, the unprejudiced examination of decision trees is at the heart of most data mining approaches. The algorithms look at every possible combination of the data to see if one feature (e.g. amino acid sequence) predicts another (e.g. protein structure). But just as a chess program can run into problems with processing power, so do data miners when they search through huge numbers of decision trees. To make life simpler for their computers, bioinformaticians have introduced techniques to "prune" the individual decision trees. These dispense with some of the less valuable branches, but in doing so they must unfortunately sacrifice some computational accuracy. One of the highlights of APBC2010 will be a focus on a relatively new technique, Random Forest, that removes the need for pruning and provides a more accurate analysis.

The Random Forest method does however introduce its own shortcuts. Only a few features are considered at any one time, thus yielding smaller trees. Also, instead of fully evaluating every single tree, the "average value" of groups of trees (the forests) are used for the further calculations in the approach. But to date these simplifications represent the best possible compromise, delivering the greatest amount of accuracy from the available computing power.

A big Indian presence

APBC2010 is the meeting's eighth edition and its first time in India. Responding to the field's rapid growth, the four day meet is a large gathering in J.N. Tata Auditorium opposite Indian Institute of Science, with plans for about 70 regular talks, 180 posters and seven keynote lectures. It will be a great opportunity for Indian bioinformaticians to network with their colleagues from India and around the world. Already 200 delegates from India have signed up. Speakers from different countries, like USA, Australia, UK, Finland, Germany, China, Bangladesh and India will present their recent work in this conference.

APBC2010 Home

Comments

Post new comment

The content of this field is kept private and will not be shown publicly.
CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.

Bookmark and Share