Data miners scout for more powerful tools
Bioinformatics is the application of data mining techniques to detect patterns in molecular biological information. The basic information can be of many types, e.g. DNA sequence data, the amino acid composition of proteins, and the levels of gene and protein expression in the cell. Since this data is now being collected in vast quantities, often using automated techniques, one of the main challenges of bioinformatics is to develop ever-more powerful computational tools.
And data mining methodology will be the focus of the 2010 meeting. As Ramanathan Sowdhamini, a professor at NCBS, and co-chair of the conference, says: "Biological data are often fuzzy and it is hard to recognise the patterns that exist within groups. Informaticians are devising ever-more computationally intense algorithms like Hidden Markov Models, Support Vector Machines and, most recently, Random Forest techniques to organize and analyse biological data. Such analyses have high predictive power and can in fact be the prelude to many biological experiments."
Random Forest: small trees speak the truth
Computer chess programs are not big on chess theory: they just look at every possible sequence of moves (a "decision tree"), and pick the one that leads to some very obvious advantage. Likewise, in molecular biology, the unprejudiced examination of decision trees is at the heart of most data mining approaches. The algorithms look at every possible combination of the data to see if one feature (e.g. amino acid sequence) predicts another (e.g. protein structure). But just as a chess program can run into problems with processing power, so do data miners when they search through huge numbers of decision trees. To make life simpler for their computers, bioinformaticians have introduced techniques to "prune" the individual decision trees. These dispense with some of the less valuable branches, but in doing so they must unfortunately sacrifice some computational accuracy. One of the highlights of APBC2010 will be a focus on a relatively new technique, Random Forest, that removes the need for pruning and provides a more accurate analysis.
The Random Forest method does however introduce its own shortcuts. Only a few features are considered at any one time, thus yielding smaller trees. Also, instead of fully evaluating every single tree, the "average value" of groups of trees (the forests) are used for the further calculations in the approach. But to date these simplifications represent the best possible compromise, delivering the greatest amount of accuracy from the available computing power.
A big Indian presence
APBC2010 is the meeting's eighth edition and its first time in India. Responding to the field's rapid growth, the four day meet is a large gathering in J.N. Tata Auditorium opposite Indian Institute of Science, with plans for about 70 regular talks, 180 posters and seven keynote lectures. It will be a great opportunity for Indian bioinformaticians to network with their colleagues from India and around the world. Already 200 delegates from India have signed up. Speakers from different countries, like USA, Australia, UK, Finland, Germany, China, Bangladesh and India will present their recent work in this conference.
Comments
Post new comment