Pan African Bioinformatics Network for H3Africa


H3ABioNet is driving the research and development of Bioinformatics and Genomics Tools, standardised workflows and pipelines for H3Africa research projects. H3ABioNet nodes are collectively harnessing their areas of expertise to initiate collaborative projects between the various Nodes within Africa. Bioinformatics pipelines, software tools and computing platforms which range from patient databases, analyses software, high performance computational pipelines, workflows and tools that enable data storage, management, accessibility, mining and analyses are actively being developed.

H3ABioNet is also promoting the collaboration of projects between different Node members with projects of various scope and scale underway. To propose a research project or find expertise for an existing research project, please complete the Research Project Proposal.

Development of a generic framework for building a database for clinical data

Dr. Nicki Tiffin, Dr. Junaid Gamieldien, Dr. Adam Dawe, Dr. Jean-Baka Domelevo-Entfellner

The patient databasing SOP is designed to assist researchers by providing effective ways to upload, store, datamine and back up clinical patient data. The selected system will be made available to all H3Africa partners and they will be able to receive assistance from H3ABioNet in installing, designing and using patient databases for their projects.
Functionality that will be provided include:

  • Database installation and mirroring
  • Database backup
  • Front end browser-based data entry forms that can be used from sites remote to the database, via the internet
  • Browser-based data-querying interface that can be used from sites remote to the database, via the internet
  • Mechanisms to ensure database fidelity
  • The capacity to link patient clinical data with genomic data through effective database and front-end solutions.

The input data will be clinical and biochemical fields including:

  • Patient demographic data
  • Clinical data, including disease phenotypes and biochemical parameters
  • Environmental data
  • Data collected on patient questionnaires
  • Capacity for upload and storage of scanned forms, questionnaires etc
  • The patient database solutions will be coordinated with solutions to store genomic data generated by the projects.

The output of the patient database will be determined by the researchers on the individual projects, according to their research questions and the type of data captured. The database solutions will be designed to allow effective querying of the stored data by end users, through a user-friendly graphic interface. The database front end will facilitate reporting and statistical analysis of the data.
We are currently exploring existing tools for effective patient databasing. We have identified REDCap as a suitable tool to implement patient databasing and are testing it on a prototype patient database. REDCap integrates a back-end SQL relational database with a front-end browser-based GUI for data entry and mining of clinical data. The REDCap software is free and available under license from the REDCap Consortium administered from Vanderbilt University (http://project-redcap.org/).

We will explore the use of a Biomart browser-based front end http://www.biomart.org/) to integrate the SQL patient database with genomics databases developed for the project. The REDCap software has been installed on the server infrastructure at SANBI. It is being used to develop a prototype database in conjunction with the Kidney Disease Research Network, and in collaboration with the University of Michigan research partners in the network who have experience working with REDCap. The database will be mirrored at Noguchi University, University of Michigan and SANBI, in order to provide database backup, a disaster recovery plan; and to facilitate uploading of patient data to a local server to avoid difficulties in data entry associated with international internet access problems in Africa. We will be working with the Kidney Disease Research Network to develop a genomics database, and again will use this opportunity as a prototype for integrated access to patient and genomics data. The lessons learned from this prototype development will be used in defining guidelines and Standard Operating Procedures to assist all researchers in their patient database implementation.
The intended users are the clinical and biomedical researchers in the H3Africa projects, both for data entry and data querying. The database interfaces will ensure that datasets are accessible for statistical analysis. Access will be provided through the internet using a user-friendly graphic user interface. For RedCAP the data is entered through custom-designed forms that reflect the questionnaires that are used in the field for data collection. Data querying and reporting is also performed through a customized graphical user interface based on the data types entered.
A significant challenge in clinical studies is frequently the poor databasing and storage.
The patient databasing tools aim to address this challenge by facilitating:

  • Easy entry of data in a graphical form via the internet, with checks in place to ensure high fidelity of data entry; and that can be accessed through a standard internet browser.
  • Secure and backed up clinical data storage.
  • Easy querying of patient data through a graphical user interface that can be opened in a standard internet browser.

NetCapDB: A central database for H3ABioNet information and online reporting system.

Dr. Nicki Tiffin, Dr. Junaid Gamieldien, Dr. Adam Dawe, Dr. Jean-Baka Domelevo-Entfellner

The central database and front end will facilitate storage of a range of metrics for the nodes and individuals involved in H3ABioNet to reflect Bioinformatics capacity within the network. These metrics will be under continual update by the members of the network in order to reflect changes in bioinformatics capacity resulting from the activities of H3ABioNet. The database will provide information about strengths and weaknesses of the network in order to harness the strengths and work to improve those areas with poor capacity. The database will also provide the funders the opportunity to assess areas in which funding has translated to good capacity growth. The database will assist accurate reporting and assessments of developments within H3ABioNet, by providing up-to-date metrics about outputs and achievements across the network. The database will also provide individual nodes with the opportunity to assess their own strengths and weaknesses and set appropriate, realistic goals to improve capacity.
The types of input data required by each node/individual are extensive. The data-fields include the following general areas:

  • Human capacity: staff/students/bioinformaticians/systems administrators
  • Bioinformatics skills
  • Training capacity: workshops/degrees conferred; workshops and courses attended
  • Hardware capacity: Server infrastructure and computing equipment
  • Bandwidth and connectivity
  • Research outputs: Publications, conference outputs, tools and databases developed
  • Professional affiliations and outputs
  • H3A collaborations
  • Other collaborations
  • Travel, country and cross-country interactions

The output of the database will be metrics that reflect the bioinformatics capacity of the H3ABioNet; at individual, node and network level.
The database is implemented in PostGRES and has a browser-based graphical user interface for data entry from sites remote to the server, using the Internet. A Biomet graphical front end has been developed for data querying of the relational database. The database is hosted at SANBI and can be accessed by users (through a secure login system) for data entry and access to their own node data.
The database and its interfaces will be tested through a stepwise roll-out of access to the different nodes. Thus, data entry and data-mining will first be implemented at SANBI and tested by several SANBI users. Then access will be shared with the UCT node in order to check data entry and querying from an external node that is close enough to SANBI to enable easy troubleshooting if any issues arise. From there, access will be granted to one West African, one North African and one East African node, to pilot the more remote use of the database. Finally, access will be granted to all nodes for data entry in the first instance. Once data entry is completed for the baseline capacity of all nodes, a similar roll-out will be used for the data-querying functionality.
Access will be provided through the internet using web-browser-based forms and sites for data entry and querying. All members of the H3ABioNet will use the database. Each node will have access to their own data. Additionally, the NIH will have access to reports and data generated from the database to assist with assessing progress within H3ABioNet.  The H3ABioNet P.I. will have access to data from all nodes to assist with reporting and assessment of H3ABioNet capacity development.
The database is intended to assist with assessing bioinformatics capacity in Africa in a systematic and comprehensive way. Assessing capacity is important for understanding the strengths and needs of the bioinformatics groups in Africa, and identifying areas that require development. It is also important for tracking outputs related to funding. Finally, it is a way to identify and utilize bioinformatics skills and infrastructure that exist on the continent.

Development of a Recombination tool to determine ancestral populations

Dr. Darren Martin, Prof. Nicola Mulder

The tool will, without any prior information on the names and numbers of ancestral populations, deconstruct chromosome-scale SNP/sequence datasets into any number of sub-datasets containing groups of aligned SNP/nucleotides that share common ancestries.  It will do this by applying a range of established heuristic recombination event detection and analysis tools that have only previously been usable in the study virus genome-scale datasets.
The tool will be a PC application that takes aligned nucleotide sequence data in any of the standard alignment formats (e.g. nexus, fasta, clustalw, paup, phylip).  For the analysis of SNP data from multiple different individuals the SNPs will need to be arranged in the order that they occur on their respective  chromosomes and SNPs from different chromosomes should be analysed separately.  SNPs must also be phased (i.e. with all the SNPs in each input sequence derived from the same chromosome).  Finally the phased SNPs must aligned so that missing data/indels in particular individuals will be represented by a gap character = “-“.  It will then be possible to take these SNP alignments and directly loaded into the program for analysis. Alternatively if full sequences for either individual chromosomes or concatenated exome sequences on individual chromosomes are to be analysed, the sequences will need to be aligned (with a program such as mauve) and saved in a format such as xmfa before the program will be able to load them. It will be possible to analyse up to 1000 chromosomes at a time with the tool.
The tool will output result files in three different formats:

  1. The output in “.rdp” file format will allow a user to examine the output in great depth using the program RDP4.   RDP4 implements a wide range of heuristic and parametric recombination  analysis, tree drawing and matrix tools and also provides a variety of data representation features.
  2. The output in “.csv” file format will allow a user to browse the results in any standard  spreadsheet application (such as Microsoft excel) and will detail the positions of recombination breakpoints, the identity of recombinant sequences, the identity of sequences resembling the parental sequences, and degrees of statistical support, determined with a range of five different recombination detection methods, for the identified tracts of sequence falling between the identified breakpoints having been derived through recombination.
  3. The tool will provide “distributed alignments” of SNPS/genome fragments derived from any number of user specified ancestral populations – i.e. it will split the component nucleotides of input sequences up into different alignments based on the ancestral populations from which they were derived (note that although the recombination analysis carried out by the tool will not require specification of ancestral population numbers, it will be up to the user to specify how many alignments the program should split the data into).  

The key methods implemented by the tool are five of the heuristic recombination detection methods implemented in the computer program RDP4: RDP, GENECONV, MAXCHI, CHIMAERA and 3SEQ.  The combined results of these methods will then be processed by the tool using the same algorithm implemented in RDP4 for the identification of recombinant sequences, the identification of sequences resembling parental sequences and the identification of recombination breakpoint locations. Crucially the RDP4 algorithm does not require any prior information on either the number of underlying populations or the identification of likely admixed individuals.  Also, by identifying sequences with shared recombinant histories the algorithm counts recombination events relatively accurately and can therefore detect evidence of recombination hot and cold-spots.
The tool will be tested for power and false positive rates of recombination detection using coalescent simulations of unstructured populations with recombination (where known ancestries for every nucleotide are not completely known). It will be compared with other currently available ancestry inference tools (such as WINPOP and LAMPLD) for breakpoint, recombinant sequence identification and parental sequence identification accuracy using simulations with up to 10-way admixture between structured populations (where the ancestry of every analysed nucleotide is known). It will be tested for consistency with other available recombination analysis tools (such as LDHAT and CLONALFRAME) on previously analysed chromosome-scale human and animal datasets with respect to (1) overall correlation between recombination event numbers inferred by our tool and recombination rate estimates inferred by established parametric tools (i.e. tools that while really good at calculating over-all recombination rates do not do very well at identifying individual recombination events) and (2) the power and accuracy with which recombination hot- and cold-spots can be detected. The tool will be available free for download both as a command-line version that runs on PC/PC emulation software (and can therefore be plugged into sequence analysis pipelines) and as a fully integrated component of the RDP4 recombination analysis program. As with the RDP4 tool, the source code will also be available for download.

The tool will provide a completely independent and relatively fast alternative means of determining likely ancestries of SNPs/nucleotides within admixed populations where there is no prior information available on the likely identities or numbers of ancestral populations.  Such a tool would be particularly valuable in GWAS of African populations where such admixture is bound to obscure subtle genetic associations with phenotypic traits. Besides the tool’s utility in “cleaning up” input data for GWAS, it will also be extremely useful in studies of the genetic exchange process itself.  The main advantages of the tool over others that are currently available is that without any prior input other than sequences (and potentially geographical sampling locations) it will yield:

  1. Precise chromosome-scale maps of both intra and inter-population recombination
  2. Precise networks detailing the polarity of recombinational sequence transfers both between and within populations.
  3. Relatively precise genealogical information on when (and potentially also where if sampling location is provided) many of the recombination events likely occurred  (it will be able to retrieve clonal phylogenetic trees for any particular genome location from the network graph describing the evolution of each chromosome, use these as reasonably precise proxies for the genealogies of those genomic regions and map the recombination events to specific branches in these trees).

The tool will also prove invaluable in efforts to use phylogenetics based molecular clock and phylogeographic approaches to accurately date and place where particular genetic polymorphisms first occurred.

HUMA: A platform for the analysis of genetic variation in humans

David Brown, Rowan Hatherley and Özlem Tastan Bishop

The HUMA (HUman Mutation Analysis) web server has been developed to provide a platform for the analysis of genetic variation in humans. It consists of a number of modules, which can function independently, but which become far more powerful when used in conjunction with one another.

The foundation of the HUMA web server is the database. It consists of genes, proteins and protein structures, mutations and mutation scores, and diseases. Data was retrieved from a number of different sources. Genes were obtained from the HGNC, proteins and their sequences were obtained from Uniprot, protein structures were obtained from the PDB, diseases were obtained from the NCBI databases, MedGen and ClinVar, and mutations and mutation scores were obtained from dbNSFP. All data in the database is linked, providing the user with powerful querying abilities. For example, a search for a particular disease will return information about that disease as well as associated proteins, genes, and mutations. From there, users can find things such as which residues in the associated proteins are linked to the disease and where those residues are on the protein structure.

It often happens that there are no experimentally determined structures available for a protein. For these cases we have developed 3DModel, a homology modelling pipeline that allows users to model the structure of a selected protein with the minimum input being a protein sequence. 3DModel is more than just a pipeline though, as it allows users to interact with results throughout the process. For example, users can tell the pipeline to pause after the alignment stage to allow them to manually edit the generated alignment. HUMA also makes use of 3DModel to model proteins with user-selected mutations. In future, tools will be added to analyse these models to determine the structural effects of the mutations.

In order to integrate tools into the web server, we have developed the Job Management System (JMS). Originally developed as part of HUMA, the JMS has developed into a standalone tool. It is essentially a web-based front-end to a cluster, that allows users to create and run workflows and integrate those workflows into their own web servers. The JMS also has it's own web interface for users who are not interested in using it to integrate tools into their own web servers, but rather want to use it as a front-end to their cluster. It allows complex workflows to be designed, where certain stages depend on one another and other stages may run in parallel. With the JMS, adding tools and scripts and designing workflows is done through an easy to use web interface, negating the need for complicated configuration files that some similar systems use. The JMS also allows batch jobs to be run. Users can generate a batch file with hundreds of lines of inputs and submit it rather than submitting individual jobs. The JMS will also provide users with the ability to create input profiles for their workflows, which are used to fill in selected parameters with default values. This is useful when a user is planning on running a number of jobs and wants to keep certain parameters constant. These input profiles can also be used with batch files. On completion, the JMS will be open-sourced and made available for download.

All the functionality and data available from the above mentioned tools will be accessible via intuitive RESTful web APIs as well as user-friendly web interfaces. In the case of the JMS, the software will be available for download and can be installed on the user's own servers.

Development of a Functional SNP Calling Pipeline

Dr. Nicki Tiffin, Dr. Junaid Gamieldien, Dr. Adam Dawe, Dr. Jean-Baka Domelevo-Entfellner

Several tools exist for predicting the potential effect of a coding variant on a protein’s function. As the different tools have their own strength and weakness, it is preferable to use multiple algorithms instead of relying on a single one. The envisaged pipeline will use the dbNSFP database which contains prediction scores from a multitude of tools for all possible missense variants in the all currently known coding regions of the human genome to assess the potential impact of lists of coding variants identified through exome sequencing or targeted re-sequencing studies. It will also enable the user to rank variants based on:

  • number of tools predicting it to be functional
  • being called as functional and also being in a region that has a high conservation score
  • being called as functional and also being a rare variant based on MAFs from dbSNP, population specific subsets of dbSNP and the NHLBI Grand Opportunity Exome Sequencing Project, etc.

The pipeline will require the user to input a set of coding variants in standard VCF format.
The output will be a CSV formatted text file containing a list of SNPs having potential functional effect, either ranked or filtered according the criteria set by the user. Other annotations produced include:

  • Prediction scores for each tool
  • Conservation scores, where available
  • rs_ids for known variants
  • Gene annotations (names, symbols, refseq ids, etc)
  • Pathway
  • Known disease involvement and/or trait association, if any
  • Tissue expression

Pre-calculated functional prediction calls from a combination of tools (SIFT, PolyPhen2, Mutation Taster, LRT, PhyloP and Mutation Assessor) as well as conservation information about the site of interest will be used to rank or filter a list of missense variants. Sets of known disease associated missense variants from OMIM and common missense variants will be obtained from the UCSC Genome Browser to test the accuracy of the different filtering criteria and establish filtering guidelines that produce manageable lists for verification without ‘over-filtering’. The computational pipeline will be hosted on an H3ABioNet dedicated server and access provided via a web-interface (simple form, with optional input file upload facility). NGS re-sequencing studies produce more variants than can be confirmed/assessed and filtering of newly-discovered variants based on potential functional effect predicted by multiple tools is essential. While the data to perform such filtering is available, a web service is needed to simplify the process for non-bioinformaticians who should also be able to decide on the filtering criteria.

Development of an Admixture mapping tool

Dr. Emile Rugamika, Prof. Nicola Mulder

The admixture mapping tools include a suite of tools for use on multi-way admixed populations to overcome the limitation of existing tools, which tend to work best with 2- or 3-way admixed populations only. Specifications have been defined for the tools to enable development to begin. The two admixture tools included in this project are:

  1. Tool for selecting the best proxy ancestral populations for an admixed population
  2. Tool for inferring local ancestry in admixed populations

The first tool is an important precursor for the second as identifying the correct ancestral populations is crucial to be able to accurately infer local ancestry. A prototype for this first tool has already been developed in the group. PROXYANC has two novel algorithms including the correlation between observed linkage disequilibrium in an admixed population and population genetic differentiation in ancestral populations, and an optimal quadratic programming based on the linear combination of population genetic distances (FST). PROXYANC was evaluated against other methods, such as the f3 statistic using a simulated 5-way admixed population as well as real data for the local South African Coloured (SAC) population, which is also 5-way admixed. The simulation results showed that PROXYANC was a significant improvement on existing methods for multi-way admixed populations.

For the second tool, we have evaluated some of the existing methods for inferring local ancestry (or locus-specific ancestry) and determining the date of admixture on multi-way admixed populations including the SAC and simulated data. These methods include HapMix, ROLLOFF and a PCA-based method, StepPCO for dating admixture, and WinPOP and LampLD for local-ancestry. All three of the dating tools gave quite different predictions of the date of admixture events, showing the lack of accuracy of existing methods and need for a better one. For the LampLD and WinPOP evaluations on simulated data, the correlation between the estimated and true ancestry was used. In general, LampLD provided more accurate estimations of local ancestry for the multi-way admixed simulated data, but still demonstrated a reasonable error rate. We have taken the results and knowledge on the underlying methods to draw up specifications on the new tool. These include the requirements of the tool and how to consider limitations in existing tools/methods.

Development of a DAS-based visualization tool

Gustavo Salazar, Ayton Meintjes, Dr. Gaston Mazandu, Holifidy Arisoa Rapanoel, Richard Olatokunbo, Prof. Nicola Mulder

With respect to the integration of data, we have collaborated with the European Bioinformatics Institute in the development of a server named MyDas which uses the Distributed Annotation System (DAS) to publish biological data (mainly genomics and proteomics). We have also proposed an extension on the DAS protocol in order to allow collaborative annotations on biological resources. The current implementation allows one for instance to create annotations on proteins using the web tool Dasty3. This experience has given us a better understanding of the data that is commonly used in the bioinformatics subfields, and with it we have been able to move into the second focus area: i.e. visualization. In a subsequent collaboration we participated in the design of BioJS (http://wwwdev.ebi.ac.uk/Tools/biojs/registry/), a library of JavaScript components, and developed some of the current components of the library. Using these BioJs components we have developed Pinv (http://biosual.cbio.uct.ac.za/interactions), a web based tool to visualize interactions of proteins reported for Mycobacterium tuberculosis (intra-species) and M. tuberculosis-Human (inter-species).

Pinv would require 2 datasets to fully operate, one that describes the network and a second that contains the annotations of the proteins to display. The first dataset can be as simple as a CSV file with two columns to indicate the two interactors. Each row can also have extra information about the interaction itself, such as the evidence used to detect the interaction. The second dataset, which contains annotations of the proteins to display, is used not only for display purposes in the graphic but also to filter and manipulate the current visualization. The visualization uses a combination of methods to make the relationships between proteins more evident. The tool provides a flexible way to define filters and visual options for the selected sub-network.

The visualization runs completely on the client and was developed using the latest web technology (HTML5) therefore the whole network viewer is fully embedded in the browser. A popular visualization library, D3, has been used which makes extensive use of HTML5 technologies in order to deliver simple and rich visualization. The two input files are processed and stored in a Solr server for ease of querying and obtaining quick responses. The members involved in the project can be split in two: developers and biologists. The latter group selects and curates the dataset but moreover they constantly test the prototype versions of the application, reporting bugs and suggesting new features.
The current version of Pinv can be access at http://biosual.cbio.uct.ac.za/interactions and the code is openly available at http://code.google.com/p/biological-networks/
The current app is limited to the interactions from Human and Mycobacterium tuberculosis, but a different dataset can be loaded into the server and visualized. Therefore the target audience of this tool is any researcher of protein networks that wishes to visually navigate the data or simply capture a graphic of the region of interest in their networks for publication purposes. The user can also map functional genomics and genotyping data onto the interaction network.

Development of a Grid-based tool for data storage and sharing

Dr. Scott Hazelhurst, Dr. Shaun Arron, Dr. Edward Steere

The management of scientific and bioinformatics data is a well-recognised problem in general and will be particular issue for projects under the H3Africa umbrella. We wish to develop tools that allow the secure, auditable and reliable storage of such data in a way that allows collaboration between project members across multiple sites. Research questions that this project will tackle include user-interface design, meta-data standards, tensions between security, accessibility, reliability and back-up in a low-bandwidth environments and interfacing with external systems. The system will have a web-based front and is primarily aimed at wet-lab scientists who will need to perform bioinformatics analyses.
The input data types are flexible although typically sequence, demographic, clinical plus meta data will be used. The tool is to provide the secure archiving of data. The outputs are

  • Data previously uploaded can be download
  • Data can be shared
  • Reports of what data is in the system
  • Audit trail of all data uploaded, edited, removed

There are three key parts :

  1. Grid middleware.
  2. User interface
  3. Meta-data standards.

The underlying technology we wish to use is grid middleware. The key advantages of this are:

  1. Secure certificate based authentication provided by officially recognised providers.
  2. Infrastructure for transporting and sharing data. This supports sharing data between collaborators at different sites and backups, but in a secure and auditable way.

The other key requirement of the tool is that it be user-friendly through a good user-interface which allows non-expert users (e.g., clinicians) to upload and download data easily and securely and to search for data based on appropriate meta-data standards (e.g., HL7, GO).
The first phase is to do alpha testing for reliability, ensuring that uploaded data is safely stored and backed up. The reliability of our system is tested at this stage. The next step will be to use the data for student projects at University of Witwatersrand. This would allow the basic functionality to be explored and used with a varied group of individuals.  The data here is relatively non-critical (provided that the system is reliable, the students will have a far better system than they have at the moment). The third phase would be to use it for the uploading and recording of raw data.

In the longer term, it is possible for the system to be extended with plug-ins to allow processing. However, it must be emphasized that we see this system as primarily a way of storing raw data securely and easily and in an auditable way. Access will be provided via the internet in the first place. (We see this as a system that will be used to support the H3Africa AWI-GEN Centre at University of Witwatersrand).

However, if successful, collaborators will be able to use our system and our code developed will be available for installation at other sites. The primary users are clinicians and geneticists, although bioinformaticians would also use the system.
H3Africa projects will be generating vast amounts of data of different sorts (sequence, demographic, phenotypical). Without proper data management policies and systems, the success of H3Africa projects will be endangered. Key requirements are providing backup, allowing data to be shared effectively while ensuring the data is kept securely and confidentially. The imperatives of sharing and security are clearly in tension, and we believe that the use of grid middleware  technology allows us to support both requirements properly. Searching of data is an additional need. Another requirement is that the system is simple to use, since we expect that the majority of the users will not be expert bioinformaticians or computer users. We also need to take into account relatively low bandwidth environments for the transfer of data so providing asynchronous communication should increase usability and reliability.

Investigation into current Cloud, High Performance and Grid Computing resources available

Nodes invovled: UCT, SANBI and IPT

The aim of the project is to conduct a pilot survey of what resources in terms of cloud, high performance computing and grid computing resources are available to Africa in general and in South Africa specifically for biomedical research.
The proposed project will seek to document:

  1. The current applications these platforms are used for.
  2. The potential applications these platforms can be used for.
  3. The advantages of these platforms.
  4. The disadvantages of these platforms.
  5. The ease of use of these platforms.
  6. The accessibility of these platforms.
  7. Any potential costs for using these platforms.

Bioinformatics is not the only science whereby large scale computing and huge datasets exists. Other sciences such as Physics, atmospheric sciences etc. generate petabyte scale data for which computational solutions must have been implemented. This project will look at how other scientific fields have dealt with the problem of “large data” and what computing platforms have been implemented and what challenges these fields and bioinformatics faces.  

The method will involve a combination of literature mining and identifying any current implementations of these platforms within Africa, contacting projects that implement or have some experience with using these resources.

Design and Implementation of a Sickle Cell Disease (SCD) database and Analysis modules

Collaborating Nodes: MUHAS and IPT

Currently, SC disease information for a Tanzanian cohort of patients is contained within a database comprising of (some db architecture here e.g MySQL version, flat files etc). The data collected consists of numerous records / metrics collected thereby forming a data rich source for the study of SC within East African populations. Large scale phenotypic and epidemiological SC studies are hindered by the current poor database implementation for which data querying, managing, mining, addition and integration of new and existing data structures for determining new genotype to phenotype associations is not supported.

The project proposes to re-design the current SCD database by modification and creation of new data structures, mapping of relationships and attributes between these data structures to ensure robust classification, validation of the data structures, development of intuitive and new end user queries and increasing the overall database efficiency and speed. The new database model will be adapted and populated with new data structures enabling statistical and large scale / genome wide association studies to be conducted in a systematic manner.    

The method for re-modelling the SCD database will be to design a new database schema, re-map and construct new database tables for various types of data collected and anticipated to be collected, define logical relationships between these data structures, validate them, allow flexible queries to be made across all data structures. It is also important that mechanisms for the secure and easy updating of data contained within the SCD-db and ability to seamlessly integrate new data structures within the re-modelled SCD-db are taken into account within the re-designed database. As the database contains patient data, security and accessibility of the database is strictly controlled and only accessible via local storage with user name pass word login. MySQL workbench will be used to develop the SCD-db schema with close guidance from the database users in order to ascetain their needs. Data tables will be created using PHPMyAdmin and PHP scripts will be modified or created to fit in with the new data models created. Data will be migrated from the old SCD database to the new one via and will be validated by a team of specialists at the Muhimbili Welcome Program.

The expected outcomes from this project are a new implementation of the SCD-db that is well structured, easy to update and incorporate new data structures, enable flexible end user queries that will form the bedrock for existing and future SCD studies to better understand SCD dynamics within East African populations. This SCD-db will be to our knowledge, the first of its kind within Africa developed by collaboration amongst African Scientists.

Surveillance Programme of IN – patients and Epidemiology (SPINE) for Sickle Cell Disease (SCD) of East African Populations

Collaborating Nodes: MUHAS, IPT, UCT and MLWTP

Currently, SC disease information for a Tanzanian cohort of xx patients is contained within a re-deigned database implementation that is both robust and user friendly. The data collected consists of example of types of records / metrics collected e.g age, disease severity, diagnostic test results thereby forming a data rich source for the study of SC within East African populations. As these data is contained in a well structured manner, custom queries can be designed to generate data required for a SPINE according to various layers of data that is to be integrated. The aim of such a system would be to provide access to clinicians of the medical histories of patients in terms of their visit types, previous diagnosis and tests (more examples of records to be used?).

  • In order for a SPINE system to be created the following types of data (examples) need to be captured and integrated.
  • Storage of data captured proposed to be MySQL tables
  • Access of data to be secure using key word / login and example of security protocol to be used e.g https / keberos key chain login
  • If web service then web portal has to be developed
  • Methods used – PHP scripting, MySQL.
  • Outcomes - Ability to generate reports based on the data captured – example e.g patient history for vaccination / diagnostic test
  • Intended users to be clinicians who will access the data via webservice.
  • Easy to use by clinicians e.g would a clinician learn how to make a MySQL query
  • Pre-determined reporting template for generating reports

Immuno-informatics and nanoantibody binders: Sequence and structural analysis of VHH sequences

Collaborating Nodes: RUBi and IPT

Scorpion envenoming and its treatment is a public health problem in several part of the world and especially in Africa. This project highlights some of the most pressing tasks that need to be undertaken to confront this frequent medical emergency responsible of more then 100000 cases of envenoming per year and more then hundred deaths. Toxicity of the scorpion venoms is essentially due to the presence of small toxins (7 kDa MW) that act on the voltage-gated sodium channels of excitable cells. Polymorphism of the secreted toxins at individual as well as species levels complicates the approach of treatment by antibody polyclonals. Treatment using Nanobodies (corresponding to recombinant single-domain antigen-binding fragments), offer special advantages in therapy over classic antibody fragments due to their robustness and smaller size, matching the size of the scorpion toxins. The sequences of Nb fragments correspond to the variable domain VH of camel-specific Heavy-Chain only antibodies.

The aim of this project is to analyse two sets of sequences encoding the VH variable domains of Androctonus australis hector (Aah) scorpion toxin-specific antibody fragments. Antibodies possess an essential feature and display an extremely specific combination of site region that recognizes and bind target antigen (i.e. antigenic determinants of scorpion toxin binders). This site is located in the variable region (VH) of the immunglobulin (Ig). Camelid variable domains of Heavy Chain immunoglobulins G (HC-IgGs) differs from human VHs by the conserved hydrophobic framework-2 residues (FR2 Val42, Gly49, Leu50 and Trp52) and the longer CDR3 loop that often tethered to the CDR1 via an interloop disulfide bond. Peptides encoding VHH domains are commonly called Nanobodies.
Depending of the antigen, the antibody determinant is differentially maturated leading to a huge diversity of antibody sequences with different canonical structures (3D). In this study we will focus on scorpion toxin binders previously selected from biopanning of phage display library of VHH.
Here we plan to exploit molecular sequences recently reported (Ben Abderrazek et al., 2009, Hmila et al., 2010)  using structural bioinformatic methods with the main objective to design more rational on the interface of toxin/Nanobody interactions. We will start with study correlation between bioinformatics data and crucial elements that are importantly related to their biological activities.

Integration of multiple biomedical data sources for the study of phenotypic data and identification of candidate disease genes

Collaborating Nodes: MUHAS and IPT

We propose to begin studying phenotypes by analysing public microarray expression data related to SCD with statistical tools (R packages for example) in order to identify candidate genes which could affect HbF concentration.
Moreover, combining many data types such as regulation data, protein-protein interactions, epigenetic data could be very interesting for the identification of candidate genes. GWA studies will be also used to identify associations between SNPs and HbF. We will set up a pipeline integrating all these informations and tools that will aim to highlight the disease and help for the diagnostic of the SCD. This pipeline can also be used for studying other diseases.

Human metabolic network modelling: contextualization of H3Africa projects high throughput data from pathological and drug-treated states and simulation of these states

Collaborating Nodes: CUBRe and CPGR

Briefly for P. falciparum, in Fatumo et al.[1-3], a computational method (Choke Point Analysis (CPA)) investigating the topology of biochemical metabolic networks was developed to mine new viable enzymatic drug targets in the most deadly malaria parasite, Plasmodium falciparum. Initial drug screening in in-vitro antiplasmodial assay experiments in Prof. Micheal Lanzer’s laboratory at the University of Heidelberg, Germany, have been performed against the predicted drug targets. The results obtained have been successful on some of the enzymatic sites computational determined by Fatumo et al., and it demonstrates that the computational method works, that is, the predicted sites on the malaria parasite proved to be effective as drug targets. The work is currently accepted with major revision by ScienceDirect Infection, Genetics and Evolution (IGE) and has been recently re-submitted after implementing the reviewer’s comments: testing of 6-diazo-5-oxonorleucine efficacy and toxicity and also in wild strain of P. falciparum. These experiments were performed by colleagues of the Biological Sciences Department at Covenant University. This work may produce novel antimalarial drugs, whose biological mode of action can be determined accurately and provides another antimalarial drug target site upon which a viable structural design pipeline is currently being built by a postdoc from CUBRE with co-workers at Prof Schlitzer’s lab at Marburg University, Germany.

One important database for P. falciparum that has proved useful in our previous works above is its biochemical metabolic network. In a work undertaken by Prof Ezekiel Adebiyi and Dr Olubanke Ogunlana of the Department of Biological Sciences at Covenant University, a first version of biochemical metabolic network, AnoCyc, for A. gambiae has also been developed and deployed under the www.bioCyc.org databases. Based on this network, we are currently pursuing currently the computational analysis of Anopheles gambiae metabolism to facilitate insecticidal target and resistance mechanism discovery.

Using experience from our works above, in this project, for the 8 diseases of interest in the H3A: type 2 diabetes, kidney disease, tuberculosis, cancer, rheumatoid heart disease, cardiometabolic disease, trypanosomiasis and schizophrenia, we will construct computational analysis using several human metabolic networks (HMNs) for affected human cells/tissues. Currently, there exist HMN-derived for the following tissues/cells: myocyte, hepatocyte, adipocyte, renal cell, alveolar macrophage, cardiacmyocyte and the brain. We will produce HMNs-derived for the remaining tissues/cells. These networks provide us the platforms to integrate expected H3A high-throughput data for the contextualization of these data from pathological and drug-treated states and the simulation of these states. We have also gathered previously generated high-throughput data and plan a pilot study of this proposal for a selected disease. Our overall results will provide leads for the development of diagnosis and treatment plans for these 8 diseases. For our anticipated diagnosis plans, we plan porting these to a hand-held machine.


  1. Fatumo, S., Plaisma, K., Mallm, J-P., Schramm, G., Adebiyi, E., Oswald, M. Eils, R. and Koenig, R.   Elsevier Journal of Infection, Genetics and Evolution, 9(3), 351-358, 2009.
  2. Fatumo, S., Kitiporn, P., Adebiyi, E. and Koenig, R.  Elsevier Journal of Infection, Genetics and Evolution, 2010 Sep 7 (Epub ahead of print).
  3. Fatumo, S., Adebiyi, E., Schramm, G., Eils, R. and Koenig, R. IACSIT, IEEE Computer Society Press, Vol. 17, 564-569, 2009.]