TCGA Data Mining

The goal is this project is to establish new data mining methods to identify molecular signatures driving cancer clinical phenotypes. We will do this using large public molecular profiling cancer datasets (“Big Data”) locally maintained by TRON, data mining techniques developed in the JGU Mainz Computer Science department, and the high performance computing tools and diagnostics developed and run by the ZDV.

The data mining proposed in this project is divided into two main stages: First, generating candidate genes or sets of candidate genes, and second, automatically and semi-automatically querying biological databases to come up with explanations or hypotheses in the light of known biological knowledge. In the first part, we will generate a large number of Cox regression models, first on single variables, then heuristically combining the most relevant ones into two-variable models, and so forth, in the style of the classical algorithm by Agrawal and Srikant for itemsets. In the second stage, we will make use of discriminative machine learning methods to predict the outcome after a certain fixed period of time.