Discovery of stem cell specific genes using GENEVESTIGATOR

Frank Staubli, Gaelle Messerli and Philip Zimmermann
© NEBION AG. Last updated: December 22, 2015


GENEVESTIGATOR is a search engine for gene expression. It allows mining thousands of experiments simultaneously to identify genes having a very specific profile (e.g. targets, biomarkers). In this example, we searched for genes most specifically expressed in stem cells as compared to over 1400 other tissues, cell types and cell lines. We identified 42 distinct transcripts, of which several are well known in stem cell research but many others are unknown or have never been associated with stem cells. Interestingly, the list of transcripts not only contains protein-coding genes, but also other transcripts such as lncRNAs. We show how GENEVESTIGATOR allows researchers to effectively identify genes highly specific for a cell type and to study the regulation of these genes in response to perturbations. This study also illustrates how to identify other genes co-regulated with selected targets of interest across a set of relevant conditions.


To achieve this, we used GENEVESTIGATOR (Hruz et al., 2008) and selected a compendium of 26,982 samples profiled on the Affymetrix Human 133 Plus 2 platform. Then we used the Cell Lines tool from the GENE SEARCH toolset and chose all embryonic stem cell lines as target, and all other cell lines and normal tissues (including primary cells) as base. We then searched for the top 50 genes (see figure below).

Figure 1. Identification of the top 50 Affymetrix probeses showing most specific expression in embryonic stem cells. Although only embryonic stem cells were chosen as target, the pathological stem cell lines appeared as the top co-expressing categories. The image was stretched horizontally to better visualize the heatmap; see here for the original image file.

Among the top 50 genes identified, several have previously been specifically associated with stem cells, for example:
  • DPPA4: developmental pluripotency associated 4 (rank 1) (Maldonado-Saldivia et al., 2007)
  • ESRG: embryonic stem cell related (rank 2) (Li et al., 2013)
  • L1TD1: LINE-1 type transposase domain containing 1 (rank 3) (Wong et al., 2011)
  • TDGF1: teratocarcinoma-derived growth factor 1 (rank 5) (Baldassare et al., 1997)
  • LIN28A: lin-28 homolog A (rank 7) (Zhu et al., 2010)
About half of the genes identified have not yet been associated with stem cell development or pluripotency in the literature, and about one fourth has not yet been characterized at all. Removing redundant transcripts yielded a list of 42 unique transcripts. This list (download Excel file) therefore provides novel findings for stem cell research.


Interestingly, the list of top 42 transcripts is pretty heterogeneous in terms of regulation and is composed of various clusters. Nevertheless, four conditions were identified as being consistently regulated across all top 25 stem cell specific genes. Each of these conditions is related to stem cell differentiation.

Figure 2. Selection of perturbations significantly regulating gene DPPA4 (screened from 3032 experimental conditions using fold-change > 2 and p < 0.01, resulting in 4 conditions). Representation of expression of the top 42 stem cell specific genes across this set of conditions. Color scale: Red represents up-regulated, green represents down-regulated.


From the 3032 experimental conditions tested, less than 100 caused significant changes of expression in any of the stem cell specific genes. To identify patterns of co-regulation, we created a new data matrix containing only conditions that significantly regulate the 42 genes and ran a biclustering analysis. Selected biclusters are shown below (Figure 3) and represent groups of genes that are locally co-regulated. In other words, they represent groups of genes that are co-responsive to a subset of conditions, irrespective of how they respond to other conditions. Earlier studies have shown that genes from a bicluster are often co-regulated by at least one common regulator.

Figure 3. Example biclusters of the top 42 genes against relevant conditions. Each bicluster is a set of genes potentially co-regulated across the conditions selected by that bicluster.


Biclusters may overlap both across genes or conditions and cannot be represented all at once. To see overall patterns of expression we ran a two-way hierarchical clustering (Figure 4) of the same matrix as used for the biclustering analysis. The two main determinants of this clustering are stem cell differentiation studies. The cluster tree marked in red indicates genes that were non-responsive to almost all 3032 conditions, except for the stem cell differentiation studies where they are strongly down-regulated. These genes/probes are: 216319_at, LECT1, ZSCAN10, 240987_at, 237275_at (lncRNA, LINC00458), 1569023_a_at, 237911_at, 237193_s_at, 237192_at, ESRG. Most of the transcripts from this cluster are uncharacterized, with only three of them representing known protein-coding genes and one of them being a known lncRNA transcript.

Figure 4. Hierarchical clustering of stem cell specific genes across relevant conditions.


The biclusters described above were obtained only from the list of 42 candidates across the significant set of conditions. To explore the space of transcripts beyond this list, we performed a co-expression analysis of one example gene, ESRG, across conditions relevant for this gene (i.e. conditions causing a change of expression with fold-change > 1.5 and p < 0.05). The below figure shows the top-100 transcripts identified, and reveals two major clusters of genes associated with our target ESRG (marked in red and in yellow). The group of genes marked in red overlaps largely with the cluster marked in red in Figure 4, whereas the group marked in yellow contains immune-responsive and other genes. Between the two (in blue) - and connected to both the red and yellow groups - are four POU class homeobox genes. Similar investigations can be done for each of the 42 candidate genes (not shown).

Figure 5. Co-expression analysis of gene ESRG across perturbations that were significant for this gene (41 perburbations, selected from 3032 perturbations).


All results and figures shown in the described analysis were achieved easily using the GENEVESTIGATOR interface, without a need for having specific knowledge of bioinformatics or command-line experience. By screening across 1655 different tissues, primary cells and cell lines, we could comfortably identify genes or transcripts specifically associated with stem cells. While several genes known to be involved in stem cell differentiation appeared in these results, it also revealed uncharacterized transcripts. Moreover, we were able to identify clusters of potentially co-regulated genes. In fact, while all stem cell specific genes possessed the same profile by tissue/cell type, subsets of genes had very distinct patterns of regulation across perturbations. Working with GENEVESTIGATOR allowed us to easily exploit over 25,000 profiled samples, to export figures and data, and also to store the analysis workspace for later re-analysis. Similar types of analyses are possible to identify genes specifically regulated by chosen conditions, diseases or tissue types. While this analysis was carried out on datasets from the Affymetrix Human 133 Plus 2 platform, similar types of queries can be done on other microarray or RNA-seq platforms, including other data types such as miRNA. To try it out yourself, create a personal user account here.


Hruz T, Laule O, Szabo G, Wessendorp F, Bleuler S, Oertle L, Widmayer P, Gruissem W and P Zimmermann (2008) Genevestigator V3: a reference expression database for the meta-analysis of transcriptomes. Advances in Bioinformatics 2008, 420747 [Full Text]

Maldonado-Saldivia J, van den Bergen J, Krouskos M, Gilchrist M, Lee C, Li R, Sinclair AH, Surani MA, Western PS (2007) Dppa2 and Dppa4 are closely linked SAP motif genes restricted to pluripotent cells and the germ line. Stem Cells. 2007 Jan;25(1):19-28. [Abstract]

Li G, Ren C, Shi J, Huang W, Liu H, Feng X, Liu W, Zhu B, Zhang C, Wang L, Yao K, Jiang X (2013) Identification, expression and subcellular localization of ESRG. Biochem Biophys Res Commun. 2013 May 24;435(1):160-4. [Abstract]

Wong RC, Ibrahim A, Fong H, Thompson N, Lock LF, Donovan PJ. (2011) L1TD1 is a marker for undifferentiated human embryonic stem cells. PLoS One. 2011 Apr 29;6(4):e19355.  [Abstract]

Baldassarre G, Romano A, Armenante F, Rambaldi M, Paoletti I, Sandomenico C, Pepe S, Staibano S, Salvatore G, De Rosa G, Persico MG, Viglietto G. (1997) Expression of teratocarcinoma-derived growth factor-1 (TDGF-1) in testis germ cell tumors and its effects on growth and differentiation of embryonal carcinoma cell line NTERA2/D1. Oncogene. 1997 Aug 18;15(8):927-36.  [Abstract]

Zhu H, Shah S, Shyh-Chang N, Shinoda G, Einhorn WS, Viswanathan SR, Takeuchi A, Grasemann C, Rinn JL, Lopez MF, Hirschhorn JN, Palmert MR, Daley GQ (2010) Lin28a transgenic mice manifest size and puberty phenotypes identified in human genetic association studies. Nat Genet. 2010 Jul;42(7):626-30  [Abstract]