Integrated Center for Oncology

Breast Cancer Gene-Expression Miner v4.5
(bc-GenExMiner v4.5)

bc-GenExMiner logo


Glossary


[ Published annotated data ][ Published genomic data ][ Intrinsic molecular subtype classification ][ Data pre-processing ]
[ Statistical analyses ][ Survival statistical tests ][ Graphic illustrations ]


Published annotated data:

The following inclusion criteria for selection of transcriptomic data were used:
- invasive carcinomas,
- tumour macrodissection (no microdissection, no biopsy),
- no neoadjuvant therapy before tumour collection,
- minimum number of patients: 35,
- no duplicate sample inside and between datasets, filtering:
. by sample ID and,
. by a threshold of Pearson correlation ‹ 0.99 which is used to avoid duplicate data,
- female breast cancer.


bc-GenExMiner version v4.5 (current: - archives:)
Data type shown: Microarray (available data:)

Ver1ReferenceNo. patientsER2PR2HER22Nodal
status
Histo.
type3
SBRNPIAOLAge diagn.4P53SSPsSCMsEvent status
IHCseqGESDMFSOSDFS
Healthy
--0-----------------
Total for healthy: 0
Tumour-adjacent
--0-----------------
Total for tumour-adjacent: 0
Tumour
1.0Van de Vijver et al.
2002
29529541-295-4140b-41---29529510179122
1.0Sotiriou et al.
2003
9999--999999999999---9099304553
1.0Ma et al.
2004
59595955525959525259---595927
1.0Minn et al.
2005
8282827682----82---82822727
1.0Pawitan et al.
2005
159159a----147-----159159159404050
1.0Wang et al.
2005
286286--286--------286286107107
1.0Weigelt et al.
2005
5050--50-5021b-50---5050131013
1.0Bild et al.
2006
158158a-----------1581585050
1.0Chin et al.
2006
11211211278112-10746b-11280--99112213542
1.0Ivshina et al.
2006
249245--240-249159b-249247-24924924989
1.0Desmedt et al.
2007
198198--198184196196196198---198198625691
1.0Loi et al.
2007
26726187-261-208123b-267--2672662676688
1.0Minn et al.
2007
5858-----------58581111
1.0Naderi et al.
2007
135133--129-134129128135---1271354765
1.0Anders et al.
2008
757170-757464646175---75751414
1.0Chanrion et al.
2008
151139139-146139144134124151---139151464155
1.0Loi et al.
2008
777777-77-5830b-77--7777771013
1.0Calabrò et al.
2009
13913613649103----139---1161396396
1.0Jézéquel et al.
2009
252239236203252-252252252252-----654768
1.1Schmidt et al.
2008
200200a--200-200200-----2002004646
1.1Zhang et al.
2009
136136136-136--------1361362020
3.1Chin et al.
2007
171170--170-170170-171---152171385756
3.1Zhou et al.
2007
5454--54----54---545499
3.1Desmedt et al.
2009
5555554555-55--55--55555555
3.1Jönsson et al.
2010
346335332---226------346-151151
3.1Li et al.
2010
11511511511511510311564b-115--1151151151414
3.1Sircoulomb et al.
2010
55474737453347--4929-5555551717
3.1Buffa et al.
2011
216216--216-191191-216---2162168282
3.1Dedeurwaerder et al.
2011
8584-8585858529b-85--85858536
3.1Filipits et al.
2011
277277-277---------2762775858
3.1Hatzis et al.
2011
309304303309309-286--309---3093096565
3.1Kao et al.
2011
296296a--296----296--296296296636273
3.1Sabatier et al.
2011
239237237224233211233--238175-23923923974
3.1Wang et al.
2011
149149149149148147149148-149---14914910
3.1Kuo et al.
2012
51515151515147--51---51511212
3.1Nagalla et al.
2013
4140383941-39393641---4141141014
4.3expO et al.
2005
29821020919825728925239b-298--298298298
4.3Yau et al.
2007
474747-43----47---4747
4.3Parris et al.
2010
9494949494807575-93---93944445
4.3Symmans et al.
2010
4343--42--------43437171
4.3Heikkinen et al.
2011
174------------172174342734
4.3Sabatier et al.
2011
7171711926----44--717171
4.3Curtis et al.
2012
1 9801 9371 9801 9801 9801 8301 8921 875-1 980-1 980-1 9781 9806021 1431 235
4.3Guedj et al.
2012
536515514390438427517--523239-536536536119119
4.3Servant et al.
2012
343---337318339--34397--343343119119
4.3Clarke et al.
2013
104101--104-10445b-104--104104104483548
4.3Larsen et al.
2013
183183183183-169157--183---182183
4.3Castagnoli et al.
2014
5353535353-53--53---53532323
4.3Fumagalli et al.
2014
5656565654525654-55--565656
4.3Merdad et al.
2014
45383838-4038--45---4545
4.3Terunuma et al.
2014
5555121255-4824b-5555--55551919
4.3Burstein et al.
2015
66666649-6447--63--666666
4.3Michaut et al.
2016
104889289104959683-103---104104262032
4.3Biermann et al.
2017
5352525342----53---5353
4.5Bos et al.
2009
204565656-----------
4.5Silver et al.
2010
75353535-----------
4.5Burstein et al.
2015
198198198198-----------
4.5Jézéquel et al.
2015
107107107107-----------
4.5Jézéquel et al.
2019
131131131131-----------
Total for tumour: 10716
10 716
9 759
6 496
5 533
8 240
4 549
7 325
4 381
948
7 857
922
1 980
2 728
9 657
9 403
2 093
2 081
3 618

  • 1 Version of bc-GenExMiner webtool
  • 2 ER, PR and HER2 status determined by immunohistochemistry (IHC)
  • 3 Histological types
  • 4 Age at diagnosis
  • a ER status determined by means of genomics data (Affymetrix™ probe: 205225_at) in case of a lack of IHC data. See Kenn et al.
  • b NPI score could be computed only for node negative patients

Legend Open

 :unavailable information
 :available information
 AOL:Adjuvant! Online
 ER:oestrogen receptor by IHC
 HER2:HER2 receptor by IHC
 IHC:ImmunoHistoChemistry
 MR:metastatic relapse
 No.:number of
 NPI:Nottingham prognostic index
 OS:overall survival (any pejorative event: local relapse, metastatic relapse or death.)
 PR:progesterone receptor by IHC
 SBR:Scarff Bloom and Richardson grade
 SCMs:Subtype Clustering Models (SCMOD1, SCMOD2, SCMGENE)
 seq:status sequence-based
 SSPs:Single Sample Predictors (Sorlie, Hu and PAM50)



[ back ]


Published genomic data:

The following inclusion criteria for selection of transcriptomic data were used:
- invasive carcinomas,
- tumour macrodissection (no microdissection, no biopsy),
- no neoadjuvant therapy before tumour collection,
- minimum number of patients: 35,
- no duplicate sample inside and between datasets, filtering:
. by sample ID and,
. by a threshold of Pearson correlation ‹ 0.99 which is used to avoid duplicate data,
- female breast cancer.


Data type shown: Microarray (available data:)

bc-GenExMiner version#ReferenceNo. patientsStudy codePlatform originPlatform codeDNA chipNo. unique genes (2019)Processing *
reset table reset table
First authorYear
sort descending sort ascending sort descending sort ascending sort descending sort ascending
1.01Van de Vijver et al.2002295   Rosetta2002Agilent(no code)25k oligo custom14 853   log2 ratio
1.02Sotiriou et al.200399   PNAS1732912100NCI(no code)8k cDNA custom4 345   log2 ratio
1.03Ma et al.200459   GSE1378ArcturusGPL122322k oligo custom14 839   log2 ratio
1.04Minn et al.200582   GSE2603Affymetrix™GPL96HG-U133A12 262   MAS5 and log2
1.05Pawitan et al.2005159   GSE1456Affymetrix™GPL96 - GPL97HG-U133A + B18 430   MAS5 and log2
1.06Wang et al.2005286   GSE2034Affymetrix™GPL96HG-U133A12 262   MAS5 and log2
1.07Weigelt et al.200550   GSE2741AgilentGPL1390Human 1A oligo UNC custom13 980   log2 ratio
1.08Bild et al.2006158   GSE3143Affymetrix™GPL91HG-U95A v28 767   MAS5 and log2
1.09Chin et al.2006112   E_TABM_158Affymetrix™A-AFFY-76HG-U133A v212 262   MAS5 and log2
1.010Ivshina et al.2006249   GSE4922Affymetrix™GPL96 - GPL97HG-U133A + B18 430   MAS5 and log2
1.011Desmedt et al.2007198   GSE7390Affymetrix™GPL96HG-U133A12 262   MAS5 and log2
1.012Loi et al.2007267   GSE6532Affymetrix™GPL96 - GPL97 - GPL570HG U133A + B + P220 542   MAS5 and log2
1.013Minn et al.200758   GSE5327Affymetrix™GPL96HG-U133A12 262   MAS5 and log2
1.014Naderi et al.2007135   E_UCON_1AgilentA-AGIL-14Human 1A oligo G4110A14 258   log2 ratio
1.015Anders et al.200875   GSE7849Affymetrix™GPL91HG-U95A v28 767   MAS5 and log2
1.016Chanrion et al.2008151   GSE9893MLRGGPL5049Human 21k v12.015 014   MAS5 and log2
1.017Loi et al.200877   GSE9195Affymetrix™GPL570HG-U133P220 542   MAS5 and log2
1.018Calabrò et al.2009139   GSE10510DKFZGPL648635k oligo17 807   log2 ratio
1.019Jézéquel et al.2009252   GSE11264UMGC-IRCNAGPL48199k cDNA custom1 808   log2 ratio
1.120Schmidt et al.2008200   GSE11121Affymetrix™GPL96HG-U133A12 262   MAS5 and log2
1.121Zhang et al.2009136   GSE12093Affymetrix™GPL96HG-U133A12 262   MAS5 and log2
3.122Chin et al.2007171   GSE8757VUMC MicroarrayGPL5737Human 30K 60-mer oligo array17 782   log2 ratio
3.123Zhou et al.200754   GSE7378Affymetrix™GPL96HG-U133A12 262   MAS5 and log2
3.124Desmedt et al.200955   GSE16391Affymetrix™GPL570HG-U133P220 542   MAS5 and log2
3.125Jönsson et al.2010346   GSE22133SweGeneGPL5345H_v2.1.1 55K9 236   log2 ratio
3.126Li et al.2010115   GSE19615Affymetrix™GPL570HG-U133P220 542   MAS5 and log2
3.127Sircoulomb et al.201055   GSE17907Affymetrix™GPL570HG-U133P220 542   MAS5 and log2
3.128Buffa et al.2011216   GSE22219IlluminaGPL6098HumanRef-8 v1.0 expr-bc15 757   log2 ratio
3.129Dedeurwaerder et al.201185   GSE20711Affymetrix™GPL570HG-U133P220 542   MAS5 and log2
3.130Filipits et al.2011277   GSE26971Affymetrix™GPL96HG-U133A12 262   MAS5 and log2
3.131Hatzis et al.2011309   GSE25055Affymetrix™GPL96HG-U133A12 262   MAS5 and log2
3.132Kao et al.2011296   GSE20685Affymetrix™GPL570HG-U133P220 542   MAS5 and log2
3.133Sabatier et al.2011239   GSE21653Affymetrix™GPL570HG-U133P220 542   MAS5 and log2
3.134Wang et al.2011149   GSE16987IlluminaGPL6104HumanRef-8 v2.0 expr-bc17 132   log2 ratio
3.135Kuo et al.201251   GSE33926AgilentGPL7264Human 1A Microarray (V2) G4110B16 641   log2 ratio
3.136Nagalla et al.201341   GSE45255Affymetrix™GPL96HG-U133A12 262   MAS5 and log2
4.337expO et al.2005298   GSE2109Affymetrix™GPL570HG-U133P220 542   MAS5 and log2
4.338Yau et al.200747   GSE8193Affymetrix™GPL96HG-U133A12 262   MAS5 and log2
4.339Parris et al.201094   GSE20462IlluminaGPL6947HumanHT-12 V3.019 016   Quantile norm. and log2
4.340Symmans et al.201043   GSE17705Affymetrix™GPL96HG-U133A12 262   MAS5 and log2
4.341Heikkinen et al.2011174   GSE24450IlluminaGPL6947HumanHT-12 V3.019 016   Quantile norm. and log2
4.342Sabatier et al.201171   GSE31448Affymetrix™GPL570HG-U133P220 542   MAS5 and log2
4.343Curtis et al.20121 980   METABRICIlluminaGPL6947HumanHT-12 V3.018 025   Quantile norm. and log2
4.344Guedj et al.2012536   E_MTAB_365Affymetrix™GPL570HG-U133P220 542   MAS5 and log2
4.345Servant et al.2012343   GSE30682IlluminaGPL6884HumanWG-6 v3.019 016   Quantile norm. and log2
4.346Clarke et al.2013104   GSE42568Affymetrix™GPL570HG-U133P220 542   MAS5 and log2
4.347Larsen et al.2013183   GSE40115AgilentGPL15931SurePrint G3 Human GE 8x60K 20 118   log2 ratio
4.348Castagnoli et al.201453   GSE55348IlluminaGPL14951HumanHT-12 WG-DASL V4.0 R219 459   Quantile norm. and log2
4.349Fumagalli et al.201456   GSE43358Affymetrix™GPL570HG-U133P220 542   MAS5 and log2
4.350Merdad et al.201445   GSE36295Affymetrix™GPL6244Gene 1.0 ST20 251   rma-gene-level
4.351Terunuma et al.201455   GSE37751Affymetrix™GPL6244Gene 1.0 ST20 251   rma-gene-level
4.352Burstein et al.201566   GSE76274Affymetrix™GPL570HG-U133P220 542   MAS5 and log2
4.353Michaut et al.2016104   GSE68057AgilentGPL20078Agendia32627 DPv1.14 SCFGplus20 209   Quantile norm. and log2
4.354Biermann et al.201753   GSE97177IlluminaGPL6947HumanHT-12 V3.019 016   Quantile norm. and log2
4.555Bos et al.2009204   GSE12276Affymetrix™GPL570HG-U133P220 542   MAS5 and log2
4.556Silver et al.201075   GSE18864Affymetrix™GPL570HG-U133P220 542   MAS5 and log2
4.557Burstein et al.2015198   GSE76124Affymetrix™GPL570HG-U133P220 542   MAS5 and log2
4.558Jézéquel et al.2015107   GSE58812Affymetrix™GPL570HG-U133P220 542   MAS5 and log2
4.559Jézéquel et al.2019131   GSE83937Affymetrix™GPL570HG-U133P220 542   MAS5 and log2
Total      # 5910 716   

* Data have been converted to a common scale (median equal to 0 and standard deviation equal to 1).

[ back ]


Intrinsic molecular subtypes classification:


Table 1: Intrinsic molecular subtyping methods

Molecular subtypes predictor (MSP) No. genes in MSP Reference Platform correspondence R script reference Statistics Subtypes
Single sample predictor (SSP) Sorlie's SSP 500   Sorlie et al, 2003 Gene symbols; probes median (if multiple probes for a same gene) Weigelt et al, 2010 Nearest centroid classifier;
highest correlation coefficient between patient profile and the 5 centroids
Basal-like,
HER2-E,
Luminal A,
Luminal B,
Normal breast-like
Hu's SSP 306   Hu et al, 2006
PAM50 SSP 50   Parker et al, 2009
Subtype clustering model (SCM) SCMOD1 726   Desmedt et al, 2008
Wirapati et al, 2008
subtype.cluster function, R package genefu Mixture of three gaussians;
use of ESR1, ERBB2 and AURKA modules
ER-/HER2-,
HER2-E,
ER+/HER2- low proliferation,
ER+/HER2- high proliferation
SCMOD2 663  
SCMGENE 3  




Table 2: Intrinsic molecular subtyping of 14 713 breast cancer patients included in bc-GenExMiner v4.4 according to 6 molecular subtype predictors. A DNA microarrays (n = 10 001). B RNA-seq (n = 4 712). (RSSPC: robust SSP classification based on patients classified in the same subtype with the three SSPs; RSCMS: robust SCM classification based on patients classified in the same subtype with the three SCMs; RIMSPC: robust intrinsic molecular subtype predictors classification based on patients classified in the same subtype with the six MSPs)

A
MSPBasal-likeHER2-ELuminal ALuminal BNormal breast-likeunclassified
No%No%No%No%No%No%
Sorlie's SSP1 453 14.5 1 182 11.8 2 934 29.3 1 142 11.4 1 313 13.1 1 977 19.8 
Hu's SSP2 277 22.8 879 8.8 2 422 24.2 1 840 18.4 1 500 15 1 083 10.8 
PAM50 SSP1 954 19.5 1 477 14.8 2 811 28.1 1 944 19.4 1 257 12.6 558 5.6 
RSSPC1 306 388 1 439 366 651 
MSPER-/HER2-HER2-EER+/HER2-
low proliferation
ER+/HER2-
high proliferation
-unclassified
No%No%No%No%--No%
SCMOD11 867 18.7 1 156 11.6 3 104 31 2 809 28.1 1 065 10.6 
SCMOD21 965 19.6 1 117 11.2 2 966 29.7 2 682 26.8 1 271 12.7 
SCMGENE2 790 27.9 1 400 14 2 586 25.9 2 202 22 1 023 10.2 
RSCMC1 288 690 1 827 1 490 
RIMSPC1 055 231 828 242 
B
MSPBasal-likeHER2-ELuminal ALuminal BNormal breast-likeunclassified
No%No%No%No%No%No%
Sorlie's SSP625 13.3 641 13.6 1 595 33.8 667 14.2 839 17.8 345 7.3 
Hu's SSP1 022 21.7 421 8.9 1 200 25.5 1 001 21.2 923 19.6 145 3.1 
PAM50 SSP832 17.7 736 15.6 1 433 30.4 1 029 21.8 639 13.6 43 0.9 
RSSPC583 208 748 226 435 
MSPER-/HER2-HER2-EER+/HER2-
low proliferation
ER+/HER2-
high proliferation
-unclassified
No%No%No%No%--No%
SCMOD1630 13.4 365 7.7 1 986 42.2 1 731 36.7 0.0 
SCMOD2667 14.2 416 8.8 1 913 40.6 1 716 36.4 0.0 
SCMGENE781 16.6 2 360 50.1 838 17.7 733 15.6 0.0 
RSCMC551 150 661 465 
RIMSPC513 72 192 66 
Figure 1: Intrinsic molecular subtyping of 14 713 breast cancer patients included in bc-GenExMiner v4.4 according to 6 intrinsic molecular subtype predictors by comparison of source of data: DNA microarrays (outer circles) vs. RNA-seq (inner circles). A 3 single sample predictors and the robust SSP classification (intersection). B 3 subtype clustering models and the robust SCM classification (intersection). C Robust RIMSPC classification (robust intrinsic molecular subtype predictors classification based on patients classified in the same subtype with the six MSPs).

A
Sorlie's SSPHu's SSPPAM50 SSPRSSPCLegend

Sorlie's SSP chart

Hu's SSP chart

PAM50 SSP chart

RSSPC chart

Basal-like
HER2-E
Luminal A
Luminal B
Normal breast-like
unclassified
B
SCMOD1SCMOD2SCMGENERSCMCLegend

SCMOD1 chart

SCMOD2 chart

SCMGENE chart

RSCMC chart

ER-/HER2-
HER2-E
ER+/HER2- low prolif.
ER+/HER2- high prolif.
C
RIMSPCLegend

chart

chart

chart

RIMSPC chart

Basal-like
HER2-E
Luminal A
Luminal B


Legend Open

 MSP:molecular subtype predictor (SSPs + SCMs)
 No:number of patients
 RIMSPC:robust intrinsic molecular subtype predictors classification
 RSCMC:robust SCM classification based on patients classified in the same subtype with the three SCMs
 RSSPC:robust SSP classification based on patients classified in the same subtype with the three SSPs
 SCM:subtype clustering model
 SSP:single sample predictor




[ back ]


Data pre-processing:


1 DNA microarrays data

1.1 Affymetrix® pre-processing:

Before being log2-transformed, Affymetrix™ raw CEL data were MAS5.0-normalised (Microarray Affymetrix™ Suite 5.0) using the Affymetrix Expression Console™.
Except for Affymetrix™ Gene 1.0 ST which were pre-processed using robust multiarray analysis (RMA) algorithme from Affy Bioconductor packagea.

1.2 Non-Affymetrix pre-processing:

Data have been downloaded as they were deposited in the public databases. When patient to reference ratio and its log2-transformation were not already calculated, we performed the complete process.

1.3 Merging data:

Finally, in order to merge all studies data and create pooled cohorts, we converted all studies data, except triple-Negative breast cancer (TNBC), cohorts to a common scale (median equal to 0 and standard deviation equal to 1b). For TNBC cohorts Combatc method was used.

2 RNA-seq data

2.1 TCGA pre-processing:

2.1.1 All analyses except nature of tissues:

RNA-Seq dataset were downloaded from the TCGA database (Genomic Data Commons Data Portal). We used the RNA-seq expression level read counts data produced by HTSeq and normalized using the FPKM normalization methodd . FPKM values was log2-transformed using an offset of 0.1 in order to avoid undefined values.

2.1.2 Nature of the tissue:

To carry out analyses according to the nature of tissue, we used RNA-seq data collected by the TCGA processed and normalized using the Rsubread packagee. TPM values were downloaded from GEO via accession number GSM1536837 (tumour) and GSM1697009 (tumour-adjacent). All gene expression datasets were log2 transformed using an offset of 1.

2.2 GTEx pre-processing:

We used a dataset that contains gene expression values for healthy tissues (no history of cancer, ie reduction mammoplasty) from the GTEx project. FPKM values available from GEO (accession number GSE86354) were processed and normalized using Rsubread packagee. We converted all FPKM gene expression data to TPM data using the formula below:
TPM formula
An offset of 1 was added to the TPM values prior to log2 transformation.

2.3 SCAN-B (GSE81540) pre-processing:

We used the Sweden Cancerome Analysis Network Breast (SCAN-B)f database. RNA-seq reads were mapped to the hg19 human genome with tophat2 and normalized in FPKM with cufflinks2 pipeline. Then log2-transformed with an offset of 0.1.

2.4 Merging data:

Finally, in order to merge all studies data and create pooled cohorts, we converted studies data to a common scale (median equal to 0 and standard deviation equal to 1b).
For the analysis of nature of the tissue, standardization is not required since RNA-seq raw reads files from different data sources were processed and normalized with the Rsubread packagee, and aligned to the same reference genome UCSC hg19 with the same pipeline.


[ back ]


Statistical analyses:


Several types of analyses are available: correlation analyses, expression analyses and prognostic analyses, all of which have different subtypes.

  Correlation analyses
Gene correlation targeted analysis:

Pearson's correlation coefficient is computed with associated p-value for each pair of genes based on eight different populations:
  • all patients pooled together,
  • patients with positive or negative oestrogen receptor (ER) status,
  • patients with positive or negative progesterone receptor (PR) status,
  • patients with ER and PR combinations statuses,
  • PAM50 molecular subtyped patients,
  • RIMSPC molecular subtyped patients,
  • basal-like (as defined by PAM50) and triple-negative (as defined by immunohistochemistry [IHC]) patients and the intersection of the 2 latter populations,
  • and finally triple-negative breast cancer subtypes patients.

Results are displayed in a correlation map, where each cell corresponds to a pairwise correlation and is coloured according to the correlation coefficient value, from dark blue (coefficient = -1) to dark red (coefficient = 1).
Pearson's pairwise correlation plots are also computed to illustrate each pairwise correlation.

Gene correlation exhaustive analysis:

Pearson's correlation coefficient is computed, with associated p-value, between the chosen gene and all other genes that are present in the database, based on eight different populations: see list in "Gene correlation targeted analysis" section.
Genes with correlation above 0.40 in absolute value and with associated p-value less than 0.05 are retained and the genes with best correlation coefficients are displayed in two different tables: one for the first 50 (or less) positive correlations, one for the first 50 (or less) negative ones.
The lists with all genes fulfilling criteria of correlation coefficient above 0.40 in absolute value and associated p-value less than 0.05 can be downloaded from the results page.

Gene Ontology analysis:

As a complement to this "screening" analysis, an analysis is performed to find Gene Ontology enrichment terms. This analysis focuses on significantly under- or over-represented terms present in the list of genes most positively correlated with the chosen gene, including itself, in the list of genes most negatively correlated with the chosen gene and in the union of these two lists.

For each term of each of the Gene Ontology trees (biological process, molecular function and cellular component), comparison is done between the number of occurrences of this term in the "target list", i.e. the number of times this term is directly linked to a gene, and the number of occurrences of this term in the "gene universe" (all of the genes that are expressed in the database) by means of Fisher's exact test. Terms with associated p-values less than 0.01 are kept.

Gene correlation analysis by chromosomal location:

Pearson's correlation coefficient is computed, with associated p-value, between the chosen gene and genes located around the chosen gene (up to 15 up and 15 down) on the same chromosome, based on eight different populations: see list in "Gene correlation targeted analysis" section. Pearson's pairwise correlation plots are also performed to illustrate correlation of each gene with the chosen one.

Targeted correlation analysis (TCA):

As a complement, results of gene correlation analysis for genes selected via the "TCA" column can be displayed.
Targeted correlation analysis ("TCA" button), which aims at evaluating the robustness of clusters, is proposed: correlation analyses are automatically computed between all possible pairs of genes that compose a selected cluster.

  Expression analyses
Targeted expression analysis:

Once the analysis criteria have been chosen (data source, gene / Probe set to be tested, clinical criterion (criteria) to test the gene against), the distribution of the gene in the available population (all cohorts with availability of required information pooled together) according to the population splitting criterion (criteria) is illustrated by box and whisker, beeswarm, violin and raincloud plots. To assess the significance of the difference in gene distributions in between the different groups, a Welch's test is performed, as well as Dunnett-Tukey-Kramer's tests when appropriate.

Exhaustive expression analysis:

box and whisker, beeswarm, violin and raincloud plots are displayed, along with Welch's (and Dunett-Tukey-Kramer's) tests for every possible population splitting criteria for a unique gene.

Customised expression analysis:

Similarly to targeted analysis, distribution of a chosen gene is compared in between groups, but here, the groups are defined based on another gene: the population (all cohorts with both gene values available pooled together) is split according to the expression level(s) of the latter gene.


  Prognostic analyses
Time-to-event endpoints or event:

The Time-to-event endpoints (or event) used for survival analyses are:
  • "distant metastasis-free survival" (DMFS): first pejorative event represented by distant relapse,
  • "overall survival" (OS): first pejorative event represented by death,
  • "disease-free survival" (DFS): first pejorative event represented by any relapse or death.


Targeted prognostic analysis:

Once the analysis criteria have been chosen (data source, gene / Probe Set to be tested, nodal, oestrogen receptor and progesterone receptor statuses of the cohorts to be explored, event, on which survival analysis will be based, and splitting criterion for the gene), the prognostic impact of the gene is evaluated on all cohorts pooled by means of univariate Cox proportional hazards model, stratified by cohort, and illustrated with a Kaplan-Meier curve.
Cox results are displayed on the curve. In case of more than 2 groups, detailed Cox results (pairwise comparisons) are given in a separate table.
In order to minimize unreliability at the end of the curve, the 15% of patients with the longest follow-up are not plotteda.
To evaluate independent prognostic impact of gene(s) relative to the well-established clinical markers NPIb and AOLc (10-year overall survival) and to proliferation scored, adjusted Cox proportional hazards models are performed on pool's patients with available data.

Exhaustive prognostic analysis:

Univariate Cox proportional hazards model and Kaplan-Meier curves are performed on each of the 27 possible pools corresponding to every combination of population (nodal, oestrogen receptor and progesterone receptor status) for each event criteria (DMFS, OS and DFS) to assess the prognostic impact of the chosen gene / Probe Set, discretised according to the splitting criterion selected. Results are displayed by event criteria and population, and are ordered by p-value (smallest to largest).

Molecular subtype prognostic analysis:

Patients are pooled according to their molecular subtypes, based on three single sample predictors (SSPs) and three subtype clustering models (SCMs), and on three supplementary robust molecular subtype classifications consisting on the intersections of the 3 SSPs and/or of the 3 SCMs classifications: only patients with concordant molecular subtype assignment for the 3 SSPs (RSSPC), for the 3 SCMs (RSCMC), or for all predictors (RIMSPC), are kept. Univariate Cox proportional analysis and Kaplan-Meier curves are performed after choosing data source, gene / Probe Set, molecular subtypes populations, kind of event and discretised according to the splitting criterion selected.

Basal-like/TNBC prognostic analysis:

Univariate Cox proportional hazards analyses and Kaplan-Meier curves are performed, for the chosen gene / Probe Set, discretised according to the splitting criterion selected for all event criteria (DMFS, OS and DFS), on Basal-like (BL) patients (PAM50), on Triple-Negative breast cancer (TNBC) patients (IHC) and on patients both BL and TNBC.

Nota bene:
  • When working with gene symbols and in case of multiple probesets for the same gene, probeset values median is taken as unique value for the gene.
  • Kaplan-Meier curves will not be computed in populations with less than 5 patients.



[ back ]


Statistical tests:


  Correlation statistical tests
Pearson correlation

  - The coefficient:
Pearson correlation coefficient, also known as the Pearson's product moment correlation coefficient and denoted by r, measures the linear dependence (correlation) between two variables (e.g. genes).
It is obtained by the formula r = cov(G1,G2) / (std(G1)*std(G2)), where cov(G1,G2) is the covariance between the variables G1 and G2 and std denotes the standard deviation of each variable.
r values can vary from -1 to 1. A negative r means that when the first variable increases, the second one decreases, a postive r means that both variables increase or decrease simultaneously. The greater the r in absolute value, the stronger the linear dependence between the two variables, with the extreme values of -1 or 1 meaning a perfect linear dependence between the two variables, in which case, if the two variables are plotted, all data points lie on a line.


  - The associated p-value:
Along with the Pearson correlation coefficient, one can test if this coefficient is different from 0, knowing that the statistic
t = r*√(n-2)/√(1-r2) follows a Student distribution with (n-2) degrees of freedom, n being the number of values.
The p-value associated with the Pearson correlation coefficient permits thus to know if a linear dependence exists between the two variables.
Note that one has to be careful when interpreting p-value associated with Pearson correlation coefficient: a significant p-value means that a linear dependence exists between two variables but does not mean that this linear dependence is strong; for example, a coefficient of 0.05 with 1600 data points is associated with a significant p-value (p = 0.046) but one can certainly not conclude that there is a strong linear dependence between the two variables !
[ back ]


  Expression statistical tests
Gene-expression comparisons

To evaluate the difference of gene's expression among the different population groups, Welch's test is used in between the groups. Moreover, when there are at least three different groups and Welch's p-value is significant (indicating that gene's expression is different in between at least two subpopulations), Dunnett-Tukey-Kramer's test is used for two-by-two comparisons (this test permits to know the significativity level but does not give a precise p-value).

Optimal discretisation

In customised analyses, when choosing "optimal" as the splitting criterion for discretisation, gene / Probe Set is split according to all percentiles from the 20th to the 80th, with a step of 5, and the cutoff giving the best p-value (Welch's test) is kept.
[ back ]


  Prognostic statistical tests
Optimal discretisation

In prognostic analyses, when choosing "optimal" as the splitting criterion for discretisation, gene / Probe Set is split according to


all percentiles from the 20th to the 80th, with a step of 5, and the cutoff giving the best p-value (Cox model) is kept.




Cox model

  - Aim of the Cox model:
Cox model is a regression model to express the relation between a covariate, either continuous (e.g. G gene) or ordered discrete (e.g. SBR grade), and the risk of occurrence of a certain event (e.g. metastatic relapse).
Its simplified formula for G gene can be written as follows:
h(t,g) = h0(t)*exp(.g), where h is the hazard function of the event occurrence at time t, dependent on the value g of G and h0(t) is the positive baseline hazard function, shared by all patients.
is the regression coefficient associated with G, the parameter one wants to evaluate.

  - Interpretation of Cox model results:
There are two particularly interesting results when building a Cox model: the p-value associated with , which tells us whether the covariate (e.g. gene) has a significant impact on the event-free survival (if the p-value is less than a certain threshold, usually 5%) and the hazard ratio (HR) (equal to exp()), sometimes summed up by its way (sign of ).


The HR, which is really interesting when the p-value is significant, is actually a risk ratio of an event occurrence between patients with regards to their relative measurements for the gene under study. To be more specific, the HR corresponds to the factor by which the risk of occurrence of the event is multiplied when the risk factor increases by one unit: h(t,G+1) = h(t,G)*exp().
The "way" of this HR permits therefore to know how the gene will generally affect the patients event-free survival.
For example, saying that parameter associated with the gene G under study is negative (thus exp() < 1) means that the greater the value of G, the lower the risk of event: if A and B are two patients such as A's G value gA is greater than B's G value gB, then one can say that patient A has a lower risk of metastatic relapse than patient B:
    gA > gB, < 0
 ⇒ .gA < .gB
 ⇒ exp(.gA) < exp(.gB)
 ⇒ h0(t)*exp(.gA) < h0(t)*exp(.gB), that is, h(t, gA) < h(t, gB).



Kaplan-Meier curves

  - The Kaplan-Meier estimator:
Kaplan-Meier method, also known as the product-limit method, is a non-parametric method to estimate the survival function S(t) (= Pr(T > t): probability of having a survival time T longer than time t) of a given population. It is based on the idea that being alive at time t means being alive just before t and staying alive at t.
Suppose we have a population of n patients, among whom k patients have experienced an event (metastastic relapse or death for instance) at distinct times t1 < t2 < ... < tm (m=k if all events occurred at different times). For each time ti, let ni designs the number of patients still at risk just before ti, that is patients who have not yet experienced the event and are not censored, and let ei designs the number of events that occurred at ti. The event-free survival probability at time ti, S(ti), is then the probability S(ti-1) of not experiencing the event before time ti (at time ti-1) multiply by the probability (ni-ei)/ni of not experiencing the event at time ti (which by definition of ti corresponds to the probability of not experiencing the event during the interval between ti-1 and ti): S(ti) = S(ti-1) x (ni-ei)/ni.
The Kaplan-Meier estimator of the survival function S(t) is thus the cumulative product:

Kaplan-Meier formula




  - The curve:
The Kaplan-Meier survival curve, i. e. the plot of the survival function, permits to visualize the evolution of the survival function (estimate). The curve is shaped like a staircase, with a step corresponding to events at the end of each [ti-1; ti[ interval. Tick marks on each curve indicate censored observation.
The illustration of the Kaplan-Meier survival estimator by the Kaplan-Meier survival curve becomes especially interesting when there are different groups of patients (e.g. according to different treatments or different values of biological markers) and one wants to compare their relative event-free survival. The different survival curves are then plotted together and can be visually compared.
The colour palette used for the curve is from R package viridisa, it permits to keep the colour difference when converted to black and white scale and is designed to be perceived by readers with the most common form of color blindness.

  - Reliability of the estimation:
Caution must be taken concerning the interpretation of the survival curve, especially at the end of the survival curve: the censored patients induce a loss of information and reduce the sample size, making the survival curve less reliable; the end of the curve is obviously particularly affected. For our analyses, in order to minimize unreliability at the end of the curve, the 15% of patients with the longest event-free survival or follow-up are not plotteda.



[ back ]



Graphic illustrations:


  Correlation graphic illustrations
Correlation map

A correlation map illustrates pairwise correlations among a given group of genes.
A correlation map is a square table where each line and each column represent a gene. Each cell represents a mathematical relation between two genes and is coloured according to the value of the Pearson correlation coefficient between these two genes, from dark blue (coefficient = -1) to dark red (coefficient = 1).
Cells from the diagonal of the correlation map represents "interaction" of a gene with itself and are coloured in black.

Pairwise correlation plot

On a correlation plot, the least-squares regression line is plotted along with the data points to illustrate the correlation between two given genes.



  Expression graphic illustrations
Box and whisker, beeswarm, violin and raincloud plots

Box and whisker plots permit to graphically represent descriptive statistics of a continuous variable (e.g. gene): the box goes from the lower quartile (Q1) to the upper quartile (Q3), with an horizontal line marking the median. At the bottom and the top of the box, whisker indicates the distance between the Q1, respectively Q3, and 1.5 times the interquartile range, that is: Q1-1.5*(Q3-Q1) and Q3+1.5*(Q3-Q1).

Beeswarm is a one-dimensional scatter plot similar to stripchart, except that would-be overlapping points are separated such that each is visible (package beeswarma).

Violin plot combines the kernel probability density plot and box and whisker plot. Density curves are plotted symmetrically on both sides of the box and whisker plot.




Raincloud plot is a combination of split-half violin, raw jittered data points, and box and whisker plotb.

Box and whisker, beeswarm, violin and raincloud plots permit to visually compare distributions of a gene among the different population groups.





[ back ]





© 2010 bc-GenExMiner team    Contact Last update June 22, 2020 Disclaimer Site map
2020