ExpressionSet and SummarizedExperiment

Introduction
ExpressionSet
Summarized Experiment
Licence
References

This analysis was performed using R (ver. 3.1.0).

Introduction

The ExpressionSet is generally used for array-based experiments, where the rows are features, and the SummarizedExperiment is generally used for sequencing-based experiments, where the rows are GenomicRanges. ExpressionSet is in the Biobase library.

There’s a library GEOquery which lets you pull down ExpressionSets by identifier.

library(Biobase) #Download data from GEO library(GEOquery) geoq <- getGEO("GSE9514") #The list has a single element #Save it to letter e for simplicity names(geoq)

## [1] "GSE9514_series_matrix.txt.gz"

e <- geoq[[1]]

ExpressionSet

ExpressionSets are basically matrices with a lot of metadata around them. Here, we have a matrix which is 9,000 by 8. It has phenotypic data, feature and annotation information. You can use the functions dim, ncol, nrow to have the dimension, the number of columns and rows respectively.

The matrix of the expression data, is stored in the exprs slot and you can access that with the exprs function. The phenotypic data, can be accessed using pData and this gives us a data frame, which is information about the samples, including the accession number, the submission date, etc). The feature data is accessible with fData function and this is information about genes or probe sets. The names of the data in the feature data are, for example, the gene title, the gene symbol, the ENREZ_Gene_ID or Gene Ontology information, which might be useful for downstream analysis.

#Number of dimension dim(e)

## Features Samples ## 9335 8

#The expression matrix exprs(e)[1:3,1:3]

## GSM241146 GSM241147 GSM241148 ## 10000_at 15.33 9.459 7.985 ## 10001_at 283.47 300.729 270.016 ## 10002_i_at 2569.45 2382.815 2711.814

#Phenotypic data: information about the samples pData(e)[1:3,1:6]

## title ## GSM241146 hem1 strain grown in YPD with 250 uM ALA (08-15-06_Philpott_YG_S98_1) ## GSM241147 WT strain grown in YPD under Hypoxia (08-15-06_Philpott_YG_S98_10) ## GSM241148 WT strain grown in YPD under Hypoxia (08-15-06_Philpott_YG_S98_11) ## geo_accession status submission_date ## GSM241146 GSM241146 Public on Nov 06 2007 Nov 02 2007 ## GSM241147 GSM241147 Public on Nov 06 2007 Nov 02 2007 ## GSM241148 GSM241148 Public on Nov 06 2007 Nov 02 2007 ## last_update_date type ## GSM241146 Aug 14 2011 RNA ## GSM241147 Aug 14 2011 RNA ## GSM241148 Aug 14 2011 RNA

dim(pData(e))

## [1] 8 31

#Column names of the phenotypic data names(pData(e))

## [1] "title" "geo_accession" ## [3] "status" "submission_date" ## [5] "last_update_date" "type" ## [7] "channel_count" "source_name_ch1" ## [9] "organism_ch1" "characteristics_ch1" ## [11] "molecule_ch1" "extract_protocol_ch1" ## [13] "label_ch1" "label_protocol_ch1" ## [15] "taxid_ch1" "hyb_protocol" ## [17] "scan_protocol" "description" ## [19] "data_processing" "platform_id" ## [21] "contact_name" "contact_email" ## [23] "contact_department" "contact_institute" ## [25] "contact_address" "contact_city" ## [27] "contact_state" "contact_zip/postal_code" ## [29] "contact_country" "supplementary_file" ## [31] "data_row_count"

#Feature data : information about genes or probe sets fData(e)[1:3,1:3]

## ID ORF SPOT_ID ## 10000_at 10000_at YLR331C  ## 10001_at 10001_at YLR332W  ## 10002_i_at 10002_i_at YLR333C

dim(fData(e))

## [1] 9335 17

names(fData(e))

## [1] "ID" "ORF" ## [3] "SPOT_ID" "Species Scientific Name" ## [5] "Annotation Date" "Sequence Type" ## [7] "Sequence Source" "Target Description" ## [9] "Representative Public ID" "Gene Title" ## [11] "Gene Symbol" "ENTREZ_GENE_ID" ## [13] "RefSeq Transcript ID" "SGD accession number" ## [15] "Gene Ontology Biological Process" "Gene Ontology Cellular Component" ## [17] "Gene Ontology Molecular Function"

head(fData(e)$"Gene Symbol")

## [1] JIP3 MID2 RPS25B NUP2 ## 4869 Levels: ACO1 ARV1 ATP14 BOP2 CDA1 CDA2 CDC25 CDC3 CDD1 CTS1 ... Il4

head(rownames(e))

## [1] "10000_at" "10001_at" "10002_i_at" "10003_f_at" "10004_at" ## [6] "10005_at"

#experiment data: experimenter name, laboratory, contact, abstract. #It's empty sometimes experimentData(e)

## Experiment data ## Experimenter name: ## Laboratory: ## Contact information: ## Title: ## URL: ## PMIDs: ## No abstract available.

#Annotation plateform annotation(e)

## [1] "GPL90"

Summarized Experiment

I’m going to load a bioconductor annotation package, which is the parathyroid SummarizedExperiment library. The loaded data is a SummarizedExperiment, which summarizes counts of RNA sequencing reads in genes for an experiment on human cell culture. The SummarizedExperiment object has 63,000 rows, which are genes, and 27 columns, which are samples, and the matrix, in this case, is called counts. And we have the row names, which are ensemble genes, and metadata about the row data, and metadata about the column data.

library(parathyroidSE) #RNA sequencing reads data(parathyroidGenesSE) se <- parathyroidGenesSE se

## class: SummarizedExperiment ## dim: 63193 27 ## exptData(1): MIAME ## assays(1): counts ## rownames(63193): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99 ## rowData metadata column names(0): ## colnames: NULL ## colData names(8): run experiment ... study sample

assay function can be used get access to the counts of RNA sequencing reads. colData function , the column data, is equivalent to the pData on the ExpressionSet. Each row in this data frame corresponds to a column in the SummarizedExperiment. We can see that there are indeed 27 rows here, which give information about the columns. Each sample in this case is treated with two treatments or control and we can see the number of replicates for each, using the as.numeric function again.

#Dimension of the SummarizedExperiment dim(se)

## [1] 63193 27

#Get access to the counts of RNA sequencing reads, using assay function. assay(se)[1:3,1:3]

## [,1] [,2] [,3] ## ENSG00000000003 792 1064 444 ## ENSG00000000005 4 1 2 ## ENSG00000000419 294 282 164

#Dimensions of this assay is a matrix, which has the same dimensions as the SummarizedExperiment. dim(assay(se))

## [1] 63193 27

#Get information about samples colData(se)[1:3,1:6]

## DataFrame with 3 rows and 6 columns ## run experiment patient treatment time submission ##       ## 1 SRR479052 SRX140503 1 Control 24h SRA051611 ## 2 SRR479053 SRX140504 1 Control 48h SRA051611 ## 3 SRR479054 SRX140505 1 DPN 24h SRA051611

#dimension of column data dim(colData(se))

## [1] 27 8

#characteristics of the samples names(colData(se))

## [1] "run" "experiment" "patient" "treatment" "time" ## [6] "submission" "study" "sample"

#Get access to treatment column of sample characteristics colData(se)$treatment

## [1] Control Control DPN DPN OHT OHT Control Control ## [9] DPN DPN DPN OHT OHT OHT Control Control ## [17] DPN DPN OHT OHT Control DPN DPN DPN ## [25] OHT OHT OHT ## Levels: Control DPN OHT

The rows in this case correspond to genes. Genes are collections of exons. The rows of the SummarizedExperiment is a GRangesList where each row corresponds to a GRanges which contains the exons, which were used to count the RNA sequencing reads. Some metadata are included in the row data and is accessible with the metadata function. This information tells us, how this GRangesList was constructed. if it was constructed from the genomic features package using a transcript database. Homo sapiens was the organism, and the database was ENSEMBL GENES number 72, and etc. In addition, there’s some more information under experiment data, using exptData and then specifying the MIAME, which is minimal information about a microarray experiment, Although we’re not using microarrays, we’ve still used the same slots to describe extra information about this object.

#Extract out a single GRanges object: 17 ranges and 2 metada columns which is #the ensembl id for the exon. rowData(se)[1]

## GRangesList of length 1: ## $ENSG00000000003 ## GRanges with 17 ranges and 2 metadata columns: ## seqnames ranges strand | exon_id exon_name ##    |   ## [1] X [99883667, 99884983] - | 664095 ENSE00001459322 ## [2] X [99885756, 99885863] - | 664096 ENSE00000868868 ## [3] X [99887482, 99887565] - | 664097 ENSE00000401072 ## [4] X [99887538, 99887565] - | 664098 ENSE00001849132 ## [5] X [99888402, 99888536] - | 664099 ENSE00003554016 ## ... ... ... ... ... ... ... ## [13] X [99890555, 99890743] - | 664106 ENSE00003512331 ## [14] X [99891188, 99891686] - | 664108 ENSE00001886883 ## [15] X [99891605, 99891803] - | 664109 ENSE00001855382 ## [16] X [99891790, 99892101] - | 664110 ENSE00001863395 ## [17] X [99894942, 99894988] - | 664111 ENSE00001828996 ## ## --- ## seqlengths: ## 1 2 ... LRG_99 ## 249250621 243199373 ... 13294

#rowData is indeed GRangesList class(rowData(se))

## [1] "GRangesList" ## attr(,"package") ## [1] "GenomicRanges"

#length gives the number of genes length(rowData(se))

## [1] 63193

#length of the first GRanges : gives the number of exons for the first gene. length(rowData(se)[[1]])

## [1] 17

head(rownames(se))

## [1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457" ## [5] "ENSG00000000460" "ENSG00000000938"

metadata(rowData(se))

## $genomeInfo ## $genomeInfo$`Db type` ## [1] "TranscriptDb" ## ## $genomeInfo$`Supporting package` ## [1] "GenomicFeatures" ## ## $genomeInfo$`Data source` ## [1] "BioMart" ## ## $genomeInfo$Organism ## [1] "Homo sapiens" ## ## $genomeInfo$`Resource URL` ## [1] "www.biomart.org:80" ## ## $genomeInfo$`BioMart database` ## [1] "ensembl" ## ## $genomeInfo$`BioMart database version` ## [1] "ENSEMBL GENES 72 (SANGER UK)" ## ## $genomeInfo$`BioMart dataset` ## [1] "hsapiens_gene_ensembl" ## ## $genomeInfo$`BioMart dataset description` ## [1] "Homo sapiens genes (GRCh37.p11)" ## ## $genomeInfo$`BioMart dataset version` ## [1] "GRCh37.p11" ## ## $genomeInfo$`Full dataset` ## [1] "yes" ## ## $genomeInfo$`miRBase build ID` ## [1] NA ## ## $genomeInfo$transcript_nrow ## [1] "213140" ## ## $genomeInfo$exon_nrow ## [1] "737783" ## ## $genomeInfo$cds_nrow ## [1] "531154" ## ## $genomeInfo$`Db created by` ## [1] "GenomicFeatures package from Bioconductor" ## ## $genomeInfo$`Creation time` ## [1] "2013-07-30 17:30:25 +0200 (Tue, 30 Jul 2013)" ## ## $genomeInfo$`GenomicFeatures version at creation time` ## [1] "1.13.21" ## ## $genomeInfo$`RSQLite version at creation time` ## [1] "0.11.4" ## ## $genomeInfo$DBSCHEMAVERSION ## [1] "1.0"

exptData(se)$MIAME

## Experiment data ## Experimenter name: Felix Haglund ## Laboratory: Science for Life Laboratory Stockholm ## Contact information: Mikael Huss ## Title: DPN and Tamoxifen treatments of parathyroid adenoma cells ## URL: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37211 ## PMIDs: 23024189 ## ## Abstract: A 251 word abstract is available. Use 'abstract' method.

abstract(exptData(se)$MIAME)

## [1] "Primary hyperparathyroidism (PHPT) is most frequently present in postmenopausal women. Although the involvement of estrogen has been suggested, current literature indicates that parathyroid tumors are estrogen receptor (ER) alpha negative. Objective: The aim of the study was to evaluate the expression of ERs and their putative function in parathyroid tumors. Design: A panel of 37 parathyroid tumors was analyzed for expression and promoter methylation of the ESR1 and ESR2 genes as well as expression of the ERalpha and ERbeta1/ERbeta2 proteins. Transcriptome changes in primary cultures of parathyroid adenoma cells after treatment with the selective ERbeta1 agonist diarylpropionitrile (DPN) and 4-hydroxytamoxifen were identified using next-generation RNA sequencing. Results: Immunohistochemistry revealed very low expression of ERalpha, whereas all informative tumors expressed ERbeta1 (n = 35) and ERbeta2 (n = 34). Decreased nuclear staining intensity and mosaic pattern of positive and negative nuclei of ERbeta1 were significantly associated with larger tumor size. Tumor ESR2 levels were significantly higher in female vs. male cases. In cultured cells, significantly increased numbers of genes with modified expression were detected after 48 h, compared to 24-h treatments with DPN or 4-hydroxytamoxifen, including the parathyroid-related genes CASR, VDR, JUN, CALR, and ORAI2. Bioinformatic analysis of transcriptome changes after DPN treatment revealed significant enrichment in gene sets coupled to ER activation, and a highly significant similarity to tumor cells undergoing apoptosis. Conclusions: Parathyroid tumors express ERbeta1 and ERbeta2. Transcriptional changes after ERbeta1 activation and correlation to clinical features point to a role of estrogen signaling in parathyroid function and disease."

Licence

References

https://github.com/genomicsclass

Enjoyed this article? I’d be very grateful if you’d help it spread by emailing it to a friend, or sharing it on Twitter, Facebook or Linked In.

Show me some love with the like buttons below... Thank you and please don't forget to share and comment below!!

Avez vous aimé cet article? Je vous serais très reconnaissant si vous aidiez à sa diffusion en l'envoyant par courriel à un ami ou en le partageant sur Twitter, Facebook ou Linked In.

Montrez-moi un peu d'amour avec les like ci-dessous ... Merci et n'oubliez pas, s'il vous plaît, de partager et de commenter ci-dessous!

Recommended for You!

Machine Learning Essentials: Practical Guide in R

Practical Guide to Cluster Analysis in R

Practical Guide to Principal Component Methods in R

R Graphics Essentials for Great Data Visualization

Network Analysis and Visualization in R

More books on R and data science

Recommended for you

This section contains best data science and self-development resources to help you on your path.

Coursera - Online Courses and Specialization

Books - Data Science

Our Books

Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
Network Analysis and Visualization in R by A. Kassambara (Datanovia)
Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)

Others

R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
Deep Learning with R by François Chollet & J.J. Allaire
Deep Learning with Python by François Chollet

Want to Learn More on R Programming and Data Science?

Follow us by Email On Social Networks:

Get involved :
Click to follow us on Facebook and Google+ :
Comment this article by clicking on "Discussion" button (top-right position of this page)