TrackSOM-workflow
Givanna Putri
TrackSOM-workflow.Rmd
In this tutorial, we outline a step by step instruction on how to use
TrackSOM to cluster and track time-series data. For simplicity, we will
use the synthetic dataset outlined in our manuscript uploaded to bioArxiv.
The dataset files are provided in the inst
directory: link.
Importing TrackSOM package
If you have not installed the TrackSOM package, please use
devtools
to install the package from the following TrackSOM github repo.
For devtools, the repo parameter will be:
ghar1821/TrackSOM
.
Importing dataset
TrackSOM supports dataset stored as either CSV or FCS files or as
data.table
object (enhanced version of R’s native
data.frame. See data.table
vignette for more details). The following sections shall show you
how to read in those files and pass them to the TrackSOM function.
Dataset as CSV files
Here, we assume that each file contains data belonging to one time-point. Hence you should have more than 1 CSV files. If this is not the case, please reformat your data files.
To import the dataset stored as CSV files, TrackSOM needs to know where the files are stored. These files’ location must be stored within a vector which get passed on to the TrackSOM function.
Important: the vector must be organised such as the first element is the data for the very first time-point, the 2nd element for the 2nd time-point, and so on.
First, we start with specifying the CSV files are. In this example, the dataset files are already stored within the package, so all you need to do is load it up:
data.files.fullpath <- c(
system.file("extdata", "synthetic_d0.csv", package = "TrackSOM"),
system.file("extdata", "synthetic_d1.csv", package = "TrackSOM"),
system.file("extdata", "synthetic_d2.csv", package = "TrackSOM"),
system.file("extdata", "synthetic_d3.csv", package = "TrackSOM"),
system.file("extdata", "synthetic_d4.csv", package = "TrackSOM")
)
For your dataset, please replace the content of the vector with the absolute path for the dataset files!
Let’s inspect the content of data.files.fullpath
:
print(data.files.fullpath)
## [1] "/home/runner/work/_temp/Library/TrackSOM/extdata/synthetic_d0.csv"
## [2] "/home/runner/work/_temp/Library/TrackSOM/extdata/synthetic_d1.csv"
## [3] "/home/runner/work/_temp/Library/TrackSOM/extdata/synthetic_d2.csv"
## [4] "/home/runner/work/_temp/Library/TrackSOM/extdata/synthetic_d3.csv"
## [5] "/home/runner/work/_temp/Library/TrackSOM/extdata/synthetic_d4.csv"
You can see it contains a list of absolute path of multiple CSV files, each belonging to a time-point.
Dataset as FCS files
The procedure is very similar to reading CSV files above, except we’re going to be storing FCS files’ path rather than CSV.
Again, here, we assume that each file contains data belonging to one time-point. Hence you should have more than 1 FCS files. If this is not the case, please reformat your data files.
To import the dataset stored as FCS files, TrackSOM needs to know where the files are stored. These files’ location must be stored within a vector which get passed on to the TrackSOM function. Important: the vector must be organised such as the first element is the data for the very first time-point, the 2nd element for the 2nd time-point, and so on.
First, we start with specifying the FCS files are. In this example, the dataset files are already stored within the package, so all you need to do is load it up:
data.files.fullpath.fcs <- c(
system.file("extdata", "synthetic_d0.fcs", package = "TrackSOM"),
system.file("extdata", "synthetic_d1.fcs", package = "TrackSOM"),
system.file("extdata", "synthetic_d2.fcs", package = "TrackSOM"),
system.file("extdata", "synthetic_d3.fcs", package = "TrackSOM"),
system.file("extdata", "synthetic_d4.fcs", package = "TrackSOM")
)
For your dataset, please replace the content of the vector with the absolute path for the dataset files!
Let’s inspect the content of
data.files.fullpath.fcs
:
print(data.files.fullpath.fcs)
## [1] "/home/runner/work/_temp/Library/TrackSOM/extdata/synthetic_d0.fcs"
## [2] "/home/runner/work/_temp/Library/TrackSOM/extdata/synthetic_d1.fcs"
## [3] "/home/runner/work/_temp/Library/TrackSOM/extdata/synthetic_d2.fcs"
## [4] "/home/runner/work/_temp/Library/TrackSOM/extdata/synthetic_d3.fcs"
## [5] "/home/runner/work/_temp/Library/TrackSOM/extdata/synthetic_d4.fcs"
You can see it contains a list of absolute path of multiple FCS files, each belonging to a time-point.
Dataset as a list of data.table
objects
Sometimes, it is convenient to have the dataset stored as the
data.table
object files, e.g. when you need to run some
code to preprocess your data using R. As an example, supposed the
synthetic dataset CSV files were already read in as
data.table
object prior to running TrackSOM (say you did
some preliminary clean up or filtering).
What you need to do is organise them in a list such that each element
is a data.table
object for the dataset in a time-point.
Important: the list must be organised such as the first element
is the data for the very first time-point, the 2nd element for the 2nd
time-point, and so on.
## [[1]]
## x y z timepoint
## <num> <num> <num> <char>
## 1: 9.598217 11.728198 11.382064 Mock
## 2: 8.937549 8.653750 11.107314 Mock
## 3: 9.632932 9.657278 9.115804 Mock
## 4: 9.770393 12.478038 11.451082 Mock
## 5: 30.426699 31.290929 28.901204 Mock
## ---
## 7096: 10.533174 9.462175 9.867354 Mock
## 7097: 10.544833 10.426697 9.571881 Mock
## 7098: 10.140938 8.727880 10.247086 Mock
## 7099: 30.967131 30.261241 30.469085 Mock
## 7100: 10.063984 10.289172 10.382876 Mock
##
## [[2]]
## x y z timepoint
## <num> <num> <num> <char>
## 1: 33.850857 30.653891 29.257884 SYN-1
## 2: 10.429818 13.880273 10.878985 SYN-1
## 3: 10.009247 10.327127 10.994237 SYN-1
## 4: 9.751578 8.555277 8.914872 SYN-1
## 5: 10.456213 10.972776 10.011441 SYN-1
## ---
## 7096: 9.237145 10.203747 12.475563 SYN-1
## 7097: 12.957000 9.147637 8.807634 SYN-1
## 7098: 12.297580 11.034141 9.828642 SYN-1
## 7099: 9.393480 11.766797 11.137156 SYN-1
## 7100: 11.684874 8.864774 11.771160 SYN-1
##
## [[3]]
## x y z timepoint
## <num> <num> <num> <char>
## 1: 10.107081 10.371958 8.862552 SYN-2
## 2: 12.271113 28.710774 13.945666 SYN-2
## 3: 10.248482 10.249500 7.918076 SYN-2
## 4: 9.341533 10.228882 12.209994 SYN-2
## 5: 10.116334 9.805188 9.742291 SYN-2
## ---
## 7196: 10.682697 13.643521 10.637965 SYN-2
## 7197: 27.598520 30.275747 29.691414 SYN-2
## 7198: 10.728449 10.020652 9.517445 SYN-2
## 7199: 10.167862 9.754739 9.416342 SYN-2
## 7200: 9.554067 11.296156 12.535182 SYN-2
##
## [[4]]
## x y z timepoint
## <num> <num> <num> <char>
## 1: 13.864150 8.884197 8.742413 SYN-3
## 2: 25.420014 40.961726 26.453685 SYN-3
## 3: 23.968729 31.313815 31.314857 SYN-3
## 4: 10.538730 9.855319 9.061674 SYN-3
## 5: 8.814686 12.068781 10.729973 SYN-3
## ---
## 7196: 10.358157 8.856995 10.440222 SYN-3
## 7197: 10.093926 11.400187 10.429873 SYN-3
## 7198: 14.720125 10.599644 9.599185 SYN-3
## 7199: 24.545254 29.734070 29.035897 SYN-3
## 7200: 9.967504 14.351193 11.619231 SYN-3
##
## [[5]]
## x y z timepoint
## <num> <num> <num> <char>
## 1: 10.088854 10.171541 10.122280 SYN-4
## 2: 14.776449 10.189092 10.621575 SYN-4
## 3: 9.652116 12.567544 9.282672 SYN-4
## 4: 10.436809 10.725664 10.016877 SYN-4
## 5: 10.235657 10.966951 9.371363 SYN-4
## ---
## 7096: 21.234807 30.522073 30.590687 SYN-4
## 7097: 9.703690 8.319128 9.579323 SYN-4
## 7098: 17.654003 30.891041 29.390515 SYN-4
## 7099: 9.089960 9.744563 12.483508 SYN-4
## 7100: 36.910728 29.551236 29.926378 SYN-4
As you can see, there are 5 elements in the list, each containing a dataset belonging to a time-point.
Dataset as a data.table
object
You can also represent your dataset as just one
data.table
object containing a column denoting which
time-point each cell comes from.
timepoints <- seq(0, 4)
dat <- lapply(seq(length(data.files.fullpath)), function(data_file_i) {
dt <- fread(data.files.fullpath[[data_file_i]])
dt[['timepoint']] <- timepoints[data_file_i]
return(dt)
})
dat <- rbindlist(dat)
head(dat)
## x y z timepoint
## <num> <num> <num> <int>
## 1: 9.598217 11.728198 11.382064 0
## 2: 8.937549 8.653750 11.107314 0
## 3: 9.632932 9.657278 9.115804 0
## 4: 9.770393 12.478038 11.451082 0
## 5: 30.426699 31.290929 28.901204 0
## 6: 30.789497 30.069909 30.767145 0
tail(dat)
## x y z timepoint
## <num> <num> <num> <int>
## 1: 7.954781 9.241713 9.429919 4
## 2: 21.234807 30.522073 30.590687 4
## 3: 9.703690 8.319128 9.579323 4
## 4: 17.654003 30.891041 29.390515 4
## 5: 9.089960 9.744563 12.483508 4
## 6: 36.910728 29.551236 29.926378 4
Here, the timepoint column denotes which timepoint the cell comes from.
Running TrackSOM
Depending on how your dataset is stored, the parameter
inputFiles
is either:
- a vector of absolute path of your CSV or FCS files
- a list of
data.table
object. - A
data.table
object
If inputFiles
is option 3 above, make sure you specify
the following parameters:
-
timepointCol
: the name of the column denoting the timepoint of your cells. -
timepoints
: a vector of timepoints in order. Make sure these values exist intimepointCol
column.
Other parameters
Note: examples here assume datasets are stored as CSV files.
The TrackSOM function have various parameters. Of them, only
inputFiles
and colsToUse
have no default
values. The inputFiles
parameter has been explained in the
previous section.
The colsToUse
parameter specify the columns in your
dataset to be used for clustering and tracking. This must be a vector.
For the synthetic dataset, the columns are denoted as x
,
y
, and z
. For cytometry data, this should be a
vector of markers.
The TrackSOM functions have the following parameters which are pre-filled with default values:
- tracking: TRUE (default) or FALSE, whether to perform tracking.
- noMerge: TRUE or FALSE (default), whether to allow merging of meta-clusters (FALSE) or not (TRUE). If noMerge is set to TRUE, meta-clusters are only allowed to split, but not merge.
- maxMeta: NULL (default) or round number. Maximum number of meta clusters for all time point. Only used for Autonomous Adaptive operation.
- nClus: NULL (default) or numeric or a vector of numbers. If single number, TrackSOM will run Prescribed Invariant operation. Otherwise, it will run Prescribed Variant operation.
- seed: 42 (default). Random seed number.
- xdim: 10 (default). SOM grid size.
- ydim: 10 (default). SOM grid size.
The function also accept parameters that are built into FlowSOM
ReadInput
, BuildSOM
and BuildMST
functions. See FlowSOM’s vignette for specific parameter
information.
Let’s run TrackSOM with the following settings:
- No merging of meta-clusters allowed
(
noMerge = TRUE
) - Enable tracking (
tracking = TRUE
) - Prescribed Variant producing 3, 3, 9, 7, 15 meta-clusters for each time-point.
The remaining parameters will be set to the default values.
tracksom.result <- TrackSOM(inputFiles = data.files.fullpath,
colsToUse = c('x', 'y', 'z'),
tracking = TRUE,
noMerge = TRUE,
nClus = c(3,3,9,7,15),
dataFileType = ".csv"
)
## Building SOM
## Mapping data to SOM
## Building MST
## Extracting SOM nodes for each time point
## Running meta clustering
## Meta clustering time point 1 with 82 SOM nodes
## Meta clustering time point 2 with 93 SOM nodes
## Meta clustering time point 3 with 90 SOM nodes
## Meta clustering time point 4 with 95 SOM nodes
## Meta clustering time point 5 with 95 SOM nodes
Extracting meta-clusters ID and SOM nodes
TrackSOM result is stored in an object. To facilitate the extraction of meta-clusters ID and SOM nodes for each cell, we provide the following functions:
-
ConcatenateClusteringDetails
: assuming your dataset is stored as adata.table
object, the function attaches the meta-cluster ID and SOM nodes as separate columns. -
ExportClusteringDetailsOnly
: this function simply extract the meta-cluster ID and SOM nodes for each cell as adata.table
object. The cells are ordered based on the ordering of your dataset.
To use ConcatenateClusteringDetails
, you need to first
read in all your datasets as one giant data.table
.
data.files <- c(
system.file("extdata", "synthetic_d0.csv", package = "TrackSOM"),
system.file("extdata", "synthetic_d1.csv", package = "TrackSOM"),
system.file("extdata", "synthetic_d2.csv", package = "TrackSOM"),
system.file("extdata", "synthetic_d3.csv", package = "TrackSOM"),
system.file("extdata", "synthetic_d4.csv", package = "TrackSOM")
)
dat <- lapply(data.files, function(f) fread(f))
dat <- rbindlist(dat)
Double check the data is read in properly:
head(dat)
## x y z timepoint
## <num> <num> <num> <char>
## 1: 9.598217 11.728198 11.382064 Mock
## 2: 8.937549 8.653750 11.107314 Mock
## 3: 9.632932 9.657278 9.115804 Mock
## 4: 9.770393 12.478038 11.451082 Mock
## 5: 30.426699 31.290929 28.901204 Mock
## 6: 30.789497 30.069909 30.767145 Mock
tail(dat)
## x y z timepoint
## <num> <num> <num> <char>
## 1: 7.954781 9.241713 9.429919 SYN-4
## 2: 21.234807 30.522073 30.590687 SYN-4
## 3: 9.703690 8.319128 9.579323 SYN-4
## 4: 17.654003 30.891041 29.390515 SYN-4
## 5: 9.089960 9.744563 12.483508 SYN-4
## 6: 36.910728 29.551236 29.926378 SYN-4
To run ConcatenateClusteringDetails
, you need to pass
the following parameters:
-
timepoint.col
: which column in your data define the time-point. -
timepoints
: what are the time-points (in order). This must be a vector.
Now let’s attach the meta-cluster ID and the SOM nodes:
dat.clust <- ConcatenateClusteringDetails(
tracksom.result = tracksom.result,
dat = dat,
timepoint.col = "timepoint",
timepoints = c('Mock', 'SYN-1', 'SYN-2', 'SYN-3', 'SYN-4')
)
Inspect the content:
head(dat.clust)
## x y z timepoint TrackSOM_cluster
## <num> <num> <num> <char> <num>
## 1: 9.598217 11.728198 11.382064 Mock 62
## 2: 8.937549 8.653750 11.107314 Mock 53
## 3: 9.632932 9.657278 9.115804 Mock 86
## 4: 9.770393 12.478038 11.451082 Mock 61
## 5: 30.426699 31.290929 28.901204 Mock 16
## 6: 30.789497 30.069909 30.767145 Mock 6
## TrackSOM_metacluster TrackSOM_metacluster_lineage_tracking
## <fctr> <fctr>
## 1: 2 B
## 2: 2 B
## 3: 2 B
## 4: 2 B
## 5: 1 A
## 6: 1 A
The function attaches extra 3 columns:
-
TrackSOM_cluster
: The SOM node of each cell. -
TrackSOM_metacluster
: The meta-cluster assignment produced by FlowSOM’s meta-clustering. -
TrackSOM_metacluster_lineage_tracking
: The tracking of meta-clusters’ evolution produced by TrackSOM. This gives you the changes undergone by the meta-clusters over time.
The function gives back a data.table
object which you
can export as CSV file using data.table
’s
fwrite
function.
Visualisations
TrackSOM offers 2 visualisation mediums:
- Network plot.
- Timeseries heatmap.
Network plot
To draw network plots, we need to call the
DrawNetworkPlot
function. Using the clustered and tracked
data from previous sections, the following is an example on how to use
the function:
DrawNetworkPlot(dat = dat.clust,
timepoint.col = "timepoint",
timepoints = c('Mock', 'SYN-1', 'SYN-2', 'SYN-3', 'SYN-4'),
cluster.col = 'TrackSOM_metacluster_lineage_tracking',
marker.cols = c('x', 'y', 'z'))
## Calculating edges
## Computing node details
## Calculating marker's average per node
## Saving node and edge details
## Start drawing plots
## Warning: Existing variables `x` and `y` overwritten by layout variables
## Drawing plots coloured by time point
## Drawing plots coloured by origin
## Drawing plots coloured by x
## Drawing plots coloured by y
## Drawing plots coloured by z
The marker.cols
will determine the markers
which mean/median expression will be drawn on the network plots. There
is no need to specify all the markers in the dataset, just the ones that
you want the network plots to be coloured on.
The function won’t preview any plots, but it will instead store the plots as image files and some extra information (median/mean expression of markers) as CSV files:
## [1] "network_colBy_origin.pdf" "network_colBy_timepoints.pdf"
## [3] "network_colBy_x.pdf" "network_colBy_y.pdf"
## [5] "network_colBy_z.pdf" "network_plot_edge_details.csv"
## [7] "network_plot_node_details.csv" "TrackSOM-workflow.Rmd"
In this example, the plots are saved as PDF files. This can be
changed, e.g. to save the plots as PNG files, by specifying the desired
file format as the file.format
parameter.
Timeseries heatmap
To draw a timeseries heatmap, we need to call the
DrawTimeseriesHeatmap
function. Using the clustered and
tracked data from previous sections, the following is an example on how
to use the function:
DrawTimeseriesHeatmap(dat = dat.clust,
timepoint.col = "timepoint",
timepoints = c('Mock', 'SYN-1', 'SYN-2', 'SYN-3', 'SYN-4'),
cluster.col = 'TrackSOM_metacluster_lineage_tracking',
marker.cols = c('x', 'y', 'z'))
## Computing node details
## Computing edge details
## Saving node and edge details
## Drawing timeseries heatmap coloured by x
## Drawing timeseries heatmap coloured by y
## Drawing timeseries heatmap coloured by z
The marker.cols
will determine the markers
which mean/median expression will be drawn on the network plots. There
is no need to specify all the markers in the dataset, just the ones that
you want the network plots to be coloured on.
The function won’t preview any plots, but it will instead store the plots as image files and some extra information (median/mean expression of markers) as CSV files:
## [1] "network_colBy_origin.pdf"
## [2] "network_colBy_timepoints.pdf"
## [3] "network_colBy_x.pdf"
## [4] "network_colBy_y.pdf"
## [5] "network_colBy_z.pdf"
## [6] "network_plot_edge_details.csv"
## [7] "network_plot_node_details.csv"
## [8] "Timeseries_heatmap_by_x.pdf"
## [9] "Timeseries_heatmap_by_y.pdf"
## [10] "Timeseries_heatmap_by_z.pdf"
## [11] "timeseries_heatmap_edges_details.csv"
## [12] "timeseries_heatmap_node_details.csv"
## [13] "TrackSOM-workflow.Rmd"
In this example, the plots are saved as PDF files. This can be
changed, e.g. to save the plots as PNG files, by specifying the desired
file format as the file.format
parameter.