G4X-output structure

Sample directory tree

Directory structure depends on run type.

Transcript & ProteinTranscript

<sample_root>
│
├── diagnostics
│   └── transcript_table.parquet
│
├── g4x_viewer 
│   ├── <sample_1>.bin
│   ├── <sample_1>.ome.tiff              
│   ├── <sample_1>.tar
│   ├── <sample_1>_HE.ome.tiff
│   ├── <sample_1>_nuclear.ome.tiff
│   └── <sample_1>_run_metadata.json
│
├── h_and_e 
│   ├── eosin.jp2
│   ├── eosin_thumbnail.png
│   ├── h_and_e.jp2
│   ├── h_and_e_thumbnail.jpg
│   ├── nuclear.jp2
│   └── nuclear_thumbnail.png
│
├── metrics 
│   ├── core_metrics.csv
|   ├── protein_core_metrics.csv
│   └── per_area_metrics.csv
│
├── protein                             
│   ├── <protein_1>.jp2
│   ├── <protein_1>_thumbnail.png
│   ├── <protein_2>.jp2
│   ├── <protein_2>_thumbnail.png
│   └── …
│
├── protein_panel.csv                   
│
├── rna
│   └── transcript_table.csv.gz
│
├── run_meta.json
│
├── samplesheet.csv
│
├── segmentation
│   └── segmentation_mask.npz
│
├── single_cell_data
│   ├── cell_by_protein.csv.gz          
│   ├── cell_by_transcript.csv.gz
│   ├── cell_metadata.csv.gz
│   ├── clustering_umap.csv.gz
│   ├── dgex.csv.gz
│   └── feature_matrix.h5
│
├── summary_<sample_1>.html
└── transcript_panel.csv

<sample_root>
│
├── diagnostics
│   └── transcript_table.parquet
│
├── g4x_viewer
│   ├── <sample_1>.bin
│   ├── <sample_1>.tar
│   ├── <sample_1>_HE.ome.tiff
│   ├── <sample_1>_nuclear.ome.tiff
│   └── <sample_1>_run_metadata.json
│
├── h_and_e
│   ├── eosin.jp2
│   ├── eosin_thumbnail.png
│   ├── h_and_e.jp2
│   ├── h_and_e_thumbnail.jpg
│   ├── nuclear.jp2
│   └── nuclear_thumbnail.png
│
├── metrics
│   ├── core_metrics.csv
│   └── per_area_metrics.csv
│
├── rna
│   └── transcript_table.csv.gz
│
├── run_meta.json
│
├── samplesheet.csv
│
├── segmentation
│   └── segmentation_mask.npz
│
├── single_cell_data
│   ├── cell_by_transcript.csv.gz
│   ├── cell_metadata.csv.gz
│   ├── clustering_umap.csv.gz
│   ├── dgex.csv.gz
│   └── feature_matrix.h5
│
├── summary_<sample_1>.html
└── transcript_panel.csv

Sample sub-directory reference

root of sample_folder

run_meta.json: JSON file containing versioning information for the panels used and analysis pipelines as well as the ID for the sequencer on which the experiment was run.

samplesheet.csv: CSV file containing detailed run information. Details the experimental design, flow cell layout, tissue type, panel utilized, etc. This file is useful for analysis, as it designates where each tissue section is positioned on the flow cell.

summary_<sample_id>.html: HTML file which gives a high level overview of the experiment outputs, data quality, and performance for the selected tissue block.

transcript_panel.csv: CSV file containing a full list of all targeted genes in this experiment and the panel(s) which they originated from.

protein_panel.csv: CSV file containing a full list of all targeted proteins in this experiment and the panel(s) which they originated from. Multiomics runs only

diagnostics/

transcript_table.parquet: Parquet file containing all decoded and non-decoded transcripts and associated metadata (e.g. spatial coordinate, gene identity, cell identity (if assigned to a cell), quality score, sequence). Parquet files can be loaded in Python using the polars, fastparquet, pandas, and pyarrow packages.

Expand to see column descriptions

column	type	description
`x_coord_shift`	float	The x coordinate for the transcript (shifted to global coordinates)
`y_coord_shift`	float	The y coordinate for the transcript (shifted to global coordinates)
`z`	int	z layer of identified transcript
`demuxed`	bool	Whether or not the transcript was demultiplexed
`transcript_condensed`	str	Shortened name of transcript
`meanQS`	float	Mean quality score for the transcript
`cell_id`	uint	Cell ID
`sequence_to_demux`	str	Sequence identified that will be demultiplexed
`transcript`	str	Long form transcript name (specific to single probe)
`TXUID`	str	Unique identifier for the transcript

The G4X-viewer is a web-based tool for visualizing and exploring G4X-data. All files in this directory are designed to be loaded into and explored with the G4X-viewer. For more information on how to use the G4X-viewer, see G4X-viewer.

<sample_id>.bin: Binary file containing the segmentation mask for the stitched image. Can be easily read in Python with numpy.

<sample_id>.ome.tiff: Multidimensional OME-TIFF image file. On windows, this may appear as <sample_id>.ome. This image contains aggregated images for all protein targets as well as nuclear stain. Can be loaded into any standard ome.tiff readers, including our G4X-viewer, and napari. Multiomics runs only..

<sample_id>_HE.ome.tiff: OME-TIFF image file containing the fH&E stain images. Can be loaded into any standard OME-TIFF readers, including our G4X-viewer and napari.

<sample_id>_nuclear.ome.tiff: OME-TIFF image file containing the nuclear stain images. Can be loaded into any standard OME-TIFF readers, including our G4X-viewer and napari.

<sample_id>_run_metadata.json: JSON file containing much of the same information as the run_meta.json along with extra core metrics information (such as tissue area, total tx, etc).

<sample_id>.tar: Tarball containing all other files from this directory bundled into one file. This can be loaded into the G4X-viewer directly with the “single file upload” option to avoid dragging each file individually. May take longer to load than the individual files due to needing to untar the components before displaying on the Viewer.

metrics/

core_metrics.csv: CSV file containing a set of core metrics for the tissue block including total transcripts, total area, number of cells and more.

protein_core_metrics.csv: CSV file containing a set of core protein metrics for the tissue block including SNR, background intensity, and Fisher's exact scores for the co-occurrence of the protein signal with its associated transcript signal (<protein>_fisher_score) and a random background (<protein>_fisher_score_background). These scores indicate the likelihood of the signal being true signal compared to the measured background. Multiomics runs only.

per_area_metrics.csv: CSV file containing a set of per-area metrics for the tissue block (coordinate location, number of transcripts, and number of cells), separated out into images from before the images were stitched together into one whole block.

h_and_e/

Tip

The .jp2 images in this folder and the /protein/ folder are suitable to use for both nuclear and cytoplasmic segmentation. For more information on how you might do this, see segment data.

eosin.jp2: Full-sized eosin stained JPEG image used for analysis purposes for selected tissue block.

eosin_thumbnail.png: Downsampled PNG image from the .jp2 file for easier viewing of the eosin stain for selected tissue block.

h_and_e.jp2: Full-sized fH&E JPEG image used for analysis purposes for selected tissue block.

h_and_e_thumbnail.jpg: Downsampled PNG image from the .jp2 file for easier viewing of the fH&E stain for selected tissue block.

nuclear.jp2: Full-sized nuclear stained JPEG image used for analysis purposes for selected tissue block.

nuclear_thumbnail.png: Downsampled PNG image from the .jp2 file for easier viewing of the nuclear stain for selected tissue block.

protein/ (protein runs only)

<protein_name>.jp2: Full-sized JPEG image used for analysis purposes. Shows the <protein_name> stain for selected tissue block.

<protein_name>_thumbnail.png: Downsampled PNG image of the .jp2 file for easier viewing. Shows the <protein_name> stain for selected tissue block.

rna/

transcript_table.csv.gz: CSV file containing a transcript table showing all demuxed transcripts on the whole tissue block. Contains coordinate information, z-layer, gene identity, and cell_id fields. All transcripts here are high confidence transcripts post-filtering and processing.

segmentation/

segmentation_mask.npz: Compressed numpy array file containing the segmentation mask. This can be easily read with the numpy.load() function.

single_cell_data/

cell_by_protein.csv.gz: Gzipped CSV file in a cell x protein intensity format. Each entry in the table is the average protein intensity for a given protein in a given cell. Multiomics runs only.

cell_by_transcript.csv.gz: Gzipped CSV file in a cell x transcript format. Each entry in the table is the counts for a given transcript in a given cell.

cell_metadata.csv.gz: Gzipped CSV file containing the metadata associated with each cell, including cell_id, protein mean intensity, and transcript counts per cell, per transcript species. This is needed to launch a Seurat object and perform downstream analyses. For more information, see data import.

Expand to see column descriptions

name	type	description
`label`	str	Cell ID
`<protein>_intensity_mean`	float	Fluorescence intensity mean for a given protein
`cell_id`	str	Cell ID for a given cell
`cell_x/y`	float	Spatial X/Y coordinate for the nuclear segmentation centroid
`expanded_cell_x/y`	float	Spatial X/Y coordinate for the expanded nuclear segmentation centroid
`log1p_n_genes_by_counts`	float	Log number of unique genes detected
`log1p_total_counts`	float	Log number of total transcripts
`n_genes_by_counts`	int	Number of unique genes detected
`nuclei_area`	int	Area of the nuclear segmentation
`nuclei_expanded_area`	int	Area of the expanded nuclear segmentation
`total_counts`	int	Total transcript counts

clustering_umap.csv.gz: Gzipped CSV file containing the matrix of cell cluster annotations and UMAP coordinates for each cell. This is used to visualize the clustering of cells in 2D for a given leiden resolution or UMAP embedding setting.

Expand to see column descriptions

column	type	description
`label`	str	Cell ID
`leiden_<resolution>`	int	Leiden cluster identity for the cell at the specified resolution (0.2-1.0)
`X_umap_<min_dist>_<spread>_<axis>`	float	UMAP coordinate for the given cell with the given min_dist and spread for a given axis (1 is typically x/UMAP1, 2 is typically y/UMAP2)

dgex.csv.gz: Gzipped CSV file containing the differential gene expression (DGEx) results for the selected tissue block. Columns are detailed below.

Expand to see column descriptions

column	type	description
`names`	str	Gene symbol
`scores`	float	Z-score from Wilcoxon rank-sum test
`logfoldchanges`	float	LogFoldChange for the given cluster compared to all other clusters combined
`pvals`	float	P-value from Wilcoxon rank-sum test
`pvals_adj`	float	Adjusted P-value
`pct_nz_group`	float	Percentage of non-zero values in the given cluster.
`pct_nz_reference`	float	Percentage of non-zero values in all cells outside the given cluster.
`group`	int	Leiden cluster identity
`leiden_res`	str	Leiden clustering resolution that this entry is derived from (1 per gene per cluster)

feature_matrix.h5: H5ad file containing the full cell by gene matrix as well as a wide array of metadata and annotations you might want to use for downstream analysis. This file can be loaded into Python and run through scanpy or a number of other pipelines. See data import. The annotations in this file are detailed below.

Expand to see annotation layer descriptions

name	layer	str	dimensions	description
`<protein>_intensity_mean`	float	obs	ncell x 1	Fluorescence intensity mean for a given protein per cell
`cell_id`	str	obs	ncell x 1	Cell ID for a given cell
`cell_x/y`	float	obs	ncell x 1	Spatial X/Y coordinate for the nuclear segmentation centroid
`expanded_cell_x/y`	float	obs	ncell x 1	Spatial X/Y coordinate for the expanded nuclear segmentation centroid per cell
`log1p_n_genes_by_counts`	float	obs	ncell x 1	Log number of unique genes detected per cell
`log1p_total_counts`	float	obs	ncell x 1	Log number of total transcripts per cell
`n_genes_by_counts`	int	obs	ncell x 1	Number of unique genes detected per cell
`nuclei_area`	float	obs	ncell x 1	Area of the nuclear segmentation per cell
`nuclei_expanded_area`	float	obs	ncell x 1	Area of the expanded nuclear segmentation per cell
`total_counts`	int	obs	ncell x 1	Total transcript counts per cell
`gene_id`	str	var	ngenes x 1	Gene symbol
`log1p_mean_counts`	float	var	ngenes x 1	Log mean transcript counts across all cells
`log1p_total_counts`	float	var	ngenes x 1	Log total transcript counts across all cells
`mean_counts`	float	var	ngenes x 1	Mean counts of each transcript across all cells
`modality`	str	var	ngenes x 1	G4X modality (transcript or protein)
`n_cells_by_counts`	int	var	ngenes x 1	Number of cells with counts of each transcript
`pct_dropout_by_counts`	float	var	ngenes x 1	Percentage of zero-count cells for each gene
`probe_type`	str	var	ngenes x 1	Type of probe: Negative control probe/sequence (NCP/NCS) or transcript targeting (targeting)
`total_counts`	int	var	ngenes x 1	Total transcript counts per gene

⸻