Pathogenic human variant that dislocates GATA2 zinc fingers disrupts hematopoietic gene expression and signaling networks

Although certain human genetic variants are conspicuously loss of function, decoding the impact of many variants is challenging. Previously, we described a patient with leukemia predisposition syndrome (GATA2 deficiency) with a germline GATA2 variant that inserts 9 amino acids between the 2 zinc fingers (9aa-Ins). Here, we conducted mechanistic analyses using genomic technologies and a genetic rescue system with Gata2 enhancer–mutant hematopoietic progenitor cells to compare how GATA2 and 9aa-Ins function genome-wide. Despite nuclear localization, 9aa-Ins was severely defective in occupying and remodeling chromatin and regulating transcription. Variation of the inter–zinc finger spacer length revealed that insertions were more deleterious to activation than repression. GATA2 deficiency generated a lineage-diverting gene expression program and a hematopoiesis-disrupting signaling network in progenitors with reduced granulocyte-macrophage colony-stimulating factor (GM-CSF) and elevated IL-6 signaling. As insufficient GM-CSF signaling caused pulmonary alveolar proteinosis and excessive IL-6 signaling promoted bone marrow failure and GATA2 deficiency patient phenotypes, these results provide insight into mechanisms underlying GATA2-linked pathologies.

Infected hi-77 +/+ cells with empty vector and hi-77 -/cells with empty or GATA2 were collected on poly-L-lysine coated slides (Electron Microscopy Sciences) and fixed with 3.7% paraformaldehyde in PBS for 10 min at room temperature. Slides were washed with PBS and permeabilized with 0.2% Triton X-100 for 10 min at room temperature. Washed slides were blocked with 3% BSA in PBS with 0.1% Tween 20 for 1 h at room temperature and incubated with anti-HA in rabbit (Cell Signaling Technology, 3724) in 3% BSA at 4˚C overnight. After washing, slides were incubated with Alexa 594 secondary antibody for 1 h at room temperature, washed and mounted using Vectashield mounting medium with DAPI (Vector Laboratories). Images were acquired with a Nikon A1R-S confocal microscope (Nikon).

Gene expression analysis.
Total RNA was purified from 1-5 × 10 5 cells with TRIzol (Invitrogen) and 2 µg RNA was  Table S5 for primers. have at least two-fold changes and an adjusted p-value < 0.05. Gene ontology analysis was performed by DAVID (version 6.8) (6). Heatmaps of gene expression levels were prepared using ComplexHeatmap (7). Fragments per kilobase of transcript per million (FPKM) mapped reads values were added by 10 −3 to avoid taking logarithm on zero.
Reads were shifted + 4-bp or -5-bp for positive and negative strand (11), respectively, using "awk" command. ATAC-seq peaks were identified from shifted-read BED files using MACS v2.0 (12) with the parameter "-nomodel". Normalized wiggle files were generated from shifted-read BED files using a custom python script. Bigwig files were then generated using wigToBigWig command with the "-clip" parameter (13).
Peak calling for four biological replicates of hi−77 −/− infected with empty vector, hi−77 −/− infected with GATA2-expressing retrovirus, and hi-77 -/infected with 9aa-Ins-expressing retrovirus was achieved by MACS2 with the parameter -nolambda. Other parameters were set as default. To ensure that peaks were consistent in each condition, we used IDR (Irreproducible Discovery Rate) to compare peaks from pairs of replicates with --idrthreshold 0.05. Peaks were filtered for the width to be narrower than 1 kb. Using these peaks, a master peak list was created by merging all ATAC-seq peaks that are overlapping. Read counts within each master peak region were retrieved from the aligned BAM files by BEDTools. A count matrix was built to summarize the ATAC-seq read count for each biological sample in each master peak region across all experimental conditions.

ATAC-seq peak annotation.
R package ChIPseeker (15) was utilized to annotate the genomic features of differentially accessible peaks where the maximum range of promoter to transcription start site (TSS) was set at 3 kb. The differentially accessible peaks were assigned to the nearest genes, based on the distance of the peak region to the TSS, or to genes where the peak is overlapping, to build associations between ATAC-seq peaks and genes. We only considered protein-coding genes. Peaks at distal intergenics more than 100 kb away from the TSS were removed. We used the annotation in the barplot, violin plot, and the heatmap. For the peaks used in the violin plot, we only filtered out regions lacking peaks categorized them as gain, loss, no change based on differential accessibility (Table S6).
After obtaining the peak categorization for each gene, we merged peaks in each gene by the following rules: if there was only "gain", we classified the gene as "gain", if there was only "loss", we classified this gene as "loss", and if both "gain" and "loss" were applicable, we classified this gene as "both", and if none were present, then it was either "open to open" or "closed to closed", based on whether peaks existed in this gene.

ATAC-seq motif analyses.
Peaks categorized as gain, loss, no change based on differential accessibility (Table S6) were used as inputs for the motif analysis. We associated peaks with gain and loss to the nearest RNA-seq DEG with the same direction of regulation (activation and repression).
Instead of merging peaks to be associated to one gene, we utilized all the peaks as the input. HOMER software (version 4.11) (16) was used for motif-based sequence analysis and the findMotifGenome.pl function was used to identify known motifs and de novo motifs in the different conditions. All parameters in the function were set as default. Motifs in the enrichment analysis ( Figure 5D) were chosen such that q-values were smaller than 0.05 and the percentage of motifs in target sequences was larger than 30%. The top 5 motifs from each comparison are chosen in the discriminative analysis ( Figure S4C). The heatmaps depicting motif enrichment were generated using the R package pheatmap (https://www.rdocumentation.org/packages/pheatmap/versions/0.2/topics/pheatmap).
CUT&Tag was conducted as described (20). hi-77 -/cells infected with HA-GATA2 or 9aa-Ins were cultured for 3 days and sorted for live, GFP + cells. Collected cells were pooled and lightly permeabilized with 0.1% formaldehyde in room temperature for 2 min and split into 2 replicates per condition (1.25 x 10 5 cells per replicate). Antibodies used were rabbit polyclonal anti-GATA2 (1, 2) and rabbit monoclonal anti-HA-tag (Cell Signaling Technology, 3724).

CUT&Tag peak annotation.
The CUT&Tag data analysis pipeline was implemented as described (21). Raw reads data were aligned to mouse reference genome (mm10) using Bowtie2 (9). Reads that were duplicated or mapped to the blacklist regions (22) were removed from the analysis.
The resulted bam files were sorted and subjected to MACS3 (12) for peak calling. For visualization of the bind profiles, deepTools (19) bamCoverage was used to generate coverage track (bigWig). Peaks called for MACS3 q-value 1e -6 in the 2 replicates were merged with HOMER (16) mergePeaks. Differential peak analysis was performed in R using the Bioconductor package Diffbind (23,24). Peaks with an FDR < 0.05 were considered significantly differentially enriched. Diffbind results were used to merge peaks from prior analysis.

Quantitative chromatin immunoprecipitation (ChIP-qPCR).
ChIP analysis was conducted as described (25). hi-77 -/cells infected with HA-GATA2, 9aa-Ins or control vector were cultured for 3 days with 2 µg/ml puromycin. Samples containing 3 x 10 6 cells were crosslinked with 1% formaldehyde for 10 min. Lysates were immunoprecipitated with rabbit anti-HA antibody (Cell Signaling Technology, 3724) using rabbit normal IgG (Cell Signaling Technology, 2729) as a control. DNA was quantified by real-time PCR (Applied Biosystems Viia 7 instrument) with SYBR green fluorescence, and product was quantified relative to a standard curve created from serial dilution of input chromatin.