Documentation

  • Running GRNmap from Code:
    • The GRNmap code was developed and tested with MATLAB R2014b; it may not function properly with other versions of MATLAB.
    • GRNmap is only compatible with the Windows operating system because of the function it uses to read and write Microsoft Excel spreadsheets. It has only been tested on Windows 7.
    • We recommend running GRNmap with a minimum of 8.00 GB of RAM and a 2.40 GHz processor. GRNmap may be compatible with slower systems; the amount of RAM and processor speed will affect the speed with which GRNmap completes the estimation.

  • Installing and running the GRNmap stand-alone executable:
    • You must have administrator rights or have the MATLAB Runtime Compiler already installed in your machine to use the stand-alone executable.
    • GRNmap is only compatible with the Windows operating system because of the function it uses to read and write Microsoft Excel spreadsheets. It has only been tested on Windows 7.
    • We recommend running GRNmap with a minimum of 8.00 GB of RAM and a 2.40 GHz processor. GRNmap may be compatible with slower systems; the amount of RAM and processor speed will affect the speed with which GRNmap completes the estimation.

production_rates sheet

This sheet contains initial guesses for the production rate parameters, P, for all genes in the network. Assuming that the system is in steady state with the relative expression of all genes equal to 1, (P/2) - lambda = 0, where lambda is the degradation rate, is a reasonable initial guess. The sheet should contain two columns (from left to right) entitled, "id", "production_rate". The id is an identifier that the user will use to identify a particular gene. The "production_rate" column should then contain the initial guesses for the P parameter as described above, rounded to four decimal places. The genes should be listed in the same order in all the sheets in the Excel workbook.

degredation_rates sheet

This sheet contains degradation rates for all genes in the network, which are provided by the user. Currently, the Dahlquist Lab is using data based on published protein half-life data from Belle et al. (2006). We converted the half-life data values to the degradation rates by taking the natural log of the half-life and dividing by 2. The sheet should contain two columns (from left to right) entitled "id", and "degradation_rate". The id is an identifier that the user will use to identify a particular gene. The "degradation_rate" column should then contain the absolute value of the degradation rate for the corresponding gene as described above, rounded to four decimal places. The genes should be listed in the same order in all the sheets in the Excel workbook.

Expression Data Sheets for Individual Yeast Strains

Expression data can be provided for either a single strain or multiple strains of yeast (for example, the wild type strain and a transcription factor deletion strain). Each strain will have its own sheet in the workbook. Each sheet should be given a unique name that follows the convention "STRAIN_log2_expression", where the word "STRAIN" is replaced by the strain designation, which will appear in the optimization_diagnostics sheet. The sheet should have the following columns in this order:

  1. "id": list of all genes. The genes should be listed in the same order in all the sheets in the Excel workbook.
  2. The next series of columns should contain the expression data for each gene at a given timepoint given as log2 ratios (log2 fold changes). The column header should be the time at which the data were collected, without any units. For example, the 15 minute timepoint would have a column header "15" and the 30 minute timepoint would have the column header "30". GRNmap supports replicate data for each of the timepoints. Replicate data for the same timepoint should be in columns immediately next to each other and have the same column headers. For example, three replicates of the 15 minute timepoint would have "15", "15", "15" as the column headers.
  3. If data are provided for multiple strains, each strain should have data for the same timepoints.

For example, the 21-genes_50-edges_Dahlquist-data_MM_estimation.xls and 21-genes_50-edges_Dahlquist-data_Sigmoid_estimation.xls input workbooks found in the test_files > data_samples directory on GitHub has the following STRAIN_log2_expression sheets:

wt_log2_expression sheet

Log2 fold change expression data from the BY4741 wild type strain at 15, 30, and 60 minutes of cold shock.

dcin5_log2_expression sheet

Log2 fold change expression data from the BY4741 CIN5 deletion strain at 15, 30, and 60 minutes of cold shock.

dgln3_log2_expression sheet

Log2 fold change expression data from the BY4741 GLN3 deletion strain at 15, 30, and 60 minutes of cold shock.

dhmo1_log2_expression sheet

Log2 fold change expression data from the BY4741 HMO1 deletion strain at 15, 30, and 60 minutes of cold shock.

dzap1_log2_expression sheet

Log2 fold change expression data from the BY4741 ZAP1 deletion strain at 15, 30, and 60 minutes of cold shock.

network sheet

Adjacency matrix representation of the gene regulatory network. The columns correspond to the transcription factors and the rows correspond to the target genes controlled by those transcription factors. A “1” means there is an edge connecting them and a “0” means that there is no edge connecting them.

  • The upper-left cell (A1) should contain the text “cols regulators/rows targets”. This text is there as a reminder of the direction of the regulatory relationships specified by the adjacency matrix.
  • The rest of row 1 should contain the names of the transcription factors that are controlling the other genes in the network, one transcription factor name per column.
  • The rest of column A should contain the names of the target genes that are being controlled by the transcription factors heading each of the columns in the matrix, one target gene name per row.
  • The transcription factor names should correspond to the "id" in the other sheets in the workbook. They should be capitalized the same way and occur in the same order along the top and side of the matrix. The matrix needs to be symmetric, i.e., the same transcription factors should appear along the top and left side of the matrix. The genes should be listed in the same order in all the sheets in the Excel workbook.
  • Each cell in the matrix should then contain a zero (0) if there is no regulatory relationship between those two transcription factors, or a one (1) if there is a regulatory relationship between them. Again, the columns correspond to the transcription factors and the rows correspond to the target genes controlled by those transcription factors.

network_weights sheet

The same format as the “network” sheet above. These are the initial guesses for the estimation of the weight parameters, w. Since these weights are initial guesses which will be optimized by GRNmap, the content of this sheet can be identical to the network sheet.

optimization_parameters sheet

The optimization_parameters sheet should have two columns (from left to right) entitled, "optimization_parameter" and "value". The "optimization_parameter" column should contain the following:

  • alpha: Penalty term weighting (from the L-curve analysis)
  • kk_max: Number of times to re-run the optimization loop. In some cases re-starting the optimization loop can improve performance of the estimation.
  • MaxIter: Number of times MATLAB iterates through the optimization scheme. If this is set too low, MATLAB will stop before the parameters are optimized.
  • TolFun: How different two least squares evaluations should be before the program determines that it is not making any improvement
  • MaxFunEval: maximum number of times the program will evaluate the least squares cost
  • TolX: How close successive least squares cost evaluations should be before the program determines that it is not making any improvement.
  • production_function: = Sigmoid (case-insensitive) if sigmoidal model, =MM (case-insensitive) if Michaelis-Menten model
  • L_curve: =0 if an L-curve analysis should NOT be run or =1 if an L-curve analysis SHOULD be run. The L-curve analysis will automatically run sequential rounds of estimation for an array of fixed alpha values (0.8, 0.5, 0.2, 0.1,0.08, 0.05,0.02,0.01, 0.008, 0.005, 0.002, 0.001, 0.0008, 0.0005, 0.0002, and 0.0001). GRNmap makes a copy of the user's selected input workbook and changes alpha to the first alpha in the list. The estimation runs and the resulting parameter values are used as the initial guesses for the next round of estimation with the next alpha value. This process repeats until all alpha values have been run. New input and output workbooks are generated for each alpha value, although currently, the graphs are only saved for the last run.
  • estimate_params =1 if want to estimate parameters and =0 if the user wants to do just one forward run
  • make_graphs =1 to output graphs; =0 to not output graphs
  • fix_P =1 if the user does not want to estimate the production rate, P, parameter, just use the initial guess and never change; =0 to estimate
  • fix_b =1 if the user does not want to estimate the b parameter, just use the initial guess and never change; =0 to estimate
  • expression_timepoints: A row containing a list of the time points when the data was collected experimentally. Should correspond to the timepoint column headers in the STRAIN_log2_expression sheets.
  • Strain: A row containing a list of all of the strains for which there is expression data in the workbook. Should correspond to the "STRAIN" portion of the names of the STRAIN_log2_expression sheets for each strain. Note that GRNmap will run the model for the wild type network (all genes present in the network) and for networks where the gene deleted from the designated STRAIN has been deleted from the network.
  • simulation_timepoints: A row containing a list of the time points at which to evaluate the differential equations to generate the simulated data. This does not need to correspond to the actual measurement times, but should be in the same units (e.g. minutes).

threshold_b sheet

These are the initial guesses for the estimation of the threshold_b parameters. There should be two columns. The left-most column should contain the header "id" and list the standard names for the genes in the model in the same order as in the other sheets. The second column should have the header "threshold_b" and should contain the initial guesses, typically all 0.

STRAIN_log2_optimized_expression sheet

One worksheet is created for each strain specified in the optimization_parameters sheet in the input workbook. The word "STRAIN" is replaced by the actual strain designation, e.g., wt_log2_optimized_expression. This worksheet contains the "SystematicName" and the "StandardName" in the first two columns in accordance to the "STRAIN_log2_expression" sheets from the input file. The rest of row 1 contains the time points specified by the "simtime" parameter in the optimization_parameters worksheet in the input workbook. The values in each cell for the remainder of the sheet correspond to the simulated log2 fold change expression values for each gene for each timepoint. The program evaluates the differential equation for each gene using the w, P, and b parameters. If the user has chosen to run a forward simulation only, these parameters are taken from the network_weights, production_rates, and threshold_b worksheets, respectively. If the user has chosen to estimate parameters, the w parameter is taken from the network_optimized_weights sheet. If P has been estimated, its value is taken from the optimized_production_rates sheet; if b has been estimated, its value is taken from the optimized_threshold_b sheet.

STRAIN_sigmas sheet

One worksheet is created for each strain specified in the optimization_parameters sheet in the input workbook. The word "STRAIN" is replaced by the actual strain designation, e.g., wt_sigmas. This worksheet contains the "SystematicName" and the "StandardName" in the first two columns in accordance to the "STRAIN_log2_expression" sheets from the input file. There is then one column for each timepoint specified by the "time" parameter in the optimization_diagnostics sheet in the input workbook. The values are the standard deviations of the log2 fold changes of expression for each gene and timepoint computed from the data contained in the corresponding STRAIN_log2_expression sheet.

optimized_production_rates sheet

If fix_P = 0 in the optimization_parameters sheet in the input workbook, the production rates are estimated and this sheet is created. This sheet contains the "SystematicName" and "StandardName" just like in the production_rates sheet in the input workbook. The "prorate" column contains the optimized production rates from the estimation procedure.

optimized_threshold_b sheet

If fix_b = 0 from the optimization parameters sheet in the input workbook, the threshold (b) parameters are estimated and this sheet is created. This sheet contains the "SystematicName" and "StandardName" just like in the worksheet described above. The third column, entitled "b", contains the optimized threshold b parameters for each gene in the network.

network_optimized_weights sheet

If estimateParams = 1 in the optimization_parameters sheet in the input workbook, the weight parameters, w, are estimated and this sheet is created. The format of this sheet is the same as the "network" sheet in the input workbook. However, the "1's" of the adjacency matrix are replaced by the estimated w parameters. These values represent the magnitude and sign of the regulatory relationship between the transcription factors (genes located in the very first row) and the target genes (genes in the leftmost column). Cell A1 contains the text "rows genes affected/cols genes controlling" as a reminder of the direction of the regulatory relationships specified by the adjacency matrix. If the value in the cell is negative, the target gene is repressed. If the value is positive, the target gene is activated. A value of 0 means that there is no regulatory relationship between those two transcription factors.

optimization_diagnostics sheet

This worksheet contains some diagnostic information about the estimation that can be used to evaluate the performance of the model. The following information is presented.

  • Top block
    • LSE: the value of the overall least squares error for the estimation
    • Penalty: the value of the sum of squares of all parameters being estimated
    • min LSE: the value of the minimum least squares error possible for the estimation that could be achieved given this particular set of expression data. This value is obtained from the variance of the flask data for each time and for each gene
    • iteration count: the count of the total number times the least squares function is evaluated by the optimization algorithm at the termination of the program
  • The bottom block of data has gene-specific information.
    • The left column, entitled "Gene", lists the IDs of the genes in the regulatory network, in the order they appear in the other worksheets.
    • Each column to the right is entitled "STRAIN SSE", where the word STRAIN is replaced by the strain designation from the optimization_parameters sheet in the input workbook, for example "wt MSE". The MSE value is the mean squared error comparing the simulated expression data found in that particular strain's STRAIN_log2_optimized_expression sheet to the experimental data found in the corresponding STRAIN_log2_expression sheet. The MSE gives an indication of how well the model fits each individual gene's expression data, whereas the LSE value in the top block indicates the overall performance of the model across all the genes.