5.4 Software Design
1. Learning From the Past
Speed, Speed, Speed! That was at the core of my design philosophy for this system. One of the main reasons that each of my two previous attempts to build this system failed was lack of speed. I started off trying to use a standard SQL database to store and analyze the data. This was horrendously slow. For the next attempt I tried to use an in-memory database. I was trying to use SQL database structures so that I could easily and flexibly query the data to get an idea of what was going on. For example, what quantity of a transcription factor with a degrade rate between 0.1 and 0.4 were created in cells (0, 0, 0) to (5, 5, 1) at time slice 50. By using standard sql and having the data in a DB I could easily do this. However, these DB's also had a massive amount of overhead. SQL DB's are designed by their very nature to be very general purpose. They knew nothing about my problem domain. Consequently even this completely in-memory SQL system was still too slow. I became convinced that I needed to use all the knowledge I had about the problem domain I was trying to solve to build a custom application and no longer rely on general SQL. So in this third attempt I set out to do just that.
The other major reason why my previous attempts failed is that at the time I simply did not have a good enough grasp of how cellular processes like gene transcription, receptor/ligand interaction, and regulation occurred. I was trying to model a system when I still did not adequately understand how that system functioned in the first place. I know, it sounds pretty stupid looking back on it. But I will admit that I do occasionally do stupid things. lol. It was only after I had finished several important biology classes like molecular cell biology, developmental neurobiology, and fundamentals of neuroscience that I again attempted to build this system.
2. Loading and File Types
All of the configuration files for this system are saved in xml format. Even the binary version of a chromosome is saved in xml as a hex string. While xml is bulky compared to saving in a binary format it has the advantage that it is human readable and can be easily modified and understood in a standard text editor. There are several different files that are required and they are listed in table 1.
3. Defining the Chromosome
When you load a gene or protein from a string of bits how do you know how many bits to use for each parameter? What values does the binary value map to? Does 1021 map to 0.3 or to 3000? It is possible to hard code this kind of information into the protein and gene objects themselves so that they know how to translate these values. But if you ever want to change one of the values like to increase the number of bits that define a degrade rate then it is a real pain in the butt. This is what the chromosome configuration file is for. Each protein and gene class has a corresponding definition class. It is this class that knows how to load and save that object, not the object itself. When the simulation configuration is loaded it points to a chromosome definition that is then loaded. Then each time a chromosome is loaded it is the definition that does the loading. The definition knows exactly how many bits each parameter of every class uses and how to translate that binary value into a real world value. However, there is also another more subtle use for this chromosome definition file. What would happen if this type of information was stored directly in the objects themselves and some important genetic algorithms were run where a lot of binary data was stored in a database and later an additional parameter was added to one of the proteins? All of that previous data would now essentially be garbage. It would not be possible to load in that previous data because it was generated based on the old configuration and when the binary data is loaded back in this parameter would be loaded in even though it was not originally saved. This would through everything off and completely different chromosome would be loaded than the original one that was saved. Chromosome definition files avoid this problem because each run of a genetic algorithm would be associated with a definition that would be stored in the db along with the binary data. Later on if a new parameter is added then when the definition is being loaded that parameter would not be found and the definition object knows that it should not attempt to load that parameter value from the binary data. So the old binary chromosome is loaded correctly even though the a new parameter has been added. The same would happen if say the mapping range of an old parameter had been changed. The old configuration would still be setup with the values used when the chromosome data was created and they would load correctly. So the chromosome definition gives us a way to extend the system while still retaining old data.
4. Arrays and Constant Time Lookups
The primary strategy employed to increase the processing speed is to try and do as much as possible through preprocessing. The idea is to put as much as possible into lookup arrays so that the main processing loop has to do a minimal amount of work. This is critical because it is the main processing loop that is going to be called over and over again for millions of iterations. Every nanosecond of wasted time in the main loop will be amplified to make huge differences in the overall time required to perform the simulation. During the preprocessing stage all of the links and lists that will be repeatedly used during the main processing are built. A good example of this is the binding lists shown in figure 1. When the chromosome is first loaded we generate a list of genes and another of proteins. We will then also have another list that separates the proteins based on their type. So we would have a list of all of the membrane proteins and so on. But during the main processing loop one of the primary things we will need to do is to match up a specific protein with all of the other proteins or genes that bind to it. Take membrane receptors and ligands for example. We will be looping through the membrane receptors and for each one we will need to loop through all of the membrane ligands that bind to that receptor. We do not want to have to do a search to find these proteins because it will always be the same list and we will no it beforehand. This is where binding lists come into play. A binding list is a variable three dimensional array. The first dimension is the indexed by the protein type. The second dimension is all of the possible binding ID values. So if the binding ID for membrane receptors is 10 bits, then there are 1024 possible values and that is the size of the second dimension. If there are no proteins with a given binding ID value then a null is stored at this spot in the array. But if there is one or more proteins that have that binding ID then a list is stored that keeps a pointer to all of the associated proteins. This list is built during the preprocessing stage and then used during the processing loop. It is true that the array of binding ID's is wasting memory. However, it is a small amount of memory relative to the total available, and it greatly increases the speed of the algorithm if it can directly access lists. So even though the memory is "wasted", it is still very beneficial. If we want to loop through all of the ligands that bind to a specific receptor we go straight to that list.
5. Protein and Gene Quantities
All of these lists of genes and proteins are stored in the chromosome and are used for the organism as a whole. They are not duplicated for each cell in the body. So in other words there is not a list of protein objects for each cell. There is only one list of protein and gene objects and that is in the chromosome. At this point you might be thinking "Then how can you have different quantities of proteins in different cells? Or how can you turn a gene off in one cell and not in another?" Something that we need to remember is that the properties and functionality of proteins and genes, and their linkages between each other, do not change from one cell to the next. The only things that change are things like the quantity present in the cell, the amount of gene expressed, the activity state of the genes, etc. So instead of each cell having to keep a bunch of protein and gene objects and wasting memory, instead a massive pool of memory is allocated at the chromosome initialization for each of these different cell states and then the memory is divided amongst each cell. So for the quantity of proteins a pool is allocated that has a size of (Cell Count * Protein Count * Size of float). Each cell then gets a piece of this pie that is the size of the protein count. The quantity of each protein in each cell is then stored as a float value in this array. But how do you associate a given protein to an element in the array? When the list of genes and proteins are created they are sorted by things like protein type, degrade rate, etc. Each protein has a variable called ProteinID that is then assigned to it based on its index in the list of proteins. So later during the processing loop if we are in a cell and want to know the quantity of that protein in that cell then it would be retrieved by accessing the array like m_aryProteinQtys[lpProtein->m_iProteinID]. And again this is a constant time array lookup, so there is no searching during the processing loop.
6. Genome Compression
The more proteins and genes there are in the system the slower it will run. For example, if we have 50 proteins and we have to loop through that list 4 times per cell, and we have 5,000 cells, then that will be 50 * 4 * 5,000 = 1,000,000 steps for each time slice. Whereas if we only have 10 proteins then this number falls to 200,000 steps for a time slice. So it is very beneficial to keep the number of proteins and genes to a minimum. However, when a chromosome is generated some of those proteins and genes that are created randomly may not have any effect on the system. For instance, if you have a membrane receptor but there is no corresponding membrane ligand with the same binding ID then that receptor is completely useless. It will Never be able to bind to anything and will never be able to produce its expressed protein. In this case there is no reason to keep that protein, and that would also eliminate its expressed protein as well. In fact, if you eliminate the expressed protein it may make another protein unusable now and require it to be eliminated and so on. So this elimination is an iterative process where you are essentially compressing the genome down to its most compact form. Once the compression is completed then only those proteins and genes that will definitely be used in the simulation will remain. You might be asking "Why not just get rid of any redundant proteins directly in the chromosome instead of doing each time a chromosome is preprocessed?" There are two problems with this. The first is that even if this is done with the initial generation of chromosomes future mutations have the potential of changing these linkages and causing proteins to no longer link up or to link up with different items. So it will essentially always have to perform this service for the chromosome. The second reason is that we do not want to destroy our variability within the chromosome. If we eliminate anything that is currently unusable then evolution has nothing to play with. At some point later on that a fortuitous mutation may occur to that protein that turns out to be the key to solving the problem. If we had eliminated it at the start then this could never occur.
7. What's Going On Inside?
To be able to analyze what is happening and to be able to produce any kind of fitness value for the chromosome it is necessary to be able to find out what is happening inside the simulation. What is the quantity of protein A at time X? How much of gene Y is being expressed at time Z? Questions such as these have to be able to be answered. As mentioned above this is one of the main reasons I originally wanted to use SQL. By putting all of the data into a database, even an in-memory one, it would be trivial to come up with queries to get back data like this. However, it was just to slow. So instead I came up with a custom solution. Each main class like a protein type of gene has an associated search class that knows everything needed to be able to search for that object. So it is possible to write a query for any of the objects that is as vague or as specific as the user wants. There are a few items that are always required like the type of search (protein or gene), and the type of protein. Other than those two things though you can specify or leave out any defined parameter for that object. Lets take an example, it is possible to create a query to find all transcription factors with a binding ID of 212 and a degrade rate between 0.2 and 0.4 and that have a graph type of Linear. Or you could want to find diffusible ligands with a diffusion rate between 0.01 and 0.2. When this was attempted in sql then all of the searches happened during the main processing loop and they had to be executed over and over again. This is not done with this custom system though. During initialization of the chromosome once the genome is stabilized all of the searches look for matching items at that point and generate a list of pointers to those items. Then during the main processing loop they simply walk through this list and do whatever is required. Again, it is all just a matter directly accessing arrays during the critical processing period.
8. Injections and Evaluations
It is possible to directly inject quantities of proteins directly into a range of cells. The injection itself can either define the protein to inject or specify a search to find the proteins that will be injected. The user can also define the time at which each injection occurs. This is a very important feature to have during testing and is often used in the examples seen throughout this site. It is also used to simulate OOGenesis. Currently it is only possible to inject proteins into cells. However, once it is possible to differentiate neurons it will also be possible to directly inject currents into cells.
No genetic algorithm can work if you can not assign a fitness for each chromosome. This system allows the user to define a number of fitness tests that are run at predefined times throughout the processing of the simulation. Some of the most common tests that are currently defined are things like comparing the quantities of a protein over a range of cells to defined values for that range. Another is doing the same but looking instead for a change in the values. By chaining these tests together it is possible to come up with a fitness that looks for protein quantity to rise over a certain number of milliseconds and to fall in a succeeding range of time. All fitness values are based so that the higher the value the worse the fitness, and the total fitness value for each test is accumulated. This means that it is possible to have one test that is more important than other tests. While building some of the genetic algorithms for the examples listed here and some that did not make into this site I found out that building a set of fitness tests that drive the evolutionary process is very difficult. If you leave even any path uncovered it will exploit that and do something you did not expect.
9. Algorithm Overview
10. Design Overview
A considerable amount of work has been put into making this system lightning fast and very flexible. I wanted to make something that would be a solid core for each of the future steps. One of the primary ways of gaining speed used here is to try and do everything possible in the preprocessing stage so that it does not have to be done in the costly main processing loop. By generating all of the lists and searches up front and having them in arrays it costs more memory but this is a negligible price to pay for the massive increases in speed that it achieves. in fact, the bulk of the coding for this system was that related to the preprocessing of the chromosomes. It was far more complicated than the relatively simple main processing loop. But hopefully all of that work will pay off in the ability to simulate very large quantities of cells.