Better Genetics Handling Part 2

We left off last time with explaining why changing our genetics handling was necessary and also set off on a programmatic approach to solving the issue for our most basic genetics strings where there was only a mother, father and an x or a / to denote what kind of relationship we knew about. However, as we all know from cannabis genetics, things aren’t always this easy.

Many times, our genetics strings are far more complicated. Here are a few examples:

(({221} x {669}) x {669}) x (({221} x {669}) x {669})
{329} / {114} / ({1934} / {2045})
({1927} / {2093}) / {137} / ???

In these cases, we have some heavier lifting to do. I’m taking them a chunk at a time. We’re doing the same thing we did for the simple examples when it comes to grabbing our data. First, I wanted to handle strains where we’ve got definite parent relationships and there was a parent listed that had more than one part. For example:

({207} x {246}) x {1799}
({1229} x {115}) x {115}

Here is what the code looked like for ensuring these were the types of records we were getting:

$values = array_count_values($strainParts);
if ($values['x'] > 1 && strpos($geneticsString, '(') !== false && strpos($geneticsString, '/') === false && strpos($geneticsString, '?') === false) {
    $matches = explode(')', $geneticsString);
    if (count($matches) == 3) {
        // Complicated
        continue;
    }

Here we can see there are no /’s which means we know what parents went where. In addition, we’ve got strains that have parents that are made of more than one strain. These are denoted by the parenthesis. It was important to me to find strains listed like that so that I could start to insert some new crosses as new strains in the database. So, for the first example listed above, here’s what I wanted the order of operations to be:

  1. Figure out whether the parens were on the mother side (left) or father side (right).
  2. Determine whether the parent made of two strains already existed in the database or not.
  3. Insert the parent made of two strains as a new strain in the database.
  4. Insert the correct genetics information for that new parent strain.
  5. Create our genetics record and use our newly created or already existing strain for the parent appropriately.

For step one, I sorta did a trick to determine whether it was a mother or a father with the ()’s. When exploding on the ) character, we can only get one of two kinds of results back:

var_dump(explode(')', '({strain1 x strain2}) x {strain3}'));
/*
array(2) {
  [0]=>
  string(20) "({strain1 x strain2}"
  [1]=>
  string(12) " x {strain3}"
}
*/
 
var_dump(explode(')', '{strain1} x ({strain2} x {strain3})'));
/*
array(2) {
  [0]=>
  string(34) "{strain1} x ({strain2} x {strain3}"
  [1]=>
  string(0) ""
}
*/

As you may or not have caught, if we don’t have a string length for array[1] we know that it was the father which was the parent with the parens. So, we’ve got code for this that we’ll use for variable variables. Here’s what the code looks like for that:

if (strlen($matches[1]) == 0) {
    $varName = 'fatherId';
    $otherVarName = 'motherId';
} else {
    $varName = 'motherId';
    $otherVarName = 'fatherId';
}

From there, I’ll let the comments in the code serve as a description for what happened to satisfy the rest of the steps we previously mentioned:

// Grab our genetics to check
$geneticsToCheck = substr($matches[0], strpos($matches[0], '('));
 
// Grab the other strain number that is in the genetics string
preg_match('/[0-9]+/', str_replace($geneticsToCheck, '', $geneticsString), $matches);
$$otherVarName = $matches[0];
 
// Explode our strains making up our paren'd strain
$parts = explode('x', str_replace(array('{', '(', '}', ' '), '', $geneticsToCheck));
if (count($parts) == 2) {
    // Check some stuffs!
    if (is_numeric($parts[0]) && is_numeric($parts[1])) {
        // See if this is the only strain where this strain exists
        $sql = 'SELECT strain_id 
               FROM strain_reports 
               WHERE LOWER(genetics_string) LIKE \'{' . $parts[0] . '} x {' . $parts[1] . '}\'';
        $result = $db2->query($sql);
 
        /**
          * If we only get one row back, we know that this combination of genetics already exists in the
          * as a strain and we want to use that strain id as the parent ID.
          */
        if (mysqli_num_rows($result) == 1) {
            $row = mysqli_fetch_assoc($result);
            $$varName = $row['strain_id'];
            $result->free();
        } else if (mysqli_num_rows($result) > 1) {
            // We've got two strains with the same genetics, not good. Need to check it manually later.
            $result->free();
            continue;
        } else {
            // This is a new strain for the database
            $result->free();
 
            // Grab the names of the strains that make up this cross to name our newly created strain
            $sql = "SELECT strain_id, name, display_name 
                   FROM strain_reports WHERE strain_id IN (" . implode(',', $parts) . ")";
            $result = $db2->query($sql);
            $names = array();
            while ($row = $result->fetch_assoc()) {
                $names[] = strlen($row['display_name'])?$row['display_name']:ucwords($row['name']);
            }
            $result->free();
 
            $name = $display_name = str_replace($parts, $names, str_replace(array('(', '{', '}'), '', $geneticsToCheck));
 
            // Insert our new strain into the DB
            $sql = "INSERT INTO strain_reports (name, display_name, genetics_string, active)
                   VALUES (?, ?, ?, 1)";
            $stmt = $db2->prepare($sql);
            $stmt->bind_param('sss', $name, $display_name, str_replace('(', '', $geneticsToCheck));
            $stmt->execute();
            $stmt->close();
 
            // Set our appropriate father or mother ID as our newly inserted strain
            $$varName = $db2->insert_id;
 
            // Insert the genetics information for this new strain
            insertMotherFather($db2->insert_id, $parts[0], $parts[1]);
            $newStrains++;
        }
 
        // Now we can insert our genetics information for our strain with our new or existing strain id
        insertMotherFather($strainId, $motherId, $fatherId);
    }
    $numUpdated++;
}

Checking our $numUpdated and $newStrains vars at the end told us that we processed another 54 strains and added another 40 which provided an additional 94 rows of genetics information in our DB as well.

So, we’ve added another 40 strains to what is already the largest cannabis strain database on the planet and provided genetic information for another 10% of the remaining strains that weren’t covered. We’ve only got another 400 strains to cover and we’ve got a great base of code to accomplish this. With a few modifications we’ll not only be able to handle our genetics strings with ???s in them, but also start handling the larger and more complicated ones.

There will definitely be some manual data entry and massaging as we get towards the end and are presented with only the more complicated strains, but until then, stay tuned for part 3 where we will cover as many strains as possible in a programmatic way.

If you liked this post or just want to keep up with the latest developments concerning Smokereports.com, please consider subscribing to my feed or following smokereports on Twitter.

May 12, 2011 · admin · No Comments
Posted in: development, genetics, list, mysql, php

Better Genetics Handling Part 1

When I first committed to creating this strain database, an absolute must for me was a decent way to link all the strains together via genetics. I didn’t think a great deal about the best way to do this so I came up with a rather rudimentary methodology. I stored the information in a couple of simple ways.

First, I stored a genetics string with each strain. It looked something like this:

({24} x {56}) x {48}

From there I also stored the links in a strain_links table in the database. Basically a strain_id to strain_id relationship for all of the strains listed in the genetics string. So, for the example above I’d have 3 rows, each with the strain in question as the parent_id and the other strains as the related id. Crude and simple but it worked for what I was trying to achieve at a time. This provided a simple 1 to 1 relationship between the strains and gave me a parsable genetics string that I could replace numbers with to provide the output and links that are seen for strains where genetics is available. While this worked, it made it difficult to create the genetics trees that I so desired. I wanted to be able to easily show a tree of genetics going as far back as possible (to the landraces) or, until there wasn’t genetics information available for the parents.

I was stuck with a couple problems now. So, I did a bit of thinking on the best way to handle the genetics in a better fashion. Because I stored the relationships in strings, I knew I would have to do some string parsing to get things right. There were a few considerations for this that were important:

  1. All plants have a mother and a father. The relationships should be as simple as that.
  2. Some plants have information about genetics but not what the mothers and fathers are. This is denoted by a / instead of an x in the genetics.
  3. Due to some not so consistent data entry by myself during many of the late nights putting this together, there are inconsistencies such as X where there should be a / or putting errant spaces or different cases for the X or x.
  4. Some genetics strings contained many strings such as Ed Rosenthal’s Super Bud ({989} x {2093} x {245} x {1934} x {1945} – also a good example of where / should have been used).
  5. Since there was a need to switch to just a mother and a father in most cases, genetics strings like ({strain1} x {strain2}) x {strain3} needed to have a strain in the database for {strain1} x {strain2}

Because of all of these considerations, I knew I could not write a simple regex or loop to get this information. I began by creating my table I would be using for genetics. It looks like this:

+--------------------+------------+------+-----+---------+----------------+
| Field              | Type       | Null | Key | Default | Extra          |
+--------------------+------------+------+-----+---------+----------------+
| strain_genetics_id | int(11)    | NO   | PRI | NULL    | auto_increment |
| child_id           | int(11)    | NO   | MUL | NULL    |                |
| mother_id          | int(11)    | NO   |     | 0       |                |
| father_id          | int(11)    | NO   |     | 0       |                |
| active             | tinyint(4) | NO   |     | 1       |                |
| updated            | datetime   | NO   |     | NULL    |                |
| parent_id          | int(11)    | YES  |     | 0       |                |
+--------------------+------------+------+-----+---------+----------------+

Straightforward child to parents table.

After this, I wanted to examine the data.

SELECT COUNT(strain_id) FROM strain_reports WHERE genetics_string IS NOT NULL AND genetics_string NOT IN ('', '???');

Out of the 1,491 strains that I had genetics for, I wanted to find out how many of those were simple mother and father relationships. I knew that I could treat all of these the same. As it turned out, roughly 2/3 of them were. I discovered this by grabbing all of my strains and exploding on the genetics string to see how many pieces there were.

$strainParts = explode(' ', str_replace(array('{', '}'), '', $geneticsString));

In the process, I got rid of the {}’s that I was using to help with parsing the strings. From there, it was time to start inserting the relationships. I created a couple of functions to assist with this. Before anyone rags on the code, I know, a global $db isn’t the best and I also know I could’ve written them a bit better. However, this code is for me, running locally, and is for a script. There’s some copy-pasta and some not-so-elegant stuff but I’m okay with that as I’m trying to get work done at this point.

function insertMotherFather ($childId, $motherId = false, $fatherId = false) {
    global $db2;
 
    if ($motherId && $fatherId) {
        $sql = "INSERT INTO strain_genetics (child_id, mother_id, father_id) VALUES ($childId, $motherId, $fatherId)";
    } else if ($motherId) {
        $sql = "INSERT INTO strain_genetics (child_id, mother_id) VALUES ($childId, $motherId)";
    } else if ($fatherId) {
        $sql = "INSERT INTO strain_genetics (child_id, father_id) VALUES ($childId, $fatherId)";
    }
 
    var_dump($sql);
 
    $result = $db2->query($sql);
}

In some cases, my genetics strain looked like this:

{345} x ???

In these cases, it was known what the mother was, but not the father. I needed my function to handle those cases so I ensured my mother_id and father_id had appropriate defaults and did my inserting.

For the others where a / was provided instead of an x, I could only provide parent records. This meant I would insert two records in the strain_genetics table. One for each parent. The function looked similar:

function insertHybrid ($childId, $parentId) {
    global $db2;
 
    $sql = "INSERT INTO strain_genetics (child_id, parent_id) VALUES ($childId, $parentId)";
    $result = $db2->query($sql);
}

With my functions written, it was time to loop through and get my genetics relationships in there:

// Check to be sure this is formatted how we expect with x or / in the middle
$vars = array ('x', '/');
if (in_array($strainParts[1], $vars)) {
    // If it's a / in the middle, we don't know which is the mother or father
    if ($strainParts[1] == '/') {
        // In some cases, the second part of the relationship is unknown like: {234} / ??? so we check each
        if (is_numeric($strainParts[0])) {
            insertHybrid($strainId, $strainParts[0]);
        }
        if (is_numeric($strainParts[2])) {
            insertHybrid($strainId, $strainParts[2]);
        }
    } else if ($strainParts[1] == 'x') {
        /**
           * This is a mother and father relationship. Because of the aforementioned case where the mother or 
           * father is unknown, we have to check
           */
        if (is_numeric($strainParts[0]) && is_numeric($strainParts[2])) {
            insertMotherFather($strainId, $strainParts[0], $strainParts[2]);
        } else {
            if (is_numeric($strainParts[0])) {
                insertMotherFather($strainId, $strainParts[0]);
            }
            if (is_numeric($strainParts[2])) {
                insertMotherFather($strainId, false, $strainParts[2]);
            }
        }
    }
}

Badda-bing, badda-boom, 1,037 strains handled and 1,204 rows of strain information in my new, usable format. Phew. This left me with roughly 400 strains to parse and alter. At this point I could probably start to do things a bit more manually, but decided I would continue writing code to handle another big chunk.

I will discuss the unique challenges that arose from more complicated genetics strings in part two of “Better Genetics Handling” which you can expect tonight or tomorrow which you can find here.

If you liked this post or just want to keep up with the latest developments concerning Smokereports.com, please consider subscribing to my feed or following smokereports on Twitter.

May 12, 2011 · admin · No Comments
Posted in: development, genetics, list, mysql, php

A Chance to Work and Welcome to the Journey

Unfortunately, I had a mishap that allowed me to take some time off of work. Rather than sit on my butt and feel sorry for myself, I decided it would be a good time to try and get this site off the ground like it deserves. So, I decided to go ahead and get a development blog going because I knew that would not only allow me to track this crazy endeavor and talk a bit about it but it would keep the drive going to develop this thing and make it as successful as it deserves.

That said, allow me to say:

“Hello there! Welcome to the journey of getting Smokereports.com ready for the big times. I hope you enjoy the show.”

Now then, let’s get to it. I’ve had this site for 3 years now and haven’t done much but give it a slight redesign and a tinkering since it’s initial months of research to bring the database together. I started getting a few thousand hits a month, then a couple hundred thousand and now I’m getting closer to 1 million hits in a month with each month that passes. Something’s telling me I’m on to something…

There are a few things that I need to get in place before I will feel comfortable sending out invites to the various cannabis powers that be to see what kinds of reactions I can get for this thing. I decided to break it down into the 10 things I want most so that’ll give me 10 blog posts and 10 things I get to finish. Usually for lists I enjoy Remember the Milk, but this blog should be just as fun.

  1. Better genetics handling
  2. Easier smoke reports
  3. Genetics trees
  4. Country mappings
  5. Data vis
  6. Logo
  7. Redesign
  8. Comments
  9. Mobile App
  10. Make it known

I’ve already begun on the better genetics handling and will write about the details tomorrow. Expect some snippets of code and brain dumps as I get all of the genetics remapped into a usable format. Much of the list has been brainstormed or mapped out or in some cases already started.

This will be a lot of fun and I hope you enjoy watching this come together as much as I will.

Stay tuned!

May 11, 2011 · admin · No Comments
Posted in: development, list, stats