Regular Expressions

class: center, middle

# Regular Expressions: Pattern matching

---
      # Pattern matching

Sometime we want to match not a single equality but a
      pattern. Mostly this is used for text processing.

https://docs.python.org/3/library/re.html

Regular expressions (RE) are used to match a string. It is a
      test to see if a string matches a pattern.
      
      ---
      #Simple usage

```python
      import re
      RESULT = re.search(PATTERN,QUERYSTRING)
      if RESULT:
          # WE HAD A MATCH
      else:
          # WE DID NOT HAVE A MATCH
      ```

```python
      import re
      m = re.search("bow","elbow")
      if m:
          print("matched bow")
      else:
          print("did not match bow")
      ```
      
      ---
      #Regular expressions and matching

Matches pattern to string
      There are several components to the match.
      * All the alpha numeric characters match themselves
      * Upper and lowercase are respected
      * Special characters to match extra patterns
          * \d matches numeric (0-9)
          * \D matches NOT numeric not(0-9)
          * \s matches white space 
          * \S matches NOT white space
          * [A-Z] - ranges, all letters A-Z
          * . - matches anything

```python
      re.search('\d bird', '8 birds') # true
      re.search('\d bird', '1 bird') # true
      re.search('\d bird', 'A bird') # false

re.search('[123] bird', '1 bird')  # true
      re.search('[0-3] bird', '4 birds') # false

re.search('\d bird', '4 Birds') # false
      re.search('\d [Bb]ird', '4 Birds') # true
      ```
     
      ---
      #Modifiers

Additionally the RE grammar allows repetitions

* + - match one or more times
      * * - match zero or more times
      * ? - match 0 or 1 time

```python
      re.search('\d birds?','8 birds') # true
      re.search('\d birds?','1 bird') # true

re.search('A+B','AAAAAAB') # true
      re.search('A+B','AB') # true
      re.search('A+B','B') # false

re.search('A*B','AAAAAAB') # true
      re.search('A*B','AB') # true
      re.search('A*B','B') # true      
      ```

---
      #Grouping patterns and Capture

Use Parentheses to group patterns and further repeat.
      Items in the parentheses that are captured can be
      retrieved and used.

```python
      import re
      m = re.search("((AB)+)C","ABABABCDED")
      if m:
          print("Group 0",m.group(0))
          print("Group 1",m.group(1))
          print("Group 2",m.group(2))
      ```
      
      ---
      #Context of pattern

* ^ - matches beginning of string
      * $ - matches end of string

```python
      re.search('\d bird', '8 birds') # true
      re.search('\d bird$', '8 birds') # false
      re.search('^\d bird', '8 birds') # true
      re.search('^\d bird', '10 birds') # false
      ```

---
      # pattern searching

If you want to find more than one occurance, or
      count the number occurance you can use `search` or `findall`
      options

```python
      start =0
      m = re.search(pattern, string, start)
      while( m ):
           #  process this match
           start = m.end()+1
           m = re.search(pattern,string,start)      
      ```
      ---
      #Speeding up

Python REs have an option called `compile` which will
      (potentially) improve speed of pattern matching

```
      pattern = re.compile("AACA")
      matches = pattern.search(DNA)
      if match:
          print(match.group(0))
      ````      
      ---
      # Practical example

Restriction Enzymes
      ```
      EcoRI   = "GAATTC"
      EcoRII  = "CC[AT]GG"

RestrictionEnzymes = [EcoRI, EcoRII]
      DNA = "ACAGACGAGAGAATTCGGTAGAT"
      for RE in RestrictionEnzymes:
         pattern = re.compile(RE)
         match = pattern.search(DNA)
         count = pattern.findall(DNA)
         print(RE,"matches", len(count), "sites")

print("//")
      ```
      ---
      #More examples

See
      https://github.com/biodataprog/code_templates/tree/master/Regexp