<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.9.5">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2024-07-02T09:45:55+00:00</updated><id>/feed.xml</id><title type="html">The Twenty Percent</title><subtitle>Personal Blog about electrical engineering, IT and finishing projects.</subtitle><author><name>Cyrill Künzi</name></author><entry><title type="html">Building a cute CO2 gauge</title><link href="/co2/" rel="alternate" type="text/html" title="Building a cute CO2 gauge" /><published>2024-03-15T07:17:21+00:00</published><updated>2024-03-15T07:17:21+00:00</updated><id>/co2</id><content type="html" xml:base="/co2/"><![CDATA[<p>A good smart home is one where I don’t have to fiddle around with my phone all the time.
This philosophy has manifested as several physical switches spread across the apartment, controlling most essential functions.
But especially in the winter when all windows stay shut, I found myself monitoring the room’s CO2 levels quite frequently. 
The quest to free myself from this unnecessary phone checking resulted in the following specimen:</p>

<p><img src="/assets/images/co2/gaugy.jpg" alt="" class="align-center" width="100%" /></p>

<h2 id="the-hardware">The Hardware</h2>
<p>The core of the setup consists of a DS3225 servo powered by a <a href="https://www.wemos.cc/en/latest/s2/s2_mini.html">Wemos S2 mini</a>.
Both components were chosen by the very scientific process of already being in my drawer.</p>

<p><img src="/assets/images/co2/guts.jpg" alt="" class="align-center" width="100%" /></p>

<p>The 25 kg servo is admittedly a bit oversized for turning a tiny plastic pointer. But it is also much quieter than one of these ultra-cheap AliExpress alternatives, which is a plus when hanging in the living room.</p>

<p><img src="/assets/images/co2/size_matters_not.jpg" alt="" class="align-center" width="100%" /></p>

<p>Everything is held together with either press-fits or threaded inserts, while a print-out of the gouge is sandwiched between the ‘face’ and outer casing. Crafting the artistic design was probably the toughest challenge of this project, especially for a total Inkscape noob like myself.</p>

<figure class="third ">
  
    
      <a href="/assets/images/co2/case.png" title="Printed case cross section">
          <img src="/assets/images/co2/case.png" alt="Printed case cross section" />
      </a>
    
  
    
      <a href="/assets/images/co2/full_plate.png" title="Ready to print">
          <img src="/assets/images/co2/full_plate.png" alt="Ready to print" />
      </a>
    
  
    
      <a href="/assets/images/co2/blank_face.jpg" title="Blank face">
          <img src="/assets/images/co2/blank_face.jpg" alt="Blank face" />
      </a>
    
  
  
    <figcaption>3D printed parts
</figcaption>
  
</figure>

<p>To be able to directly plug the servo into the board, I had to swap the servo’s power and ground connections inside the dupont connector. So if you choose to recreate this design, make sure to double-check your connections to avoid any accidents.</p>

<h2 id="the-software">The Software</h2>
<h3 id="setup">Setup</h3>
<p>Using <a href="https://esphome.io/">ESPHome</a> made it super easy to integrate this piece of hardware into my <a href="https://www.home-assistant.io/">Home Assistant</a> instance.
But unfortunately, their convenient <a href="https://web.esphome.io/">web programmer</a> doesn’t yet support the ESP32-S2 I’m using. After some digging, I found the following workaround to get everything up and running:</p>
<ol>
  <li>Install the ESPHome Dashboard following one of the <a href="https://esphome.io/">ESPHome getting started guides</a></li>
  <li>Click on ‘New Device’</li>
  <li>Choose a name for the board</li>
  <li>Select ESP32-S2 then cancel</li>
  <li>In the newly created config file, tweak the top to match the following:</li>
</ol>

<figure class="highlight"><pre><code class="language-yaml" data-lang="yaml"><span class="na">esp32</span><span class="pi">:</span>
<span class="na">  board</span><span class="pi">:</span> <span class="s">lolin_s2_mini</span>
<span class="na">  variant</span><span class="pi">:</span> <span class="s">ESP32S2</span>
<span class="na">  framework</span><span class="pi">:</span>
<span class="na">    type</span><span class="pi">:</span> <span class="s">arduino</span>
<span class="na">    version</span><span class="pi">:</span> <span class="s">2.0.3</span>
<span class="na">    platform_version</span><span class="pi">:</span> <span class="s">5.0.0</span></code></pre></figure>

<ol start="6">
  <li>Click Install -&gt; Manual Download -&gt; Modern Format</li>
  <li>Flash the resulting file with the <a href="https://adafruit.github.io/Adafruit_WebSerial_ESPTool/">Adafruit ESPTool</a></li>
</ol>

<p>At this point, the board can be wirelessly updated like any other ESPHome device. There are probably other ways of doing this, like flashing directly with <a href="https://github.com/espressif/esptool">esptool</a>, but this was the most convenient for me.</p>

<h3 id="board-config">Board config</h3>
<p>As you might have guessed, this device doesn’t come with its own CO2 sensor onboard. Instead, the values are relayed from another ESP32-based sensor through Home Assistant. But it would, of course, also be possible to integrate such a sensor into the same device. Just make sure to include some extra holes in the case. 
The servo is configured like this:</p>

<figure class="highlight"><pre><code class="language-yaml" data-lang="yaml"><span class="na">servo</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="na">id</span><span class="pi">:</span> <span class="s">pointer</span>
    <span class="na">output</span><span class="pi">:</span> <span class="s">pwm_output</span>
    <span class="na">transition_length</span><span class="pi">:</span> <span class="s">10s</span> <span class="c1"># make pointer nove slowly to reduce noise</span>
    <span class="na">auto_detach_time</span><span class="pi">:</span> <span class="s">3s</span> 
    <span class="na">min_level</span><span class="pi">:</span> <span class="s">2.8%</span> <span class="c1"># calibrate to move in a 180° arc</span>
    <span class="na">max_level</span><span class="pi">:</span> <span class="s">12.7%</span> <span class="c1"># calibrate to move in a 180° arc</span>
    
<span class="na">output</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="na">platform</span><span class="pi">:</span> <span class="s">ledc</span>
    <span class="na">id</span><span class="pi">:</span> <span class="s">pwm_output</span>
    <span class="na">pin</span><span class="pi">:</span> <span class="s">GPIO16</span> <span class="c1"># conveniently located next to 5V/GND</span>
    <span class="na">frequency</span><span class="pi">:</span> <span class="s">50Hz</span>

<span class="na">number</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="na">platform</span><span class="pi">:</span> <span class="s">template</span>
    <span class="na">name</span><span class="pi">:</span> <span class="s">Servo Control</span>
    <span class="na">min_value</span><span class="pi">:</span> <span class="m">0</span>
    <span class="na">initial_value</span><span class="pi">:</span> <span class="m">50</span>
    <span class="na">max_value</span><span class="pi">:</span> <span class="m">100</span>
    <span class="na">step</span><span class="pi">:</span> <span class="m">0.1</span>
    <span class="na">optimistic</span><span class="pi">:</span> <span class="no">true</span>
    <span class="na">set_action</span><span class="pi">:</span>
      <span class="na">then</span><span class="pi">:</span>
        <span class="pi">-</span> <span class="na">servo.write</span><span class="pi">:</span>
            <span class="na">id</span><span class="pi">:</span> <span class="s">pointer</span>
            <span class="na">level</span><span class="pi">:</span> <span class="kt">!lambda</span> <span class="s1">'</span><span class="s">return</span><span class="nv"> </span><span class="s">(x*2-100)</span><span class="nv"> </span><span class="s">/</span><span class="nv"> </span><span class="s">-100.0;'</span> <span class="c1"># map value to servo position</span></code></pre></figure>

<p>You might have to tinker with the min_level, max_level, or value mapping in order to get the pointer to line up perfectly with the printout.</p>

<h3 id="home-assistant-config">Home Assistant config</h3>
<p>All Home Assistant has to do is monitor changes in CO2 concentration and map/transmit the updated values to the gauge. The corresponding automation could look like this:</p>

<figure class="highlight"><pre><code class="language-yaml" data-lang="yaml"><span class="na">alias</span><span class="pi">:</span> <span class="s">CO2 gauge control</span>
<span class="na">description</span><span class="pi">:</span> <span class="s2">"</span><span class="s">"</span>
<span class="na">trigger</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="na">platform</span><span class="pi">:</span> <span class="s">state</span>
    <span class="na">entity_id</span><span class="pi">:</span>
      <span class="pi">-</span> <span class="s">sensor.living_room_co2</span>
<span class="na">condition</span><span class="pi">:</span> <span class="pi">[]</span>
<span class="na">action</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="na">service</span><span class="pi">:</span> <span class="s">number.set_value</span>
    <span class="na">data</span><span class="pi">:</span>
      <span class="na">value</span><span class="pi">:</span> <span class="s2">"</span><span class="s">"</span>
    <span class="na">target</span><span class="pi">:</span>
      <span class="na">entity_id</span><span class="pi">:</span> <span class="s">number.co2_gauge_servo_control</span>
<span class="na">mode</span><span class="pi">:</span> <span class="s">single</span></code></pre></figure>

<p>At this point, everything should be up and running smoothly, with the pointer responding promptly to new readings from the CO2 sensor.</p>

<h2 id="conclusion">Conclusion</h2>
<p><img src="/assets/images/co2/just_hanging.jpg" alt="" class="align-center" width="100%" /></p>

<p>This was a fun little project, and I like looking at the little guy nudging me to open the windows more often.</p>

<p>The design can also easily be tweaked to display any kind of sensor value.</p>

<p>If you are interested in building your own, you can find the relevant files on my <a href="https://github.com/ckuenzi/co2-gauge">GitHub</a>.</p>

<!-- Google tag (gtag.js) -->
<script async="" src="https://www.googletagmanager.com/gtag/js?id=G-JCFYDY59EG"></script>

<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'G-JCFYDY59EG');
</script>]]></content><author><name>Cyrill Künzi</name></author><summary type="html"><![CDATA[A good smart home is one where I don’t have to fiddle around with my phone all the time. This philosophy has manifested as several physical switches spread across the apartment, controlling most essential functions. But especially in the winter when all windows stay shut, I found myself monitoring the room’s CO2 levels quite frequently. The quest to free myself from this unnecessary phone checking resulted in the following specimen: The Hardware The core of the setup consists of a DS3225 servo powered by a Wemos S2 mini. Both components were chosen by the very scientific process of already being in my drawer. The 25 kg servo is admittedly a bit oversized for turning a tiny plastic pointer. But it is also much quieter than one of these ultra-cheap AliExpress alternatives, which is a plus when hanging in the living room. Everything is held together with either press-fits or threaded inserts, while a print-out of the gouge is sandwiched between the ‘face’ and outer casing. Crafting the artistic design was probably the toughest challenge of this project, especially for a total Inkscape noob like myself. 3D printed parts To be able to directly plug the servo into the board, I had to swap the servo’s power and ground connections inside the dupont connector. So if you choose to recreate this design, make sure to double-check your connections to avoid any accidents. The Software Setup Using ESPHome made it super easy to integrate this piece of hardware into my Home Assistant instance. But unfortunately, their convenient web programmer doesn’t yet support the ESP32-S2 I’m using. After some digging, I found the following workaround to get everything up and running: Install the ESPHome Dashboard following one of the ESPHome getting started guides Click on ‘New Device’ Choose a name for the board Select ESP32-S2 then cancel In the newly created config file, tweak the top to match the following: esp32:   board: lolin_s2_mini   variant: ESP32S2   framework:     type: arduino     version: 2.0.3     platform_version: 5.0.0 Click Install -&gt; Manual Download -&gt; Modern Format Flash the resulting file with the Adafruit ESPTool At this point, the board can be wirelessly updated like any other ESPHome device. There are probably other ways of doing this, like flashing directly with esptool, but this was the most convenient for me. Board config As you might have guessed, this device doesn’t come with its own CO2 sensor onboard. Instead, the values are relayed from another ESP32-based sensor through Home Assistant. But it would, of course, also be possible to integrate such a sensor into the same device. Just make sure to include some extra holes in the case. The servo is configured like this: servo: - id: pointer output: pwm_output transition_length: 10s # make pointer nove slowly to reduce noise auto_detach_time: 3s min_level: 2.8% # calibrate to move in a 180° arc max_level: 12.7% # calibrate to move in a 180° arc output: - platform: ledc id: pwm_output pin: GPIO16 # conveniently located next to 5V/GND frequency: 50Hz number: - platform: template name: Servo Control min_value: 0 initial_value: 50 max_value: 100 step: 0.1 optimistic: true set_action: then: - servo.write: id: pointer level: !lambda 'return (x*2-100) / -100.0;' # map value to servo position You might have to tinker with the min_level, max_level, or value mapping in order to get the pointer to line up perfectly with the printout. Home Assistant config All Home Assistant has to do is monitor changes in CO2 concentration and map/transmit the updated values to the gauge. The corresponding automation could look like this: alias: CO2 gauge control description: "" trigger: - platform: state entity_id: - sensor.living_room_co2 condition: [] action: - service: number.set_value data: value: "" target: entity_id: number.co2_gauge_servo_control mode: single At this point, everything should be up and running smoothly, with the pointer responding promptly to new readings from the CO2 sensor. Conclusion This was a fun little project, and I like looking at the little guy nudging me to open the windows more often. The design can also easily be tweaked to display any kind of sensor value. If you are interested in building your own, you can find the relevant files on my GitHub.]]></summary></entry><entry><title type="html">Hacking my “smart” toothbrush</title><link href="/toothbrush/" rel="alternate" type="text/html" title="Hacking my “smart” toothbrush" /><published>2023-05-24T09:17:21+00:00</published><updated>2023-05-24T09:17:21+00:00</updated><id>/toothbrush</id><content type="html" xml:base="/toothbrush/"><![CDATA[<p>After buying a new <a href="https://www.philips.ch/c-p/HX6851_53/sonicare-protectiveclean-5100-elektrische-schallzahnbuerste">Philips Sonicare</a> toothbrush I was surprised to see that it reacts to the insertion of a brush head by blinking an LED. 
A quick online search reveals that the head communicates with the toothbrush handle to remind you when it’s time to buy a new one.</p>
<p style="text-align: center;"><img src="/assets/images/toothbrush/smart.png" alt="0" /><br />
<em>From the Philips product page: seems to be REALLY smart!</em></p>

<h2 id="reverse-engineering">Reverse Engineering</h2>

<p>Looking at the base of the head shows that it contains an antenna and a tiny black box that is presumably an IC.
The next hint can be found in the manual where it says that: “Radio Equipment in this product operates at 13.56 MHz”, which would indicate that it is an <a href="https://en.wikipedia.org/wiki/Near-field_communication">NFC tag</a>. 
And indeed when holding the brush head to my phone it opens a link to a product page: <a href="https://www.usa.philips.com/c-m-pe/toothbrush-heads">https://www.usa.philips.com/c-m-pe/toothbrush-heads</a>.</p>

<figure class="half ">
  
    
      <a href="/assets/images/toothbrush/brush_head.jpg" title="Brush head">
          <img src="/assets/images/toothbrush/brush_head.jpg" alt="Brush head" />
      </a>
    
  
    
      <a href="/assets/images/toothbrush/nfc_chip.jpg" title="Antenna">
          <img src="/assets/images/toothbrush/nfc_chip.jpg" alt="Antenna" />
      </a>
    
  
  
    <figcaption>
</figcaption>
  
</figure>

<p>Using the <a href="https://play.google.com/store/apps/details?id=com.wakdev.nfctools.pro">NFC Tools</a> app we can learn a lot about this tag:</p>

<p><img src="/assets/images/toothbrush/nfc_info.png" alt="" class="align-center" width="50%" /></p>

<ul>
  <li>It is an <a href="https://www.nxp.com/products/rfid-nfc/nfc-hf/ntag-for-tags-labels/ntag-213-215-216-nfc-forum-type-2-tag-compliant-ic-with-144-504-888-bytes-user-memory:NTAG213_215_216">NTAG213</a></li>
  <li>It uses NfcA</li>
  <li>It is password protected</li>
  <li>We can see the link to the Philips webpage</li>
</ul>

<p>Also using NFC Tools, the memory and memory access conditions can be read:</p>

<table>
  <thead>
    <tr>
      <th>Address</th>
      <th>Data</th>
      <th>Type</th>
      <th>Access</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>0x00</td>
      <td>04:EC:FC:9C</td>
      <td>UID0-UID2/BCC0</td>
      <td>Read-Only</td>
    </tr>
    <tr>
      <td>0x01</td>
      <td>A2:94:10:90</td>
      <td>UID3-UDI6</td>
      <td>Read-Only</td>
    </tr>
    <tr>
      <td>0x02</td>
      <td>B6:48:FF:FF</td>
      <td>BCC1/INT./LOCK0-LOCK1</td>
      <td>Read-Only</td>
    </tr>
    <tr>
      <td>0x03</td>
      <td>E1:10:12:00</td>
      <td>OTP0-OTP3</td>
      <td>Read-Only</td>
    </tr>
    <tr>
      <td>0x04</td>
      <td>03:20:D1:01</td>
      <td>DATA</td>
      <td>Read-Only</td>
    </tr>
    <tr>
      <td>0x05</td>
      <td>1C:55:02:70</td>
      <td>DATA</td>
      <td>Read-Only</td>
    </tr>
    <tr>
      <td>0x06</td>
      <td>68:69:6C:69</td>
      <td>DATA</td>
      <td>Read-Only</td>
    </tr>
    <tr>
      <td>0x07</td>
      <td>70:73:2E:63</td>
      <td>DATA</td>
      <td>Read-Only</td>
    </tr>
    <tr>
      <td>0x08</td>
      <td>6F:6D:2F:6E</td>
      <td>DATA</td>
      <td>Read-Only</td>
    </tr>
    <tr>
      <td>0x09</td>
      <td>66:63:62:72</td>
      <td>DATA</td>
      <td>Read-Only</td>
    </tr>
    <tr>
      <td>0x0A</td>
      <td>75:73:68:68</td>
      <td>DATA</td>
      <td>Read-Only</td>
    </tr>
    <tr>
      <td>0x0B</td>
      <td>65:61:64:74</td>
      <td>DATA</td>
      <td>Read-Only</td>
    </tr>
    <tr>
      <td>0x0C</td>
      <td>61:70:FE:00</td>
      <td>DATA</td>
      <td>Read-Only</td>
    </tr>
    <tr>
      <td>0x0D…</td>
      <td>00:00:00:00</td>
      <td>DATA</td>
      <td>Read-Only</td>
    </tr>
    <tr>
      <td>0x1F</td>
      <td>00:01:07:00</td>
      <td>DATA</td>
      <td>Readable, write protected by PW</td>
    </tr>
    <tr>
      <td>0x20</td>
      <td>00:00:00:02</td>
      <td>DATA</td>
      <td>Read-Only</td>
    </tr>
    <tr>
      <td>0x21</td>
      <td>60:54:32:32</td>
      <td>DATA</td>
      <td>Read-Only</td>
    </tr>
    <tr>
      <td>0x22</td>
      <td>31:32:31:34</td>
      <td>DATA</td>
      <td>Read-Only</td>
    </tr>
    <tr>
      <td>0x23</td>
      <td>20:31:32:4B</td>
      <td>DATA</td>
      <td>Read-Only</td>
    </tr>
    <tr>
      <td>0x24</td>
      <td>B3:02:02:00</td>
      <td>DATA</td>
      <td>Readable,write protected by PW</td>
    </tr>
    <tr>
      <td>0x25</td>
      <td>00:00:00:00</td>
      <td>DATA</td>
      <td>Readable,write protected by PW</td>
    </tr>
    <tr>
      <td>0x26</td>
      <td>00:00:00:00</td>
      <td>DATA</td>
      <td>Readable,write protected by PW</td>
    </tr>
    <tr>
      <td>0x27</td>
      <td>00:00:00:01</td>
      <td>DATA</td>
      <td>Readable,write protected by PW</td>
    </tr>
    <tr>
      <td>0x28</td>
      <td>00:03:30:BD</td>
      <td>LOCK2 - LOCK4</td>
      <td>Readable,write protected by PW</td>
    </tr>
    <tr>
      <td>0x29</td>
      <td>04:00:00:10</td>
      <td>CFG 0</td>
      <td>Read-Only</td>
    </tr>
    <tr>
      <td>0x2A</td>
      <td>43:00:00:00</td>
      <td>CFG 1</td>
      <td>Read-Only</td>
    </tr>
    <tr>
      <td>0x2B</td>
      <td>00:00:00:00</td>
      <td>PWD0-PWD3</td>
      <td>Write-Only</td>
    </tr>
    <tr>
      <td>0x2C</td>
      <td>00:00:00:00</td>
      <td>PACK0-PACK1</td>
      <td>Write-Only</td>
    </tr>
  </tbody>
</table>

<p>I repeated this process for one black and two white <a href="https://www.usa.philips.com/c-p/HX6062_65/sonicare-w-diamondclean-standard-sonic-toothbrush-heads">W DiamondClean</a> brush heads and learned the following:</p>
<ul>
  <li>Address 0x00-0x02 contains a unique ID and its checksum</li>
  <li>Address 0x04-0x0C contains the link to the Philips store</li>
  <li>Address 0x22 is 31:32:31:34 for black and 31:31:31:31 for white heads respectively</li>
  <li>Address 0x24 contains the <strong>total brush time</strong></li>
  <li>All other readable data is identical between all heads</li>
</ul>

<h3 id="decoding-the-stored-time">Decoding the stored time</h3>
<p>Let’s do an experiment to see what changes happen to the tag when using the toothbrush:</p>
<ol>
  <li>Read the tag
    <ul>
      <li>When reading a new brush head that has never been in contact with the data at addr. 0x24 is 00:00:02:00.</li>
      <li>Simply attaching it to the handle (without brushing) changes nothing</li>
    </ul>
  </li>
  <li>Brush for some time
    <ul>
      <li>In this case, I let the toothbrush run for 5s</li>
    </ul>
  </li>
  <li>Read the tag again
    <ul>
      <li>The data at addr. 0x24 is now 05:00:02:00</li>
    </ul>
  </li>
  <li>Observe the difference
    <ul>
      <li>Looks like addr. 0x24 saves the number of seconds that the brush head was in use</li>
    </ul>
  </li>
</ol>

<p>When the brush is used for more than 255s, this timer rolls over to the second bit (02:01:02:00 -&gt; 258s).</p>

<p>Trying to overwrite the stored time is unfortunately unsuccessful, as this memory address is password protected.</p>

<h2 id="sniffing-the-password">Sniffing the password</h2>
<p>Luckily it turns out that the required password is sent over plain text! So all I need to do is to sniff the communication between the toothbrush and the head.
After digging out my <a href="https://greatscottgadgets.com/hackrf/">HackRF</a> <a href="https://en.wikipedia.org/wiki/Software-defined_radio">software defined radio</a> and some trial and error, I came up with the following workflow.</p>

<h3 id="record-rf-signal">Record RF signal</h3>
<p><img src="/assets/images/toothbrush/sniffing_in_progress.jpg" alt="" class="align-center" width="100%" /></p>

<p>When opening <a href="https://gqrx.dk/">gqrx</a> and tuning it to 13.736 MHz while holding the toothbrush close to the antenna, it is visible that the head gets polled multiple times a second. It is a welcome surprise that my simple monopole antenna gets a signal that is strong enough for this purpose. You can download the relevant gqrx configuration file <a href="/assets/files/gqrx.conf">here</a>.</p>

<p><img src="/assets/images/toothbrush/gqrx.png" alt="" class="align-center" width="100%" /></p>

<p>While brushing, the NFC polling takes a brief pause and the first burst of packets that follows updates the time counter. 
With the ability of gqrx to make I/Q recordings, we can capture the password RF signals like this:</p>
<ol>
  <li>Turn on the toothbrush</li>
  <li>Start recording</li>
  <li>Turn off the toothbrush</li>
  <li>Stop the recording</li>
</ol>

<p>The first packets in the file should now contain the password in plain text.</p>

<h3 id="convert-recording">Convert recording</h3>
<p><img src="/assets/images/toothbrush/gnuradio.png" alt="" class="align-center" width="100%" /></p>

<p>Before this raw I/Q file can be decoded it needs to be converted into a slightly different format to be read by the decoding program.<br />
I created a small <a href="https://www.gnuradio.org/">gnuradio</a> companion script that applies a lowpass filter and converts the data into a wav file with two channels that contain the real and imaginary components of the complex signal.<br />
Make sure to substitute the correct paths in the source/sink blocks and check the sampling frequency (I used 2MHz).<br />
You can download the script <a href="/assets/files/sniff_NFC.grc">here</a>.</p>

<h3 id="decode-recording">Decode recording</h3>

<figure class=" ">
  
    
      <a href="/assets/images/toothbrush/nfc_lab.png" title="Decoded traffic">
          <img src="/assets/images/toothbrush/nfc_lab.png" alt="Decoded traffic" />
      </a>
    
  
  
</figure>

<p>I found the perfect tool for this task called <a href="https://github.com/josevcm/nfc-laboratory">NFC-laboratory</a>. 
After opening the newly created WAV file, it should look something like the picture above. In this case, the recording is only good enough to see the communication that goes from host to tag (green arrow). But to sniff the password this is perfect.<br />
When looking at the <a href="https://www.nxp.com/docs/en/data-sheet/NTAG213_215_216.pdf#page=32">datasheet</a> for the NTAG213, we can see what is happening:</p>
<ul>
  <li>Line #0-#6: communication is established with the tags’ unique ID</li>
  <li>Line #7: The toothbrush sends the <strong>password</strong> (command 0x1B = PWD_AUTH)</li>
  <li>Line #9: The time counter is updated to the new value (command 0xA2 = WRITE)</li>
  <li>All lines below are repeated polling without password authentication or writing anything</li>
</ul>

<p>So the password for this brush head is <strong>67:B3:8B:98</strong> (underlined in the picture).</p>

<h2 id="writing-to-the-brush">Writing to the brush</h2>
<p>With the password successfully acquired, it’s now possible to set the counter on the brush head to anything we want by sending the relevant bytes over NFC.<br />
NFC Tools comes to the rescue again:</p>
<ol>
  <li>Go to Other -&gt; Advanced NFC commands</li>
  <li>Set the I/O Class to NfcA</li>
  <li>Set the data to 1B:67:B3:8B:98,A2:24:00:00:02:00</li>
  <li>Enjoy a factory-new brush head (at least as far as the time counter is concerned)</li>
</ol>

<p>Here is the breakdown of the command in step 3:</p>

<table>
  <thead>
    <tr>
      <th>Command</th>
      <th>Explanation</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>1B</td>
      <td>PWD_AUTH</td>
    </tr>
    <tr>
      <td>67:B3:8B:98</td>
      <td>The password</td>
    </tr>
    <tr>
      <td>,</td>
      <td>Package delimiter</td>
    </tr>
    <tr>
      <td>A2</td>
      <td>WRITE</td>
    </tr>
    <tr>
      <td>24</td>
      <td>To address 0x24</td>
    </tr>
    <tr>
      <td>00:00:02:00</td>
      <td>Timer set to 0s</td>
    </tr>
  </tbody>
</table>

<p>Below you can see the memory of the brush head before and after the custom NFC commands:</p>

<figure class="third ">
  
    
      <a href="/assets/images/toothbrush/nfc_before.png" title="Before: 10s on timer">
          <img src="/assets/images/toothbrush/nfc_before.png" alt="" />
      </a>
    
  
    
      <a href="/assets/images/toothbrush/nfc_during.png" title="Applying the update">
          <img src="/assets/images/toothbrush/nfc_during.png" alt="" />
      </a>
    
  
    
      <a href="/assets/images/toothbrush/nfc_after.png" title="After: 0s on timer">
          <img src="/assets/images/toothbrush/nfc_after.png" alt="" />
      </a>
    
  
  
    <figcaption>Observe how the timer at address 0x24 changes
</figcaption>
  
</figure>

<p>With this, the toothbrush is now <strong>successfully hacked</strong> and we can play around with the timer as we wish.</p>

<p>Here are some interesting observations:</p>
<ul>
  <li>Only the first two bytes at address 0x24 are used for timekeeping.  Once the counter reaches FF:FF:02:00 it stops going up (18 hours of continuous brushing).</li>
  <li>When the stored time is greater than 0x5460 the toothbrush blinks the LED to notify you to change heads. This corresponds to 21’600s -&gt; 180 x 2min -&gt; 3 months of brushing twice a day, which is exactly in line with Philips recommendation to change heads every 3 months.</li>
</ul>

<h2 id="final-remarks">Final Remarks</h2>

<h3 id="password-verification-protection">Password verification protection</h3>
<p>You might have noticed the color of the brush head changing throughout of this post. This is because I had to run out and buy a new one after getting locked out of the first one.<br />
When having a close look at the contents of address 0x2A which is 43:00:00:00 and <a href="https://www.nxp.com/docs/en/data-sheet/NTAG213_215_216.pdf#page=18">page 18</a> of the datasheet, we can see that the tag is configured to permanently disable all write access after three wrong password attempts. (Which I promptly exceeded when playing around) This means that not even the toothbrush handle itself can write to this head again.</p>

<h3 id="password-generation">Password generation</h3>
<p>Unfortunately, the password of every brush head is unique and this process of extracting it with an SDR is quite involved and requires special hardware. 
At the bottom of page 30 in the datasheet, NXP recommends generating the password from the 7-byte UID. Below are all the UID - password pairs I obtained from my 3 heads:</p>

<table>
  <thead>
    <tr>
      <th>UID</th>
      <th>Password</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>04:79:CF:7A:89:10:90</td>
      <td>FF:34:CE:4C</td>
    </tr>
    <tr>
      <td>04:EC:FC:A2:94:10:90</td>
      <td>61:F0:A5:0F</td>
    </tr>
    <tr>
      <td>04:D7:29:0A:94:10:90</td>
      <td>67:B3:8B:98</td>
    </tr>
  </tbody>
</table>

<p>All my tries to guess to one-way function for generating the passwords failed. Depending on the care that the Philips engineers took, guessing this function could be almost impossible. 
But if you manage to solve this puzzle, feel free to hit me up with an E-mail.</p>

<h2 id="update-august-16-2023">Update (August 16, 2023)</h2>
<p>After publishing this article, I was pleasantly surprised to see it picked up by some big news sites such as <a href="https://news.ycombinator.com/item?id=36128617">Hacker News</a> and <a href="https://hackaday.com/2023/05/27/hacking-a-smart-electric-toothbrush-to-reset-its-usage-counter/">Hackaday</a>. The resulting discussions and comments proved to be both enlightening and entertaining. Thanks to everyone who dropped positive comments and messages!<br />
A special shoutout has to go to <a href="https://www.youtube.com/@atc1441">Aaron Christophel</a> who got inspired by this post to:</p>
<ul>
  <li>Dump and reverse engineer the Philips Sonicare firmware to extract the password generation algorithm: <a href="https://www.youtube.com/watch?v=EPytrn8i8sc">Video</a></li>
  <li>Wrote a password generator: <a href="https://gist.github.com/atc1441/41af75048e4c22af1f5f0d4c1d94bb56">GitHub</a></li>
  <li>And just for fun, he made the toothbrush bust out a  <a href="https://www.youtube.com/watch?v=OkfS_z0FrlE">Rick Roll</a></li>
</ul>

<p>Please go check his content if you are interested in the solution to the puzzle.</p>

<!-- Google tag (gtag.js) -->
<script async="" src="https://www.googletagmanager.com/gtag/js?id=G-JCFYDY59EG"></script>

<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'G-JCFYDY59EG');
</script>]]></content><author><name>Cyrill Künzi</name></author><summary type="html"><![CDATA[After buying a new Philips Sonicare toothbrush I was surprised to see that it reacts to the insertion of a brush head by blinking an LED. A quick online search reveals that the head communicates with the toothbrush handle to remind you when it’s time to buy a new one. From the Philips product page: seems to be REALLY smart! Reverse Engineering Looking at the base of the head shows that it contains an antenna and a tiny black box that is presumably an IC. The next hint can be found in the manual where it says that: “Radio Equipment in this product operates at 13.56 MHz”, which would indicate that it is an NFC tag. And indeed when holding the brush head to my phone it opens a link to a product page: https://www.usa.philips.com/c-m-pe/toothbrush-heads. Using the NFC Tools app we can learn a lot about this tag: It is an NTAG213 It uses NfcA It is password protected We can see the link to the Philips webpage Also using NFC Tools, the memory and memory access conditions can be read: Address Data Type Access 0x00 04:EC:FC:9C UID0-UID2/BCC0 Read-Only 0x01 A2:94:10:90 UID3-UDI6 Read-Only 0x02 B6:48:FF:FF BCC1/INT./LOCK0-LOCK1 Read-Only 0x03 E1:10:12:00 OTP0-OTP3 Read-Only 0x04 03:20:D1:01 DATA Read-Only 0x05 1C:55:02:70 DATA Read-Only 0x06 68:69:6C:69 DATA Read-Only 0x07 70:73:2E:63 DATA Read-Only 0x08 6F:6D:2F:6E DATA Read-Only 0x09 66:63:62:72 DATA Read-Only 0x0A 75:73:68:68 DATA Read-Only 0x0B 65:61:64:74 DATA Read-Only 0x0C 61:70:FE:00 DATA Read-Only 0x0D… 00:00:00:00 DATA Read-Only 0x1F 00:01:07:00 DATA Readable, write protected by PW 0x20 00:00:00:02 DATA Read-Only 0x21 60:54:32:32 DATA Read-Only 0x22 31:32:31:34 DATA Read-Only 0x23 20:31:32:4B DATA Read-Only 0x24 B3:02:02:00 DATA Readable,write protected by PW 0x25 00:00:00:00 DATA Readable,write protected by PW 0x26 00:00:00:00 DATA Readable,write protected by PW 0x27 00:00:00:01 DATA Readable,write protected by PW 0x28 00:03:30:BD LOCK2 - LOCK4 Readable,write protected by PW 0x29 04:00:00:10 CFG 0 Read-Only 0x2A 43:00:00:00 CFG 1 Read-Only 0x2B 00:00:00:00 PWD0-PWD3 Write-Only 0x2C 00:00:00:00 PACK0-PACK1 Write-Only I repeated this process for one black and two white W DiamondClean brush heads and learned the following: Address 0x00-0x02 contains a unique ID and its checksum Address 0x04-0x0C contains the link to the Philips store Address 0x22 is 31:32:31:34 for black and 31:31:31:31 for white heads respectively Address 0x24 contains the total brush time All other readable data is identical between all heads Decoding the stored time Let’s do an experiment to see what changes happen to the tag when using the toothbrush: Read the tag When reading a new brush head that has never been in contact with the data at addr. 0x24 is 00:00:02:00. Simply attaching it to the handle (without brushing) changes nothing Brush for some time In this case, I let the toothbrush run for 5s Read the tag again The data at addr. 0x24 is now 05:00:02:00 Observe the difference Looks like addr. 0x24 saves the number of seconds that the brush head was in use When the brush is used for more than 255s, this timer rolls over to the second bit (02:01:02:00 -&gt; 258s). Trying to overwrite the stored time is unfortunately unsuccessful, as this memory address is password protected. Sniffing the password Luckily it turns out that the required password is sent over plain text! So all I need to do is to sniff the communication between the toothbrush and the head. After digging out my HackRF software defined radio and some trial and error, I came up with the following workflow. Record RF signal When opening gqrx and tuning it to 13.736 MHz while holding the toothbrush close to the antenna, it is visible that the head gets polled multiple times a second. It is a welcome surprise that my simple monopole antenna gets a signal that is strong enough for this purpose. You can download the relevant gqrx configuration file here. While brushing, the NFC polling takes a brief pause and the first burst of packets that follows updates the time counter. With the ability of gqrx to make I/Q recordings, we can capture the password RF signals like this: Turn on the toothbrush Start recording Turn off the toothbrush Stop the recording The first packets in the file should now contain the password in plain text. Convert recording Before this raw I/Q file can be decoded it needs to be converted into a slightly different format to be read by the decoding program. I created a small gnuradio companion script that applies a lowpass filter and converts the data into a wav file with two channels that contain the real and imaginary components of the complex signal. Make sure to substitute the correct paths in the source/sink blocks and check the sampling frequency (I used 2MHz). You can download the script here. Decode recording I found the perfect tool for this task called NFC-laboratory. After opening the newly created WAV file, it should look something like the picture above. In this case, the recording is only good enough to see the communication that goes from host to tag (green arrow). But to sniff the password this is perfect. When looking at the datasheet for the NTAG213, we can see what is happening: Line #0-#6: communication is established with the tags’ unique ID Line #7: The toothbrush sends the password (command 0x1B = PWD_AUTH) Line #9: The time counter is updated to the new value (command 0xA2 = WRITE) All lines below are repeated polling without password authentication or writing anything So the password for this brush head is 67:B3:8B:98 (underlined in the picture). Writing to the brush With the password successfully acquired, it’s now possible to set the counter on the brush head to anything we want by sending the relevant bytes over NFC. NFC Tools comes to the rescue again: Go to Other -&gt; Advanced NFC commands Set the I/O Class to NfcA Set the data to 1B:67:B3:8B:98,A2:24:00:00:02:00 Enjoy a factory-new brush head (at least as far as the time counter is concerned) Here is the breakdown of the command in step 3: Command Explanation 1B PWD_AUTH 67:B3:8B:98 The password , Package delimiter A2 WRITE 24 To address 0x24 00:00:02:00 Timer set to 0s Below you can see the memory of the brush head before and after the custom NFC commands: Observe how the timer at address 0x24 changes With this, the toothbrush is now successfully hacked and we can play around with the timer as we wish. Here are some interesting observations: Only the first two bytes at address 0x24 are used for timekeeping. Once the counter reaches FF:FF:02:00 it stops going up (18 hours of continuous brushing). When the stored time is greater than 0x5460 the toothbrush blinks the LED to notify you to change heads. This corresponds to 21’600s -&gt; 180 x 2min -&gt; 3 months of brushing twice a day, which is exactly in line with Philips recommendation to change heads every 3 months. Final Remarks Password verification protection You might have noticed the color of the brush head changing throughout of this post. This is because I had to run out and buy a new one after getting locked out of the first one. When having a close look at the contents of address 0x2A which is 43:00:00:00 and page 18 of the datasheet, we can see that the tag is configured to permanently disable all write access after three wrong password attempts. (Which I promptly exceeded when playing around) This means that not even the toothbrush handle itself can write to this head again. Password generation Unfortunately, the password of every brush head is unique and this process of extracting it with an SDR is quite involved and requires special hardware. At the bottom of page 30 in the datasheet, NXP recommends generating the password from the 7-byte UID. Below are all the UID - password pairs I obtained from my 3 heads: UID Password 04:79:CF:7A:89:10:90 FF:34:CE:4C 04:EC:FC:A2:94:10:90 61:F0:A5:0F 04:D7:29:0A:94:10:90 67:B3:8B:98 All my tries to guess to one-way function for generating the passwords failed. Depending on the care that the Philips engineers took, guessing this function could be almost impossible. But if you manage to solve this puzzle, feel free to hit me up with an E-mail. Update (August 16, 2023) After publishing this article, I was pleasantly surprised to see it picked up by some big news sites such as Hacker News and Hackaday. The resulting discussions and comments proved to be both enlightening and entertaining. Thanks to everyone who dropped positive comments and messages! A special shoutout has to go to Aaron Christophel who got inspired by this post to: Dump and reverse engineer the Philips Sonicare firmware to extract the password generation algorithm: Video Wrote a password generator: GitHub And just for fun, he made the toothbrush bust out a Rick Roll Please go check his content if you are interested in the solution to the puzzle.]]></summary></entry><entry><title type="html">Language modeling journey: From bigram prediction and DIY transformers to LLaMA 65B</title><link href="/nlp/" rel="alternate" type="text/html" title="Language modeling journey: From bigram prediction and DIY transformers to LLaMA 65B" /><published>2023-03-15T14:17:21+00:00</published><updated>2023-03-15T14:17:21+00:00</updated><id>/llm</id><content type="html" xml:base="/nlp/"><![CDATA[<!-- Mathjax Support -->
<script type="text/javascript" async="" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-MML-AM_CHTML">
</script>

<p>With all the hype surrounding chatGPT (and now GPT-4), it really bothered me that I don’t have the faintest idea of how language models or transformers work.
Fortunately, the <a href="https://github.com/karpathy/nn-zero-to-hero">Neural Networks: Zero to Hero</a> lecture series that helped me understand backpropagation in my <a href="/nanograd/">previous post</a>, also covers multiple language modeling techniques.
I found that I spend too much time in my last post explaining things that were already covered much better in the lecture. So I’ll try to keep this one shorter.</p>

<h2 id="goal">Goal</h2>
<p>The project has the very open-ended goal of “Learning about language models”. I don’t have millions to spend on GPU time to train the next chatGPT, so I’m going to be happy with understanding the underlying theory and training some toy models.
These are the rough goals I want to achieve:</p>

<ul class="task-list">
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Understand how language models are trained</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Write my own transformer</li>
  <li class="task-list-item"><input type="checkbox" class="task-list-item-checkbox" disabled="disabled" />Train a <a href="https://en.wikipedia.org/wiki/Generative_pre-trained_transformer">GPT</a> language model</li>
</ul>

<h2 id="documentation">Documentation</h2>
<p>In this post, I’m going to follow lectures 2 through 7 pretty closely. They start with creating a very simple bigram prediction model and then introduce incrementally more complex models.</p>

<h3 id="bigram-model-729-parameters">Bigram model (729 parameters)</h3>
<p>The first step is creating a character-level bigram model: It takes a training text and counts how many times a character follows another. These counts are then normalized and converted to probability distributions.
After converting the ASCII characters to integers, all bigrams are counted and the results are stored in a 2D array. In this case, each word is a name and the model will try to predict a new name.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">bigram_counts</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="mi">27</span><span class="p">,</span> <span class="mi">27</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">torch</span><span class="p">.</span><span class="n">int32</span><span class="p">)</span>

<span class="k">for</span> <span class="n">word</span> <span class="ow">in</span> <span class="n">words</span><span class="p">:</span>
    <span class="n">word</span> <span class="o">=</span> <span class="s">'.'</span> <span class="o">+</span> <span class="n">word</span> <span class="o">+</span> <span class="s">'.'</span>
    <span class="k">for</span> <span class="n">c1</span><span class="p">,</span> <span class="n">c2</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">word</span><span class="p">,</span> <span class="n">word</span><span class="p">[</span><span class="mi">1</span><span class="p">:]):</span>
        <span class="n">bigram_counts</span><span class="p">[</span><span class="n">stoi</span><span class="p">[</span><span class="n">c1</span><span class="p">],</span> <span class="n">stoi</span><span class="p">[</span><span class="n">c2</span><span class="p">]]</span> <span class="o">+=</span> <span class="mi">1</span>

<span class="c1"># Get probability distribution
</span><span class="n">P_bigram</span> <span class="o">=</span> <span class="n">bigram_counts</span> <span class="o">/</span> <span class="n">bigram_counts</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">keepdims</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span></code></pre></figure>

<p>Something <strong>interesting</strong> to note is that the words are surrounded by a special character ‘.’ to indicate the start and stop of each sequence. This is useful because generation can be started with a ‘.’ and the model itself can then decide when to stop by generating another ‘.’</p>

<p>After feeding roughly <a href="https://github.com/karpathy/makemore/blob/master/names.txt">30’000</a> names into the model it produces the following probability distribution:</p>

<p style="text-align: center;"><img src="/assets/images/NLP/bigram_probs.png" alt="0" /><br />
<em>Figure 1: Bigram probability distribution generated by simple counting</em></p>

<p>As we can see, the model has learned that a lot of names start with ‘a’ and that ‘q’ is almost always followed by ‘u’.</p>

<p>As expected, when we use these probabilities to generate random new names, the results aren’t very good.</p>

<ul>
  <li>.myliena.</li>
  <li>.r.</li>
  <li>.a.</li>
  <li>.ahi.</li>
  <li>.grammian.</li>
  <li>.n.</li>
  <li>.xxonh.</li>
  <li>.chaldeiniy.</li>
  <li>.bler.</li>
  <li>.jaranige.</li>
</ul>

<p>Because the model only sees the previous character in the sequence, it doesn’t know the length of the generated name. So ‘.a.’ is a perfectly fine choice, as ‘a’ is a common letter at the start and the end of names (see figure 1).</p>

<h3 id="now-with-gradient-descent-729-parameters">Now with gradient descent (729 parameters)</h3>
<p>Instead of explicitly calculating the matrix by counting, the probabilities can be learned with backpropagation and gradient descent.
In the following snippet X is the <a href="https://en.wikipedia.org/wiki/One-hot">one-hot</a> encoded input character and y is the next character in the sequence. W takes the place of <strong>P_bigram</strong> from the last section.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">W</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">27</span><span class="p">,</span> <span class="mi">27</span><span class="p">,</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

<span class="k">for</span> <span class="n">epoch</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1000</span><span class="p">):</span>
    <span class="n">logits</span> <span class="o">=</span> <span class="n">X</span> <span class="o">@</span> <span class="n">W</span> 
    <span class="n">loss</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">cross_entropy</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>

    <span class="n">W</span><span class="p">.</span><span class="n">grad</span> <span class="o">=</span> <span class="bp">None</span>
    <span class="n">loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span>
    <span class="n">W</span><span class="p">.</span><span class="n">data</span> <span class="o">-=</span> <span class="mi">50</span> <span class="o">*</span> <span class="n">W</span><span class="p">.</span><span class="n">grad</span></code></pre></figure>

<p>After training for a bit, W converges to the following distribution, which looks the same as the one in Figure 1:</p>

<p style="text-align: center;"><img src="/assets/images/NLP/backprop_probs.png" alt="0" /><br />
<em>Figure 2: Bigram probability distribution generated by backpropagation</em></p>

<p>This is confirmed when generating random names with the same seed, as the results are exactly the same.</p>

<ul>
  <li>.myliena.</li>
  <li>.r.</li>
  <li>.a.</li>
  <li>.ahi.</li>
  <li>.grammian.</li>
  <li>.n.</li>
  <li>.xxonh.</li>
  <li>.chaldeiniy.</li>
  <li>.bler.</li>
  <li>.jaranige.</li>
</ul>

<h3 id="multi-layer-perceptron-8762-parameters">Multi-layer perceptron (8’762 parameters)</h3>
<p>One straightforward way to improve the model is to increase its context length. In this case, instead of only seeing the previous character, the model is going to see the previous 3 characters.
If we just used the previous technique, the resulting 4-dimensional probability matrix would have \(27^4=531'441\) parameters, which is starting to get pretty big.
So the purely statistical approach is ditched in favor of a small feed-forward neural network defined like this:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">N_EMBEDDINGS</span> <span class="o">=</span> <span class="mi">5</span>
<span class="n">HIDDEN_NEURONS</span> <span class="o">=</span> <span class="mi">200</span>
<span class="n">C</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">((</span><span class="mi">27</span><span class="p">,</span> <span class="n">N_EMBEDDINGS</span><span class="p">),</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">W1</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">((</span><span class="n">N_EMBEDDINGS</span> <span class="o">*</span> <span class="n">BLOCK_SIZE</span><span class="p">,</span> <span class="n">HIDDEN_NEURONS</span><span class="p">),</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">b1</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="n">HIDDEN_NEURONS</span><span class="p">,</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">W2</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">((</span><span class="n">HIDDEN_NEURONS</span><span class="p">,</span> <span class="mi">27</span><span class="p">),</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">b2</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">27</span><span class="p">,</span> <span class="n">requires_grad</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">parameters</span> <span class="o">=</span> <span class="p">[</span><span class="n">C</span><span class="p">,</span> <span class="n">W1</span><span class="p">,</span> <span class="n">b1</span><span class="p">,</span> <span class="n">W2</span><span class="p">,</span> <span class="n">b2</span><span class="p">]</span>

<span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="n">X</span><span class="p">):</span>
    <span class="n">embeddings</span> <span class="o">=</span> <span class="n">C</span><span class="p">[</span><span class="n">X</span><span class="p">]</span>
    <span class="n">x</span> <span class="o">=</span> <span class="n">embeddings</span><span class="p">.</span><span class="n">view</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="n">N_EMBEDDINGS</span> <span class="o">*</span> <span class="n">BLOCK_SIZE</span><span class="p">)</span>
    <span class="n">x</span> <span class="o">=</span> <span class="n">x</span> <span class="o">@</span> <span class="n">W1</span> <span class="o">+</span> <span class="n">b1</span>
    <span class="n">x</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tanh</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
    <span class="n">x</span> <span class="o">=</span> <span class="n">x</span> <span class="o">@</span> <span class="n">W2</span> <span class="o">+</span> <span class="n">b2</span>
    <span class="k">return</span> <span class="n">x</span></code></pre></figure>

<p>This introduces the concept of an <strong>embedding</strong> which maps each 27-dimensional one-hot encoded input character to a 5-dimensional embedding that is learned together with the rest of the network.<br />
We can take a look at the cosine similarities between all embeddings after training:</p>

<p style="text-align: center;"><img src="/assets/images/NLP/mlp_embedding_similarity.png" alt="0" /><br />
<em>Figure 3: ReLU of the cosine similarities between learned embeddings</em></p>

<p>It looks like the embeddings of the vowels are often similar to each other and that ‘.’, ‘n’, and ‘q’ have pretty distinct embeddings.</p>

<p>The random names generated with this model are already way better than before:</p>
<ul>
  <li>.mylieke.</li>
  <li>.rada.</li>
  <li>.erackanolando.</li>
  <li>.yusamailem.</li>
  <li>.karlymandrah.</li>
  <li>.parlif.</li>
  <li>.meilae.</li>
  <li>.ayy.</li>
  <li>.saiora.</li>
  <li>.jaylianyyah.</li>
</ul>

<h3 id="wavenet-176k-parameters">WaveNet (176k parameters)</h3>
<p>The next step is extending the context length even further and using another network topology with much more parameters.
This topology is a WaveNet introduced in 2016 by Google DeepMind <a href="https://arxiv.org/abs/1609.03499">https://arxiv.org/abs/1609.03499</a>.
As seen in Figure 4, it uses dilated causal convolutions and was originally created for speech synthesis. 
In this case, the convolutions are replaced with carefully arranged linear layers and the network is of course used for character prediction.</p>

<p style="text-align: center;"><img src="/assets/images/NLP/WaveNet_animation.gif" alt="0" /><br />
<em>Figure 4: WaveNet topology <a href="https://www.deepmind.com/blog/high-fidelity-speech-synthesis-with-wavenet">source</a></em></p>

<p>This particular implementation is very inefficient because the data is occasionally reshaped to create the correct tensor shapes.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">WaveNet</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">n_embeddings</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">hidden_neurons</span><span class="o">=</span><span class="mi">200</span><span class="p">):</span>
        <span class="nb">super</span><span class="p">(</span><span class="n">WaveNet</span><span class="p">,</span> <span class="bp">self</span><span class="p">).</span><span class="n">__init__</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">n_embeddings</span> <span class="o">=</span> <span class="n">n_embeddings</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">hidden_neurons</span> <span class="o">=</span> <span class="n">hidden_neurons</span>
        
        <span class="bp">self</span><span class="p">.</span><span class="n">embedding</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Embedding</span><span class="p">(</span><span class="mi">27</span><span class="p">,</span> <span class="n">n_embeddings</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">fc1</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">n_embeddings</span> <span class="o">*</span> <span class="mi">2</span><span class="p">,</span> <span class="n">hidden_neurons</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">bn1</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">BatchNorm1d</span><span class="p">(</span><span class="n">hidden_neurons</span><span class="p">)</span>
        
        <span class="bp">self</span><span class="p">.</span><span class="n">fc2</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">hidden_neurons</span> <span class="o">*</span> <span class="mi">2</span><span class="p">,</span> <span class="n">hidden_neurons</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">bn2</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">BatchNorm1d</span><span class="p">(</span><span class="n">hidden_neurons</span><span class="p">)</span>
        
        <span class="bp">self</span><span class="p">.</span><span class="n">fc3</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">hidden_neurons</span> <span class="o">*</span> <span class="mi">2</span><span class="p">,</span> <span class="n">hidden_neurons</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">bn3</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">BatchNorm1d</span><span class="p">(</span><span class="n">hidden_neurons</span><span class="p">)</span>
        
        <span class="bp">self</span><span class="p">.</span><span class="n">fc4</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">hidden_neurons</span><span class="p">,</span> <span class="mi">27</span><span class="p">)</span>
        
    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">embedding</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        
        <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">view</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">n_embeddings</span> <span class="o">*</span> <span class="mi">2</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">tmp</span> <span class="o">=</span> <span class="n">x</span>
        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">fc1</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">swapaxes</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">)</span>
        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">bn1</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">swapaxes</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">)</span>
        <span class="n">x</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tanh</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        
        <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">hidden_neurons</span> <span class="o">*</span> <span class="mi">2</span><span class="p">)</span>
        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">fc2</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">swapaxes</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">)</span>
        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">bn2</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">swapaxes</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">)</span>
        <span class="n">x</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tanh</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        
        <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">reshape</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">hidden_neurons</span> <span class="o">*</span> <span class="mi">2</span><span class="p">)</span>
        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">fc3</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">bn3</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="n">x</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">tanh</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        
        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">fc4</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
       
        <span class="k">return</span> <span class="n">x</span>
    
<span class="n">model</span> <span class="o">=</span> <span class="n">WaveNet</span><span class="p">(</span><span class="n">n_embeddings</span> <span class="o">=</span> <span class="mi">20</span><span class="p">,</span> <span class="n">hidden_neurons</span><span class="o">=</span><span class="mi">200</span><span class="p">)</span></code></pre></figure>

<p>After training this network for a couple of epochs with a decreasing learning rate, the generated names start to sound pretty name-like:</p>
<ul>
  <li>.kynnale.</li>
  <li>.obagann.</li>
  <li>.evanne.</li>
  <li>.jatetton.</li>
  <li>.adiliah.</li>
  <li>.kiyaen.</li>
  <li>.nalar.</li>
  <li>.khfi.</li>
  <li>.keilei.</li>
  <li>.awyana.</li>
</ul>

<p>Looking at the embedding similarities again, it seems like they are more unique than before. This makes sense considering the dimensionality of the embedding vector has gone up from 5 to 20 Dimensions.</p>

<p style="text-align: center;"><img src="/assets/images/NLP/waveNet_embeddings.png" alt="0" /><br />
<em>Figure 5: ReLU of the cosine similarities between learned WaveNet embeddings</em></p>

<h3 id="gpt-from-scratch-107m-parameters">GPT from scratch (10.7M parameters)</h3>
<p>Learning how transformers work was definitely the most exciting part of this project for me!
Explaining transformers is no easy task, so I’m not even going to try. If you are interested, I highly recommend watching the <a href="https://youtu.be/kCc8FmEb1nY">lecture</a>.</p>

<p>But I want to quickly show what lies in the heart of a transformer: The <strong>self-attention</strong> mechanism:<br />
For each input, it generates a query, key, and value that get combined, such that they can exchange information with each other in a very elegant way.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">Head</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">head_size</span><span class="p">):</span>
        <span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">key</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">n_embeddings</span><span class="p">,</span> <span class="n">head_size</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">query</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">n_embeddings</span><span class="p">,</span> <span class="n">head_size</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">value</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="n">n_embeddings</span><span class="p">,</span> <span class="n">head_size</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">register_buffer</span><span class="p">(</span><span class="s">'tril'</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">tril</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">ones</span><span class="p">(</span><span class="n">block_size</span><span class="p">,</span> <span class="n">block_size</span><span class="p">)))</span>
        
        <span class="bp">self</span><span class="p">.</span><span class="n">dropout</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Dropout</span><span class="p">(</span><span class="n">dropout</span><span class="p">)</span>
        
    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
        <span class="n">B</span><span class="p">,</span> <span class="n">T</span><span class="p">,</span> <span class="n">C</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">shape</span>
        <span class="n">k</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">key</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="n">q</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">query</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="n">v</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">value</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        
        <span class="n">wei</span> <span class="o">=</span> <span class="n">q</span> <span class="o">@</span> <span class="n">k</span><span class="p">.</span><span class="n">transpose</span><span class="p">(</span><span class="o">-</span><span class="mi">2</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="n">k</span><span class="p">.</span><span class="n">shape</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">**-</span><span class="mf">0.5</span>
        <span class="n">wei</span> <span class="o">=</span> <span class="n">wei</span><span class="p">.</span><span class="n">masked_fill</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">tril</span><span class="p">[:</span><span class="n">T</span><span class="p">,</span> <span class="p">:</span><span class="n">T</span><span class="p">]</span> <span class="o">==</span> <span class="mi">0</span><span class="p">,</span> <span class="nb">float</span><span class="p">(</span><span class="s">'-inf'</span><span class="p">))</span>
        <span class="n">wei</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">softmax</span><span class="p">(</span><span class="n">wei</span><span class="p">,</span> <span class="n">dim</span><span class="o">=-</span><span class="mi">1</span><span class="p">)</span>
        <span class="n">wei</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">dropout</span><span class="p">(</span><span class="n">wei</span><span class="p">)</span>
        
        <span class="n">x</span> <span class="o">=</span> <span class="n">wei</span> <span class="o">@</span> <span class="n">v</span>
        
        <span class="k">return</span> <span class="n">x</span></code></pre></figure>

<p>The final model has 384 embeddings, a context size of 256 characters, and consists of 6 transformer blocks with 6 self-attention heads each. 
After being trained on the works of Shakespeare for a couple of thousand epochs (on the GPU this time), it produces text like this:</p>
<blockquote>
  <p>And come to Coriolanus.</p>

  <p>CORIOLANUS:
I will be so, show; I hope this service
As it gives me a word which to practise my lave,
When I had rather cry ‘‘twas but a bulk.</p>

  <p>MENENIUS:
Let’s not pray you.</p>

  <p>AUTOLYCUS:
I know ‘tis not without the shepherd, not a monster, man.</p>

  <p>CORIOLANUS:
The poor, Pampey, sir.</p>

  <p>CORIOLANUS:
Nor thou hast, my lord; there I know the taple
I was violently?</p>

  <p>CORIOLANUS:
Why, that that he were recounted nose,
A heart of your lord’st traded, whose eservice we remain
With this chance to you did
The deputy of such pure complices them
Redeem with our love they call and new good to-night.</p>

  <p>BRUTUS:
I dare now in the voices: here are
Acrown the state news.</p>

  <p>SICINIUS:
You have been in crutching of them, cry on their this?</p>

  <p>MENENIUS:
Proclam, sir, I shall.</p>
</blockquote>

<h3 id="playing-with-llama-65b-parameters">Playing with LLaMA (65B parameters)</h3>
<p>While doing this project the weights of Facebooks LLaMa model leaked online in a <a href="https://www.vice.com/en/article/xgwqgw/facebooks-powerful-large-language-model-leaks-online-4chan-llama">hilarious way</a>.
When looking at their publicly available <a href="https://github.com/facebookresearch/llama/blob/main/llama/model.py">model</a>, most things seem very similar to the DIY version.</p>

<p>Language models might start having their StableDiffusion moment right now. Seemingly every day there is a new innovation like reducing the model size with quantization or finetuning it to act more like OpenAIs Davinci model.
After increasing my <a href="https://clay-atlas.com/us/blog/2021/08/31/windows-en-wsl-2-memory/">WSLs RAM budget</a> to 50GB, downloading the LLaMA weights and quantizing them to 4-bit, I was able to use <a href="https://github.com/ggerganov/llama.cpp">llama.cpp</a> to run the 65B model on a CPU.<br />
Like with diffusion models, it seems futile to go into much detail here, as the whole landscape will probably look completely different in only a couple of weeks.</p>

<h2 id="conclusion">Conclusion</h2>
<p>Going from having absolutely no clue about LLMs and transformers to understanding the current cutting-edge research was definitely a fun project. 
Major props have to go to Andrej Karpathy for his amazing lectures that perfectly guided me through this process.
I will probably return to this topic soon when the available models and tooling have advanced enough to do DIY fine-tuning.<br />
All code can be found on my <a href="https://github.com/ckuenzi/moremakemore">GitHub</a></p>

<!-- Google tag (gtag.js) -->
<script async="" src="https://www.googletagmanager.com/gtag/js?id=G-JCFYDY59EG"></script>

<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'G-JCFYDY59EG');
</script>]]></content><author><name>Cyrill Künzi</name></author><summary type="html"><![CDATA[With all the hype surrounding chatGPT (and now GPT-4), it really bothered me that I don’t have the faintest idea of how language models or transformers work. Fortunately, the Neural Networks: Zero to Hero lecture series that helped me understand backpropagation in my previous post, also covers multiple language modeling techniques. I found that I spend too much time in my last post explaining things that were already covered much better in the lecture. So I’ll try to keep this one shorter. Goal The project has the very open-ended goal of “Learning about language models”. I don’t have millions to spend on GPU time to train the next chatGPT, so I’m going to be happy with understanding the underlying theory and training some toy models. These are the rough goals I want to achieve: Understand how language models are trained Write my own transformer Train a GPT language model Documentation In this post, I’m going to follow lectures 2 through 7 pretty closely. They start with creating a very simple bigram prediction model and then introduce incrementally more complex models. Bigram model (729 parameters) The first step is creating a character-level bigram model: It takes a training text and counts how many times a character follows another. These counts are then normalized and converted to probability distributions. After converting the ASCII characters to integers, all bigrams are counted and the results are stored in a 2D array. In this case, each word is a name and the model will try to predict a new name. bigram_counts = torch.zeros(27, 27, dtype=torch.int32) for word in words: word = '.' + word + '.' for c1, c2 in zip(word, word[1:]): bigram_counts[stoi[c1], stoi[c2]] += 1 # Get probability distribution P_bigram = bigram_counts / bigram_counts.sum(1, keepdims=True) Something interesting to note is that the words are surrounded by a special character ‘.’ to indicate the start and stop of each sequence. This is useful because generation can be started with a ‘.’ and the model itself can then decide when to stop by generating another ‘.’ After feeding roughly 30’000 names into the model it produces the following probability distribution: Figure 1: Bigram probability distribution generated by simple counting As we can see, the model has learned that a lot of names start with ‘a’ and that ‘q’ is almost always followed by ‘u’. As expected, when we use these probabilities to generate random new names, the results aren’t very good. .myliena. .r. .a. .ahi. .grammian. .n. .xxonh. .chaldeiniy. .bler. .jaranige. Because the model only sees the previous character in the sequence, it doesn’t know the length of the generated name. So ‘.a.’ is a perfectly fine choice, as ‘a’ is a common letter at the start and the end of names (see figure 1). Now with gradient descent (729 parameters) Instead of explicitly calculating the matrix by counting, the probabilities can be learned with backpropagation and gradient descent. In the following snippet X is the one-hot encoded input character and y is the next character in the sequence. W takes the place of P_bigram from the last section. W = torch.randn(27, 27, requires_grad=True) for epoch in range(1000): logits = X @ W loss = F.cross_entropy(logits, y) W.grad = None loss.backward() W.data -= 50 * W.grad After training for a bit, W converges to the following distribution, which looks the same as the one in Figure 1: Figure 2: Bigram probability distribution generated by backpropagation This is confirmed when generating random names with the same seed, as the results are exactly the same. .myliena. .r. .a. .ahi. .grammian. .n. .xxonh. .chaldeiniy. .bler. .jaranige. Multi-layer perceptron (8’762 parameters) One straightforward way to improve the model is to increase its context length. In this case, instead of only seeing the previous character, the model is going to see the previous 3 characters. If we just used the previous technique, the resulting 4-dimensional probability matrix would have \(27^4=531'441\) parameters, which is starting to get pretty big. So the purely statistical approach is ditched in favor of a small feed-forward neural network defined like this: N_EMBEDDINGS = 5 HIDDEN_NEURONS = 200 C = torch.randn((27, N_EMBEDDINGS), requires_grad=True) W1 = torch.randn((N_EMBEDDINGS * BLOCK_SIZE, HIDDEN_NEURONS), requires_grad=True) b1 = torch.randn(HIDDEN_NEURONS, requires_grad=True) W2 = torch.randn((HIDDEN_NEURONS, 27), requires_grad=True) b2 = torch.randn(27, requires_grad=True) parameters = [C, W1, b1, W2, b2] def forward(X): embeddings = C[X] x = embeddings.view(-1, N_EMBEDDINGS * BLOCK_SIZE) x = x @ W1 + b1 x = torch.tanh(x) x = x @ W2 + b2 return x This introduces the concept of an embedding which maps each 27-dimensional one-hot encoded input character to a 5-dimensional embedding that is learned together with the rest of the network. We can take a look at the cosine similarities between all embeddings after training: Figure 3: ReLU of the cosine similarities between learned embeddings It looks like the embeddings of the vowels are often similar to each other and that ‘.’, ‘n’, and ‘q’ have pretty distinct embeddings. The random names generated with this model are already way better than before: .mylieke. .rada. .erackanolando. .yusamailem. .karlymandrah. .parlif. .meilae. .ayy. .saiora. .jaylianyyah. WaveNet (176k parameters) The next step is extending the context length even further and using another network topology with much more parameters. This topology is a WaveNet introduced in 2016 by Google DeepMind https://arxiv.org/abs/1609.03499. As seen in Figure 4, it uses dilated causal convolutions and was originally created for speech synthesis. In this case, the convolutions are replaced with carefully arranged linear layers and the network is of course used for character prediction. Figure 4: WaveNet topology source This particular implementation is very inefficient because the data is occasionally reshaped to create the correct tensor shapes. class WaveNet(nn.Module): def __init__(self, n_embeddings=10, hidden_neurons=200): super(WaveNet, self).__init__() self.n_embeddings = n_embeddings self.hidden_neurons = hidden_neurons self.embedding = nn.Embedding(27, n_embeddings) self.fc1 = nn.Linear(n_embeddings * 2, hidden_neurons) self.bn1 = nn.BatchNorm1d(hidden_neurons) self.fc2 = nn.Linear(hidden_neurons * 2, hidden_neurons) self.bn2 = nn.BatchNorm1d(hidden_neurons) self.fc3 = nn.Linear(hidden_neurons * 2, hidden_neurons) self.bn3 = nn.BatchNorm1d(hidden_neurons) self.fc4 = nn.Linear(hidden_neurons, 27) def forward(self, x): x = self.embedding(x) x = x.view(-1, 4, self.n_embeddings * 2) self.tmp = x x = self.fc1(x) x = x.swapaxes(1,2) x = self.bn1(x) x = x.swapaxes(1,2) x = torch.tanh(x) x = x.reshape(-1, 2, self.hidden_neurons * 2) x = self.fc2(x) x = x.swapaxes(1,2) x = self.bn2(x) x = x.swapaxes(1,2) x = torch.tanh(x) x = x.reshape(-1, self.hidden_neurons * 2) x = self.fc3(x) x = self.bn3(x) x = torch.tanh(x) x = self.fc4(x) return x model = WaveNet(n_embeddings = 20, hidden_neurons=200) After training this network for a couple of epochs with a decreasing learning rate, the generated names start to sound pretty name-like: .kynnale. .obagann. .evanne. .jatetton. .adiliah. .kiyaen. .nalar. .khfi. .keilei. .awyana. Looking at the embedding similarities again, it seems like they are more unique than before. This makes sense considering the dimensionality of the embedding vector has gone up from 5 to 20 Dimensions. Figure 5: ReLU of the cosine similarities between learned WaveNet embeddings GPT from scratch (10.7M parameters) Learning how transformers work was definitely the most exciting part of this project for me! Explaining transformers is no easy task, so I’m not even going to try. If you are interested, I highly recommend watching the lecture. But I want to quickly show what lies in the heart of a transformer: The self-attention mechanism: For each input, it generates a query, key, and value that get combined, such that they can exchange information with each other in a very elegant way. class Head(nn.Module): def __init__(self, head_size): super().__init__() self.key = nn.Linear(n_embeddings, head_size, bias=False) self.query = nn.Linear(n_embeddings, head_size, bias=False) self.value = nn.Linear(n_embeddings, head_size, bias=False) self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size))) self.dropout = nn.Dropout(dropout) def forward(self, x): B, T, C = x.shape k = self.key(x) q = self.query(x) v = self.value(x) wei = q @ k.transpose(-2, -1) * k.shape[-1]**-0.5 wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf')) wei = F.softmax(wei, dim=-1) wei = self.dropout(wei) x = wei @ v return x The final model has 384 embeddings, a context size of 256 characters, and consists of 6 transformer blocks with 6 self-attention heads each. After being trained on the works of Shakespeare for a couple of thousand epochs (on the GPU this time), it produces text like this: And come to Coriolanus. CORIOLANUS: I will be so, show; I hope this service As it gives me a word which to practise my lave, When I had rather cry ‘‘twas but a bulk. MENENIUS: Let’s not pray you. AUTOLYCUS: I know ‘tis not without the shepherd, not a monster, man. CORIOLANUS: The poor, Pampey, sir. CORIOLANUS: Nor thou hast, my lord; there I know the taple I was violently? CORIOLANUS: Why, that that he were recounted nose, A heart of your lord’st traded, whose eservice we remain With this chance to you did The deputy of such pure complices them Redeem with our love they call and new good to-night. BRUTUS: I dare now in the voices: here are Acrown the state news. SICINIUS: You have been in crutching of them, cry on their this? MENENIUS: Proclam, sir, I shall. Playing with LLaMA (65B parameters) While doing this project the weights of Facebooks LLaMa model leaked online in a hilarious way. When looking at their publicly available model, most things seem very similar to the DIY version. Language models might start having their StableDiffusion moment right now. Seemingly every day there is a new innovation like reducing the model size with quantization or finetuning it to act more like OpenAIs Davinci model. After increasing my WSLs RAM budget to 50GB, downloading the LLaMA weights and quantizing them to 4-bit, I was able to use llama.cpp to run the 65B model on a CPU. Like with diffusion models, it seems futile to go into much detail here, as the whole landscape will probably look completely different in only a couple of weeks. Conclusion Going from having absolutely no clue about LLMs and transformers to understanding the current cutting-edge research was definitely a fun project. Major props have to go to Andrej Karpathy for his amazing lectures that perfectly guided me through this process. I will probably return to this topic soon when the available models and tooling have advanced enough to do DIY fine-tuning. All code can be found on my GitHub]]></summary></entry><entry><title type="html">Writing an autograd engine and creating images with backpropagation</title><link href="/nanograd/" rel="alternate" type="text/html" title="Writing an autograd engine and creating images with backpropagation" /><published>2023-02-23T15:54:21+00:00</published><updated>2023-02-23T15:54:21+00:00</updated><id>/nanograd</id><content type="html" xml:base="/nanograd/"><![CDATA[<p>As part of my master thesis, I implemented a <a href="https://en.wikipedia.org/wiki/Spiking_neural_network">spiking neural network</a> from scratch on a microcontroller. 
Because training was done separately on a PC, I only had to write the forward pass. This was a big relief because PyTorches autograd always seemed like black magic to me (even after learning the theoretical background).
So seeing a post on HackerNews about <a href="https://en.wikipedia.org/wiki/Andrej_Karpathy">Andrej Karpathys</a> amazing video series on building neural networks from scratch, seemed like the perfect opportunity to fix this blind spot of mine. That means this project will probably be <strong>heavily</strong>  inspired by his <a href="https://github.com/karpathy/micrograd">micrograd</a>.</p>

<h2 id="goal">Goal</h2>
<p>The goal is to create the whole machine-learning pipeline from scratch and successfully use it to learn a task.</p>

<ul>
  <li>Implement a neural network pipeline from scratch (forward and backward pass, <a href="https://en.wikipedia.org/wiki/Stochastic_gradient_descent">SGD</a>)</li>
  <li>Don’t use any machine learning libraries (including NumPy)</li>
  <li>Get decent accuracy on <a href="https://en.wikipedia.org/wiki/MNIST_database">MNIST</a>, let’s say 80%</li>
</ul>

<h2 id="documentation">Documentation</h2>

<h3 id="watching-the-lecture">Watching the lecture</h3>
<p>For a couple of weeks, I watched the first <a href="https://www.youtube.com/watch?v=VMj-3S1tku0">video</a> in Andrej Karpathys <a href="https://karpathy.ai/zero-to-hero.html">Neural Networks: Zero to Hero
</a> series. With the help of a Jupyter notebook, he implements a version of micrograd that implements everything from backpropagation to the ability to learn a simple binary classification task with a <a href="https://en.wikipedia.org/wiki/Multilayer_perceptron">multilayer perceptron</a>. In my opinion, the explanations are done <strong>exceptionally</strong> well, relying mostly on code to explain the concepts (as opposed to equations, as many university courses would).<br />
But sometimes that feeling of understanding can be deceitful: Even when something is well understood in theory, actually using it in an exercise or a real word application can often reveal some blind spots or unforeseen difficulties.</p>

<h3 id="computational-graph-and-backpropagation">Computational graph and backpropagation</h3>
<p>So let’s finally start writing code! I use a Jupyter notebook for this, as it is very convenient for quick prototyping.<br />
The first task is implementing the forward pass that builds a <a href="https://en.wikipedia.org/wiki/Directed_acyclic_graph">directed acyclic graph</a> which remembers the order of all executed operations. Every operation takes some input vertices (each representing a number) and produces one output vertex (the result of the operation). Addition would look something like: a(child) + b(child) = c(parent)
The harder part that lies at the heart of backpropagation is the backward() function that propagates a gradient from the parent to its children. This backward function explicitly spells out the partial derivative for each operation. For example, an addition during the forward pass distributes the gradient of the parent to all children during the backward pass.<br />
One interesting aspect of doing backpropagation on the whole graph is that it needs to be done in the correct order so that each gradient is fully computed before propagating it further. This order can be computed by using something like <a href="https://en.wikipedia.org/wiki/Topological_sorting">topological sorting</a>. This is an item that cost me a lot of time because I had an undiscovered bug in my topo sort that would put a parent vertex in the wrong position with respect to its descendants.</p>

<h3 id="multi-layer-perceptron">Multi-layer perceptron</h3>
<p>With the hard part out of the way, the previously defined operations (such as add and multiply) can be composed together into larger functions:</p>
<ul>
  <li>A fully connected linear layer that does matrix multiplication and adds a bias (This also stores the weights of the network)</li>
  <li><a href="https://en.wikipedia.org/wiki/Rectifier_(neural_networks)">RelU</a> as the nonlinearity</li>
  <li><a href="https://en.wikipedia.org/wiki/Root-mean-square_deviation">RMSE</a> as the loss function</li>
</ul>

<h3 id="learning-with-sgd">Learning with SGD</h3>
<p>At this point the neural network is complete and can be trained by repeating the following steps:</p>
<ol>
  <li>Feed data into the network</li>
  <li>Predict an outcome from this data</li>
  <li>Compare with the true outcome to calculate the loss</li>
  <li>Backpropagate the loss throughout the network to calculate the gradients</li>
  <li>Subtract the gradients from their weights/biases (with an appropriate learning rate)</li>
</ol>

<p>With these steps implemented, the network can successfully learn a simple classification task such as the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_moons.html">moons dataset</a></p>

<h3 id="training-larger-models">Training larger models</h3>
<p>At this point, it seemed trivial to just do the same thing for MNIST. Unfortunately, I was confronted with the fact that this implementation of backpropagation is <strong>very very slow</strong>! Even Andrej himself said “It’s the slowest autograd engine imaginable” on <a href="https://twitter.com/karpathy/status/1249561661855297536">Twitter</a>. Training a model on only a handful of the <strong>60’000</strong> 28x28 images of MNIST already took several minutes. Even after moving the goalpost to a smaller <a href="https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html">dataset</a> with 8x8 images, training was prohibitively slow.</p>

<p>At this point, I decided that I learned everything I wanted to learn from this project and throw in the towel regarding the goal of learning MNIST.</p>

<h2 id="going-on-a-tangent">Going on a tangent</h2>
<p>During the lecture, there was an offhand comment on how the gradients are also unnecessarily computed for all input values (the training images). This made me want to see if it’s possible to generate images with backpropagation.</p>

<h2 id="training-8x8-digits-with-pytorch">Training 8x8 digits with PyTorch</h2>
<p>First I had to train a decent classifier on the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html">digits dataset</a>. For this, I ditched my own implementation and used PyTorch which will speed up development and computation time by several orders of magnitude.</p>

<p>The model is a very straightforward three-layer MLP with a bit of dropout for regularization</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">class</span> <span class="nc">FF</span><span class="p">(</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="nb">super</span><span class="p">(</span><span class="n">FF</span><span class="p">,</span> <span class="bp">self</span><span class="p">).</span><span class="n">__init__</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">fc1</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">64</span><span class="p">,</span> <span class="mi">32</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">fc2</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">32</span><span class="p">,</span><span class="mi">16</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">fc3</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Linear</span><span class="p">(</span><span class="mi">16</span><span class="p">,</span><span class="mi">10</span><span class="p">)</span>
        
        <span class="bp">self</span><span class="p">.</span><span class="n">dropout1</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Dropout</span><span class="p">(</span><span class="mf">0.2</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">dropout2</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="n">Dropout</span><span class="p">(</span><span class="mf">0.2</span><span class="p">)</span>
        
    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
        <span class="n">x</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">fc1</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">dropout1</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="n">x</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">relu</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">fc2</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">dropout2</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="n">x</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">fc3</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">x</span></code></pre></figure>

<p>The data is split 80%/20% into a train and a test set. The training goes well, with the model achieving a final 93% accuracy on the test set.
<img src="/assets/images/nanograd/88mnist_training.png" alt="Train/test accuracy training curves" /></p>

<h2 id="generating-images-with-backprop">Generating images with backprop</h2>
<p>The goal is to generate images that look as much like a digit to the network as possible.
This is done by doing something very similar to training, but instead of the network parameters, the input image is modified with gradient descent.
In this case, the individual pixels are also clamped between 0 and 16 to match the dataset.</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># seed: Array of 64 initial pixels
# target: number between 0-9
# output: Array of 64 modified pixels
</span><span class="k">def</span> <span class="nf">generate_digit</span><span class="p">(</span><span class="n">seed</span><span class="p">,</span> <span class="n">target</span><span class="p">):</span>
    <span class="n">image</span> <span class="o">=</span> <span class="n">seed</span><span class="p">.</span><span class="n">clone</span><span class="p">()</span> 
    <span class="n">image</span><span class="p">.</span><span class="n">requires_grad</span> <span class="o">=</span> <span class="bp">True</span> <span class="c1"># This is the important part
</span>    <span class="n">image_optimizer</span> <span class="o">=</span> <span class="n">optim</span><span class="p">.</span><span class="n">Adam</span><span class="p">([</span><span class="n">image</span><span class="p">],</span> <span class="n">lr</span><span class="o">=</span><span class="mf">0.1</span><span class="p">)</span> 
    
    <span class="c1">#This is like doing epochs during training
</span>    <span class="k">for</span> <span class="n">step</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">):</span>
        <span class="n">image_optimizer</span><span class="p">.</span><span class="n">zero_grad</span><span class="p">()</span>
        <span class="n">pred</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">clamp</span><span class="p">(</span><span class="n">image</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">16</span><span class="p">))</span> 
        <span class="c1"># Same loss as in training
</span>        <span class="n">loss</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">cross_entropy</span><span class="p">(</span><span class="n">pred</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">target</span><span class="p">))</span> 
        <span class="n">loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span>
        <span class="n">image_optimizer</span><span class="p">.</span><span class="n">step</span><span class="p">()</span>
        
    <span class="c1"># Clamp s.t. the domain stays the same
</span>    <span class="k">return</span> <span class="n">torch</span><span class="p">.</span><span class="n">clamp</span><span class="p">(</span><span class="n">image</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">16</span><span class="p">)</span></code></pre></figure>

<p>With this, we can convert a 0 to be recognized as an 8 by the network</p>
<p style="text-align: center;"><img src="/assets/images/nanograd/0-8b.gif" alt="Generateing an 8 from a 0" /><br />
<em>Generating an 8 from zero</em></p>

<p>But usually, the perception of the network does not align very well with the one of us humans! Here are the optimal digits generated from a black square:</p>

<p style="text-align: center;"><img src="/assets/images/nanograd/0.png" alt="0" /> <img src="/assets/images/nanograd/1.png" alt="1" /> <img src="/assets/images/nanograd/2.png" alt="2" /> <img src="/assets/images/nanograd/3.png" alt="3" /> <img src="/assets/images/nanograd/4.png" alt="4" /><br />
0 1 2 3 4<br />
<img src="/assets/images/nanograd/5.png" alt="5" /> <img src="/assets/images/nanograd/6.png" alt="6" /> <img src="/assets/images/nanograd/7.png" alt="7" /> <img src="/assets/images/nanograd/8.png" alt="8" /> <img src="/assets/images/nanograd/9.png" alt="9" /><br />
5 6 7 8 9<br />
<em>Digits created from black squares</em></p>
<p>If you squint your eyes hard enough you might see some rough shapes for 0, 2, 3, and 5.</p>

<p>Another thing we can do is generate a bunch of images from noise and average them into a single composite image. In this case for each digit 1000 versions are generated from uniform noise and then averaged:</p>
<p style="text-align: center;"><img src="/assets/images/nanograd/avg_0.png" alt="0" /> <img src="/assets/images/nanograd/avg_1.png" alt="1" /> <img src="/assets/images/nanograd/avg_2.png" alt="2" /> <img src="/assets/images/nanograd/avg_3.png" alt="3" /> <img src="/assets/images/nanograd/avg_4.png" alt="4" /><br />
0 1 2 3 4<br />
<img src="/assets/images/nanograd/avg_5.png" alt="5" /> <img src="/assets/images/nanograd/avg_6.png" alt="6" /> <img src="/assets/images/nanograd/avg_7.png" alt="7" /> <img src="/assets/images/nanograd/avg_8.png" alt="8" /> <img src="/assets/images/nanograd/avg_9.png" alt="9" /><br />
5 6 7 8 9<br />
<em>Average digits created from noise</em></p>
<p>With some more eye squinting, you might even see the 0, 2, 3, 4, 5, 7, and 8!</p>

<p>It is no StableDiffusion, but nonetheless a fun exercise in using backpropagation in an unusual way.</p>

<h3 id="generating-adversarial-images">Generating adversarial images</h3>
<p>We can also modify the loss function to simultaneously optimize toward a target digit while staying as close to the starting point as possible:</p>

<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">for</span> <span class="n">step</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">):</span>
    <span class="n">image_optimizer</span><span class="p">.</span><span class="n">zero_grad</span><span class="p">()</span>
    <span class="n">pred</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">clamp</span><span class="p">(</span><span class="n">image</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">16</span><span class="p">))</span>

    <span class="c1"># This is the same as before
</span>    <span class="n">loss</span> <span class="o">=</span> <span class="n">F</span><span class="p">.</span><span class="n">cross_entropy</span><span class="p">(</span><span class="n">pred</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">tensor</span><span class="p">(</span><span class="n">target</span><span class="p">))</span>

    <span class="c1"># Additional loss from difference to the original image
</span>    <span class="n">loss</span> <span class="o">+=</span> <span class="n">MSE</span><span class="p">(</span><span class="n">image</span><span class="p">,</span> <span class="n">seed</span><span class="p">)</span> <span class="o">*</span> <span class="mi">2</span>

    <span class="n">loss</span><span class="p">.</span><span class="n">backward</span><span class="p">()</span>
    <span class="n">image_optimizer</span><span class="p">.</span><span class="n">step</span><span class="p">()</span></code></pre></figure>

<p>For example, the image on the left is correctly read as an 8. But the one on the right, after being altered by the code above, is incorrectly predicted to be a 9.</p>

<p style="text-align: center;"><img src="/assets/images/nanograd/original_8.png" alt="8" /> <img src="/assets/images/nanograd/predicted_9.png" alt="9?" /><br />
<em>Left: Origianal, read as 8 <br /> Right: Adversarial 9 with just a bit of noise</em></p>

<h2 id="conclusion">Conclusion</h2>
<p>Unfortunately, I was not able to achieve all the initially stated goals because I didn’t achieve 80% accuracy on MNIST with my own backpropagation. I knew this was a pretty ambitious task, but did not expect it to be <em>this</em> hard. Maybe I was a bit spoiled by using well-optimized machine-learning libraries for years now, never appreciating how much slower they could be.<br />
But I’m still happy with the outcome of this project, having learned a lot about backpropagation and gradient descent! Also, the tangent about generating images with backprop was quite fun, mostly because of the visual aspect and not knowing what outcomes to expect.<br />
I think for future learning projects like this it would be a good idea to state the initial goal as something more like “Understand backprop” instead of technical milestones.</p>

<p>The code for this project can be found on <a href="https://github.com/ckuenzi/nanograd">GitHub</a></p>

<!-- Google tag (gtag.js) -->
<script async="" src="https://www.googletagmanager.com/gtag/js?id=G-JCFYDY59EG"></script>

<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'G-JCFYDY59EG');
</script>]]></content><author><name>Cyrill Künzi</name></author><summary type="html"><![CDATA[As part of my master thesis, I implemented a spiking neural network from scratch on a microcontroller. Because training was done separately on a PC, I only had to write the forward pass. This was a big relief because PyTorches autograd always seemed like black magic to me (even after learning the theoretical background). So seeing a post on HackerNews about Andrej Karpathys amazing video series on building neural networks from scratch, seemed like the perfect opportunity to fix this blind spot of mine. That means this project will probably be heavily inspired by his micrograd. Goal The goal is to create the whole machine-learning pipeline from scratch and successfully use it to learn a task. Implement a neural network pipeline from scratch (forward and backward pass, SGD) Don’t use any machine learning libraries (including NumPy) Get decent accuracy on MNIST, let’s say 80% Documentation Watching the lecture For a couple of weeks, I watched the first video in Andrej Karpathys Neural Networks: Zero to Hero series. With the help of a Jupyter notebook, he implements a version of micrograd that implements everything from backpropagation to the ability to learn a simple binary classification task with a multilayer perceptron. In my opinion, the explanations are done exceptionally well, relying mostly on code to explain the concepts (as opposed to equations, as many university courses would). But sometimes that feeling of understanding can be deceitful: Even when something is well understood in theory, actually using it in an exercise or a real word application can often reveal some blind spots or unforeseen difficulties. Computational graph and backpropagation So let’s finally start writing code! I use a Jupyter notebook for this, as it is very convenient for quick prototyping. The first task is implementing the forward pass that builds a directed acyclic graph which remembers the order of all executed operations. Every operation takes some input vertices (each representing a number) and produces one output vertex (the result of the operation). Addition would look something like: a(child) + b(child) = c(parent) The harder part that lies at the heart of backpropagation is the backward() function that propagates a gradient from the parent to its children. This backward function explicitly spells out the partial derivative for each operation. For example, an addition during the forward pass distributes the gradient of the parent to all children during the backward pass. One interesting aspect of doing backpropagation on the whole graph is that it needs to be done in the correct order so that each gradient is fully computed before propagating it further. This order can be computed by using something like topological sorting. This is an item that cost me a lot of time because I had an undiscovered bug in my topo sort that would put a parent vertex in the wrong position with respect to its descendants. Multi-layer perceptron With the hard part out of the way, the previously defined operations (such as add and multiply) can be composed together into larger functions: A fully connected linear layer that does matrix multiplication and adds a bias (This also stores the weights of the network) RelU as the nonlinearity RMSE as the loss function Learning with SGD At this point the neural network is complete and can be trained by repeating the following steps: Feed data into the network Predict an outcome from this data Compare with the true outcome to calculate the loss Backpropagate the loss throughout the network to calculate the gradients Subtract the gradients from their weights/biases (with an appropriate learning rate) With these steps implemented, the network can successfully learn a simple classification task such as the moons dataset Training larger models At this point, it seemed trivial to just do the same thing for MNIST. Unfortunately, I was confronted with the fact that this implementation of backpropagation is very very slow! Even Andrej himself said “It’s the slowest autograd engine imaginable” on Twitter. Training a model on only a handful of the 60’000 28x28 images of MNIST already took several minutes. Even after moving the goalpost to a smaller dataset with 8x8 images, training was prohibitively slow. At this point, I decided that I learned everything I wanted to learn from this project and throw in the towel regarding the goal of learning MNIST. Going on a tangent During the lecture, there was an offhand comment on how the gradients are also unnecessarily computed for all input values (the training images). This made me want to see if it’s possible to generate images with backpropagation. Training 8x8 digits with PyTorch First I had to train a decent classifier on the digits dataset. For this, I ditched my own implementation and used PyTorch which will speed up development and computation time by several orders of magnitude. The model is a very straightforward three-layer MLP with a bit of dropout for regularization class FF(nn.Module): def __init__(self): super(FF, self).__init__() self.fc1 = nn.Linear(64, 32) self.fc2 = nn.Linear(32,16) self.fc3 = nn.Linear(16,10) self.dropout1 = nn.Dropout(0.2) self.dropout2 = nn.Dropout(0.2) def forward(self, x): x = F.relu(self.fc1(x)) x = self.dropout1(x) x = F.relu(self.fc2(x)) x = self.dropout2(x) x = self.fc3(x) return x The data is split 80%/20% into a train and a test set. The training goes well, with the model achieving a final 93% accuracy on the test set. Generating images with backprop The goal is to generate images that look as much like a digit to the network as possible. This is done by doing something very similar to training, but instead of the network parameters, the input image is modified with gradient descent. In this case, the individual pixels are also clamped between 0 and 16 to match the dataset. # seed: Array of 64 initial pixels # target: number between 0-9 # output: Array of 64 modified pixels def generate_digit(seed, target): image = seed.clone() image.requires_grad = True # This is the important part image_optimizer = optim.Adam([image], lr=0.1) #This is like doing epochs during training for step in range(100): image_optimizer.zero_grad() pred = model(torch.clamp(image, 0, 16)) # Same loss as in training loss = F.cross_entropy(pred, torch.tensor(target)) loss.backward() image_optimizer.step() # Clamp s.t. the domain stays the same return torch.clamp(image, 0, 16) With this, we can convert a 0 to be recognized as an 8 by the network Generating an 8 from zero But usually, the perception of the network does not align very well with the one of us humans! Here are the optimal digits generated from a black square: 0 1 2 3 4 5 6 7 8 9 Digits created from black squares If you squint your eyes hard enough you might see some rough shapes for 0, 2, 3, and 5. Another thing we can do is generate a bunch of images from noise and average them into a single composite image. In this case for each digit 1000 versions are generated from uniform noise and then averaged: 0 1 2 3 4 5 6 7 8 9 Average digits created from noise With some more eye squinting, you might even see the 0, 2, 3, 4, 5, 7, and 8! It is no StableDiffusion, but nonetheless a fun exercise in using backpropagation in an unusual way. Generating adversarial images We can also modify the loss function to simultaneously optimize toward a target digit while staying as close to the starting point as possible: for step in range(100): image_optimizer.zero_grad() pred = model(torch.clamp(image, 0, 16)) # This is the same as before loss = F.cross_entropy(pred, torch.tensor(target)) # Additional loss from difference to the original image loss += MSE(image, seed) * 2 loss.backward() image_optimizer.step() For example, the image on the left is correctly read as an 8. But the one on the right, after being altered by the code above, is incorrectly predicted to be a 9. Left: Origianal, read as 8 Right: Adversarial 9 with just a bit of noise Conclusion Unfortunately, I was not able to achieve all the initially stated goals because I didn’t achieve 80% accuracy on MNIST with my own backpropagation. I knew this was a pretty ambitious task, but did not expect it to be this hard. Maybe I was a bit spoiled by using well-optimized machine-learning libraries for years now, never appreciating how much slower they could be. But I’m still happy with the outcome of this project, having learned a lot about backpropagation and gradient descent! Also, the tangent about generating images with backprop was quite fun, mostly because of the visual aspect and not knowing what outcomes to expect. I think for future learning projects like this it would be a good idea to state the initial goal as something more like “Understand backprop” instead of technical milestones. The code for this project can be found on GitHub]]></summary></entry><entry><title type="html">Hello World!</title><link href="/hello-world/" rel="alternate" type="text/html" title="Hello World!" /><published>2023-02-23T11:54:21+00:00</published><updated>2023-02-23T11:54:21+00:00</updated><id>/hello-world</id><content type="html" xml:base="/hello-world/"><![CDATA[<p><img src="/assets/images/hello-world.jpg" alt="" /></p>

<p>Ever since I first discovered how to print my own name with the Windows command line as a child, I’ve been drawn to all kinds of programming projects. Many years later I blinked the first LED with an Arduino and the scope of my personal projects quickly expanded to the physical realm. So over time and many projects, I’ve accumulated an impressive pile of demos, proof of concepts, and mostly finished PCBs. Unfortunately, I’ve come to the realization that the number of projects I’ve actually completed to a satisfactory degree is rather small.</p>

<h4 id="why-projects-fail">Why projects fail</h4>
<p>The cycle of starting 100 projects and finishing none is probably something that a lot of people are familiar with. There are many reasons why the initial enthusiasm oftentimes quickly fizzles out: Lack of time or motivation, unforeseen obstacles, unrealistic expectations, fear of failure, or crippling perfectionism.
I’ve fallen victim to all of these things, but generally, they happen at a point where the basic mechanism is completed, but the final touches are still missing. In these situations the 80/20 rule often comes to mind: It says that 80% of the work can be done within 20% of the time. But this also means that the final 20% (the finishing touches) take 80% of the time. This is the reason for this blog currently being called “<strong>The Twenty Percent</strong>”.</p>

<p><img src="/assets/images/box_of_learning_opportunities.PNG" alt="One box of mostly half-finished projects" />
<em>Box of mostly half-finished projects</em></p>

<h4 id="trying-to-be-better">Trying to be better</h4>
<p>Just having completed my master’s degree puts me in a comfortable position with lots of free time and motivation to do cool things. I would like to use this opportunity to improve my ability to successfully complete personal projects. This especially means focusing on completing the elusive last 20% of the project which takes so much effort to do.</p>

<p>For that reason, I’m creating this personal blog to document my progress and also get some writing practice. In the beginning, the goal is to complete a few simpler projects to hone in on the process, before advancing to more complex topics later on.</p>

<p>The post for each project will have three sections:</p>
<ul>
  <li><strong>Goal</strong>: Specification of the things I want to achieve with the project. When these things are done, the project is considered “complete”. But some moving of the goalpost will probably be inevitable…</li>
  <li><strong>Documentation</strong>: Description of the project itself, including technical details, design decisions, and challenges.</li>
  <li><strong>Takeaway</strong>: Analysis of the process itself, including what went well, what could be improved, and any lessons learned.</li>
</ul>

<p><img src="/assets/images/box_of_prints.png" alt="One box of mostly half finished projects" />
<em>Box of 3D printed learning opportunities</em></p>

<h2 id="first-project-creating-this-blog">First Project: Creating this Blog</h2>
<p>Setting up this site is a perfect opportunity for a first project.</p>

<h3 id="goal">Goal</h3>
<p>I don’t have a lot of experience with blogs or writeups, so the exact specifications of this blog will probably only be clear once I start using it. But there are some rough goals that I want to achieve:</p>
<ol>
  <li>Host the site on GitHub (or some other free alternative)</li>
  <li>Access the site with my own domain <a href="https://kuenzi.dev/">https://kuenzi.dev/</a></li>
  <li>The ability to write posts in markdown</li>
  <li>Make it look decent</li>
  <li>Write the first post (this one)</li>
</ol>

<h3 id="documentation">Documentation</h3>
<p>With a relatively good picture of the final product in mind, I can start working on the project.</p>

<h4 id="choosing-a-site-generator">Choosing a site generator</h4>
<p>Going with <strong>Jekyll</strong> was the obvious choice, as it is well supported and endorsed by <a href="https://pages.github.com/">GitHub pages</a> themselves and enables <a href="https://jekyllrb.com/docs/posts/">post</a> creation with <a href="https://www.markdowntutorial.com/">markdown</a>. It also seems to have a nice trade-off between ease of use and customizability.
I chose to go with a local Jekyll install because the ability to immediately see any changes locally, significantly speeds up the process of tinkering around. Even though I don’t have any experience with Ruby, the initial <a href="https://jekyllrb.com/docs/step-by-step/01-setup/">setup</a> went very smoothly and I had the example blog running in no time.</p>

<h4 id="hosting-on-github">Hosting on GitHub</h4>
<p>GitHub even has its own documentation on how to <a href="https://docs.github.com/en/pages/setting-up-a-github-pages-site-with-jekyll/about-github-pages-and-jekyll">setup a Jekyll blog</a> which surprisingly worked without a hitch. I chose to host it as my user page on <a href="https://ckuenzi.github.io">ckuenzi.github.io</a>, as it will serve as a landing page for all things concerning my projects.</p>

<h4 id="using-a-custom-domain">Using a custom domain</h4>
<p>Free hosting on GitHub is nice and all, but using a custom domain elevates it to the next level. Luckily this can be done rather easily by <a href="https://gist.github.com/plembo/84f80c920bb5ac6f19e53fe6f8db1ff7">changing some DNS records</a>. The hardest part was being patient for an hour until the DNS changes were active and the certificate was generated.</p>

<h4 id="making-it-look-pretty">Making it look pretty</h4>
<p>The default <a href="https://github.com/jekyll/minima">minima</a> Jekyll theme already looks nice, but I decided to go with the dark version of the very popular <a href="https://github.com/mmistakes/minimal-mistakes">minimal-mistakes</a> theme because it looks a bit more modern and seems to have a lot of nifty features. The nice thing about a program like Jeykll is that the actual posts are written completely independently of the theme, which can easily be changed later.<br />
But even with a complete theme, there are a lot of screws to turn and styles to try. This is a part of the project that I spent a lot of time on, especially because of the sophistication of this particular theme. There are still some things that I would like to improve such as including header images and a side-bar.<br />
But no matter the theme, this site will look a bit barren without the actual content.</p>

<h4 id="writing-the-first-post">Writing the first post</h4>
<p>With all the technical aspects working, this is the point where I reached my <strong>last 20%</strong> for this project. Writing this post took way longer than expected, taking a lot of perseverance and motivation. And with the hype surrounding ChatGPT at the moment, it required a lot of restraint to not outsource all the work to an LLM.
There are many things that I’m not yet sure of, such as the writing style/tone and the actual content for this blog. But these things will hopefully become more apparent with time and additional content.</p>

<h3 id="takeaway">Takeaway</h3>
<p>When this subsection is written, I successfully completed my first Project. All the technical stuff was pleasantly straightforward without any unforeseen difficulties. It was also interesting to actively observe the dip in motivation once I had to start writing, which easily took 80% of my effort. It will also be interesting how that plays out with future projects that are not <em>this</em> meta and need writing in addition to the rest of the challenges.<br />
For the future, it would be lovely to not procrastinate as much as I did once the hard part inevitably showed up. But carefully observing the dip in motivation and the experience of having pushed through will hopefully help with that.
In the end, I’m satisfied with the outcome of this project, as all of the goals were satisfied and I now have a product that can be easily used for future posts.</p>

<!-- Google tag (gtag.js) -->
<script async="" src="https://www.googletagmanager.com/gtag/js?id=G-JCFYDY59EG"></script>

<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'G-JCFYDY59EG');
</script>]]></content><author><name>Cyrill Künzi</name></author><summary type="html"><![CDATA[Ever since I first discovered how to print my own name with the Windows command line as a child, I’ve been drawn to all kinds of programming projects. Many years later I blinked the first LED with an Arduino and the scope of my personal projects quickly expanded to the physical realm. So over time and many projects, I’ve accumulated an impressive pile of demos, proof of concepts, and mostly finished PCBs. Unfortunately, I’ve come to the realization that the number of projects I’ve actually completed to a satisfactory degree is rather small. Why projects fail The cycle of starting 100 projects and finishing none is probably something that a lot of people are familiar with. There are many reasons why the initial enthusiasm oftentimes quickly fizzles out: Lack of time or motivation, unforeseen obstacles, unrealistic expectations, fear of failure, or crippling perfectionism. I’ve fallen victim to all of these things, but generally, they happen at a point where the basic mechanism is completed, but the final touches are still missing. In these situations the 80/20 rule often comes to mind: It says that 80% of the work can be done within 20% of the time. But this also means that the final 20% (the finishing touches) take 80% of the time. This is the reason for this blog currently being called “The Twenty Percent”. Box of mostly half-finished projects Trying to be better Just having completed my master’s degree puts me in a comfortable position with lots of free time and motivation to do cool things. I would like to use this opportunity to improve my ability to successfully complete personal projects. This especially means focusing on completing the elusive last 20% of the project which takes so much effort to do. For that reason, I’m creating this personal blog to document my progress and also get some writing practice. In the beginning, the goal is to complete a few simpler projects to hone in on the process, before advancing to more complex topics later on. The post for each project will have three sections: Goal: Specification of the things I want to achieve with the project. When these things are done, the project is considered “complete”. But some moving of the goalpost will probably be inevitable… Documentation: Description of the project itself, including technical details, design decisions, and challenges. Takeaway: Analysis of the process itself, including what went well, what could be improved, and any lessons learned. Box of 3D printed learning opportunities First Project: Creating this Blog Setting up this site is a perfect opportunity for a first project. Goal I don’t have a lot of experience with blogs or writeups, so the exact specifications of this blog will probably only be clear once I start using it. But there are some rough goals that I want to achieve: Host the site on GitHub (or some other free alternative) Access the site with my own domain https://kuenzi.dev/ The ability to write posts in markdown Make it look decent Write the first post (this one) Documentation With a relatively good picture of the final product in mind, I can start working on the project. Choosing a site generator Going with Jekyll was the obvious choice, as it is well supported and endorsed by GitHub pages themselves and enables post creation with markdown. It also seems to have a nice trade-off between ease of use and customizability. I chose to go with a local Jekyll install because the ability to immediately see any changes locally, significantly speeds up the process of tinkering around. Even though I don’t have any experience with Ruby, the initial setup went very smoothly and I had the example blog running in no time. Hosting on GitHub GitHub even has its own documentation on how to setup a Jekyll blog which surprisingly worked without a hitch. I chose to host it as my user page on ckuenzi.github.io, as it will serve as a landing page for all things concerning my projects. Using a custom domain Free hosting on GitHub is nice and all, but using a custom domain elevates it to the next level. Luckily this can be done rather easily by changing some DNS records. The hardest part was being patient for an hour until the DNS changes were active and the certificate was generated. Making it look pretty The default minima Jekyll theme already looks nice, but I decided to go with the dark version of the very popular minimal-mistakes theme because it looks a bit more modern and seems to have a lot of nifty features. The nice thing about a program like Jeykll is that the actual posts are written completely independently of the theme, which can easily be changed later. But even with a complete theme, there are a lot of screws to turn and styles to try. This is a part of the project that I spent a lot of time on, especially because of the sophistication of this particular theme. There are still some things that I would like to improve such as including header images and a side-bar. But no matter the theme, this site will look a bit barren without the actual content. Writing the first post With all the technical aspects working, this is the point where I reached my last 20% for this project. Writing this post took way longer than expected, taking a lot of perseverance and motivation. And with the hype surrounding ChatGPT at the moment, it required a lot of restraint to not outsource all the work to an LLM. There are many things that I’m not yet sure of, such as the writing style/tone and the actual content for this blog. But these things will hopefully become more apparent with time and additional content. Takeaway When this subsection is written, I successfully completed my first Project. All the technical stuff was pleasantly straightforward without any unforeseen difficulties. It was also interesting to actively observe the dip in motivation once I had to start writing, which easily took 80% of my effort. It will also be interesting how that plays out with future projects that are not this meta and need writing in addition to the rest of the challenges. For the future, it would be lovely to not procrastinate as much as I did once the hard part inevitably showed up. But carefully observing the dip in motivation and the experience of having pushed through will hopefully help with that. In the end, I’m satisfied with the outcome of this project, as all of the goals were satisfied and I now have a product that can be easily used for future posts.]]></summary></entry></feed>