David Vedvick

Notes

May 20th, 2021

These are notes I took from the 2021 Open-Source North conference, which takes place annually in Minneapolis, Minnesota

Deep Learning for Natural Language Processing

  • Natural Language Understanding (NLU) is the focus of this talk

Natural Language Generation

  • Mapping from computer representation space to language space
  • Opposite direction of NLU

Deep Learning

  • Subfield of machine learning
  • Algorithms inspired by the structure and function of the brain called artificial neural networks
  • Advantage over machine learning is to extract features automatically

Text is Messy

  • Punctuation, typos, unknown words, etc.

Preprocessing Techniques

  1. Turn the text into meaningful format for analysis (tokenization)
  2. Clean the data
  • Remove: upper case letters, punctuation, numbers, stop words
  • Stemming
  • Parts of speech tagging
  • Correct misspelled words
  • Chunking (named entity recognition, compound term extraction)

Preprocessing: stemming

Stemming and Lemmatization = cut word down to base form

  • Stemming: uses rough heuristics to reduce words to base
  • Lemmatization: uses vocabulary and morphological analysis
  • Makes the meaning of run, runs, running, ran all the same

Bag of words

Way of representing text data when modeling text with machine learning or deep learning algorithms

Word embeddings

Type of word representation that allows words with similar meaning to have a similar representation

  • They are a distributed representation of text
  • Word embedding methods learn a real-valued vector representation for a predefined fixed sized vocabulary from a corpus of text

Word2Vec

  • Problem: count vectors far too large for many documents
    • Solution: Word2Vec reduces number of dimensions (configurable, e.g. 300)
  • Problem: bag of words neglects word order
  • SkipGrams: SkipGrams is a neural network architecture that uses a word to predict the words in the surrounding context, defined by the window size
  • Continuous Bag of Words: CBOW uses the surrounding context
  • What happens? Learns words likely to appear near each word
  • Vector scan combined to create features for documents
  • Use Document Vectors for ML/DL on documents (classification, etc.)

Feature Selection

Manual process in traditional machine learning techniques, which happens automatically in deep learning

Embeddings + CNN

Using word embeddings for representing words and convolution neural network for classification task. Architecture has 3 pieces:

  1. Word embedding model: generate word vectors
  2. Convolutional model: extracts salient features from documents
  3. Full-connected model: interpretation of extracted features in terms of a predictive model

Recurrent Neural Networks

  • Networks with loops
  • Allows information to persist
  • Enables connecting previous information to present task
  • Context preserved
Vanishing Gradients with RNN's
  • In the simplest form RNN's don't work as well as wanted
  • Learning rate drops with back propagation
  • Long-Short-Term Memory units help combat the vanishing gradient problem by introducing an "error carousel".
    • Allows learning sequences, keeping track of the order without a vanishing gradient

Major challenges with DL for NLP

  • Data size: RNN/LSTM doesn't generalize well on small datasets
  • Relevant Corpus: required to create domain specific word embedding
  • Deeper Networks: empirically deeper networks have better accuracy
  • Training Time: RNN's take a long time to learn

https://github.com/hardlyhuman

Robot Rock - AI and Music Composition

  • Previous Examples of Music Generation:
    • Mozart?! - developed game for auto-generating music
    • ILIAC 1
    • Neural Networks

Machine Learning used for Music Generation

  • Standard feed forward networks aren't a good fit for predicting sequential events (e.g. music, text)
    • Limitation: fixed number of inputs/outputs

Recurrent Neural Networks (RNN)

  • Better for text/music
  • LSTM is key to improving results

Music Encoding Options

  • MIDI
  • Waveform

Programs that Do Music Generation

  • Amper
  • AIVA
    • Generational soundtracks for video games?!
  • LANDR - AI based mastering of music
  • Magenta (Google)
  • OpenAI: MuseNet (built off of MuseTree), JukeBox
  • PopGun
  • Live Performance toolsTidal Cycles, Orca

Artists That Use AI

  • Taryn Southern - I Am AI (2018)
  • Yacht - Chain Tripping (2019)
    • Transcribed entire backlog to MIDI to train Magenta
    • Treated ML as a collaborator
  • Holly Herndon - Proto (2019)
    • Created "Spawn", which performed music
    • She earned her PhD based on this album

Lyrics Generation

  • GPT-2 has a model to generate lyrics

Empowering Streams through KSQL

  • Querying Kafka streams through KSQL
  • Custom Data Integration is hard: ephemeral isn't useful, stateful is hard
  • Kafka: A-B integration allows loose coupling, with Kafka as the middle layer
  • Kafka can handle load with a very predictive, linearly scaling, model
  • Kafka partitions data with "Topics"

Kafka Data Transformation

  • Single Message Transforms (SMT)
    • Transformations configured via JSON
  • KStreams: advanced message transforms in Java
  • KStream - unending list of messages arriving
  • KTable - a projection of the most recent value in a KStream

ksqlDB

Uses a SQL interface to work against KTables/KStreams

  • Emit changes keyword continuously runs query
  • Has basic querying capabilities, and other functions, that work against Kafka streams

Links

  • ksqldb.io
  • Confluent open source
  • Confluent runs cloud native Kafka distribution

Lessons on Chaos Engineering

Chaos engineering is an experiment, building an experiment around steady-state hypothesis.

  • Not all signs are useful.
  • "The future seems implausible, the past incredible"
  • Weak signals are the signals we get before something goes wrong, and are an important insight into something before it goes wrong
  • Search for how close we are to failure
  • Past signals may not be future signals, future signals may come from areas that were not signalling before

Insights that Come From Weak Signals

  • On-Call shifts should end on Fridays!
    • Engineers are tired, and
    • On-call shifts ended on Fridays and begin for the next person Friday
  • A designated "ops-support" person
  • "I don't know anything about this, we'll need to talk to Emma.": signalling the system is approaching a boundary - what happens if Emma decides to pursue other opportunities?
  • Value proposition of chaos engineering is the insights you gain
  • Rare that a single signal is strong enough
  • Having a multi-functional product team is the best way to make products

Technical Excellence through Mob Programming

  • Retrospectives: tie together learning time
  • 1 year of no bugs! Organization chose to scale mob programming.

How to Mob Program

The Mob Programming RPG

https://github.com/willemlarsen/mobprogrammingrpg

  • Driver: drive the PC
  • Navigator: gives the directions on what to program
  • Mobber: yield to less privileged voice, contributes ideas
  • Researcher: break off on tangents to look into different ideas
  • Sponsor: speak-up for others
  • Navigator of the Navigator: navigates the navigator!
  • Automationist: sees a developer doing the same thing over and over again, might be able to automate those things
  • The Nose: calls out code smells
  • Traffic Cop: keeps everyone in line

Other Mob Role Taxonomies: other mob role taxonomies exist

Goals

  • Treat everyone with kindness, consideration, and respect
  • No one between code and production
  • Clean Code - code expressed cleanly within the domain
  • Zero Bugs!
  • Deliver Working Software to Production Consistently
  • Anyone can take a vacation (zero silos)
  • Effective interdepartmental ownership
  • Continuously develop lofty goals and practices
  • Experiment Frequently with small changes

Benefits

  • High Bandwidth Learning
  • Quality and Technical Debt
  • Group Conscientiousness
  • Flow is easier in a mob vs pairing
  • No more bugs!

Law of Personal Mobility

  • If you are not contributing or learning, go to a different mob

There and Back Again: Our Rust Adoption Journey

  • Async implies IO

  • & means the type that is passed into a method is immutable

    • async fn verify_signature(token: &Jwt)
  • State Machines - enabled by enum types having fields

    enum User {
      Pending {
        email: Email
      },
      Active {
        email: Email,
        confirmation_timestamp: DateTime<Utc>
      }
    }
    
  • Future looking - new states can be added, and compiler will tell you when a state isn't covered

  • Predictable performance - rust is fast, but more importantly, its performance isn't affected by things such as garbage collectors

  • The Rust book is a great place to start

  • Rustlings

Note posted on Friday, April 30, 2021 7:00 PM CDT - link