ONNX for ML Interoperability

Having been a Keras user since  I read  the seminal Deep Learning with Python , I’ve been experimenting with exporting formats to different frameworks to be more framework-agnostic.

ONNX ( Open Neural Network Exchange) is an open format for representing traditional and deep learning ML models.  Key goal being promoting inter-operability between a variety of frameworks and target environments. ONNX helps you to export a fully trained model into its format and enables targeting diverse environments without you doing manual optimization and painful rewrites of the models to accommodate environments.
It defines an extensible computation graph model along with built-in operators and standard data types to allow for a compact and cross-platform representation for serialization. A typical use case could be scenarios where you want to use transfer learning to use model weights of another model possibly built in another framework into your own model i.e. if you build  a model in Tensorflow, you get a protobuf (PB) file as output and it would be great if there is one universal format that you can now convert to the PT format to load and reuse in Pytorch or use its own hardware agnostic runtime.

For high-performance inference requirements in varied frameworks, this is great with platforms like NVIDIA’s TensorRT supporting ONNX with optimizations aimed at the accelerator present on their devices like the Tesla GPUs or the Jetson embedded devices.


The ONNX file is a protobuf encoded tensor graph. List of operators supported are documented here and operations are referred to as “opsets” i.e. operation sets. Opsets are defined for different runtimes in order to enable interoperability. The operations are a growing list of widely used linear operations, functions and other primitives used to deal with tensors.

The operations include most of the typical deep learning primitives, linear operations, convolutions and activation functions. The model is mapped to the ONNX format by executing the model with often just random input data and tracing the execution. The operations executed are mapped to ONNX operations and so the entire model graph is mapped into the ONNX format. After this the ONNX model is then saved as .onnx protobuf file which can be read and executed by a wide and growing range of ONNX runtimes.

Note – Opsets are fast evolving and with fast release cycles of competing frameworks, it may not always be easy to upgrade to the latest ONNX version if it breaks compatibility with other frameworks. The file format consists of the following:

  • Model: Top level construct
    • Associates version Info and Metadata with a graph
  • Graph: describes a function
    • Set of metadata fields
    • List of model parameters
    • List of computation nodes – Each node has zero or more inputs and one or more outputs.
  • Nodes: used for computation
    • Name of node
    • Name of an operator that it invokes a list of named inputs
    • List of named outputs
    • List of attributes

More details here.


The ONNX model can be inferenced with ONNX runtime that uses a variety of hardware accelerators for optimal performance. The promise of ONNX runtime is that it abstracts the underlying hardware to enable developers to use a single set of APIs for multiple deployment targets. Note – the ONNX runtime is a separate project and aims to perform inference for any prediction function converted to the ONNX format.

This has  advantages over dockerized pickle models that is usually the approach in a lot of production deployments where there are runtime restrictions (i.e. can run only in .NET or JVM) , memory and storage overhead, version dependencies, and batch prediction requirements.

ONNX runtime has been integrated in WINML, Azure ML with MSFT as its primary backer. Some of the new enhancements include INT8 quantization to reduce floating point numbers for reducing model size, memory footprint and to increase efficiencies benchmarked here.

The usual path to proceed :

  • Train models with frameworks
  • Convert into ONNX with ONNX converters
  • Use onnx-runtime to verify correctness and Inspect network structure using netron (https://netron.app/)
  • Use hardware-accelerated inference with ONNX runtime ( CPU/GPU/ASIC/FPGAs)


To convert Tensorflow models, the easiest way is to use the tf2onnx tool from the command line. This converts the saved model to a model representation that includes the inference graph.

Here is an end-to-end example of saving a simple Tensorflow model , converting it to ONNX and then running the predictions using the ONNX model and verifying the predictions match.


However, some things to consider while using this format is the lack of “official support” from frameworks like Tensorflow. For example, Pytorch does provide the functionality to exports models into ONNX (torch.ONNX ) however I could not find any function to import an ONNX model to out put a Pytorch model. Considering CAFFE 2 that is a part of PyTorch fully supports ONNX import/export, it may not be totally unreasonable to expect an official conversion importer(there is a proposal already documented here).

The Tensorflow converters seem to be part of the ONNX project i.e. not an official/out of the box Tensorflow implementation. List of Tensorflow Ops supported are documented here. The github repo is a treasure trove of information on the computation graph model and the operators/data types that power the format. However, as indicated earlier depending on the complexity of the model (especially in transfer learning scenarios), it’s likely to encounter conversion issues during function calls that may cause the ONNX converter to fail. In this case, there are likely scenarios which may necessitate modifying the graph in order to fit the format. I’ve had a few issues running into StatefulPartitionednCalls especially in using TransferLearning situations for larger encoders in language models.

I have also had to convert Tensorflow to PyTorch by first converting Tensorflow to ONNX. Then the ONNX models to Keras using onnx2keras and then convert to Pytorch using MMdn with mixed results and a lot of debugging and many abandons. However, I think ONNX runtime for inference rather than framework-to-framework conversions will be a better use of ONNX.

The overall viability of a universal format like ONNX though well intentioned and highly sought may not fully ever come into fruition with so many divergent interests amongst the major contributors and priorities though its need cannot be disputed.


Replay is a collaboration track and part of an evolving experiment with multi-tracked guitars revolving around cyclic patterns. More collaborations and sounds to follow.

Arjun on Bass

Sunder – Drums (Instagram and Facebook – @onlysunder)

Mini-glossary of terms in audio production

I’ve used a lot of audio engineering terms over the years and realized that a lot of them were not exactly what I was referring to/meant. While talking to a lot of experienced audio engineers, I’ve always found the below glossary useful to convey my objectives effectively. Hopefully this serves as starter boilerplate for more research with more terms to be added on.  A lot of these and more are covered in Coursera’s excellent course on the Technology of Music Production.

Nature of sound


Tracks, Files and Editing



Dynamic Effects

Filter and Delay Effects

Nature of sound

Amplitude: Size of the vibration of sound. Larger sizes (louder sound) indicate louder amplitude. Measured in decibels. Multiple places in the signal flow where we measure amplitude.

  • In the air: dBSPL or decibels of sound pressure level
  • In the digital domain: dBFS or decibels full scale

Compression: Compression is one of the most commonly used type of dynamic processing. It is used to control uneven dynamics in individual tracks in a multi track mix and also to be used in creative ways like decays of notes and for fatter sounds. Compressors provide gain reduction which is measured by metrics like Ratio control.

  • For example a ratio like 4:1 , means audio that goes above 4 dB above Threshold will be reduced to it only goes 1 dB above only.

Decibel: The words bel and decibel are units of measurement of sound intensity. Bel” is a shortening of the name of inventor Alexander Graham Bell (1847-1922).

  • A bel is equivalent to ten decibels and used to compare two levels of power in an electrical circuit.
  • The normal speaking range of the human voice is about 20-50 decibels.
  •  Noise becomes painful at 120 db. Sounds above 132 db lead to permanent hearing damage and eardrum rupture.

Frequency: Speed of the vibration which determines the pitch of the sound. Measured as the number of wave cycles that occur in one second.

Propagation: Sequence of waves of pressure (sound) moving through a medium such as water, solids or air.

Timbre: Term used to indicate distinguished characteristics of a sound. For example a falsetto versus a vibrato.

Transducer: Another term for a microphone. Converts one energy type to another. A microphone converts sound pressure variations in the air into voltage variations in a wire.

Digital Audio Workstation (DAW)

Bit Rate: Product of sampling rate and sampling depth and measured as bits per second. Higher bit rates indicates more quality. Compressed audio formats (mp3) have lower bit rates than uncompressed (wave).

Buffer Size: Amount of time allocated to the DAW for processing audio.Used to balance the delay between the audio input ( say a guitar plugged in ) to the sound playback and to minimize any delay. It usually works best to set the buffer size to a lower amount to reduce the amount of latency for more accurate monitoring. However, this puts more load on the computer’s processing power and could cause crashes or interruptions.

Sampling Rate: Rate at which samples of an analog signal are taken to be converted to a digital form Expressed in samples per second (hertz). Higher sampling rates indicate better sound as they indicate higher samples per second. An analogy could be FPS i.e Frames per second in video. Some of the values we comes across are 8kHz, 44.1kHz, and 48kHz. 44.1 kHz are most common sampling rates for audio CDs.

Sampling Depth: Measured in bits per sample indicates the number of data points of audio. An 8-bit sample depth indicates a 2^8 = 256 distinct amplitudes for each audio sample. Higher the sample depth, better the quality. This is analogous to image processing where higher number of bits indicate higher quality.

Sine waveCurve representing periodic oscillations of constant amplitude. Considered the most fundamental of sound. A sine wave can be easily recognized by the ear. Since sine waves consist of a single frequency, it’s used to depict/test audio.

In 1822, French mathematician Joseph Fourier discovered that sinusoidal waves can be used as simple building blocks to describe and approximate any periodic waveform, including square waves. Fourier used it as an analytical tool in the study of waves and heat flow. It is frequently used in signal processing and the statistical analysis of time series.


  • Wave: Uncompressed at chosen bit rate and sampling speed. Takes up memory and space.
  • AIFF: Audio Interchange File Format (AIFF): Uncompressed file format (originally from Apple). High level of device compatibility and used in situations for mastering files for audio captured live digitally.
  • MP3: Compressed Audio layer of the larger MPEG video file format.Smaller sizes and poorer quality that the formats above. Compresses data using a 128 kbit/s setting that results in a file about 1/11th of the size of the data.
  • MIDI: Musical Instrument Digital Interface – commonly defined as a set of instructions instructing the computers sound card on creating music.Small in size and control notes of each instrument, loudness, scale, pitch etc.

Tracks, Files and Editing

  • Cycling: Usually refers to musical cycles formed by a group of cycles.Useful for arrangements and re-arrangements
  • Comping: Process where you use the best parts of multiple takes and piece them together for one take.DAWS such as ProTools allow multiple takes that are stocked in a folder in a single track.
  • Destructive editing: Editing in which changes are permanently written to the audio file. Though these can usually be undone based on the DAW undo history in reverse order. Helps when you have less processing power and need to see changes applied immediately and in case where you know you don’t want to repeat that change again. Non-destructive editing uses computer processing power to make changes on the fly.
  • Fades: Fades are progressive increases (fade-in) or decreases (fade-out) of audio signals. Most commonly used when no obvious ending of a song. Crossfades are transitional regions that can bridge regions so the ending of one fades into another.


  • Controllers: Hardware or software that generates and transmits MIDI data to MIDI-enabled devices, typically to trigger sounds and control parameters of an electronic music performance.
  • Quantization: One of the more important concepts. Quantization has many meanings based on the task to be performed but in this context, it’s for making music with precision with respect to timing of notes. To compensate for human error on precision, quantization can help nail the right note at the mathematically perfect time. While great for MIDI note data, it does become challenging but a worthwhile effort to quantize MIDI tracks. Most DAWS have this built-in but this is not a magic wand to blow away all your problems. Quantization in my experience works best when Ive performed a track with acceptable level of timing.
  • Velocity: Force with which a note is played and used to making MIDI sounds more human ( or more mechanical if thats the intent). This typically controls the volume of the note and can be used to control dynamics, filters, multiple samples and other functions.


  • Automation: Process where we can program the arrangements, level, EQ to change based on pre-determined pattern. For example automation to increase the reverb just before the chorus or add delays to a particular part in the mix. 
  • Auxiliary sends: Type of output used in mixers while recording. Allows the producer to create an ‘auxiliary” mix where you can control each input channel on the mixer. This helps route multiple input channels to a single output send. A mixer can choose how much of a signal that needs to be sent to the aux channel. In Ableton, two Aux channels (Titled A and B) are created by default. Aux channels are great for filtering in effects such as reverb and delay. 
  • Channel strip – Type of preamp with additional signal processing units, similar to an entire channel in a mixing console (example).
  • Bus: Related to Aux sends above, a bus is a point in the signal flow where multiple channels are routed into the same output. In Ableton, this is the Master channel – where all the tracks merge together before being exported.
  • Unbalanced cables pick up noise ( from electrical, radio and power interference from nearby cables) and are best used for short distances, for example a short cable to connect different analog pedals with each other. Quarter inch TS (tip, sleeve) cables are used for unbalanced cables. 
  • Balanced cables: Have ground wire and carry two copies of the same signal that are reversed in polarities and they travel down the cable and cancel each other out. Once the two signals get to the other side of the cable, the polarity of the negative signal gets reversed so both signals are in sync. The noise as the signals travelled is picked up by both signals but not reversed in polarity effectively eliminating it.

Dynamic Effects

  • Downward compressor: Same as a compressor which is reducing the level of louder things. When explicitly called out , “upward compressors” bring up the volume of the quiet material. One of the most important effects in audio engineering. Compressors are used for dynamic range and compresses the signal.Expander” Expander expands dynamic range. Louder parts become louder, quieter parts become quieter. Making it louder means amplifying the signal that passes the threshold, it is the opposite of a compressor. 
  • Gate: Provides a floor level for the signal to cross to get through – if the signal is below the gate level if will be treated as silence. Used to cut out the audio when it’s quiet.
  • Limiter: Serves as a ceiling above which the signal cannot pass. It’s essentially a compressor with a very high ratio – as the compression increases, the ratio increases.

Filter and Delay Effects

  • Convolution reverb: Convolution reverbs digitally simulate the reverberation of a physical or virtual space. They are based on mathematical convolution operations and use pre-recorded audio samples of the impulse response of the space being modeled. These use an Impulse Response (IR) to create reverbs. An impulse response is a representation of the signal change as it goes through a system. The advantage of a convolution reverb is its ability to accurately simulate reverb for natural sounding effects. The disadvantage is that it can be computationally expensive. Impulse response is the recording of a real space that we are applying with this mathematical procedure called convolution. In most Convolution plugins, we can find a wide variety of audio files that are representing a large number of real spaces. So, DAWS have large selections where we can simulate different places say a small club versus a stadium.
  • Algorithmic reverb: Algorithmic reverbs are based on the settings we set in our DAW. These simulate the impulse responses. Algorithmic reverbs use delay lines, loops and filters to simulate the general effects of a reverb environment. All non-convolution reverbs can be considered as algorithmic. Algorithm reverbs are kind of like synthesizers  since we are creating the impression of a space with an algorithm of some sort of a mathematical representation. These create echoes using mathematical algorithms to simulate the delays that occur in reverb. Tradeoff is that these may sound less natural than convolution reverbs.
  • Comb filtering: Two audio signals that are playing the same signal arrive at the listeners ears at different times due to a delay. The signals look like a comb when graphed out. 
  • Dry/wet: Dry sounds that has no effects of any kinds of modifications. Raw unprocessed sound. Wet sounds are processed sounds with effects that are added while recording or after mixing.
  • Low Shelf filter: Low shelf filters cut or boost signals of frequencies below a threshold. These usually use “cutoff frequencies” to cut /boost lower frequencies mostly to ensure instruments don’t interfere with each other. Used a lot during guitar EQ mixing and vocals.

Deep Learned Shred Solo

Music generation with Recurrent Neural Nets has been of great interest to me with projects like Magenta displaying amazing feats of ML-driven creativity. AI is increasingly being used to augment human creativity and this trend will lay to rest creativity blocks like in the future. As someone who is usually stuck in a musical rut, this is great for spurring creativity.

With a few covid-induced reconnects with old friends (some of whom are professional musicians) and some inspired late night midi programming on Ableton, I decided to modify some scripts / tutorials that have been lying around on my computer to blend deep learning and compose music around it as I research on the most optimal ways to integrate Deep Learning into original guitar music compositions.

There’s plenty of excellent blogs and code on the web on LSTMs including this one and this one on generating music using Keras. LSTMs have plenty of boiler plate code on github that demonstrate LSTM and GRUs for creating music. For this project, I was going for recording a guitar solo based on artists I like and to set up a template for future experimentation for research purposes. A few mashed up solos of Yngwie served as the source data but the source data could have been pretty much anything in the midi format and it helps to know how to manipulate these files in the DAW, which in my case was Ableton. Most examples on the web have piano midi files that generate music in isolation. However, I wanted to combine the generated music with minimal accompaniment so as to make it “real”.

With the key of the track being trained on being in F Minor , I also needed to make sure i have some accompaniment in the key of FMinor for which I recorded a canned guitar part with some useful drum programming thanks to EZDrummer.

Tracks in Ableton

Note: this was for research purposes only and for further research into composing pieces that actually make sense based on the key being fed into the model. 

Music21 is invaluable for manipulating midi via code. Its utility is that is lets us manipulate starts, durations and pitch. I used Ableton to use the midi notes generated to plug in an instrument along with programmed drums and rhythm guitars.

Step 1:

Find the midi file(s) you want to base your ML solo on. In this case, Im going for generating a guitar solo to layer over a backing track. This could be pretty much anything as long as its midi that can be processed by Music21.

Step 2:

Preprocessing the midi file(s): The original midi file had guitars over drums, bass and keyboards. So, the goal was to extract the list of notes first to save them, the instrument.partitionByInstrument() function, separates the stream into different parts according to the instrument. If we have multiple files we can loop over the different files to partition it by individual instrument. This returns a list of notes and chords in the file.

from tqdm import tqdm
songs = glob(' /ml/vish/audio_lstm/YJM.mid') # this could be any midi file to be trained
notes = []
for file in tqdm(songs):
    midi = converter.parse(file) # convert all supported data formates to music21 objects
    notes_parser = None
        # partition parts for each unique instrument
        parts = instrument.partitionByInstrument(midi)
        print("No uniques")

    if parts: 
        notes_parser = parts.parts[0].recurse()
        notes_parser = midi.flat.notes # flatten notes to get all the notes in the stream
        print("parts == None")

    for element in notes_parser:
        if isinstance(element, note.Note):# check if elements are in the note class
            notes.append(str(element.pitch))  # Returns  Pitch objects found as a Python List
        elif(isinstance(element, chord.Chord)):
          notes.append('.'.join(str(n) for n in element.normalOrder))  
print("notes:", notes)

Step 3:

Creating the model inputs: Convert the items in the notes list to an integer so they can serve as model inputs. We create arrays for the network input and output to train the model. We have 5741 notes in  our input data and have defined a sequence length of 50 notes. The input sequence will be 50 notes and the output array will store the 51st note for every input sequence that we enter. Then we reshape and normalize the input vector sequence. We also one hot encoder on the integers so that we have the number of columns equal to the number of categories to get a network output shape of  (5691, 92). I’ve commented out some of the output so the results are easier to follow.

pitch_names = sorted(set(item for item in notes))   # ['0', '0.3.7', '0.4.7', '0.5', '1', '1.4.7', '1.5.8', '1.6', 10', '10.1.5',..]
note_to_int = dict((note, number) for number, note in enumerate(pitch_names))  #{'0': 0,'0.3.7': 1, '0.4.7': 2,'0.5': 3, '1': 4,'1.4.7': 5,..]
sequence_length = 50
len(pitch_names) # 92
range(0, len(notes) - sequence_length, 1) #range(0, 5691)
# Deifne input and output sequence
network_input = []
network_output = []
for i in range(0, len(notes) - sequence_length, 1):
    sequence_in = notes[i: i + sequence_length]
    sequence_out = notes[i + sequence_length]
    network_input.append([note_to_int[char] for char in sequence_in]) 
print("network_input shape (list):", (len(network_input), len(network_input[0]))) #network_input shape (list): (5691, 50)
print("network_output:", len(network_output)) #network_output: 5691
patterns = len(network_input)  
print("patterns , sequence_length",patterns, sequence_length) #patterns , sequence_length 5691 50
network_input = np.reshape(network_input, (patterns , sequence_length, 1)) # reshape to array of (5691, 50, 1)
print("network input",network_input.shape) #network input (5691, 50, 1)
n_vocab = len(set(notes))
print('unique notes length:', n_vocab) #unique notes length: 92
network_input = network_input / float(n_vocab) 
# one hot encode the output vectors to_categorical(y, num_classes=None)
network_output = to_categorical(network_output)  
network_output.shape #(5691, 92)

Step 4:

Model: We invoke Keras to build out the model architecture using LSTM. Each input note is used to predict the next note. Code below uses standard model architecture from tutorials without too many tweaks. Plenty of tutorials online that explain the model way better than I can such as this: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Training on the midi input can be expensive and time consuming so I suggest setting a high epoch number with calls backs defined based on the metrics to monitor, In this case,  I used loss and also created checkpoints for recovery and save the model as ‘weights.musicout.hdf5’. Also note , I trained this on community edition Databricks for convenience.

def create_model():
  from tensorflow.keras.models import Sequential
  from tensorflow.keras.layers import Activation, Dense, LSTM, Dropout, Flatten

  model = Sequential()
  model.add(LSTM(128, input_shape=network_input.shape[1:], return_sequences=True))
  model.add(LSTM(128, return_sequences=True))
  model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=["accuracy"])
  return model

from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
model = create_model()

save_early_callback = EarlyStopping(monitor='loss', min_delta=0,
                                    patience=3, verbose=1,
epochs = 5000
filepath = 'weights.musicout.hdf5'
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=0, save_best_only=True)
model.fit(network_input, network_output, epochs=epochs, batch_size=32, callbacks=[checkpoint,save_early_callback])

Step 5:

Predict: Once we have the model trained, we can start generating nodes based on the trained model weights along with feeding the model a sequence of notes. We can pick a random integer and a random sequence from the input sequence as a starting point. In my case, it involved calling the model.predict function for a 1000 notes that can be converted to a midi file. The results might vary at this stage, for some reason I saw some degradation after 700 notes so some tuning required here.

start = np.random.randint(0, len(network_input)-1)  # randomly pick an integer from input sequence as starting point
print("start:", start)
int_to_note = dict((number) for number in enumerate(pitch_names))
pattern = network_input[start]
prediction_output = [] # store the generated notes
print("pattern.shape:", pattern.shape)
pattern[:10] # check shape

# generating 1000 notes

for note_index in range(1000):
    prediction_input = np.reshape(pattern, (1, len(pattern), 1))
    prediction_input = prediction_input / float(n_vocab)

    prediction = model.predict(prediction_input, verbose=0) # call the model predict function to predict a vector of probabilities
    predict_index = np.argmax(prediction)  # Argmax is finding out the index of the array that results in the largest predict value
    #print("Prediction in progress..", predict_index, prediction)
    result = int_to_note[predict_index]   

    pattern = np.append(pattern, predict_index)
    # Next input to the model
    pattern = pattern[1:1+len(pattern)]

print('Notes generated by model...')
prediction_output[:25] # Out[30]: ['G#5', 'G#5', 'G#5', 'G5', 'G#5', 'G#5', 'G#5',...

Step 6:

Convert to Music21: Now that we have our prediction_output numpy array with the predicted notes, it’s time to convert it back into a format that Music21 can recognize with the objective of converting that back to a midi file.

offset = 0
output_notes = []

# create note and chord objects based on the values generated by the model
# convert to Note objects for  music21
for pattern in prediction_output:
    if ('.' in pattern) or pattern.isdigit():  # pattern
        notes_in_chord = pattern.split('.')
        notes = []
        for current_note in notes_in_chord:
            new_note = note.Note(int(current_note))
            new_note.storedInstrument = instrument.Piano() 
        new_chord = chord.Chord(notes)
        new_chord.offset = offset
    else:  # pattern
        new_note = note.Note(pattern)
        new_note.offset = offset
        new_note.storedInstrument = instrument.Piano()  

    # increase offset each iteration so that notes do not stack
    offset += 0.5

#Convert to midi
midi_output = music21.stream.Stream(output_notes)
print('Saving Output file as midi....')
midi_output.write('midi', fp=' /ml/vish/audio_lstm/yjmout.midi')

Step 7:

Once we have the midi file with the generated notes, the next step was to load the midi track into Ableton. The next steps were standard recording processes one would follow to record a track in the DAW.

a) Compose and Record the Rhythm guitars, drums and Keyboards.

Instruments/software I used:


b) Insert the midi track into the DAW and quantize and sequence accordingly. This can take significant time depending on the precision wanted. In my case, this was just a quick fun project not really destined for the charts so a quick rough mix and master sufficed.

The track is on soundcloud here. The solo kicks in around the 16 second mark. Note I did have to adjust the pitch to C to blend in with the rhythm track though it was originally trained on a track in F minor

There are other ways of dealing with more sophisticated training like using different activation functions or by normalizing inputs. GRUs are another way to get past this problem and I plant iterate on more complex pieces blending deep learning with my compositions. This paper gives a great primer on the difference between LSTMs and GRUs: https://www.scihive.org/paper/1412.355

Book Review – The Great Mughals & their India

As someone who lived through learning ( and forgetting)  Mughal history consumed via droll and biased textbooks through Middle School growing up in India, Dirk Collier’s The Great Mughals and their India  is a captivating look at the lives of the Kings of the Mughal dynasty that rose from the ashes of the Delhi Sultanate and disintegrated spectacularly through a thousand cuts gradually inflicted by their own regional enemies and the British. 


An added benefit of visiting Delhi to see family is always the getaways to tourist attractions with the captivating tombs, forts, monuments that one can find as remnants of the Mughals. This book has been on my wish-list for a while and its been a wonderful ride through Mughal influence on Indian architecture, politics, philosophy, culture and outlook albeit through the eyes of a Belgian.

The book primarily covers the chaotic, brilliant, pathetic, conquests, defeats of the Mughal rulers in the order of their appearance starting with the  descendant of Timur and Genghis Khan  – Babur.

Babur, on the run from Uzbeks, sought refuge in India and destiny ensured that his descendants never left.

The unbiased commentary of the author on the subject is a refreshing change from the usual divisive literature available that I have come across. In the current political climate of rising nationalism and divisive politics, the Mughal era, in certain eras, seems to be the epitome of tolerant views and harmony amongst people of different religions , at least as described here.

The book takes you on a galloping ride through the highs and glory of Akbar to the lows and bigotry of Aurangzeb, from the magnificence of Shah Jahan’s imagination to the pitiful incompetence of Shah Alam. 

The earlier Hindu dynasties (Mauryas, Guptas etc) including Ashoka predominantly covered Northern India and cannot really be seen as ruling the entire sub continent. This was also largely the trend in the Mughal empire except for Akbar whose rule of over 100M inhabitants (1/5th of the world’s population) covered vast territories across India.

I found myself reflecting on the remarkable fact that the Mughal dynasty was all but a blip in the annals of Indian history which kicked off in the Indus Valley in 3300 BCE. The book does an amazing job of putting these 331 years into context while being cognizant of its impact to future generations.

  • Homo sapiens in India: Around 75,000 years ago.
  • IndusValley (Harappa) Civilization: c. 3300–1300 BCE
  • Vedic civilization: c. 1500–500 BCE
  • Spread of Buddhism and Jainism: 500–200 BCE
  • Maurya Empire: 297–250 BCE
  • Ashoka the Great: 304–232 BCE
  • Hindu revival and classical Hindu civilization: 200 BCE–CE 1100
  • Gupta Empire/golden age of Hinduism: CE 320–550
  • Late classical civilization: CE 650–1100
  • The Hindu-Islamic Period Early sultanates plus trading colonies: 1100–1857
  • Mughal Empire: 
    • Babur: 1526–1530
    • Humayun: (1530–1556)
    • Akbar the Great: 1556–1605
    • Jahangir: 1605–1627
    • Shah Jahan: 1627–1658
    • Aurangzeb: 1658–1707
    • The ‘Lesser Mughals’: 1707–1857
  • Their rivals and successors Maratha Empire: 1713–1818
  • Sikh Empire: 1799–1849
  • Afghan Empire: 1747–1862
  • British East India Company: 1757–1858
  • British Raj: 1858–1947
  • Independence, partition and beyond: August 1947 to the present

The last book I read on this subject years ago was WIlliam Darymple seminal study on the “The Last Mughal” which seemed to reach the heights of authoritative study on this subject. However, Dirk Collier’s easy style of writing and his genuine reflections on the state of affairs through every stage of the empire makes this a much more endearing read to me.

Babur :

The founder of the dynasty though a stranger in a strange land torn between grandiose ambition to rule large swathes of territory and nostalgia for his central asian home. Excerpts from his memoir are filled with longing for Central Asia and Kabul. Considering he was a forced immigrant fleeing Central Asia, he had no special affliction for the climate, food or people. India was more of a consolation prize when faced with the reality of Uzbeks occupying his beloved Samarqand.


Humayun who was born in Kabul, had life full of contradictions as he spent years of incompetence losing his inheritance, wandering about in exile with warriors of questionable quality and then regaining his throne with Persian help. A life spent in harems and opium addiction.  Strangely, also a voracious reader and builder of contraptions, patron of scholars and artists and highly knowledgeable in arcane matters like plants, herbs and metals. It was his misfortune that his regime coincided with the rise of Sher Shah Suri whose competence in government and military matters eclipsed anything that Human could throw at him. His innocuous death tripping on a library staircase epitomizes his life. This is well described in the book with the quote from  British orientalist and historian Stanley Edward Lane-Poole (1854–1931):

‘his end was of a piece with his character. If there was a possibility of falling, Humayun was not the man to miss it. He tumbled through life, and he tumbled out of it.’


The first of the Mughals emperors born in India, Akbar’s eventual empire stretched from the heartland of India down to central India. Akbar lies at the center of Mughal achievement in India (barring the Taj Mahal) due to his impact to the military, cultural, political and economic development that had never been seen before.

A micro-manager with a real interest in his royal duties, the book mentions charming stories about him wandering in disguise in the streets to gauge the efficacy of his rule. Universal tolerance to different forms of worship led to the concept of ‘Din-e-Ilahi’ emphasizing one god without divisive religion that combined the best of Hinduism and Islam. In some ways this was a failure that did not outlast him but its reflective of his forward thinking and rationalist views which were driven by an obsession to find the truth. Interestingly his interstate’s towards organized monetization led to development in design for the royal coin .

His cultural impact was astutely planned by forming alliances with non-Mughals and shrewd military acumen inspiring his forces with his own daring in the battlefield. The book also has interesting anecdotes of his encounters with the Portuguese who landed on the Indian shores who ostensibly set up trade and evangelize. The accolades go on and the book offers many details and insights in this glorious period while also inspecting his motivations for actions that helped cement his place. 


Known for drunken depravity, cruelty and excesses, Akbar’s successor was at the opposite end of the spectrum. The book refers to this reign almost as a placeholder between Akbar’s and Shah Jahan’s reign with no notable achievements apart from the constant struggle to keep the inherited empire intact. The reign here was characterized by kindling religious difference, orthodoxy and wanton destruction of non-islamic religious places. This notably set the stage for the absolute division of the empire that would ultimately, decades later  help the British pick apart the fragmented empire. There are also some contradictions and some historians disagree on this achievements and contribution. This should be an interesting read.  Some great quotes here:

 ‘I never saw any man keep so constant a gravity,’ affirms Sir Thomas Roe, the first English ambassador to the Mughal court.

“the only emotions apparent on his stone-cold face were extreme pride and utter contempt for others. “


References to Shahjahan are always accompanied by the Taj Mahal and his reign gets credit for the  best example of Mughal architecture and a symbol of India’s rich history rightfully. However, he is also reported to be another self-centered, humorless fundamentalist though of a lesser degree than his predecessor. Post the death of Mumtaz Mahal – he seems to have delved deeper into orthodoxy and bigotry. He abolished Akbar’s solar Din-e-Ilahi calendar, replacing it with the conventional lunar (Hijri) calendar. From a civil and military administration point of view, most of the empire seems to have been squandered. Myths around the blinding of the builders of the Taj Mahal are debunked by the author and shredded for their lack of authenticity. 


Characterized by some expansionism through the subcontinent but the hold was precarious at best with revolts all across the empire. Another religious bigot whose life was consumed warring against his brothers ( notable one being Darah Shikoh) . Another walking contradiction, he also expressed genuine interest in other religions while doing nothing to unite his own empire. He is universally reviled by non-islamists while being depicted as a pious servant of god by his apologists. His acts against non-muslims were yet another assault on the unified fabric that his great-grandfather had worked so diligently for. In many parts of India, he is known for his role destroying a lot of religious structures that had centuries of existence before his reign. Some great observations made by the author for this period include the rise of militant Sikhism thanks to his hounding of Sikh religious leaders. The rise of Shivaji the great Maratha is well documented here with the various stores of legend picked apart and debated. Another descriptive anecdote that does not disguise the author’s contempt for this period:

“In 1666, it was proudly announced that the emperor’s invincible armies had conquered ‘Tibet’; in actual fact, it merely meant that a petty local chief in the stony wastelands of Ladakh had been bullied into building a mosque and minting coin with Aurangzeb’s name – hardly worthy of a ‘universe-conquering’ monarch.”

Post- Aurangzeb :

This period was a succession of Mughal kings characterized by mismanagement,corruption and squandering of their empire.   Notable incidents include the ruler Farrukhsiyar’s imperial firman of 1717, granting duty-free trading and territorial rights to the British East India Company  which opened the gates for what was to follow. Barely , thirty years later the Persians under Nadir Shah would sack Delhi plundering and murdering its citizens including carting away the kohinoor diamond to Persia. This was essentially the death knell to the empire and further worsened the anarchy.

Overall, this was a great impartial summary of the mughal chronology and the impact it had on future generations. The author’s credentials re-enforce an objective view to the entire period without being unnecessarily romanticized by the majesty of certain phases and cultural folklore. Interestingly, the Mughal empire was at its zenith when it had its most tolerant and just empire in power which is a lesson to be learnt even in modern times of regionalism and divisive politics that break apart societies.

TFDV for Data validation

Working with my teams trying to build out a robust feature store these days, it’s becoming even more imperative to ensure feature Engineering data quality. The models that gain efficiency out of a performant feature store are only as good as the underlying data. 

Tensorflow Data Validation (TFDV) is a python package from the TF Extended ecosystem. The package has been around for a while but now has evolved to a point of being extremely useful for machine learning pipelines as part of feature engineering and determining data drift scenarios. Its main functionality is to compute descriptive statistics, infer  schema,and detect data anomalies.  It’s well integrated with the Google Cloud Platform and Apache Beam. The core API uses Apache Beam transforms to compute statistics over input data.

I end up using it in cases where I need quick checks on data to validate and identify drift scenarios before starting expensive training workflows. This post is a summary of some of my notes on the usage of the package. Code is here.

Data Load

TFDV accepts CSV, Dataframes or TFRecords as input.

The csv integration and the built-in visualization function makes it relatively easy to use within Jupyter notebooks. The library takes input feature data and then analyzes them by feature to visualize them. This makes it easy to get a quick understanding of the distribution of values, helps identifying anomalies and identifying training/test/validate skew. Also a great way to discover bias in the data since you can infer aggregates of values that skewed towards certain features.

As evident, with trivial amount of code you can spot issues immediately – missing columns, inconsistent distribution and data drift scenarios where newer dataset could have different statistics compared to earlier trained data.

I used a dataset from Kaggle to quickly illustrate the concept:

import tensorflow_data_validation as tfdv
train = tfdv.generate_statistics_from_csv(data_location='Data/Musical_instruments_reviews.csv', delimiter=',')
# Infer schema
schema = tfdv.infer_schema(TRAIN)

This generates a data structure that stores summary statistics for each feature.

TFDV Schema

Schema Inference

The schema properties describe every feature present in the 10261 reviews. Example:

  • their type (STRING)
  • Uniqueness of features – for example 1429 unique reviewer IDs.
  • the expected domains of features.
  • the min/max of the number of values for a feature in each example. For example: If A2EZWZ8MBEDOLN is a reviewerid and has 36 occurrences
top_values {
        value: "A2EZWZ8MBEDOLN"
        frequency: 36.0
datasets {
  num_examples: 10261
  features {
    type: STRING
    string_stats {
      common_stats {
        num_non_missing: 10261
        min_num_values: 1
        max_num_values: 1
        avg_num_values: 1.0
        num_values_histogram {
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 1026.1
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 1026.1

Schema inference is usually tedious but becomes a breeze with TFDV. This schema is stored as a protocol buffer

schema = tfdv.infer_schema(train)

The schema also generates definitions like “Valency” and “Presence”. I could not find too much detail in the documentation but I found this useful paper that describes it well.

  • Presence: The expected presence of each feature, in terms of a minimum count and fraction of examples that must contain the feature.
  • Valency: The expected valency of the feature in each example, i.e., minimum and maximum number of values.

TFDV has inferred the revewerName as STRING and the universe of values around them termed as Domain. Note – TFDV can also encode your fields as BYTES. Im not seeing any function call in the API to update the column type as-is but you could easily update it externally if you want to explicitly specify a string. From the documentation, its explicitly advised to review the inferred schema and refine it per the requirement so as to embellish this auto-inference with our domain knowledge based on the data. You could also update the Feature based on the Data Type to BYTES, INT, FLOAT or STRUCT.

# Convert to BYTES
tfdv.get_feature(schema, 'helpful’).type=1 

Once loaded, you can generate the statistics from the csv file.
For a comparison and to simulate a  dataset validation scenario, I cut down the Musical_instruments_reviews.csv to 100 rows to compare with the original and also added an extra feature called ‘Internal’ with the values A, B,C randomly interspersed for every row.

Visualize Statistics

After this you can pass in the ‘visualize_statistics’ call to first visualize the two datasets based on the schema of the first dataset (TRAIN in the code). Even though this is limited to two datasets, this is a powerful way to identify issues immediately. For example – it can right off the bat identify “missing features” such as over 99.6% values in the feature. “reviewerName” as well as split the visualization into numerical and categorical features based on its inference of the data type.

# Load test data to compare
TEST = tfdv.generate_statistics_from_csv(data_location='Data/Musical_instruments_reviews_100.csv', delimiter=',')
# Visualize both datasets
tfdv.visualize_statistics(lhs_statistics=TRAIN, rhs_statistics=TEST, rhs_name="TEST_DATASET",lhs_name="TRAIN_DATASET")

A particularly nice option is the ability to choose a log scale for validating categorical features. The ‘Percentages’ option can show quartile percentages.


Anomalies can be detected using  the display_anomalies call. The long and short descriptions allow easy visual inspection of the issues in the data. However, for large scale validation this may not be enough and you will need to   use tooling that handle a stream of defects being presented. 

# Display anomalies
anomalies = tfdv.validate_statistics(statistics=TEST, schema=schema)

The various kinds of anomalies that can be detected and their invocation are described here. Some especially useful ones are:


Schema Updates

Another useful feature here is the ability to update the schema and values to make corrections. For example, in order to insert a particular value

# Insert Values
names = tfdv.get_domain(schema, 'reviewerName').value
names.insert(6, "Vish") #will insert "Vish" as the 6th value of the reviewerName feature

You can also adjust the minimum number of values that must be preset in the domain and choose to drop it if is below a certain threshold.

# Relax the minimum fraction of values that must come from the domain for feature reviewerName
name = tfdv.get_feature(schema, 'reviewerName')
name.distribution_constraints.min_domain_mass = 0.9


The ability to split data into ‘Environments’ helps indicate the features that are not necessary in certain environments. For example,if we want the ‘internal’  column to be in the TEST data but not the TRAIN data. Features in schema can be associated with a set of environments using:

  •  default_environment
  •  in_environment
  •  not_in_environment
# All features are by default in both TRAINING and SERVING environments.

# Specify that 'Internal' feature is not in SERVING environment.
tfdv.get_feature(schema2, 'Internal').not_in_environment.append('TESTING')

tfdv.validate_statistics(TEST, schema2, environment='TESTING')

Sample anomaly output:

string_domain {
    name: "Internal"
    value: "A"
    value: "B"
    value: "C"
  default_environment: "TESTING"
anomaly_info {
  key: "Internal"
  value {
    description: "New column Internal found in data but not in the environment TESTING in the schema."
    severity: ERROR
    short_description: "Column missing in environment"
    reason {
      short_description: "Column missing in environment"
      description: "New column Internal found in data but not in the environment TESTING in the schema."
    path {
      step: "Internal"
anomaly_name_format: SERIALIZED_PATH

Skews & Drifts

The ability to detect data skews and drifts is invaluable. However, the drift  here does not indicate a divergence from the mean but refers to the “L-infinity”  norm of the difference between the summary statistics of the two datasets. We can specify a threshold which if exceeded for the given feature flags the drift. 

Lets say we have two vectors [2,3,4] and [-4,-7,8] , the L-infinity norm is the maximum absolute value of the difference between the two vectors so in this case the absolute maximum of [6,10,-4] which is 1.

#Skew comparison
                 'helpful').skew_comparator.infinity_norm.threshold = 0.01
skew_anomalies = tfdv.validate_statistics(statistics=TRAIN,

Sample Output:

anomaly_info {
  key: "helpful"
  value {
    description: "The Linfty distance between training and serving is 0.187686 (up to six significant digits), above the threshold 0.01. The feature value with maximum difference is: [0, 0]"
    severity: ERROR
    short_description: "High Linfty distance between training and serving"
    reason {
      short_description: "High Linfty distance between training and serving"
      description: "The Linfty distance between training and serving is 0.187686 (up to six significant digits), above the threshold 0.01. The feature value with maximum difference is: [0, 0]"
    path {
      step: "helpful"
anomaly_name_format: SERIALIZED_PATH

The drift comparator is useful in cases where you could have the same data being loaded in a frequent basis and you need to watch for anomalies to reengineer features. The validate_statistics call combined with the drift_comparator threshold can be used to monitor for any changes that you need to action on.

#Drift comparator
tfdv.get_feature(schema,'helpful').drift_comparator.infinity_norm.threshold = 0.01
drift_anomalies = tfdv.validate_statistics(statistics=TRAIN,schema=schema,previous_statistics=TRAIN)
Anomaly_info {
  key: "reviewerName"
  value {
    description: "The feature was present in fewer examples than expected."
    severity: ERROR
    short_description: "Column dropped"
    reason {
      short_description: "Column dropped"
      description: "The feature was present in fewer examples than expected."
    path {
      step: "reviewerName"

You can easily save the updated schema in the format you want for further processing.

Overall, this has been useful to me to use for mainly models within the TensorFlow ecosystem and the documentation indicates that using components like StatisticsGen with TFX makes this a breeze to use in pipelines with out-of-the box integration on a platform like GCP.

The use case for avoiding time-consuming preprocessing/training steps by using TFDV to identify anomalies for feature drift and inference decay is a no-brainer however defect handling is up to the developer to incorporate. It’s important to also consider that ones domain knowledge on the data plays a huge role in these scenarios for optimizing data according to your needs so an auto-fix on all data anomalies may not really work in cases where a careful review is unavoidable.

This can also be extended for overall general data quality by applying to any validation cases where you are constantly getting updated data for the features. The application of TFDV could even be post-training for any data input/output scenario to ensure that values are as expected.

Official documentation is here.

Autoencoders for Data Anomalies

With more and more emphasis on data anomaly detection and the proliferation of build/buy options, I’ve been exploring auto encoders for a few projects. In a nutshell, Autoencoders are a type of neural network that take an input (image, data) minimize it down to core features and then reverse the process to recreate the input. Key aspect being that the encoding part is actually done in an unsupervised manner hence the ‘auto’.

For example, dismantling a picture of a automobile, taking out every part and representing ( encoding) them as chassis, wheels as representative components and then reassembling them (decoding) from the encoding minimizing some amount of expected reconstruction errors.

Autoencoders use an encoder that learns the concise representation of the input data and the decoder reconstructs that representation that has been compressed. A lot of the literature online calls this compressed vector to be the “latent space representation”.

The seminal paper on the subject that shows the benefits of Autoencoders has been dissected many times and demonstrates the use of Restricted Boltzmann Machines (a 2-layer Autoencoder consisting of a visible/hidden layer) that learns the difference between the hidden and visible layer using a metric called K-L divergence and provides a greater dimensionality reduction than Principal Component Analysis. Thankfully the implementation is much more approachable than some of the background math used to prove the model!

These are feedforward, non-recurrent neural networks having an input layer, output layer and one of more hidden layers with the count of output nodes matching the input nodes minimizing “noise’ instead of predicting a target variable as we do in supervised learning implementations. Hence, they dont require labels which qualifies them to be unsupervised.

In a market rife with products offering “data quality” solutions, using Autoencoders to detect for anomalies could have the potential for a low cost, easy to use solution built in house to add to existing options.

My focus has been more on exploring this for analyzing data anomalies in structured data. In terms of cost/benefit here, one could argue this might be overkill to use a neural network instead of more rule-based checks on the data which is very valid and extensively used in large enterprises instead of neural net deployments. However, the benefits of squashing the input data into a smaller representative vector help in cases where we deliberately need dimensionality reduction and recognizing outliers at scale. There are tons of material on the web for Image processing using autoencoders for use cases such as image compression, image denoising and medical imaging. For example, fascinating results by converting THIS to THIS make colorizing an engrossing endevor. Also tons of applications in the Natural Language Processing field for understanding text, word embeddings and learning semantic meaning of words.

Autoencoders – unlike GANS can’t generate newer datapoints since their core goal is to determine an identity function suing compression. Also, if the goal is to just achieve compression, they are poor general-purpose image compressors.

There are a few types of an Autoencoder well described here:

  • Denoising autoencoder
  • Sparse Autoencoder
  • Deep Autoencoder
  • Contractive Autoencoder
  • Undercomplete Autoencoder
  • Convolutional Autoencoder
  • Variational Autoencoder

Most of the examples I found online dealt with images, so for exploration I used Faker to generate a million records to simulate a data scenario for regular versus non-regular coffee consumers. The irregulars were determined on a random rule say those who spent less than a specified threshold.

The objective was to have the Autoencoder learn from the fabricated data examples on what the values for the “regular” customers were, test against a holdout dataset from the “regular” group and then use the model to identify anomalies post reconstruction to identify cases of irregularities. Essentially, have the autoencoder achieve reasonable compression on the data and then identify anomalous inputs while reading out data with irregular values that do not match the original representation.

Below is a simple gist I created for a walkthrough of the process for a possible implementation with comments inline that should be self-explanatory.

  • Customer Test Score : 0.014549643226719535
  • Customer Validation Score : 0.014477944187138688
  • Irregular Customer Validation Score : 3.541257450764963

The scores reflected the anomaly for a synthetic dataset consisting of a million records and I was able to use spark to scale this to well over 10 million records. Essentially, as you can tell the Irregular Customer validation scores against a validation dataset is around 35% well over the Customer validation score over the entire data set (1%). Next step is to try some of these approaches against more “production”-type data at scale and implement some alerting against this data to make this more actionable.

There are tons of considerations that make a quality data anomaly solution work for particular use cases not limited to Statistical analysis use cases, storage considerations, UI/UX for test case development, the right orchestration tools, database/data lake operability, scaling and developmental costs and security audit requirements. Hence, the methodology for detection is just one piece for a much larger puzzle.

Some interesting reads/videos:

Laplace and the law of small data

One of my recent favorite reads is The Computer Science of Human Decisions by Brian Christian/Tom Griffiths. In the age of “big” data, encountering uncertainty and little data is also a norm in our daily lives and Laplace offers us a rule of thumb to make an optimized decision event with one observation.

Pierre-Simon, marquis de Laplace is a ubiquitous character in the annals of Science history in the 1700s and my undergraduate Mathematics years. The term “inverse Laplace transform” would be met with an uncontrollable shudder during finals especially if you had allergies towards solving differential equations using integral transforms.

The Marquis de Laplace

However, a few decades later and with the prevalence of matrix operations and linear algebra in deep learning and by extension my overall appreciation for advanced mathematics, I’ve been fascinated by some of his work. Laplace was a mathematician and physicist ands appears all over the place in the field. He is known as the “French Newton” , a bonafide virtuoso with contributions like the Laplace transforms, Laplace equation, Laplace operators amongst other things. If that weren’t enough, he also enlightened the world with theories on black holes and gravitational collapse. He was also a marquis in the french court after the Bourbon Restoration (which as much as it wants to does not refer to the weekend festivities in my backyard, it actually refers to a period in French history following Napoleon’s downfall ).

Laplace essentially wrote the first hand book on Probability with “A philosophical essay on probabilities” – a magnificent treatise that reflects the author’s depth of knowledge and curiosity. A bit dense in parts but a fascinating look at 18th century French life from the eyes of a polymath. Unless a deep researcher in Probabilistic history, the material is organized well enough to comb through points of interest. Part 1 is a “philosophical essay on probabilities” while part 2 is an “application of the calculus of probabilities”.

Laplace’s rule of Succession primarily solving the Sunrise problem is extremely important to compute probabilities when the originating events have the same probability.

Every day the sun same up n times in a row, what’s the probability it will rise tomorrow? One can imagine he got ridiculed for it since we have never known/experienced a day the sun never rises and hence it is the end of the world if it’s not going to rise the next day. More specifically, the problem does not seem realistic considering it assumes every day is an independent event ile random variables for the sun rising on each day.

We have evidence that the sun has risen n times in a row, but we don’t know what the value of P or the probability is. Treating this P as unknown brings to fore a long standing debate in statistics between frequentists and Bayesians. As per the Bayesian point of view, since P is unknown, we treat P as a random variable with distribution. As with Bayes theorem, we start with prior beliefs about P before we have any data. Once we collect data, we then use Bayes rule to update this based on our evidence.

The integral calculus leading to deriving this rule is masterfully explained here in this lecture on moment generating functions (MGFs) by Joe Blitzstein. Amazing explanations if you can sit through the detailed derivations.

The probability of the sun rising tomorrow is n+1/n+2 or as Wikipedia puts it:

” if one has observed the sun rising 10000 times previously, the probability it rises the next day is  10001/10002 = 0.99990002. Expressed as a percentage, this is approximately 99.990002%  chance.”

Pretty good odds it seems.

Essentially per Laplace, for any possible drawing of w winning tickets in n attempts, the expectation is the number of wins + 1, divided by the number of attempts + 2.

Said differently, if we have n experiments which each results in success (s) or failure (n -s), the probability that the next repetition will succeed is (s+1)/(n+2).

If I make 10 attempts at playing a musical piece and 8 of them succeed, per Laplace – my overall chance at this endeavor is 9/12 or 75% of the time. If I play it once and succeed, the probability is 2/3 (66.6%) which is intuitively more reliable than assuming I have a 100% chance of nailing this the next time.

Some fascinating quotes –

“Man, made for the temperature which he enjoys, and for the element which he breathes, would not be able, according to all appearance, to live upon the other planets. But ought there not to be an infinity of organization relative to the various constitutions of the globes of this universe? If the single difference of the elements and of the climates make so much variety in terrestrial productions, how much greater the difference ought to be among those of the various planets and of their satellites! The most active imagination can form no idea of it; but their existence is very probable.”

(Pg. 181)

“the transcendent results of calculus are, like all the abstractions of the understanding, general signs whose true meaning may be ascertained only by repassing by metaphysical analysis to the elementary ideas which have led to them; this often presents great difficulties, for the human mind tries still less to transport itself into the future than to retire within itself. The comparison of infinitely small differences with finite differences is able similarly to shed great light upon the metaphysics of infinitesimal calculus.”

(Pg. 44)

 “The day will come, when, by study pursued through several ages, the things now concealed with appear with evidence; and posterity will be astonished that truths so clear had escaped us”

Laplace quoting Seneca

Probability is relative, in part to this ignorance, in part to our knowledge. We know that of three or a greater number of events a single one ought to occur; but nothing induces us to believe that one of them will occur rather than the others-

Laplace, Concerning Probability

The Rule of Succession is essentially the world’s first simple algorithm for choosing problems of small data. It holds well when we have all known possible outcomes before observing the data. If we apply this in problems where the prior state of knowledge is not well known, the results may not be useful as the question being asked is then of a different nature based on different prior information.

ToneWood amp review

I rarely break out my fleet of electric guitars anymore so my usual go-to is my trusty old Cordoba Iberia that’s usually within reach . Having instruments lying around the house is a huge aspect of getting to practice more. The ToneWood amp caught my eye immediately as I’ve been looking for simple amplification while playing outdoors or in places with absolutely poor acoustics where a little echo/reverb or delay can go a long way in justifying the piece I’m trying to play and even serve to feed some creativity.

3 essential knobs

Its essentially a lightweight effects unit that can be mounted on the back of the acoustic guitar to give you amplification and a few effects. There are magnets as part of the install that hook the ToneWood on the back of your guitar and the effects are amplified from the body as the amp picks up sound from the pickup on the acoustic and sends it back via a “vibrating driver” so the sounds becomes augmented with the effects. It essentially blends the natural guitar sound with the effects and comes out the sound hole as a unified sonic experience. The patent explains the concept well and is pretty ingenious.

Magnetic attachment to the back

The natural sound of the unamplified guitar coming from the soundboard seamlessly blends with the effects radiating outward via the sound hole, creating a larger than life soundscape.  All that’s required is some type of pickup installed in the guitar to provide signal to the device. You can connect the ToneWood to an external Amp/PA via the 1/4″ output port and it is iDevice interface that is great if you are on the Mac ecosystem. It also has the 1/4″ standard guitar input and a 1/8″ TRRS I/O for iDevice. The processor takes in 3 AA batteries for an average of 8 hours.

The installation took me about 10 minutes. It required me to slacken the strings, place a X-brace unit inside the guitar pointing the magnets so that the ToneWood amp could attach itself to the outside back of the guitar using the suction provided. This took some adjusting and I’m not sure i’ve dialed in the optimal most optimal position but it’s close enough.Once you stick the batteries in, it’s showtime. The display screen and knobs are intuitive and the barrier to entry here is phenomenally low.

From an effects perspective , it’s really everything you need considering you are playing an acoustic guitar. All the effects come with Gain and Volume settings.

  • Hall Reverb with Decay, Pre-delay and Hi-cut settings. These settings are accessed by pressing on the knobs on the ToneWood
  • Room Reverb with Decay, Pre-delay and Hi-cut settings
  • Plate Reverb with Decay, Pre-delay and Hi-cut settings
  • Delay with Speed, Feedback and Reverb. ( Note: you are not going to sound like the The Edge on the Skrydstrup switching system anytime soon with this)
  • Tremolo with Rate, Depth and Delay
  • Leslie style tremolo with rate, depth and reverb
  • Auto-Wah with Sensitivity, Envelope , Reverb
  • Overdrive with Drive, Filter and Reverb
  • DSP Bypass to mute the processor
  • Notch Filters to Notch Low and Notch High to filter based on the frequency

There is also the ability to save effect settings based on the tweaks you make which seems useful though I’ve not really played around with it.

I’ve largely played around with the Hall and Room effects for my purpos. You can tweak this plenty but I’d like to make sure I’m not sounding “wall of sound Spector-mode” on my Cordoba for every track.

All in all, a great addition to enhance the acoustic and more than anything else, the convenience factor is amazing. It’s much more easy to optimize practice time now without switching guitars or hooking up effects racks to my Ibanez for a 10-minute session. If you want more control over ambience and soundscapes with minimal setup or complexity, this is it.

I recorded a quick demo with the Hall Reverb with Decay and Hi-cut set to default and no audio edits off the iPhone camera. The audio needs to be enhanced and it doesn’t fully do justice to the ToneWood sound. The jam is me noodling on S&G’s cover of Anji by Davey Graham. The nylon strings don’t lend to much slack in bending at all but point was to capture a small moment of a few hours testing this wonderful amp.

Note – I don’t have any affiliation with ToneWood.

Spark AI Summit 2020 Notes

Spark + AI Summit - Databricks

Spark AI Summit just concluded this week and as always, plenty of great announcements. (Note: I was one of the speakers at the event but this post is more about the announcements and areas of my personal interest in Spark. The whole art of virtual public speaking is another topic). The ML enhancements and impact is a bigger topic probably for another day as I catch up with all the relevant conference talks and try out the new features.

Firstly, I think the online format worked for this instance. This summit ( and I’ve been to it every year since its inception) was way more relaxing and didn’t leave me exhausted physically and mentally with information overload. Usually held at the Moscone in San Francisco, the event becomes a great opportunity to network with former colleagues, friends and Industry experts which is the most enjoyable part yet taxing in many ways with limited time to manage. The virtual interface was way better than most of the online events I’ve been to before – engaging and convenient. The biggest drawback was the networking aspect and the online networking options just don’t cut it. The video conferencing fatigue probably didn’t hit since it was 3 days and the videos were available instantly online so plenty of them are in my “Watch Later” list. (Note the talks I refer to below are only the few I watched so plenty of many more interesting ones)

The big announcement was the release of Spark 3.0 – Hard to believe but it’s been 10 years of evolution. I remember 2013 as the year I was adapting to the Hadoop ecosystem writing map-reduce using Java/Pig/Hive for large scale data pipelines when Spark started emerging as a fledgling competitor with an interesting distributed computational engine using Resilient Distributed Datasets (RDD). Fast-forward to 2020 and Spark is the foundation of large scale data implementations across the industry and its ecosystem has evolved to frameworks and engines like Delta and MLflow which are also gaining a foothold as foundational to the enterprise across Cloud providers. More importantly, smart investment into its DataFrames API has reduced the barrier to entry to it with the SQL access patterns.

There were tons of new features introduced but focusing on the ones I paid attention to. There has not been a major release of Spark for years so this is pretty significant (2.0 was in 2016).

Spark 3.0

  • Adaptive Query execution: At the core, this helps in changing the number of reducers at runtime. It divides the SQL Execution plan into stages earlier instead of the usual RDD graph. Newer stages help injecting optimizations before the queries get executed as later stages have the full picture of the entire query plan to have a global picture of all shuffle dependencies . The execution plans can be auto-optimized at runtime for example changing a SortMergeJoin to a BroadcastJoin where applicable. This is huge in large-scale implementation when I see tons of poorly formed queries eating a lot of compute thanks to skewed joins. More specifically, settings like the number of shuffle partitions set using spark.sql.shuffle.partitions which has defaulted to 200 since inception can now be automatically tuned based on the reducers required for the mapping stage output – i.e. setting it high for larger data and smaller for smaller data.

  • Dynamic partition pruning: Enables the ability to perform filter pushdowns versus table scans by adding a partition pruning filter. At the core if you consider a broadcast hash join between a fact and dimension table, the enhancement intercepts the result of the broadcast and plugs them as a filter on top of the dynamic filter on the fact table as opposed to the earlier approach of pushing out the broadcast hash table derived from the dimension table to every worker to determine the value of the join with the fact. This is huge to avoid scanning irrelevant data. This session explains it well.

  • Accelerator-aware scheduler: Traditionally, the bottleneck usually has been small data in partitions that GPUs find hard to handle, cache processing efficiencies, slow I/O on disk, UDFs that need CPU processing and a lot more issues. But GPUs are massively useful for high cardinality datasets, matrix operations, window operations and transcoding situations. Originally termed project Hydrogen, this feature helps Spark be GPU-aware. The cluster managers now have GPU support that schedulers can request from. The schedulers can now understand GPUs allocations to executors and assign GPUs appropriately to tasks. The GPU resources still need to be configured using the configs to assign the appropriate resources. We can request resources at the executor, drive and the task level. This also allows the resources to be discovered on the nodes and their assignments. This is supported in YARN, Kubernetes and Standalone modes.
  • Pandas UDF overhaul: Extensive use of python type annotations – this becomes more and more imperative as codebases scale and newer engineers take longer to understand and maintain the code effectively. instead of writing hundreds of test cases or worse find out about it from irate users. Great documentation and examples here.

  • PySpark UDF: Another feature that I’ve looked forward is to enable PySpark to handle Pandas Vectorized UDFs as an array. In the past, we needed to jump through god awful hoops like writing scala functions as a helper and then switch over to Python in order to help Python read these as arrays. ML engineers will welcome this.

  • Structured Streaming UI: Great to see more focus on the UI and additional features appearing in the workspace interface which frankly has got to be pretty stale over the last few years. The new tab shows more statistics for running and completed queries and more importantly will help developers debug exceptions quickly rather than poring through log files.

  • Proleptic Gregorian calendar: Switched to this from the previous hybrid (Julian + Gregorian). This uses Java 8 API classes from the java.time packages that are based on ISO chronology . The “proleptic” part comes from extending the Gregorian calendar backward to dates before before 1582 when it was officially introduced.

    Fascinating segway here –
Pope Gregory XIII portrait.jpg

The Gregorian Calendar (named after pope Gregory the 13th , not the guy who gave us the awesome Gregorian Chants, that was Gregory 1 ) is what we use today as part of ISO 8601:2004. The Gregorian calendar’ replaced the the Julian Calendar due to its inaccuracies in determining an actual year plus issues where it could not really take into the complexities of adding a leap year almost every 4 years. Catholics liked this and adopted it while protestants held out for 200 years (!) with suspicion before  England and the colonies switched over advancing the date from September 2 to September 14, 1752! Would you hand over 12 days of your life as a write -off? In any case, you can thank Gregory the 13th for playing a part in this enhancement.

  • Also a whole lot of talk on better ANSI SQL compatibility that I need to look closer at. Working with a large user base of SQL users, this could only be good news.

  • A few smaller super useful enhancements:
    • “Show Views” command
    • “Explain” output formatted for better usability instead of a jungle of text
    • Better documentation for SQL
    • Enhancements on MLlib, GraphX

Useful talks:

Delta Engine/Photon/Koalas

Being a big Delta proponent, this was important to me especially as adoption grows and large-scale implementations need continuous improvements in this product to justify rising storage costs on cloud providers as the scale grows.

The Delta Engine now has an improved query optimizer and a native vectorized execution engine written in C++. This builds on the optimized reads and writes in today’s NVMe SSDs that eclipse the SATA SSDs found in previous generations along with faster seek times. Gaining these efficiencies out of the CPU at the bare metal level is significant especially as data teams deal with more and more unstructured data and high velocity. The C++ implementation helps exploiting data-level and instruction-level parallelism as explained in detail in the keynote by Reynold Xin. Some interesting benchmarks on strings using regex to demonstrate faster processing. Looking forward to more details on how the optimization works under the hood and implementation guidelines.

Koalas 1.0 now implements 80% of the Pandas APIs. We can invoke accessors to use the Pyspark APIs from Koalas. Better type hinting and a ton of enhancements on DataFrames, Series and Indexes with support for Python 3.8 make this another value proposition on Spark.

A lot of focus on Lakehouse in ancillary meetings were encouraging and augurs well for data democratization on a single linear stack versus fragmenting data across data warehouses and data lakes. The Redash acquisition will provide another option for large scale enterprises for easy-to-use dashboarding and visualization capabilities on these curated data lake. Hope to see more public announcements on that topic.


More announcements around the MLflow model serving aspects with Model Registry (announced in April) that lets data scientists track model lifecycle across versions such as Staging, Production, or Archived. With MLflow in the Linux Foundation, it helps evangelizing it to a larger audience with a vendor-independent non-profit managing this project.

  • Autologging : Enables automatic logging of Spark datasource information at read-time, without the need for explicit log statements. mlflow.spark.autolog() will enable auto logging for spark data sources if you provide the relevant data and versions using Delta Lake so the managed Databricks implementation definitely looks slicker with the UI. Implementation would be as easy as attaching a ml-flow spark JARS and then call mlflow.spark.autolog. More significantly, enables the cloning of models.
  • On Azure – the updated mlflow.azureml.deploy API for deploying MLflow models to AzureML. This now uses the up-to-date Model.package() and Model.deploy() APIs.
  • Model schemas for input and out schemas, custom metadata tags for tracking which means more metadata to track which is great.
  • Model Serving : Ability to deploy models via a rest endpoint on Hosted ML Flow which is great. Would have loved to see more turnkey methods to deploy to an agnostic deployment endpoint say, a managed Kubernetes service – the current implementation is for databricks clusters from what I noticed.

  • Lots of cool UI fixes including highlighting different parameter values when comparing runs, UI plot updates with scaling to thousands of points.

Useful talks:

Looking forward to trying out the new MLflow features which will go on public preview later in July.