Book Review – The Great Mughals & their India

As someone who lived through learning ( and forgetting)  Mughal history consumed via droll and biased textbooks through Middle School growing up in India, Dirk Collier’s The Great Mughals and their India  is a captivating look at the lives of the Kings of the Mughal dynasty that rose from the ashes of the Delhi Sultanate and disintegrated spectacularly through a thousand cuts gradually inflicted by their own regional enemies and the British. 

Mughals


An added benefit of visiting Delhi to see family is always the getaways to tourist attractions with the captivating tombs, forts, monuments that one can find as remnants of the Mughals. This book has been on my wish-list for a while and its been a wonderful ride through Mughal influence on Indian architecture, politics, philosophy, culture and outlook albeit through the eyes of a Belgian.

The book primarily covers the chaotic, brilliant, pathetic, conquests, defeats of the Mughal rulers in the order of their appearance starting with the  descendant of Timur and Genghis Khan  – Babur.

Babur, on the run from Uzbeks, sought refuge in India and destiny ensured that his descendants never left.

The unbiased commentary of the author on the subject is a refreshing change from the usual divisive literature available that I have come across. In the current political climate of rising nationalism and divisive politics, the Mughal era, in certain eras, seems to be the epitome of tolerant views and harmony amongst people of different religions , at least as described here.

The book takes you on a galloping ride through the highs and glory of Akbar to the lows and bigotry of Aurangzeb, from the magnificence of Shah Jahan’s imagination to the pitiful incompetence of Shah Alam. 

The earlier Hindu dynasties (Mauryas, Guptas etc) including Ashoka predominantly covered Northern India and cannot really be seen as ruling the entire sub continent. This was also largely the trend in the Mughal empire except for Akbar whose rule of over 100M inhabitants (1/5th of the world’s population) covered vast territories across India.

I found myself reflecting on the remarkable fact that the Mughal dynasty was all but a blip in the annals of Indian history which kicked off in the Indus Valley in 3300 BCE. The book does an amazing job of putting these 331 years into context while being cognizant of its impact to future generations.

  • Homo sapiens in India: Around 75,000 years ago.
  • IndusValley (Harappa) Civilization: c. 3300–1300 BCE
  • Vedic civilization: c. 1500–500 BCE
  • Spread of Buddhism and Jainism: 500–200 BCE
  • Maurya Empire: 297–250 BCE
  • Ashoka the Great: 304–232 BCE
  • Hindu revival and classical Hindu civilization: 200 BCE–CE 1100
  • Gupta Empire/golden age of Hinduism: CE 320–550
  • Late classical civilization: CE 650–1100
  • The Hindu-Islamic Period Early sultanates plus trading colonies: 1100–1857
  • Mughal Empire: 
    • Babur: 1526–1530
    • Humayun: (1530–1556)
    • Akbar the Great: 1556–1605
    • Jahangir: 1605–1627
    • Shah Jahan: 1627–1658
    • Aurangzeb: 1658–1707
    • The ‘Lesser Mughals’: 1707–1857
  • Their rivals and successors Maratha Empire: 1713–1818
  • Sikh Empire: 1799–1849
  • Afghan Empire: 1747–1862
  • British East India Company: 1757–1858
  • British Raj: 1858–1947
  • Independence, partition and beyond: August 1947 to the present

The last book I read on this subject years ago was WIlliam Darymple seminal study on the “The Last Mughal” which seemed to reach the heights of authoritative study on this subject. However, Dirk Collier’s easy style of writing and his genuine reflections on the state of affairs through every stage of the empire makes this a much more endearing read to me.

Babur :

The founder of the dynasty though a stranger in a strange land torn between grandiose ambition to rule large swathes of territory and nostalgia for his central asian home. Excerpts from his memoir are filled with longing for Central Asia and Kabul. Considering he was a forced immigrant fleeing Central Asia, he had no special affliction for the climate, food or people. India was more of a consolation prize when faced with the reality of Uzbeks occupying his beloved Samarqand.

Humayun:

Humayun who was born in Kabul, had life full of contradictions as he spent years of incompetence losing his inheritance, wandering about in exile with warriors of questionable quality and then regaining his throne with Persian help. A life spent in harems and opium addiction.  Strangely, also a voracious reader and builder of contraptions, patron of scholars and artists and highly knowledgeable in arcane matters like plants, herbs and metals. It was his misfortune that his regime coincided with the rise of Sher Shah Suri whose competence in government and military matters eclipsed anything that Human could throw at him. His innocuous death tripping on a library staircase epitomizes his life. This is well described in the book with the quote from  British orientalist and historian Stanley Edward Lane-Poole (1854–1931):

‘his end was of a piece with his character. If there was a possibility of falling, Humayun was not the man to miss it. He tumbled through life, and he tumbled out of it.’

Akbar:

The first of the Mughals emperors born in India, Akbar’s eventual empire stretched from the heartland of India down to central India. Akbar lies at the center of Mughal achievement in India (barring the Taj Mahal) due to his impact to the military, cultural, political and economic development that had never been seen before.

A micro-manager with a real interest in his royal duties, the book mentions charming stories about him wandering in disguise in the streets to gauge the efficacy of his rule. Universal tolerance to different forms of worship led to the concept of ‘Din-e-Ilahi’ emphasizing one god without divisive religion that combined the best of Hinduism and Islam. In some ways this was a failure that did not outlast him but its reflective of his forward thinking and rationalist views which were driven by an obsession to find the truth. Interestingly his interstate’s towards organized monetization led to development in design for the royal coin .

His cultural impact was astutely planned by forming alliances with non-Mughals and shrewd military acumen inspiring his forces with his own daring in the battlefield. The book also has interesting anecdotes of his encounters with the Portuguese who landed on the Indian shores who ostensibly set up trade and evangelize. The accolades go on and the book offers many details and insights in this glorious period while also inspecting his motivations for actions that helped cement his place. 


Jahangir:

Known for drunken depravity, cruelty and excesses, Akbar’s successor was at the opposite end of the spectrum. The book refers to this reign almost as a placeholder between Akbar’s and Shah Jahan’s reign with no notable achievements apart from the constant struggle to keep the inherited empire intact. The reign here was characterized by kindling religious difference, orthodoxy and wanton destruction of non-islamic religious places. This notably set the stage for the absolute division of the empire that would ultimately, decades later  help the British pick apart the fragmented empire. There are also some contradictions and some historians disagree on this achievements and contribution. This should be an interesting read.  Some great quotes here:

 ‘I never saw any man keep so constant a gravity,’ affirms Sir Thomas Roe, the first English ambassador to the Mughal court.

“the only emotions apparent on his stone-cold face were extreme pride and utter contempt for others. “

Shahjahan:

References to Shahjahan are always accompanied by the Taj Mahal and his reign gets credit for the  best example of Mughal architecture and a symbol of India’s rich history rightfully. However, he is also reported to be another self-centered, humorless fundamentalist though of a lesser degree than his predecessor. Post the death of Mumtaz Mahal – he seems to have delved deeper into orthodoxy and bigotry. He abolished Akbar’s solar Din-e-Ilahi calendar, replacing it with the conventional lunar (Hijri) calendar. From a civil and military administration point of view, most of the empire seems to have been squandered. Myths around the blinding of the builders of the Taj Mahal are debunked by the author and shredded for their lack of authenticity. 

Aurangzeb:

Characterized by some expansionism through the subcontinent but the hold was precarious at best with revolts all across the empire. Another religious bigot whose life was consumed warring against his brothers ( notable one being Darah Shikoh) . Another walking contradiction, he also expressed genuine interest in other religions while doing nothing to unite his own empire. He is universally reviled by non-islamists while being depicted as a pious servant of god by his apologists. His acts against non-muslims were yet another assault on the unified fabric that his great-grandfather had worked so diligently for. In many parts of India, he is known for his role destroying a lot of religious structures that had centuries of existence before his reign. Some great observations made by the author for this period include the rise of militant Sikhism thanks to his hounding of Sikh religious leaders. The rise of Shivaji the great Maratha is well documented here with the various stores of legend picked apart and debated. Another descriptive anecdote that does not disguise the author’s contempt for this period:

“In 1666, it was proudly announced that the emperor’s invincible armies had conquered ‘Tibet’; in actual fact, it merely meant that a petty local chief in the stony wastelands of Ladakh had been bullied into building a mosque and minting coin with Aurangzeb’s name – hardly worthy of a ‘universe-conquering’ monarch.”

Post- Aurangzeb :

This period was a succession of Mughal kings characterized by mismanagement,corruption and squandering of their empire.   Notable incidents include the ruler Farrukhsiyar’s imperial firman of 1717, granting duty-free trading and territorial rights to the British East India Company  which opened the gates for what was to follow. Barely , thirty years later the Persians under Nadir Shah would sack Delhi plundering and murdering its citizens including carting away the kohinoor diamond to Persia. This was essentially the death knell to the empire and further worsened the anarchy.

Overall, this was a great impartial summary of the mughal chronology and the impact it had on future generations. The author’s credentials re-enforce an objective view to the entire period without being unnecessarily romanticized by the majesty of certain phases and cultural folklore. Interestingly, the Mughal empire was at its zenith when it had its most tolerant and just empire in power which is a lesson to be learnt even in modern times of regionalism and divisive politics that break apart societies.

TFDV for Data validation

Working with my teams trying to build out a robust feature store these days, it’s becoming even more imperative to ensure feature Engineering data quality. The models that gain efficiency out of a performant feature store are only as good as the underlying data. 

Tensorflow Data Validation (TFDV) is a python package from the TF Extended ecosystem. The package has been around for a while but now has evolved to a point of being extremely useful for machine learning pipelines as part of feature engineering and determining data drift scenarios. Its main functionality is to compute descriptive statistics, infer  schema,and detect data anomalies.  It’s well integrated with the Google Cloud Platform and Apache Beam. The core API uses Apache Beam transforms to compute statistics over input data.

I end up using it in cases where I need quick checks on data to validate and identify drift scenarios before starting expensive training workflows. This post is a summary of some of my notes on the usage of the package. Code is here.

Data Load

TFDV accepts CSV, Dataframes or TFRecords as input.

The csv integration and the built-in visualization function makes it relatively easy to use within Jupyter notebooks. The library takes input feature data and then analyzes them by feature to visualize them. This makes it easy to get a quick understanding of the distribution of values, helps identifying anomalies and identifying training/test/validate skew. Also a great way to discover bias in the data since you can infer aggregates of values that skewed towards certain features.

As evident, with trivial amount of code you can spot issues immediately – missing columns, inconsistent distribution and data drift scenarios where newer dataset could have different statistics compared to earlier trained data.

I used a dataset from Kaggle to quickly illustrate the concept:

import tensorflow_data_validation as tfdv
train = tfdv.generate_statistics_from_csv(data_location='Data/Musical_instruments_reviews.csv', delimiter=',')
# Infer schema
schema = tfdv.infer_schema(TRAIN)
tfdv.display_schema(schema)

This generates a data structure that stores summary statistics for each feature.

TFDV Schema

Schema Inference

The schema properties describe every feature present in the 10261 reviews. Example:

  • their type (STRING)
  • Uniqueness of features – for example 1429 unique reviewer IDs.
  • the expected domains of features.
  • the min/max of the number of values for a feature in each example. For example: If A2EZWZ8MBEDOLN is a reviewerid and has 36 occurrences
top_values {
        value: "A2EZWZ8MBEDOLN"
        frequency: 36.0
      }
datasets {
  num_examples: 10261
  features {
    type: STRING
    string_stats {
      common_stats {
        num_non_missing: 10261
        min_num_values: 1
        max_num_values: 1
        avg_num_values: 1.0
        num_values_histogram {
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 1026.1
          }
          buckets {
            low_value: 1.0
            high_value: 1.0
            sample_count: 1026.1
          }

Schema inference is usually tedious but becomes a breeze with TFDV. This schema is stored as a protocol buffer

schema = tfdv.infer_schema(train)
tfdv.display_schema(schema)

The schema also generates definitions like “Valency” and “Presence”. I could not find too much detail in the documentation but I found this useful paper that describes it well.

  • Presence: The expected presence of each feature, in terms of a minimum count and fraction of examples that must contain the feature.
  • Valency: The expected valency of the feature in each example, i.e., minimum and maximum number of values.

TFDV has inferred the revewerName as STRING and the universe of values around them termed as Domain. Note – TFDV can also encode your fields as BYTES. Im not seeing any function call in the API to update the column type as-is but you could easily update it externally if you want to explicitly specify a string. From the documentation, its explicitly advised to review the inferred schema and refine it per the requirement so as to embellish this auto-inference with our domain knowledge based on the data. You could also update the Feature based on the Data Type to BYTES, INT, FLOAT or STRUCT.

# Convert to BYTES
tfdv.get_feature(schema, 'helpful’).type=1 


Once loaded, you can generate the statistics from the csv file.
For a comparison and to simulate a  dataset validation scenario, I cut down the Musical_instruments_reviews.csv to 100 rows to compare with the original and also added an extra feature called ‘Internal’ with the values A, B,C randomly interspersed for every row.

Visualize Statistics

After this you can pass in the ‘visualize_statistics’ call to first visualize the two datasets based on the schema of the first dataset (TRAIN in the code). Even though this is limited to two datasets, this is a powerful way to identify issues immediately. For example – it can right off the bat identify “missing features” such as over 99.6% values in the feature. “reviewerName” as well as split the visualization into numerical and categorical features based on its inference of the data type.

# Load test data to compare
TEST = tfdv.generate_statistics_from_csv(data_location='Data/Musical_instruments_reviews_100.csv', delimiter=',')
# Visualize both datasets
tfdv.visualize_statistics(lhs_statistics=TRAIN, rhs_statistics=TEST, rhs_name="TEST_DATASET",lhs_name="TRAIN_DATASET")


A particularly nice option is the ability to choose a log scale for validating categorical features. The ‘Percentages’ option can show quartile percentages.

Anomalies

Anomalies can be detected using  the display_anomalies call. The long and short descriptions allow easy visual inspection of the issues in the data. However, for large scale validation this may not be enough and you will need to   use tooling that handle a stream of defects being presented. 

# Display anomalies
anomalies = tfdv.validate_statistics(statistics=TEST, schema=schema)
tfdv.display_anomalies(anomalies)


The various kinds of anomalies that can be detected and their invocation are described here. Some especially useful ones are:

  • SCHEMA_MISSING_COLUMN
  • SCHEMA_NEW_COLUMN
  • SCHEMA_TRAINING_SERVING_SKEW
  • COMPARATOR_CONTROL_DATA_MISSING
  • COMPARATOR_TREATMENT_DATA_MISSING
  • DATASET_HIGH_NUM_EXAMPLES
  • UNKNOWN_TYPE
Anomaly

Schema Updates

Another useful feature here is the ability to update the schema and values to make corrections. For example, in order to insert a particular value

# Insert Values
names = tfdv.get_domain(schema, 'reviewerName').value
names.insert(6, "Vish") #will insert "Vish" as the 6th value of the reviewerName feature

You can also adjust the minimum number of values that must be preset in the domain and choose to drop it if is below a certain threshold.

# Relax the minimum fraction of values that must come from the domain for feature reviewerName
name = tfdv.get_feature(schema, 'reviewerName')
name.distribution_constraints.min_domain_mass = 0.9

Environments

The ability to split data into ‘Environments’ helps indicate the features that are not necessary in certain environments. For example,if we want the ‘internal’  column to be in the TEST data but not the TRAIN data. Features in schema can be associated with a set of environments using:

  •  default_environment
  •  in_environment
  •  not_in_environment
# All features are by default in both TRAINING and SERVING environments.
schema2.default_environment.append('TESTING')

# Specify that 'Internal' feature is not in SERVING environment.
tfdv.get_feature(schema2, 'Internal').not_in_environment.append('TESTING')

tfdv.validate_statistics(TEST, schema2, environment='TESTING')
#serving_anomalies_with_env

Sample anomaly output:

string_domain {
    name: "Internal"
    value: "A"
    value: "B"
    value: "C"
  }
  default_environment: "TESTING"
}
anomaly_info {
  key: "Internal"
  value {
    description: "New column Internal found in data but not in the environment TESTING in the schema."
    severity: ERROR
    short_description: "Column missing in environment"
    reason {
      type: SCHEMA_NEW_COLUMN
      short_description: "Column missing in environment"
      description: "New column Internal found in data but not in the environment TESTING in the schema."
    }
    path {
      step: "Internal"
    }
  }
}
anomaly_name_format: SERIALIZED_PATH

Skews & Drifts

The ability to detect data skews and drifts is invaluable. However, the drift  here does not indicate a divergence from the mean but refers to the “L-infinity”  norm of the difference between the summary statistics of the two datasets. We can specify a threshold which if exceeded for the given feature flags the drift. 

Lets say we have two vectors [2,3,4] and [-4,-7,8] , the L-infinity norm is the maximum absolute value of the difference between the two vectors so in this case the absolute maximum of [6,10,-4] which is 1.

#Skew comparison
tfdv.get_feature(schema,
                 'helpful').skew_comparator.infinity_norm.threshold = 0.01
skew_anomalies = tfdv.validate_statistics(statistics=TRAIN,
                                          schema=schema,
                                          serving_statistics=TEST)
skew_anomalies

Sample Output:

anomaly_info {
  key: "helpful"
  value {
    description: "The Linfty distance between training and serving is 0.187686 (up to six significant digits), above the threshold 0.01. The feature value with maximum difference is: [0, 0]"
    severity: ERROR
    short_description: "High Linfty distance between training and serving"
    reason {
      type: COMPARATOR_L_INFTY_HIGH
      short_description: "High Linfty distance between training and serving"
      description: "The Linfty distance between training and serving is 0.187686 (up to six significant digits), above the threshold 0.01. The feature value with maximum difference is: [0, 0]"
    }
    path {
      step: "helpful"
    }
  }
}
anomaly_name_format: SERIALIZED_PATH

The drift comparator is useful in cases where you could have the same data being loaded in a frequent basis and you need to watch for anomalies to reengineer features. The validate_statistics call combined with the drift_comparator threshold can be used to monitor for any changes that you need to action on.

#Drift comparator
tfdv.get_feature(schema,'helpful').drift_comparator.infinity_norm.threshold = 0.01
drift_anomalies = tfdv.validate_statistics(statistics=TRAIN,schema=schema,previous_statistics=TRAIN)
drift_anomalies
Anomaly_info {
  key: "reviewerName"
  value {
    description: "The feature was present in fewer examples than expected."
    severity: ERROR
    short_description: "Column dropped"
    reason {
      type: FEATURE_TYPE_LOW_FRACTION_PRESENT
      short_description: "Column dropped"
      description: "The feature was present in fewer examples than expected."
    }
    path {
      step: "reviewerName"
    }
  }
}

You can easily save the updated schema in the format you want for further processing.

Overall, this has been useful to me to use for mainly models within the TensorFlow ecosystem and the documentation indicates that using components like StatisticsGen with TFX makes this a breeze to use in pipelines with out-of-the box integration on a platform like GCP.

The use case for avoiding time-consuming preprocessing/training steps by using TFDV to identify anomalies for feature drift and inference decay is a no-brainer however defect handling is up to the developer to incorporate. It’s important to also consider that ones domain knowledge on the data plays a huge role in these scenarios for optimizing data according to your needs so an auto-fix on all data anomalies may not really work in cases where a careful review is unavoidable.

This can also be extended for overall general data quality by applying to any validation cases where you are constantly getting updated data for the features. The application of TFDV could even be post-training for any data input/output scenario to ensure that values are as expected.


Official documentation is here.