DQ Framework

I came across the excellent draft of the new EU data quality framework that aims to align stakeholders. While it focuses on medicines regulation and procedures that apply across the European Medicines Regulatory , it definitely applies to larger data governance policies and serves as a great template for setting your own strategy.

Data quality is one of the biggest asks from consumers of the data in large scale implementations. As the scale of the data grows, so does complexity and the plethora of usecases that are now dependent on this data. In a large organization, there are typically multiple degrees of separation i.e. siloes between producers of the data and consumers. While the technology to detect and provide feedback loops to data anomalies exist, the fundamental problem of minimizing the value of clean data at the source persists.

A downstream data engineering team stitching together various sources of data is caught between a rock and a hard place i.e. between ever ravenous consumers of data/dashboards/models versus producers who throw over application data over the fence ensuring operational needs are met while downplaying analytics needs.  This also leads to minimal understanding of the data downstream by teams building and using these data products based on resourcing as the focus is on throwing data into the data lake hoarding data. The lack of understanding  is  further compounded by an ever increasing backlog of balancing the tight wire between new data products and tech debt.

This is where organizational structures play a huge part in determining the right business outcome and usage of the data. The right org structure would enforce data products conceptualized and built at source with outcomes and KPIs planned at the outset with room for evolution. 

Some of the better implementations I have been fortunate to be involved in treated data as a product where the data consumption has evolved though deliberation of usecases with KPIs defined at the outset in terms of value of this data as opposed to ‘shovel TBs into the lake’ and then we figure out what to do with it.

Call it what you want ( data mesh etc) , but fundamentally the data-driven approach to plan data usage and governance with the right executive sponsorship holds a ton of value especially in the age of massive data generation across connected systems. The advantage of being KPI-driven at the outset means you can set taxonomy at the beginning, the taxonomy helps feed the domain-based usecase which has implications for consumption of the data and sets the foundation for a holistic permissioning matrix, least privileges clearly implemented and policy-based data access. More to come on this subject but great to see a well defined manifesto defined by policy-makers at the highest levels.

Tenacity for retries

Most times when I have taken over new roles or added more portfolios to my current workstreams, it has usually involved decision making to either build on or completely overhaul legacy code. I’ve truly encountered some horrific code bases that are no longer understandable, secure or  scaleable. Often if there is no ownership of the code or if attrition has rendered it an orphan, it usually sits unnoticed ( like an argiope spider waiting for its prey) until it breaks (usually on a friday afternoon) when someone has to take ownership of it and bid their weekend plans goodbye. Rarely do teams have  a singular coding style with patterns and practices clearly defined that are repeatable and can withstand change people or technology change – if you are in such an utopian situation, consider yourself truly lucky!

A lot of times,  while you have to understand and reengineer the existing codebase, it is imperative to keep the lights on while you are figuring out the right approach, a sorta situation while you got to keep the car moving while changing its tires. I discovered Tenacity while  encountering a bunch of  gobbledygook shell and python scripts crunched together that ran a critical data pipeline and had to figure out a quick retry logic to keep the job running  while it randomly failed due to environment issues ( exceptions, network timeouts, missing files, low memory and every other entropy-inducing situation on the platform). The library can handle different scenarios based on your usecase.

  • Continuous retrying for your commands. This usually works if its a trivial call where you have minimal concerns over overwhelming target systems. Here, the function can be retried until no exception is returned.
  • Stop after a certain number of tries. Example – if you have a shell script like shellex below that needs to execute and you anticipate delays or issues with the target URL or if a command you run raises an exception, you can have Tenacity retry based on your requirement .
  • Exponential backoff patterns – To handle cases where the delays maybe too long or too short for a fixed wait time, tenacity has a wait_exponential algo that can ensure that if a request cant succeed in a short time after a retry, the application can wait longer and longer with each retry thus alleviating the target system of repetitive fixed time retries.

The library handles plenty of others uses cases like Error handling, custom callbacks and tons more.

Overall, this has been a great find to use a functional library like Tenacity to for various usecases instead of writing custom retry logic or implementing a new orchestration engine for handling retries.

Book Review – Machine Learning with PyTorch and Scikit-Learn

by Sebastian Raschka & Vahid Mirjalili

I primarily wanted to read this book due to the Pytorch section and pretty much flipped through the Scikit learn section while absorbing and practicing with the Pytorch section. So my review largely is based on chapter 13 and beyond. Apart from the Pytorch official documentation, there are not too many comprehensive sources that can serve as a good reference with practical examples for Pytorch in my view. This book aims to do that and pretty much hits the mark.

Ch 1-11 is a good refresher on SciKit -learn and sets up all the foundational knowledge you need for the more advanced concepts with Pytorch. I wish though the authors had different examples and not resorted to ubiquitous examples like MNIST ( like in chapter 11/ 13 etc) for explaining neural network concepts. While these are good for understanding foundational concepts, I find the really good books usually veer away from standard examples found online and get creative.

Chapter 12 provides an excellent foundation in Pytorch and a primer for building a NN model in Pytorch. The code examples are precise with the data sources clearly defined so I could follow along without any issues. I did not need a GPU/collab to run examples . Good to see the section on writing custom loss functions as those are useful practical skills to have.

Ch-14 which has us training a smile classifier to explain convolution neural networks is a useful example especially for tricks like data augmentation that can be applied to other usecases.

I skipped through the chapter on RNNs as transformers are the rage now ( Ch-16) and Pytorch already has everything implemented in its wrapper function for structures like LSTMs. Still , a lot of dense and useful material explaining the core concepts behind RNNs and some interesting text generation models using Categorical pytorch classes to draw random samples.

The chapter on Transformers is a must-read and will clear up a lot of foundational concepts. Another thing to mention is that the book has well depicted color figures that make some of the dense material more understandable. Contrasting the transformers approach to RNNs using concepts like attention mechanisms is clearly explained. More interestingly, the book dwells into building larger language models with unlabeled data such as BERT and BART. I plan to re-read this chapter to enhance my understanding of transformers and the modern libraries such as HuggingFace that they power.

The chapter of GANs was laborious with more MNIST examples and could have had a lot more practical examples.

Ch-18 on Graph Neural Network is a standout section in the book and provides useful code examples to build pytorch graphs to treat as datasets defining edges and nodes. For example, libraries like Torch Drug are mentioned that use pytorch Geometric framework for drug discovery. Spectral graph convolution layers, graph pooling layers, and
normalization layers for graphs are explained and I found this chapter to be a comprehensive summary that would save one hours of searching online for the fundamental concepts. GNNs definitely have a ton of interesting applications and a lot of recent papers with links are provided.

Ch-19 on reinforcement learning adds another dimension to the book which is largely focused on supervised and unsupervised learning in prior chapters. Dynamic programming to Monte Carlo to Temporal Difference methods are clearly articulated. The standard open AI gym examples are prescribed for implementing grids to specify actions and rewards. I thought this chapter was great explaining the theoretical concepts but the examples were all the standard Q-learning fare you would find online. Would have loved to see a more realistic example or pointers to apply to your own usecases.

All in all, I enjoyed the Pytorch examples and clearly explained concepts in this book and it would be a good Pytorch reference to add to your library.

Asynchronicity with asyncio and aiohttp

The usual Synchronous versus Asynchronous versus Concurrent versus Parallel is a topic in technical interviews that usually leads to expanded conversations in the candidates overall competency on scaling and leads to interesting rabbit holes and examples. While keeping the conversations open-ended, I’ve noticed when candidates usually incorporate techniques to speed up parallelism or enhance concurrency or mention other ways of speeding up processing, its a great sign.

Its also important to distinguish between CPU-bound and IO-bound tasks in such situations since parallelism is effective on CPU-bound tasks (example preprocessing data, running ensemble of models) while concurrency works best for IO-bound tasks (web scraping, database calls). CPU-bound tasks are great for parallelization using multiple CPUs where a task can be split into multiple subtasks while IO-bound are not CPU dependent but depend on time reading/writing from disk.

Key standard libraries in Python for concurrency include:

  • AsyncIO – for concurrency with coroutines and eventloops. This is most similar to a pub-sub model.
  • Concurrent.futures –  for concurrency via threading

Both are limited by the global interpreter lock (GIL) and single process, multi-threaded.
Note for parallelism, the library I’ve usually used is multiprocessing which is another post.

AsyncIO  is a great tool to execute tasks concurrently and is a great way to add asynchronous calls to your program. However, the right usecase matters as it can also lead to unpredictability since tasks can start, run and complete in overlapping times using context switching between threads. Threads can be blocked using asyncio and the next available thread in the queue can be processed until it completes or is blocked. The key here is the lack of any manual checking if the thread is freed up, as asyncio will announce the availability of the thread when it does actually free up.

This post has a couple of quick examples of asyncio using the async/await syntax  for event-loop management. Plenty of other libraries available but async is usually sufficient for most workloads and a couple of simple examples go a long way in explaining the concept.

The key calls here are –

  • async – tells the Python interpreter to run the coroutine asynchronously with an event loop
  • await – while waiting for the result to be returned, this passes control back to the event loop, suspending the execution of the current coroutine to let the event loop run other things until the await has a result returned
  • run – Schedule a coroutine (python 3.7+) . In earlier versions (3.5/3.6) you  can a
  • sleep – suspends execution of a task to switch to the other task
  • gather – Execute multiple coroutines that need to finish before resuming the current context with the list of responses from each coroutine

Example of using async for multiple tasks:

A trivial example like above goes a long way in explaining the core usage of the sync library especially while bolting it onto long running python processes that are primarily slow due to IO.

aiohttp is another great resource for synchronous HTTP requests which by nature are a great usecase for asynchronicity while requests wait for servers to respond to do other tasks. This basically works by creating a client session that can be used to support multiple individual requests and make connections upto 100 different servers at the same time.

Non async example

A quick example to handle requests from a website (https://api.covid19api.com/total/country/{country}/status/confirmed) that provides a JSON string based on the specific request. The specific request is not important here and used only for demonstrative purposes to demonstrate the async functionality.

Async example using aiohttp which will be needed in order to asynchronously call the same endpoint.

The example clearly shows the time difference where the async calls halve the time taken for the same calls. Granted its a trivial example but it shows the benefit of the non-blocking Ascync call and can be applied to any situation that deals with multiple requests and calls to different servers.

Key Points

  • AsyncIO is usually a great fit for IP bound problems
  • Putting async before every function will not be beneficial as the blocking calls can slow down the code so validate the usecase first
  • async await support a specific set of methods only so for specific calls (say to databases), the specific python wrapper library you use will need to support async await.
  • https://github.com/timofurrer/awesome-asyncio is the go-to place for higher level async APIs along with https://docs.python.org/3/library/asyncio-task.html

Genesis at MSG

Genesis
last Domino

I’ve enjoyed the different eras of Genesis regardless of the Peter Gabriel-era prog-rock or the later more commercial Collin-era hits. Regardless of the timeline, the Hackett-Rutherford guitar soundscapes made me love the catalog whether it was the theatrics of Nursery Cryme, the sonic bliss of Selling England by the Pound, the massive hooks of Invisible Touch, or the 90s heyday of We cant Dance ( a personal guilty pleasure).


On the spur of the moment, a couple of high school buddies and I decided to fly to NYC to see the Genesis – ‘The Last Domino’ tour at Madison Square Garden. Being progheads, the motivating factor was to catch Genesis on what could be their final tour . Being someone who spent his teens in the 90s in a cloud of personal angst, the guitar was an escape and I have many fond memories of spending many a gray bleak day muddling my way through a Genesis guitar tab to manifest some poorly played yet soul satisfying covers. With so many of my musical heroes passing over the last few years ( Neil Peart, Eddie Van Halen, Allan Holdsworth…this list goes on sadly), its an imperative now to see the acts I have never had the chance to see before or want to see again.

GensisConcert1
The Last Domino


The setlist was average and largely meant to accommodate Phil Collins’ current state of performance prowess. I had no expectations of them belting out “”Get ‘Em Out By Friday” or “Can Utility And The Coastliners'” . However, with Nic Collins being stellar on drums and Phil still belting out the songs in top form backed by the genius of Rutherford/Banks , I was glad I made the trek and it was a show to remember not merely for nostalgia but that we could actually make it out to a legit concert in 2021 after an year(s) of pandemic-induced seclusion.


Genesis

Zone of Proximal Development

The best times of productivity often happen when in a state of “flow” – say a sustained state of seamless productivity while coding, a burst of self-expression while composing music, a sublime feeling of sentient exhilaration while working out and more. However, as great as that state is, its become increasingly hard over the years to achieve for me with all the various dopamine rollercoaster of distractions and the overwhelming demands of time. I came across constructivist methods while researching active learning methods and Vygotsky is a name that is prominent in this subject.

ZPD was a theory proposed by Lev Vygotsky in the 1930s. The crux of his proposal was that giving children experiences that better supported their learning enhanced their development. The 1930s were an interesting time where there was much debate about the the best methods of education and development for children. Vygotsky introduced the concept of ZPD to criticize the psychometric-based testing in Russian schools. He argued that academic tests where students displayed the same academic level were not the best way to determine intelligence. Instead, he focused on the augmented ability of children via social interaction to solve problems via interacting with more knowledgeable persons. He also inferred that while self-directed curiosity based approached worked for some subjects like languages , their learning in other subjects like Mathematics benefited by interacting with more “knowledgeable” folks.

Essentially ZPD is the place where once cannot progress in their learning without the interaction of a person who is high skilled at that learning objective. In a group setting, if some individuals grasp the concept, while other individuals are still in ZPD, the peer interaction between may create the most conducive environment for learning. While it makes sense at first glance, the nuance here is the ability to recognize and set learning objectives that are within the range of ZPD to enhance learning as well as ensure the right subject matter experts around in order to help complete that objective. ZPD can be applicable to various situations, everything ranging from the learning patterns of my 8-year old to setting appropriate goals to someone on my team for their advancement.

More interestingly, reflecting upon my learning blocks over the years, I have noticed multiple instances when serendipitous encounters/conversations have unblocked what seemed an impasse. It may not be too much of a stretch to apply this to religious philosophies like say Buddhism where the role of a teacher important for attainment  of one’s own “inner guru” to become a guru.

ZPD and leading teams

One of the key areas for being successful leaders is goal setting for your teams. Analyzing the tasks that are in ZPD for your team with a view to their future aspirations is invaluable.

Before goal setting, its worth the time to analyze the tasks that are within ZPD.  Then validate if the team has a variety of skill sets and expertise that can help mentorship for team members that are within ZPD but need that extra expertise to help them through it. For example, if a team member is tasked with an important large scale organizational objective, pair them with a partner that is more experienced to help them accomplish that task.

Delegating your own tasks to folks who are interested in being managers and being the  subject matter expert that helps them navigate ZPD. Challenging oneself to think about the types of learning we might or might not be doing on a daily basis and the effect of ZPD can help enhance that learning process.

Contemplating on my own learning, as it currently stands, is a daily dance around various concepts based on time, interest and compulsions – the patchwork quilt of topics consumed (tech, news etc) could use a recipe with ZPD in mind on topics that need to be reenforced or are important to enhance.

Interesting reads:

Learning Computer Science in the “Comfort Zone of Proximal Development”

Great paper on Vygotsky’s Zone of Proximal Development: Instructional Implications and Teachers’ Professional Development

Merkle Trees for Comparisons – Example

Merkle trees (named after Ralph Merkle, one of the fathers of modern cryptography) are fascinating data structures used in hash based data structures to verify the integrity of data in peer-to-peer systems. Systems like Dynamo use this to compare hashes  – essentially itself a binary tree of hashes and typically used to remove conflicts for reads. For example – in a distributed system, if a replica node falls considerably behind  its peers, using techniques like vector clocks might take unacceptable times to resolve. A hash-based comparison approach like Merkle tree would help quickly compare two copies of a range of data on different replicas. This is also a core part of blockchains like Ethereum which uses a non-binary variant but the binary ones are the most common and easy to understand and fun to implement.

Conceptually this involves:

  1. Comparing the root hashes of both trees.
  2. Continue recursion on the left and right children of the tree until the root hashes are equal.

The “Merkle root” stores the summary of all the transaction value in a singular value.

Simple Example

For example , if TA, TB,TC ,TD are transactions ( could be files, keys etc) and H is a Hash function. You can construct a tree by taking the transactions, hashing their concatenated values to generate children and finally reduced to a single root. In my scrawl above, this means hashing TA and TB, TC and TD, then hashing their concatenations H (AB), H(CD) to land at H(ABCD).Essentially keep hashing the until all the transactions meet at a single hash.

Example

Here’s an example that uses this technique to compare two files by generating their Merkle root to validate if they are equal of not (comments inline).

Invoke the script by calling “python merkle_sample.py “<file1>.csv” “<file2>.csv” to compare two merkle trees. Code below:

Key advantage here is that each branch of the tree can be checked independently without downloading the whole dataset to compare.

This translates to reducing the number of disk reads for synchronization though that efficiency needs to be balanced against the recalculation of the entire tree when nodes leave or go down. This is fundamental to Crypto currencies when transactions need to be validated by nodes and there is enormous time and space cost to validate every transaction which can be mitigated by Merkle trees in logarithmic time instead of linear time.  The Merkle root get put into the block header that gets hashed in the process of mining and comparisons are made via the Merkle root rather than submitting all the transactions over the network. Ethereum uses a more complex variant of the Merkle, namely the Merkle Patricia tree.

The applications of this range beyond blockchains to Torrents, Git, Certificates and more.

Prisoners of Geography – Book Review

Prisoners of Geography: Ten Maps That Explain Everything About the World by  Tim Marshall is an enjoyable read that traces the worlds geography  and its impact on today’s geopolitics. I picked this up on a whim as the description seemed to indicate an enjoyable refresher on the state of the world as it came to be in terms of geopolitics. The maps were pretty awful to research on my kindle paperwhite and I had to resort to getting the paperback to navigate the maps better.


A lot of the worlds problems today remain as gridlocked as they were at the origin despite decades of evolution and political talks ( think border and territory conflicts across the globe – Israel/Palestine, South China sea, South America and the list goes on). This book is not a comprehensive treatise on the evolution of those problems but a great overview. It does a good job in identifying the factors that drive the national interest and conflict in these areas and the impact of the countries dealing with the limitations/opportunities that their geography has bestowed them with.

I never did realize the importance of navigable and intersecting rivers or natural harbors that impact the destinies of countries. Here are some of my observations and notes from the book.

  • Russia – The books kicks off with the author highlighting the 100-year forward thinking of the Russians and the obsession with “warm water” ports with direct access to the ocean unlike the ports on the Arctic like Murmansk that freeze for several months. This limits the Russian fleet and its aspiration to be a bigger global power. While the oil and gas and being the 2nd biggest supplier of natural gas  in the world brings its own geographical advantages and prop the country up, its aspiration remain for fast maneuvering  to move out of areas like the Black Sea or even the Baltic Sea to counter a feared NATO strike. The author describes moves like the annexation of Crimea to be moves to construct more naval ports to boost its fleets. Countries like Moldova and Georgia ( and propensity to the west)  have a huge bearing on foreign policy and military planning. Interestingly , ‘Bear ‘is a Russian word, but per the author the Russians are also wary of calling this animal by its name, fearful of conjuring up its darker side. They call it medved, “the one who likes honey.”

It doesn’t matter if the ideology of those in control is czarist, Communist, or crony capitalist—the ports still freeze, and the North European Plain is still flat.

  • China –  The Chinese civilization, over 4000 years old that originated around the Yangtze river  is today comprised of 90% Han people united by ethnicity and politics. This sense of identity pervades all aspect of modern Chinese life and powers its  ascent as a global power. The massive Chinese border touches Mongolia in the North, Russia,Vietnam, Laos in the East and  India, Pakistan, Afghanistan and Tajikistan in the West with various levels of protracted conflicts/disagreements. For example – The India/China border is perceived by China as the Tibetan-Indian border and integral to protect the Tibetan plateau which could open a route for an Indian military push into the Chinese heartland never mind the low probability of that ever happening. The book provides a brief overview of the origin of the Tibet occupation and the worlds attention to it. The author says that if the population  were to be give a free vote, the unity of the Han would crack and weaken the hold of the communist party.  The need for  China to extend its borders and grab land it perceives as its own also extend to the seas. The growing naval fleet and its need to assert supremacy in the south china sea also fuels conflict with Japan and its neighbors. Scouring the length and breadth of Africa for minerals and precious metals in return for cheap capital and modern form of debt slavery is another strategy to dominate the world. This part of the book did not offer any new insight however a society that  holds unity and economic progress as the highest priorities is definitely admirable considering the “developing” status that it once had.
  • Unites States –  The geographical position of invulnerability, fertile land, navigable river systems and the unification of the states ensures prosperity and greatness for the U.S. The author goes into the evolution of the states as they came together after the revolutionary war such as the Louisiana purchase, the ceding of Florida by the Spanish, the Mexican war to acquire Mexico and the purchase of Alaska.  Post world war 2 and the Marshall Plan, the formation of NATO then assured the US of being the greatest firepower across the world.The author deems the Russian threat largely seen off and insists China is the rising power that the US is concerned about (as the current geopolitical climate in 2021 validates). The domination of the sea-lanes will occupy the attention with numerous potential flashpoints. Self-sufficient in energy will continue to America’s position as the preeminent economic power. Overall, this section was well summarized with the progress of American domination despite hiccups over the centuries like the great depression. I still think the author painted a rosier picture than the current situation suggests. Post-pandemic, it remains to be seen if these assertions still hold with all the internal struggles faced in the American society with respect to race relations, inclusivity and attention to a wide variety of social issues.

The California gold rush of 1848–49 helped, but the immigrants were heading west anyway; after all, there was a continental empire to build, and as it developed, more immigrants followed. The Homestead Act of 1862 awarded 160 acres of federally owned land to anyone who farmed it for five years and paid a small fee. If you were a poor man from Germany, Scandinavia, or Italy, why go to Latin America and be a serf, when you could go to the United States and be a free land-owning man?

  • Western Europe – Again, the geographical blessings in this case ensured an agreeable climate mostly to cultivate the right crops at large scale, the right minerals to power the industrial revolution and abundant natural harbors. This led to industrial scale wars as well as Europe remains an amalgam of linguistic and culturally disparate countries yet remains an industrial power. The contrast between northern and southern Europe in terms of prosperity is attributed to industrialization, the domination of Catholicism, and the availability of coastal plains. Spain, Greece, U.K, Germany, Poland, Denmark, Sweden  and the contrasts in their economical status are discussed and  attributed to geographical limitations.  I was hoping the  author provides more than a passing nod to the concerns of immigration and prejudice. Prejudice against immigration and the rise of nationalism remains on the rise across the world and its troubling to see this rise of hate groups, holocaust deniers and all other abhorrent tribes that debase basic human ideals of equality, peace and harmony.  The demographic change with the inverted pyramid of older people at the top with fewer people paying taxes to support them in the future needs to be reversed and the benefits of legal immigration need to be given greater attention rather than burying them under misdirected xenophobic fears.
  • Africa – This was an enlightening section on the lack of utility of African rivers for transportation due to waterfalls and natural obstacles. Africa developed in isolation from the Eurasian landmass and the author asserts that the lack of idea exchange played a huge part in its under development. Sub-saharan exposure to virulent diseases, crowded living conditions and poor health-care infrastructure has also impeded growth. The great rivers of Africa—the Niger, the Congo, the Zambezi, the Nile, and others—don’t connect to its own detriment. The 56~ countries have relatively unchanged borders over the years along with the legacy of colonialism which like most parts of the world divided societies on the basis of ethnicities. The rise of radical Islamist groups has been attributed to the sense of underdevelopment and overall malcontent. On a more positive note, every year roads and railroads are fueling infrastructure boom and greater connectivity with rising education and healthcare.

“You could fit the United States, Greenland, India, China, Spain, France, Germany, and the UK into Africa and still have room for most of Eastern Europe. We know Africa is a massive landmass, but the maps rarely tell us how massive.”

  • Middle East – Another witness to the ancient civilization that rose from the fertile plains of Mesopotamia. The largest continuous sand desert that  British and French colonists carved up as part of the Sykes-Picot carving reflects some of the unrest and extremism today. Its interesting that prior to Sykes-Picot, there was no Syria, Lebanon,  Jordan, Iraq, Saudi Arabia, Kuwait, Israel, or Palestine. These are all modern entities with a short history unified by versions of the same religion. Conflict and chaos have ruled supreme in some of the countries ( Iraq, Lebanon etc) while prosperity from the oil fields have propelled some to the world stage (UAE). Lot of detail on Iran-Iraq history, Palestine, the failed promise of the Arab spring, Turkey and others. The complexity of the demographics and religious idealism compounds an already volatile region.

Sykes-Picot is breaking; putting it back together, even in a different shape, will be a long and bloody affair.

  • India & Pakistan– A population of 1.4 Billion pitted against another of 182 million with impoverishment, volatility and mistrust at both ends. Post-pandemic, this section is dated as the imminent Indian emerging economic power described by the author is no longer a reality at least in the near term. Had to skip over this section as there wasn’t much I didn’t already know.
  • Korea and Japan– Tension between the Koreas is well known to the world and the author describes the origins of the Hermit kingdom and the lack of strategy from the USA in dealing with the problem. The 38th parallel was yet another hasty line of division and an uninformed repetition of the line drawn in the aftermath of the Ruso-Japanese war of 1904. I have fond memories of visiting Seoul years ago and it was interesting on how the concept of unification was welcomes by some of the South Koreans I had the opportunity to interact with (peering through the binoculars in the DMZ to the North Korean side was a thrilling experience and emphasized the proximity of the two sides). Not sure if that is the general sentiment but there is enough justification there considering the Northern nuclear power in control of a dictator. The Japanese post-war stance is described in detail and the author contends the increasing Japanese defense budget displays the intent of resolve against Chinese threats.
  • Latin America– Limitation of the Latin America originates from the historical inequality, the reluctance of the original settles to move away from the coats and the lack of subsequent infrastructure in the interiors. Geographical limitations plague Mexico, Brazil, Chile, Argentina despite natural resources. The civil wars of the 19th century broke apart independent countries with border disputes that persist, naval arms races between countries like Brazil, Argentina and Chile held back development of all three and drug cartels have devastated societies. The Panama canal’s newer rival – the Nicaragua Grand Canal that has a huge Chinese investment across the continent seem questionable in terms of value to Latin America.
  • The Arctic– The effects of global warming are alarmingly showing in the Arctic coinciding with the discovery of energy deposits. The complex land ownership includes land in parts of Canada, Finland, Greenland, Iceland, Norway, Russia, Sweden, and the United States (Alaska). The melting ice has far flung ramifications globally in terms of projected flooding effects in countries far away as Maldives. The melting ice has also opened up new transportation corridor that hugs the Siberian coastline and more access to energy reserves much to the interest of multiple countries that are now jostling for superiority including Russia building an Arctic army.

The word arctic comes from the Greek arktikos, which means “near the bear,” and is a reference to the Ursa Major constellation, whose last two stars point toward the North Star.

A lot of rich detail in the book on various nuances of the geographies and this was an enjoyable read however it did make me pessimistic as status quo or deterioration of the situation in a lot of these geographies has been the norm. As the 21st century progresses, there is not much indication that change is afoot unless a planet threatening situation like climate change becomes a forcing function to minimize petty squabbles to focus on larger resolutions. Being Idealistic or moralistic will not jive well with the ideas in this book and the way forward is to think of creative and new ideas to resolve a lot of these global problems. Great ideas and great leaders need to arise to challenge these realities and put humanity first.

As the author ruefully writes:

” A human being first burst through the top layer of the stratosphere in 1961 when twenty-seven-year-old Soviet cosmonaut Yuri Gagarin made it into space aboard Vostok 1. It is a sad reflection on humanity that the name of a fellow Russian named Kalashnikov is far better known.”


A couple of other reads recommended to me are Peter Zeihan’s “The Accidental Superpower” and Robert D Kaplan’s “The Revenge of Geography”. I look forward to reading them as well.

Text Summarizer on Hugging Face with mlflow

Hugging Face Emoji Classic Round Sticker - EmojiPrints

Hugging Face is the go-to resource open source natural language processing these days. The Hugging Face hubs are an amazing collection of models, datasets and metrics to get NLP workflows going. Its relatively easy to incorporate this into a mlflow paradigm if using mlflow for your model management lifecycle. mlflow makes it trivial to track model lifecycle, including experimentation, reproducibility, and deployment. mlflow’s open format makes it my go-to framework for tracking models in an array of personal projects and It also has an impressive enterprise implementation that my teams at work enable for large enterprise use cases. For smaller projects, its great to use mlflow locally for any projects that requires model management as this example illustrates.

The beauty of Hugging Face (HF) is the ability to use their pipelines to to use models for inference. The models are products of massive training workflows performed by big tech and available to ordinary users who can use them for inference. The HF pipelines offer a simple API dedicated to performing inference in these models thus sparing the ordinary the user the complexity and compute / storage requirements for running such large models.

The goal was to put some sort of tracking around all my experiments with the Hugging Face Summarizer that I’ve been using to  summarize text and then use the mlflow Serving via REST as well as running predictions on the inferred model by passing in a  text file. Code repository is here with snippets below.

Running the Text Summarizer and calling it via curl

Text summarization consists of Extractive and Abstractive types where Extractive selects sentence that has the most valuable context while Abstractive is trained to create summaries.

Considering I was running on a CPU, I picked a small model like the T5-small model trained on Wikihow All data set that has been trained to write summaries. The boiler plate code on the HuggingFace website gives you all you need to get started. Note that this models input length is set to 512 tokens max which may not be optimum for usecases with larger text.

a) First step is to define a wrapper around the model code so it can be called easily later on by subclassing it with the mlflow.pyfunc.PythonModel to use custom logic and artifacts.

class Summarizer(mlflow.pyfunc.PythonModel):
    '''
    Any MLflow Python model is expected to be loadable as a python_function model.
    '''

    def __init__(self):
        from transformers import pipeline, AutoTokenizer, AutoModelWithLMHead

        self.tokenizer = AutoTokenizer.from_pretrained(
            "deep-learning-analytics/wikihow-t5-small")

        self.summarize = AutoModelWithLMHead.from_pretrained(
            "deep-learning-analytics/wikihow-t5-small")

    def summarize_article(self, row):
        tokenized_text = self.tokenizer.encode(row[0], return_tensors="pt")

        # T5-small model trained on Wikihow All data set.
        # model was trained for 3 epochs using a batch size of 16 and learning rate of 3e-4.
        # Max_input_lngth is set as 512 and max_output_length is 150.
        s = self.summarize.generate(
            tokenized_text,
            max_length=150,
            num_beams=2,
            repetition_penalty=2.5,
            length_penalty=1.0,
            early_stopping=True)

        s = self.tokenizer.decode(s[0], skip_special_tokens=True)
        return [s]

    def predict(self, context, model_input):
        model_input[['name']] = model_input.apply(
            self.summarize_article)

        return model_input

b) We define the tokenizer to prepare the inputs of the model and the model using the HuggingFace specifications. This is a smaller model trained on Wikihow All data set. From the documentation – the model was trained for 3 epochs using a batch size of 16 and learning rate of 3e-4. Max_input_length is set as 512 and max_output_length is 150.

c) Then define the model specifications of the T5-small model by calling the summarize_article function with the tokenized text that will called it for every row in the dataframe input and eventually return the prediction.

d) The prediction function calls the summarize_article providing the  model input and calling the summarizer and returns the prediction. This is also where we can plug in mlflow  to infer the predictions.

The input and output schema are defined in the ModelSignature class as follows :

# Input and Output formats
input = json.dumps([{'name': 'text', 'type': 'string'}])
output = json.dumps([{'name': 'text', 'type': 'string'}])
# Load model from spec
signature = ModelSignature.from_dict({'inputs': input, 'outputs': output}) input = json.dumps([{'name': 'text', 'type': 'string'}])
 output = json.dumps([{'name':'text', 'type':'string'}]) 


e) We can set mlflow operations by setting the tracking URI which was “” in this case since its running locally. Its trivial in a platform like Azure to spin up a databricks workspace and get a tracking server spun up automatically so you can persist all artifacts at cloud scale.


Start tracking the runs by wrapping the mlflow.start_run invocation. The key here is to call the model for inference using the mlflow.pyfunc function to make the python code load into mlflow. In this case , the dependencies of the model are all stored directly with the model. Plenty of parameters here that can be tweaked described here.

# Start tracking
with mlflow.start_run(run_name="hf_summarizer") as run:
    print(run.info.run_id)
    runner = run.info.run_id
    print("mlflow models serve -m runs:/" +
          run.info.run_id + "/model --no-conda")
    mlflow.pyfunc.log_model('model', loader_module=None, data_path=None, code_path=None,
                            conda_env=None, python_model=Summarizer(),
                            artifacts=None, registered_model_name=None, signature=signature,
                            input_example=None, await_registration_for=0)


f) Check the runs via mlflow UI either using the “mlflow ui” command or just invoke the commandmlflow models serve -m runs:/<run_id>


g) Thats it – Call the curl command using sample text below:

curl -X POST -H "Content-Type:application/json; format=pandas-split" --data '{"columns":["text"],"data":[["Howard Phillips Lovecraft August 20, 1890 – March 15, 1937) was an American writer of weird and horror fiction, who is known for his creation of what became the Cthulhu Mythos.Born in Providence, Rhode Island, Lovecraft spent most of his life in New England. He was born into affluence, but his familys wealth dissipated soon after the death of his grandfather. In 1913, he wrote a critical letter to a pulp magazine that ultimately led to his involvement in pulp fiction.H.P.Lovecraft wrote his best books in Masachusettes."]]}' http://127.0.0.1:5000/invocations

Output:

"name": "Know that Howard Phillips Lovecraft (H.P.Lovecraft was born in New England."}]%

Running the Text Summarizer and calling it via a text file

For larger text, its more convenient reading the text from a file, formatting it and running the summarizer on it. The predict_text.py does exactly that.


a) Clean up the text in article.txt and load the text into a dictionary.

b) Load the model using pyfunc.load_model and then run the model.predict on the dictionary.

# Load model as a PyFuncModel.
loaded_model = mlflow.pyfunc.load_model(logged_model)


# Predict on a Pandas DataFrame.
summary = loaded_model.predict(pd.DataFrame(dict1, index=[0]))

print(summary['name'][0])

Code here

In summary, this makes for a useful way to track models and outcomes from readily available transformer pipelines to pick the best ones for the task.