Many Worlds Architecture

Data Labs

Data labs provided by Siva provided cleaning and processing services for data being fed for various algorithms.

Data cleaning is one of the important parts of machine learning. It plays a significant part in building a model. If we have a well-cleaned dataset, there are chances that we can get achieve good results with simple algorithms also, which can prove very beneficial at times especially in terms of computation when the dataset size is large. 

Obviously, different types of data will require different types of cleaning. However, this systematic approach can always serve as a good starting point. 

Steps involved in Data Cleaning: 

  1. Removal of unwanted observations 

This includes deleting duplicate/ redundant or irrelevant values from your dataset. Duplicate observations most frequently arise during data collection and Irrelevant observations are those that don’t actually fit the specific problem that you’re trying to solve. 

  • Redundant observations alter the efficiency by a great extent as the data repeats and may add towards the correct side or towards the incorrect side, thereby producing unfaithful results.

  • Irrelevant observations are any type of data that is of no use to us and can be removed directly.

  1. Fixing Structural errors 

The errors that arise during measurement, transfer of data, or other similar situations are called structural errors. Structural errors include typos in the name of features, the same attribute with a different name, mislabeled classes, i.e. separate classes that should really be the same, or inconsistent capitalization. 

  1. Managing Unwanted outliers 

Outliers can cause problems with certain types of models. For example, linear regression models are less robust to outliers than decision tree models. Generally, we should not remove outliers until we have a legitimate reason to remove them. Sometimes, removing them improves performance, sometimes not. So, one must have a good reason to remove the outlier, such as suspicious measurements that are unlikely to be part of real data.

  1. Handling missing data 

Missing data is a deceptively tricky issue in machine learning. We cannot just ignore or remove the missing observation. They must be handled carefully as they can be an indication of something important. The two most common ways to deal with missing data are: 

  • Dropping observations with missing values.

The fact that the value was missing may be informative in itself.

Plus, in the real world, you often need to make predictions on new data even if some of the features are missing!

  • Imputing the missing values from past observations.

Again, “missingness” is almost always informative in itself, and you should tell your algorithm if a value was missing.

Even if you build a model to impute your values, you’re not adding any real information. You’re just reinforcing the patterns already provided by other features.

Forecasting Data Algorithms

Forecasting demand involves measuring pricing, promotion, seasonality, and holiday effects to estimate future sales. In addition to helping with inventory planning and negotiations, it measures how customers respond to your pricing and promotion efforts to help optimize marketing. In the end, one can prevent stockouts from leaving money on the table by understanding one’s holiday effects. It not only helps in forecasting the demand but also forecast sales and returns. 

Forecasting is a technique that uses historical data as inputs to make informed estimates that are predictive in determining the direction of future trends. There are multiple data points which we can predict or forecast like demand, sales, what purchases are going to happen, return forecast etc. Demand forecasting is the process of predicting the future demand for products. That could be new products or those that one has been selling for years. The advantages of demand forecasting are:

  • Helps reduce financial risk

  • Provide customers with products when they want them

  • Decreases inventory expenses

  • Create a pricing strategy that reflects the demand

Most sellers (especially small business owners) apply the wrong methodology (like an expert opinion or collective knowledge) or leverage incomplete data to make decisions. This gives them incorrect information and they are at a disadvantage wrt to large businesses that can cough up large sums of money to buy market research or get their own AI tool to predict future demand.

Siva’s forecasting tool brings the power available to large businesses and hands it to smaller sellers.

DFD for forecasting

UCD Price, quantity forecasting

Solution  Macro Economic measures

The question of mass-scale deployment of the open network platform app comes up. How do you convince millions of people to install a new application on their phones? Many buyers/sellers from rural parts might not even know how to install an application. During the covid pandemic, it took months for the government to make people download and install the CoWin application on their phones. 

We try to use the current ecosystem that has enough penetration. So we developed a chatbot that can sit on top of any messenger application and help the end user interact with the platform. There are about 49 crore Whatsapp users in India. Here we have solved the economic case of finding the cheapest way to have the open network application be mass deployed. This saves the government crores of rupees in rolling out the network and provides for better adherence and lesser cost of training. Can we help users (like the local mom-and-pop store or the costermonger) upload prices, and catalogues of their products via WhatsApp messages..and get and place orders via WhatsApp?

There are about 40 crore Instagram users. The majority of them create reels which showcases a product or service. How can we leverage those reels to create product catalogs? People are using Instagram to sell their products as part of social commerce. How can we onboard them to an open network?

If we make onboarding easy, if we make network participants work easy via our iPaaS, if we make our development community work easy and if we utilize the currently available extremely wide scale ecosystem we can create a massive impact in the online commerce space not just in India but around the world. 

Can we one day go and take picture of a product or service and then upload it to the open network or convert a video of someone buying a shirt and push it onto the open network? What will be the overall network economics? Who will do what activity? Who can do what activity? These are some of the questions that we need to figure out the answers to.

Siva Many Worlds


Onboarding comprises of multiple subsystems working together seamlessly. First is the chatbot which allows communication and transfer of data from the end user (seller in onboarding scenario) to the Siva platform. Then comes the cataloging part which comprises of category mapping, template mapping, field prediction, image optimization, keyword generation and description generation.

DFD Category new template mapping -a

DFD Category new template mapping -b

UCD for cataloging, field prediction, image optimization, keyword, description gen

Virtual Try On (VTO)

A virtual try-on falls under the category of augmented reality. It enables buyers/customers to try on a product on themselves without actually touching the said product. All this is done virtually with the help of a phone, laptop, or other devices that capture the image in real-time. Virtually trying on items allows customers to know exactly how something looks on them. Whether it is shoes or a piece of jewelry, virtually trying on products enables customers and retailers to connect without the barrier of customers coming to a physical store for shopping.

The virtual try increases customer satisfaction and customer engagement. If customers have already tried on (virtually) a product, there are reduced chances of returns. Also, by giving the option of VTO, sellers can keep their prospective customers engaged and thus have higher chances of sales to those groups. 


VTO can be used in multiple places. The use cases become endless once this feature becomes mainstream. Currently, they can be used for makeup products like lipsticks, wristwatches, eyewear, and, last but not least, for clothing.


Transformer Model

A transformer is a deep learning model that adopts the self-attention mechanism, differentially weighing the significance of each part of the input data. Like recurrent neural networks (RNNs), transformers are designed to process sequential input data, such as natural language, with applications for tasks such as translation and text summarization. However, unlike RNNs, transformers process the entire input all at once. The attention mechanism provides context for any position in the input sequence. For example, if the input data is a natural language sentence, the transformer does not have to process one word at a time. This allows for more parallelization than RNNs and therefore reduces training times.


The input text is parsed into tokens by a byte pair encoding tokenizer, and each token is converted via a word embedding into a vector. Then, positional information of the token is added to the word embedding.

Encoder–decoder architecture

Like earlier seq2seq models, the original Transformer model used an encoder-decoder architecture. The encoder consists of encoding layers that process the input iteratively, one layer after another. In contrast, the decoder consists of decoding layers that do the same thing to the encoder's output.

The function of each encoder layer is to generate encodings that contain information about which parts of the inputs are relevant to each other. It passes its encodings to the next encoder layer as inputs. Each decoder layer does the opposite, taking all the encodings and using their incorporated contextual information to generate an output sequence. To achieve this, each encoder and decoder layer uses an attention mechanism.

For each input, attention weighs the relevance of every other input and draws from them to produce the output. Each decoder layer has an additional attention mechanism that draws information from the outputs of previous decoders before the decoder layer draws information from the encodings.

Both the encoder and decoder layers have a feed-forward neural network for additional processing of the outputs and contain residual connections and layer normalization steps

Scaled dot-product attention

The transformer building blocks are scaled dot-product attention units. When a sentence is passed into a transformer model, attention weights are calculated between every token simultaneously. The attention unit produces embeddings for every token in the context that contains information about the token and a weighted combination of other relevant tokens, each weighted by its attention weight.

The attention calculation for all tokens can be expressed as one large matrix calculation using the softmax function, which is useful for training due to computational matrix operation optimizations that quickly compute matrix operations. Q, K, and V are the matrices where the rows are vectors q, ki, and vi, respectively.

Image processing

Image processing identifies specific features of particular objects in an image. AI-based image recognition often uses such techniques as object detection, object recognition, and segmentation. Image classification helps to identify the type of product present in the image and also find the characteristics of the product. So if one provides an image of a red shirt, it will not only identify that the image is of a shirt but also identify the color.

An Image Classification problem involves labeling a set of images with a single category, predicting these categories for a new set of test images, and measuring the accuracy of the predictions. There are a variety of challenges associated with this task, including viewpoint variation, scale variation, intra-class variation, image deformation, image occlusion, illumination conditions, background clutter, etc.

Instead of trying to specify what every one of the image categories of interest looks like directly in code, we provide the computer with many examples of each image class and then develop learning algorithms that look at these examples and learn about the visual appearance of each class. In other words, we first accumulate a training dataset of labeled images and then feed it to the computer to get familiar with the data.

Given that fact, the complete image classification pipeline can be formalized as follows:

  • Our input is a training dataset that consists of N images, each labeled with one of K different classes.

  • Then, we use this training set to train a classifier to learn what every one of the classes looks like.

  • In the end, we evaluate the classifier's quality by asking it to predict labels for a new set of images that it has never seen before. We will then compare the true labels of these images to the ones predicted by the classifier.

Convolutional Neural Networks (CNN)

Convolutional Neural Networks (CNNs) are the most popular neural network model for image classification problems. The idea behind CNNs is that a local understanding of an image is good enough. The practical benefit is that having fewer parameters greatly improves the time it takes to learn and reduces the amount of data required to train the model. Instead of a fully connected network of weights from each pixel, a CNN has just enough weights to look at a small image patch. It's like reading a book using a magnifying glass; eventually, you read the whole page, but you look at only a small patch of the page at any given time.

Consider a 256 x 256 image. CNN can efficiently scan chunk by chunk — say, a 5 × 5 window. The 5 × 5 window slides along the image (usually left to right and top to bottom), as shown below. How "quickly" it slides is called its stride length. For example, a stride length of 2 means the 5 × 5 sliding window moves by 2 pixels at a time until it spans the entire image. A convolution is a weighted sum of the pixel values of the image as the window slides across the whole image.

The sliding-window magic happens in the convolution layer of the neural network. A typical CNN has multiple convolution layers. Each convolutional layer typically generates many alternate convolutions, so the weight matrix is a tensor of 5 × 5 × n, where n is the number of convolutions.

The thing with CNN is that the number of parameters is independent of the size of the original image. You can run the same CNN on a 300 × 300 image, and the number of parameters won't change in the convolution layer.

Tesseract (OCR)

Tesseract — is an optical character recognition engine with open-source code. This is the most popular and qualitative OCR library.

OCR is used to identify text within images and extract them. Its main function is to identify text within product images and convert it into machine-readable text format so that the product image characteristics and features can be pulled out.

Tesseract is finding templates in pixels, letters, words, and sentences. It uses a two-step approach that calls adaptive recognition. It requires one data stage for character recognition, then the second stage to fulfill any letters. It wasn't insured in by letters that matched the word or sentence context.



Hidden Markov Model

A hidden Markov model (HMM) is a statistical model that can describe the evolution of observable events that depend on internal factors which are not directly observable. We call the observed event a `symbol' and the invisible factor underlying the observation a `state.' Markov and Hidden Markov models are engineered to handle data that can be represented as a 'sequence' of observations over time. Hidden Markov models are probabilistic frameworks where the observed data are modeled as a series of outputs generated by one of several (hidden) internal states.

Markov Model: Series of (hidden) states z={z_1,z_2………….} drawn from state alphabet S ={s_1,s_2,…….𝑠_|𝑆|} where z_i belongs to S.

Hidden Markov Model: Series of observed output x = {x_1,x_2,………} drawn from an output alphabet V= {𝑣1, 𝑣2, . . , 𝑣_|𝑣|} where x_i belongs to V

Fast Fourier Transform

Fast Fourier Transform (FFT) is an algorithm that determines the Discrete Fourier Transform of an input significantly faster than computing it directly. In computer science lingo, the FFT reduces the number of computations needed for a problem of size N from O(N^2) to O(NlogN).

The Fourier Transform can speed up convolutions by taking advantage of the following property.

The above equation states that the convolution of two signals is equivalent to the multiplication of their Fourier transforms. Therefore, by transforming the input into frequency space, a convolution becomes a single element-wise multiplication. In other words, the input to a convolutional layer and kernel can be converted into frequencies using the Fourier Transform, multiplied once, and then converted back using the inverse Fourier Transform. There is an overhead associated with transforming the inputs into the Fourier domain and the inverse Fourier Transform to get responses back to the spatial domain. However, this is offset by the speed obtained from performing a single multiplication instead of having to multiply the kernel with different sections of the image.

Single Shot Detector (SSD)

A single-shot detector takes a single shot to detect multiple objects within the image. To achieve high detection accuracy, the SSD model produces predictions at different scales from the feature maps of different scales and explicitly separates predictions by aspect ratio.

These techniques result in simple end-to-end training and high accuracy, even on input images of low resolutions. SSD has two components: a backbone model and an SSD head. The backbone model usually is a pre-trained image classification network as a feature extractor. This is typically a network like ResNet trained on ImageNet from which the final fully connected classification layer has been removed. We are thus left with a deep neural network that can extract semantic meaning from the input image while preserving the spatial structure of the image, albeit at a lower resolution. For ResNet34, the backbone results in 256 7x7 feature maps for an input image. The SSD head is just one or more convolutional layers added to this backbone. The outputs are interpreted as the bounding boxes and classes of objects in the spatial location of the activations of the final layers.

Grid Cell

Instead of using a sliding window, SSD divides the image using a grid and has each grid cell responsible for detecting objects in that region of the image. Detection of objects simply means predicting the class and location of an object within that region. If no object is present, we consider it the background class, and the location is ignored. For instance, we could use a 4x4 grid in the example below. Each grid cell can output the position and shape of the object it contains.

Anchor Box

Each grid cell in SSD can be assigned with multiple anchors/prior boxes. These anchor boxes are pre-defined, and each one is responsible for the size and shape within a grid cell. SSD uses a matching phase while training to match the appropriate anchor box with the bounding boxes of each ground truth object within an image. Essentially, the anchor box with the highest degree of overlap with an object is responsible for predicting that object's class and its location. This property is used for training the network and predicting the detected objects and their locations once the network has been trained. In practice, each anchor box is specified by an aspect ratio and a zoom level.

Aspect ratio

Not all objects are square. Some are longer, and some are wider by varying degrees. The SSD architecture allows pre-defined aspect ratios of the anchor boxes to account for this. The ratios parameter can specify the aspect ratios of the anchor boxes associated with each grid cell at each zoom/scale level.

Zoom level

The anchor boxes don't need to have the same size as the grid cell. We might be interested in finding smaller or larger objects within a grid cell. The zooms parameter is used to specify how much the anchor boxes need to be scaled up or down concerning each grid cell. 

Hopfield Network

The Hopfield Neural Networks consists of one layer of 'n' fully connected recurrent neurons. It is generally used in performing auto association and optimization tasks. It is calculated using a converging interactive process, and it generates a different response than our normal neural nets. 

Discrete Hopfield Network: It is a fully interconnected neural network where each unit is connected to every other unit. It behaves discretely, i.e., it gives finite distinct output, generally of two types: 

  • Binary (0/1)

  • Bipolar (-1/1)

Structure & Architecture  

Each neuron has an inverting and non-inverting output.

Being fully connected, the output of each neuron is an input to all other neurons but not the self.

Natural language generation (NLG)

Natural language generation (NLG) is a software process that produces natural language output. In one of the most widely-cited surveys of NLG methods, NLG is characterized as "the subfield of artificial intelligence and computational linguistics that is concerned with the construction of computer systems that can produce understandable texts in English or other human languages from some underlying non-linguistic representation of information

NLG may be viewed as complementary to natural-language understanding (NLU): whereas in natural-language understanding, the system needs to disambiguate the input sentence to produce the machine representation language, in NLG, the system needs to make decisions about how to put a representation into words. The practical considerations in building NLU vs. NLG systems are not symmetrical. NLU needs to deal with ambiguous or erroneous user input, whereas the ideas the system wants to express through NLG are generally known precisely. NLG needs to choose a specific, self-consistent textual representation from many potential representations, whereas NLU generally tries to produce a single, normalized representation of the idea expressed.


The process of generating text can be as simple as keeping a list of canned text that is copied and pasted, possibly linked with some glue text. The results may be satisfactory in simple domains such as horoscope machines or generators of personalized business letters. However, a sophisticated NLG system needs to include stages of planning and merging of information to enable the generation of text that looks natural and does not become repetitive. The typical stages of natural-language generation are

Content determination: Deciding what information to mention in the text.

Document structuring: Overall organization of the information to convey.

Aggregation: Merging of similar sentences to improve readability and naturalness.

Lexical choice: Putting words to the concepts.

Referring expression generation: Creating referring expressions that identify objects and regions.

Realization: Creating the actual text, which should be correct according to the rules of syntax, morphology, and orthography.