Thursday, March 29, 2018

Designing Chatbot for E-Commerce Customer Care

I wrote this blog after I could not find good examples online for building a messenger bot for customer care
Steps I followed to build messenger bot for customer care
Step 1: Identify goal: Messenger bot for customer care.
Step 2: Identify non-goal: Messenger bot for marketing updates
Step 3: Identify Impact:
  • Impact to Customer: How fast can customer be done with their need to reach customer care.
  • Impact to Business: How many hours saved from agent time spend if bot can satisfy customer need
Step 4: Identify Use Cases: 
  • Cancel Order
  • Update Billing Address
  • Update Shipping Address
  • Update Payment Method
  • Track Order
  • Get Invoice
  • Return Device
  • Track Return Processing
  • Help Setup
  • Debug Device

Step 4: Detailed Flows:

Cancel Order Flows
Flow 1 : Customer is identified based on fb login to purchase.
Flow 2: Cannot Cancel Order as Order is already shipped.
Flow 3: Identify Customer if order id is not available and identifying information does not match purchase order. Ask them to talk to an agent.
Flow 4: Identify Customer based on order id
Flow 5: Identify Customer if order id is not available and identifying information matches.

Friday, July 17, 2015

Marketing Concepts Techical Overview


Introduction

This document is a 20,000 ft overview of the marketing ecosystem. It shows majn marketing functions. The document also illustrates to some extent how core-advertising topics work from a high-level technical perspective.

Marketing Needs

General Marketing Functions & Measures:




 

Important Concepts & Flows

First Party Cookie:

First party cookie is a small piece of data sent from a website and stored in a user's web browser while the user is browsing that website. Every time the user loads the website, the browser sends the cookie back to the server to notify the website of the user's previous activity [1]. A first party cookie's domain name will match the domain name that is shown in the web browser's address bar.[2]

Third Party Cookie:

 Third-party cookies belong to domains different from the one shown in the address bar. These sorts of cookies typically appear when web pages feature content, such as banner advertisements, from external websites. This opens up the potential for tracking the user's browsing history, and is often used by advertisers in an effort to serve relevant advertisements to each user. As an example, suppose a user visits www.example.org. This web site contains an advertisement from ad.foxytracking.com, which, when downloaded, sets a cookie belonging to the advertisements's domain (ad.foxytracking.com). Then, the user visits another website, www.foo.com, which also contains an advertisement fromad.foxytracking.com/, and which also sets a cookie belonging to that domain (ad.foxytracking.com). Eventually, both of these cookies will be sent to the advertiser when loading their advertisements or visiting their website. The advertiser can then use these cookies to build up a browsing history of the user across all the websites that have ads from this advertiser.[2]

Cookie Syncing:
Cookie syncing involves mapping a userId on one system with the userId on another system.


The process begins when a user visits a site (say example.com, not shown in the figure), which includes A.com as an embedded third-party tracker. (1) The browser makes a request to A.com, and included in this request is the tracking cookie set by A.com. (2) A.com retrieves its tracking ID from the cookie, and redirects the browser toB.com, encoding the tracking ID into the URL. (3) The browser then makes a request to B.com, which includes the full URLA.com redirected to as well as B.com’s tracking cookie. (4) B.com can then link its ID for the user to A.com’s ID for the user2. All of this is invisible to the user. [14]

Tags

Tags are incorporated into the HTML/JavaScript code delivered a web browser or app when a web page loads. Here’s a sample of what you might see if you look at the source code on a typical website using third-party tags.

[15]

How do tags work?


[15]
Tags power online marketing and analytics. Specifically, they can:
  • Instruct web browsers to collect data;
  • Set cookies;
  • Extend audiences between multiple websites;
  • Integrate third-party content into a website (e.g. social media widgets, video players, ads, etc.).
  • Everything from ad campaigns to Google Analytics runs on tags. [15]

Tag Container

A container tag is a code snippet used in web development that removes the need for multiple data tracking codes being placed directly on the site. Instead, one code is placed on every page on a site. This code literally acts as a container whereby all tracking codes can be placed off site in the Google Tag Manager interface and fired from the one code on the site. This means that you can update, add or remove your tracking codes, through the Google Tag Manager interface, avoiding website development times. [16]
How does it work?
When a site loads or a specific action is completed on the site, the tag will be activated. It will search the Google Tag Manager database located in the cloud. It will then fire any tracking codes that are held in the account and that match a set of specific rules that you have defined. [16]

Figure 1A Container Tag: Google Tag Manager fires, checks the cloud, pulls the tags that fit the criteria and fires them. [16]

Contextual Targeting:

A contextual advertising system scans the text of a website for keywords and returns advertisements to the webpage based on those keywords [3]. While setting up a campaign, the advertiser sets keywords she is looking for. When a web page is loaded, the bid request contains the contextual information based on historical scan of the page. Mapping needs to be done between the tags on the publisher side vs keywords in advertiser site. A third party provider can do mapping.

Behavioral Targeting:

Behavioral targeting uses information collected from an individual’s web-browsing behavior (e.g., the pages that they have visited or the searches they have conducted) to select advertisements to display[4]

The first step in serving behavioral ads is tracking who the users are and what they do online. Tracking refers to the process of collecting and processing data, while targeting refers to using that data to offer personalized solutions or advertising[5]

While setting up a campaign, the advertiser specifies the behaviors they wish to target eg. Outdoor lifestyle, in-market for a car. When the advertiser encounters such an audience, a bid request is responded to. The audience behaviors will be in the bid request by the third party vendor. Third party vendors can use third party cookies or ip addresses or geo based tracking to track user behavior across websites.

Re-targeting:

Retargeting is a form of online targeted advertising by which online advertising is targeted to consumers based on their previous Internet actions, in situations where these actions did not result in a sale or conversion.[6]

Technically all that is necessary is to place a JavaScript tag in the footer of your website.[7] Every time a new visitor comes to your site, the code drops a third party cookie on user browser. Later, when your cookied visitors browse the Web, the cookie will let the DSP/ exchange/ SSP know when to serve ads, ensuring that your ads are served to only to people who have previously visited your site [8]

Demographic Targeting:

Advertiser would like to reach people who are likely to be of a specific gender, age, ethinicity, income group, education level, home ownership and # of children.

The first step is to identify user using a match service (eg. Google has an email id and some demographic information). Third party vendors can supplement the demographic information by using their own third party cookies or matching characteristics with census data or Ip data or geo data or behavior data.

Geo-Fence based Targeting:

This is especially useful for mobile advertising. Advertisers want to send promotion for people in specific region. Eg. Target anyone within 5 miles of the mall.

Geo-fence based targeting involves identifying the geo-location of the device and target an ad via an app which collect location information. These apps ask audience if location tracking is allowed through the app. Device id is key here to identify user in case advertiser wants to combine behavioral targeting with geo-fence based targeting. [9]

Geographic region Targeting:

Advertising can be run on national sites (wsj.com) or local sites (chicagotribune.com).  It will be important for advertisers to identify key areas where they customers concentrate in order to use local sites effectively.

This targeting generally does not work with mobile apps that are national.

References:


Tuesday, July 14, 2015

Distributed Data and Query Processing

This blog is going to talk about SQL optimization techniques in distributed environment and then general data processing products in Hadoop world.
SQL Optimization
Four stages exist in query execution:
Query decomposition
Data Localization
Global Optimization
Distributed execution
Query Decomposition: decomposition into logical operators. Validation of the query. Eliminate redundant predicates. Calculus query is restructured as algebraic query
Data Localization: This layer determines which fragments are involved in the query and transforms the distributed query into a query on fragments.
Global query Optimization
In both centralized and distributed systems, a common heuristic is to minimize the size of intermediate relations. This can be done by performing unary operators first, and ordering the binary operators by the increasing sizes of their intermediate relations. An important heuristic in distributed systems is to replace join operators by combinations of semijoins to minimize data communication. [1]
Optimization can be done statically before executing the query or dynamically as the query is executed. Static query optimization is done at query compilation time. Thus the cost of optimization may be amortized over multiple query executions. Therefore, this timing is appropriate for use with the exhaustive search method. Since the sizes of the intermediate relations of a strategy are not known until run time, they must be estimated using database statistics. [1]
Hybrid query optimization attempts to provide the advantages of static query optimization while avoiding the issues generated by inaccurate estimates. The approach is basically static, but dynamic query optimization may take place at run time when a high difference between predicted sizes and actual size of intermediate relations is detected. [1]
Query optimization refers to the process of producing a query execution plan which represents an execution strategy for the query. This QEP minimizes an objective cost function. A query optimizer, the software module that performs query optimization is usually seen as consisting of three components: a search space, a cost model and a search strategy. The search space is the set of alternative execution plans that represent input query. These plans differ in the execution order of operations and the way these operations are implemented and therefore in their performance. The cost model predicts the cost of a given execution plan. The search strategy explores the search space and selects the best plan using the cost model.[1]
An execution strategy for a distributed query can be described with relational algebra operators and communication primitives for transferring data between sites. Query optimization consists of finding the “best” ordering of operators in the query, including communication operators that minimize a cost function. The cost function, often defined in terms of time units, refers to computing resources such as disk space, disk I/Os, buffer space, CPU cost, communication cost, and so on. Generally, it is a weighted combination of I/O, CPU, and communication costs.[1]
To select the ordering of operators it is necessary to predict execution costs of alternative candidate orderings. Determining execution costs before query execution (i.e., static optimization) is based on fragment statistics and the formulas for estimating the cardinalities of results of relational operators. Thus the optimization decisions depend on the allocation of fragments and available statistics on fragments which are recorder in the allocation schema.[1]
An important aspect of query optimization is join ordering, since permutations of the joins within the query may lead to improvements of orders of magnitude. One basic technique for optimizing a sequence of distributed join operators is through the semijoin operator. The main value of the semijoin in a distributed system is to reduce the size of the join operands and then the communication cost.[1]

Distributed Query Execution: Each subquery executing at one site, called a local query, is then optimized using the local schema of the site and executed. At this time, the algorithms to perform the relational operators may be chosen.
The general formula for response time
Response time=TCPU seq #insts+TI/Oseq #I/Os +TMSG seq #msgs+TTR seq #bytes
Join trees can be produced by applying the commutativity and associativity rules is O(N!) for N relations. Investigating a large search space may make optimization time prohibitive, sometimes much more expensive than the actual execution time. Therefore query optimizers typically restrict the size of the search space they consider. The first restriction is to use heuristics. The most common heuristic is to perform selection and projection when accessing base relations. Another common heuristic is to avoid Cartesian products that are not required by the query. Another important restriction is with respect to the shape of the join tree. Two kinds of join trees are usually distinguished: linear vs bushy trees. By considering only linear trees, the size of the search space is reduced to O(2N ). However, in a distributed environment, bushy trees are useful in exhibiting parallelism[1]



Broadcast join: If one part of the join is a small table, broadcast join is used. This means that the small table is replicated to all nodes.
Shuffle join:   
Shuffle join involves joining 2 large tables. 
This does not apply to non relational systems.
There will be bucketing involved to co-locate foreign keys
Common join, Map join, auto mapjoin, Bucket map join, sort merge bucket map join, skew join[3]

It is possible to run complex processing like R like functions in MPP environment however, it is impossible to store data in unstructured format in MPP architectures which has resulted in those databases being not so flexible.

Hadooop:
Schema on read not schema on write
This changed the industry in the way that allowed users to store data and then worry about how to analyze data.

Map Reduce
Map Reduce is a general-purpose processing framework. It was proposed by a paper from Google. Very wide range of processing possible with map/reduce paradigm.
MapReduce is programming model and an associated implementation for processing and generating large data sets with paralleldistributed algorithm on a cluster.
A MapReduce program is composed of a Map() procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The "MapReduce System" (also called "infrastructure" or "framework") orchestrates the processing by marshaling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance
[4]           

Dryad 
This is a general-purpose processing framework as well.
Dryad is an execution engine that is optimized for certain types of distributed computation. It relies on the fact that its inputs are immutable and finite, and the outputs of a job are not generally available until the entire computation completes  (i.e. it doesn’t support “infinite” streaming computations) [5]
Dryad is optimized for medium to large clusters (say, between 100 computers and 10,000 computers), and for long-running batch jobs rather than sub-second interactive lookups.
It has mostly been used for data-intensive computations which can benefit from streaming data to or from many disks at once, and which can be broken into parallel stages that do not require very frequent communications. If a program can be broken into pieces that can do more than 20 or 30 seconds of IO or computation before needing to communicate then it can probably be run efficiently on Dryad. The basic computational model that Dryad uses is a directed-acyclic graph (DAG)
[5] 
The approach is to create DAG of operations with data flowing between them.


A Dryad programmer writes several sequential programs and connects them using one-way channels. The computation is structured as a directed graph: programs are graph vertices, while the channels are graph edges. A Dryad job is a graph generator, which can synthesize any directed acyclic graph. These graphs can even change during execution, in response to important events in the computation.[5]

Dryad is notable for allowing graph vertices and computations in general to use an arbitrary number of inputs and outputs. Mapreduce restricts all computations to take a single input set and generate a single output set. SQL and shader languages allow multiple inputs but generate a single output from the user’s perspective, though SQL query plans internally use multiple-output vertices.[6]

Different implementations exist for Dryad with Tez being prominent among those in Hadoop world.

Resilient Distributed Data Sets: This is a general purpose processing framework as well.
Spark is a product which implement resilient distributed data sets. Spark is able to utilize memory for intermediate results as well as share data across DAGs. Spark allows cyclic graphs. Also iterations is a specialty with Spark ML algorithms are a perfect fit to the underlying architecture.
RDDs keep a delta of the operations, which lead from one RDD to another instead of storing each RDD. This facilitates Iterative machine learning algorithms on top of spark as well as interactive operations. However, the capabilities are enhanced by effective use of memory to store intermediate results.
The difference between spark and other data processing engines are the use of memory and recovery capabilities to recover intermediate transformations using delta processing between RDDs. Hence no need to save intermediate state and checkpoint in order to do recovery processing. Or in cases of Dryad/ map reduce you need to reprocess the entire operation. [7]


Principles of Distributed systems by M. Tamer Ozsu, Patrick Valduriez
2 http://cs.nyu.edu/courses/Fall12/CSCI-GA.2433-001/lecture4.pdf
3 https://cwiki.apache.org/confluence/download/attachments/27362054/Hive%2BSummit%2B2011-join.pdf
4 https://en.wikipedia.org/wiki/MapReduce 
5 http://blogs.msdn.com/b/dryad/archive/2009/11/24/what-is-dryad.aspx
6 http://research.microsoft.com/en-us/projects/dryad/eurosys07.pdf
7 https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Monday, May 11, 2015

NoSQL DB - Managing Consistency, Reliability, Isolation, Concurrency & Replication in Distributed Systems

Consistency, Reliability, Isolation, Concurrency & Replication is very closely related concepts when designing a distributed system. I wanted to give a 1000 ft overview of considerations in distributed database design.
Consistency: Consistency of a transaction is its correctness. A transaction is consistent if it maps one consistent database state to another.[1] Simple ;-)
Following are the three important factors, which interplay with consistency. When you talk about consistency in distributed systems, the impact on reliability, replication and concurrency need to be discussed.

Consistency Buddies:[2][1]
Reliability
Concurrency
Replication


Isolation: Isolation specifies the level of correctness/ consistency.

SQL-92 has defined a set of isolation levels:
Read Uncommitted (Dirty Reads) – Allows a transaction to read uncommitted data by another transaction. [1]
Read Committed (Non-Repeatable Reads) – Allows a transaction to update value read by another open transaction. So when a transaction does two reads on the same data, it returns different results [1]
Repeatable Reads (Phantom reads) – Allows a transaction to add/delete record in a dataset/table while another transaction is doing query on that dataset/table. So if a transaction does two searches, the search results would be different.[1]
Serializable – All transactions are executed one after another [1]

Other Isolation Levels:
Snapshot Read  - Provides repeatable reads but not serializable read. Each transaction sees a snapshot of database when it starts and its reads and writes are performed on this snapshot. [1]
Read Atomic – New ! – Provides repeatable reads but not serializable read. All or none of transaction updates are visible.[3]

Reliability:
Ability to maintain consistency in the event of failures defines reliability of the system.
Types of failures: Machine failure, network failure, runtime failure,
CAP theorem:
C= Consistency
A = Availability
P = Partition

Partition refers to cluster failure. In case of distributed system, it is assumed that some servers will fail or network equipment will fail. System reliability is an issue. So partition is assumed. In case of partition, user has to choose between consistency and availability.
Consistency refers to the data that is read by the client. If user chooses consistency, then they will have lack of availability of data. So user will get message that the system is unavailable to process a request. Eg. In RTB, amount spend on a campaign would definitely prefer consistency over availability.
If user chooses availability, the user does not care of exactness of data. Use case of online reporting would prefer availability to consistency since the system can show data from an hour old record and its fine.

Non-blocking termination protocol -> a protocol is non-blocking if it permits a transaction to terminate at the operational sites without waiting for the recovery of the failed site.[1]

2-phase commit – It is a simple and elegant protocol to establish reliability in distributed systems. It extends the effects of local atomic commit actions to distributed transactions by insisting that all sites involved in the execution of a distributed transaction agree to commit the transaction before its effects are made permanent[1]


Concurrency:
The main concern in designing a concurrency control algorithm is to correctly process transactions that are in conflict. [3]
 The following shows classification of concurrency control approaches [4]



The most common classification criterion, however, is the synchronization prim- itive. The corresponding breakdown of the concurrency control algorithms results in two classes [Bernstein and Goodman, 1981]: those algorithms that are based on mutually exclusive access to shared data (locking), and those that attempt to order the execution of the transactions according to a set of rules (protocols). However, these primitives may be used in algorithms with two different viewpoints: the pessimistic view that many transactions will conflict with each other, or the optimistic view that not too many transactions will conflict with one another.[1]
Locking based concurrency: One common implementation to enforce lock-based consistency is using 2-phase locking. [1]
Timestamp based concurrency: This is implemented by assigning a unique timestamp to each client action. [1]
The basic method involves having time synchronized by message passing
The multi version concurrency control allows the read to take place on the record with latest timestamp. The transactions involving write operations are queued based on the timestamp of arrival. However reads are never blocked. This concurrency mechanism is very useful for read heavy transactions.[1]

Timestamp based optimistic concurrency control delay the validation of the transaction until the final commit. [1]

The disadvantage again of timestamp based validation is that more storage is required to keep all the terminated transactions which were in progress when the particular transaction started.[1]

Replication:[1]
Replication poses significant benefits but its takes effort in keeping all the copies in sync.
Mutual Consistency: It is consistency with respect to replication that can be weak or strong.

Where updates are performed: centralized if update is done on the master copy first or distributed if they allow updates over any replica. Centralized techniques can be further identified as single master or primary copy where the master copy of each data item may be different [1]
Update Propagation: once updates are performed on a replica, the next decision is how updates are propagated to the others. The alternatives are identified as eager or lazy. Eager technique updates all replicas before the transaction commits. Lazy propagation updates replicas after transaction commits. Eager techniques are further identified according to when they push each write to the other replicas. [1]
Four combinations are possible eager centralized, eager distributed, lazy centralized, lazy distributed. [1]         

Mutual Consistency vs Transactional Consistency:
Both are related but different. Mutual consistency refers to replicas converging to the same value while transaction consistency requires that the global execution history be serializable. [1]

1. Principles of Distributed systems by M. Tamer Ozsu, Patrick Valduriez