Friday, July 17, 2015

Marketing Concepts Techical Overview


Introduction

This document is a 20,000 ft overview of the marketing ecosystem. It shows majn marketing functions. The document also illustrates to some extent how core-advertising topics work from a high-level technical perspective.

Marketing Needs

General Marketing Functions & Measures:




 

Important Concepts & Flows

First Party Cookie:

First party cookie is a small piece of data sent from a website and stored in a user's web browser while the user is browsing that website. Every time the user loads the website, the browser sends the cookie back to the server to notify the website of the user's previous activity [1]. A first party cookie's domain name will match the domain name that is shown in the web browser's address bar.[2]

Third Party Cookie:

 Third-party cookies belong to domains different from the one shown in the address bar. These sorts of cookies typically appear when web pages feature content, such as banner advertisements, from external websites. This opens up the potential for tracking the user's browsing history, and is often used by advertisers in an effort to serve relevant advertisements to each user. As an example, suppose a user visits www.example.org. This web site contains an advertisement from ad.foxytracking.com, which, when downloaded, sets a cookie belonging to the advertisements's domain (ad.foxytracking.com). Then, the user visits another website, www.foo.com, which also contains an advertisement fromad.foxytracking.com/, and which also sets a cookie belonging to that domain (ad.foxytracking.com). Eventually, both of these cookies will be sent to the advertiser when loading their advertisements or visiting their website. The advertiser can then use these cookies to build up a browsing history of the user across all the websites that have ads from this advertiser.[2]

Cookie Syncing:
Cookie syncing involves mapping a userId on one system with the userId on another system.


The process begins when a user visits a site (say example.com, not shown in the figure), which includes A.com as an embedded third-party tracker. (1) The browser makes a request to A.com, and included in this request is the tracking cookie set by A.com. (2) A.com retrieves its tracking ID from the cookie, and redirects the browser toB.com, encoding the tracking ID into the URL. (3) The browser then makes a request to B.com, which includes the full URLA.com redirected to as well as B.com’s tracking cookie. (4) B.com can then link its ID for the user to A.com’s ID for the user2. All of this is invisible to the user. [14]

Tags

Tags are incorporated into the HTML/JavaScript code delivered a web browser or app when a web page loads. Here’s a sample of what you might see if you look at the source code on a typical website using third-party tags.

[15]

How do tags work?


[15]
Tags power online marketing and analytics. Specifically, they can:
  • Instruct web browsers to collect data;
  • Set cookies;
  • Extend audiences between multiple websites;
  • Integrate third-party content into a website (e.g. social media widgets, video players, ads, etc.).
  • Everything from ad campaigns to Google Analytics runs on tags. [15]

Tag Container

A container tag is a code snippet used in web development that removes the need for multiple data tracking codes being placed directly on the site. Instead, one code is placed on every page on a site. This code literally acts as a container whereby all tracking codes can be placed off site in the Google Tag Manager interface and fired from the one code on the site. This means that you can update, add or remove your tracking codes, through the Google Tag Manager interface, avoiding website development times. [16]
How does it work?
When a site loads or a specific action is completed on the site, the tag will be activated. It will search the Google Tag Manager database located in the cloud. It will then fire any tracking codes that are held in the account and that match a set of specific rules that you have defined. [16]

Figure 1A Container Tag: Google Tag Manager fires, checks the cloud, pulls the tags that fit the criteria and fires them. [16]

Contextual Targeting:

A contextual advertising system scans the text of a website for keywords and returns advertisements to the webpage based on those keywords [3]. While setting up a campaign, the advertiser sets keywords she is looking for. When a web page is loaded, the bid request contains the contextual information based on historical scan of the page. Mapping needs to be done between the tags on the publisher side vs keywords in advertiser site. A third party provider can do mapping.

Behavioral Targeting:

Behavioral targeting uses information collected from an individual’s web-browsing behavior (e.g., the pages that they have visited or the searches they have conducted) to select advertisements to display[4]

The first step in serving behavioral ads is tracking who the users are and what they do online. Tracking refers to the process of collecting and processing data, while targeting refers to using that data to offer personalized solutions or advertising[5]

While setting up a campaign, the advertiser specifies the behaviors they wish to target eg. Outdoor lifestyle, in-market for a car. When the advertiser encounters such an audience, a bid request is responded to. The audience behaviors will be in the bid request by the third party vendor. Third party vendors can use third party cookies or ip addresses or geo based tracking to track user behavior across websites.

Re-targeting:

Retargeting is a form of online targeted advertising by which online advertising is targeted to consumers based on their previous Internet actions, in situations where these actions did not result in a sale or conversion.[6]

Technically all that is necessary is to place a JavaScript tag in the footer of your website.[7] Every time a new visitor comes to your site, the code drops a third party cookie on user browser. Later, when your cookied visitors browse the Web, the cookie will let the DSP/ exchange/ SSP know when to serve ads, ensuring that your ads are served to only to people who have previously visited your site [8]

Demographic Targeting:

Advertiser would like to reach people who are likely to be of a specific gender, age, ethinicity, income group, education level, home ownership and # of children.

The first step is to identify user using a match service (eg. Google has an email id and some demographic information). Third party vendors can supplement the demographic information by using their own third party cookies or matching characteristics with census data or Ip data or geo data or behavior data.

Geo-Fence based Targeting:

This is especially useful for mobile advertising. Advertisers want to send promotion for people in specific region. Eg. Target anyone within 5 miles of the mall.

Geo-fence based targeting involves identifying the geo-location of the device and target an ad via an app which collect location information. These apps ask audience if location tracking is allowed through the app. Device id is key here to identify user in case advertiser wants to combine behavioral targeting with geo-fence based targeting. [9]

Geographic region Targeting:

Advertising can be run on national sites (wsj.com) or local sites (chicagotribune.com).  It will be important for advertisers to identify key areas where they customers concentrate in order to use local sites effectively.

This targeting generally does not work with mobile apps that are national.

References:


Tuesday, July 14, 2015

Distributed Data and Query Processing

This blog is going to talk about SQL optimization techniques in distributed environment and then general data processing products in Hadoop world.
SQL Optimization
Four stages exist in query execution:
Query decomposition
Data Localization
Global Optimization
Distributed execution
Query Decomposition: decomposition into logical operators. Validation of the query. Eliminate redundant predicates. Calculus query is restructured as algebraic query
Data Localization: This layer determines which fragments are involved in the query and transforms the distributed query into a query on fragments.
Global query Optimization
In both centralized and distributed systems, a common heuristic is to minimize the size of intermediate relations. This can be done by performing unary operators first, and ordering the binary operators by the increasing sizes of their intermediate relations. An important heuristic in distributed systems is to replace join operators by combinations of semijoins to minimize data communication. [1]
Optimization can be done statically before executing the query or dynamically as the query is executed. Static query optimization is done at query compilation time. Thus the cost of optimization may be amortized over multiple query executions. Therefore, this timing is appropriate for use with the exhaustive search method. Since the sizes of the intermediate relations of a strategy are not known until run time, they must be estimated using database statistics. [1]
Hybrid query optimization attempts to provide the advantages of static query optimization while avoiding the issues generated by inaccurate estimates. The approach is basically static, but dynamic query optimization may take place at run time when a high difference between predicted sizes and actual size of intermediate relations is detected. [1]
Query optimization refers to the process of producing a query execution plan which represents an execution strategy for the query. This QEP minimizes an objective cost function. A query optimizer, the software module that performs query optimization is usually seen as consisting of three components: a search space, a cost model and a search strategy. The search space is the set of alternative execution plans that represent input query. These plans differ in the execution order of operations and the way these operations are implemented and therefore in their performance. The cost model predicts the cost of a given execution plan. The search strategy explores the search space and selects the best plan using the cost model.[1]
An execution strategy for a distributed query can be described with relational algebra operators and communication primitives for transferring data between sites. Query optimization consists of finding the “best” ordering of operators in the query, including communication operators that minimize a cost function. The cost function, often defined in terms of time units, refers to computing resources such as disk space, disk I/Os, buffer space, CPU cost, communication cost, and so on. Generally, it is a weighted combination of I/O, CPU, and communication costs.[1]
To select the ordering of operators it is necessary to predict execution costs of alternative candidate orderings. Determining execution costs before query execution (i.e., static optimization) is based on fragment statistics and the formulas for estimating the cardinalities of results of relational operators. Thus the optimization decisions depend on the allocation of fragments and available statistics on fragments which are recorder in the allocation schema.[1]
An important aspect of query optimization is join ordering, since permutations of the joins within the query may lead to improvements of orders of magnitude. One basic technique for optimizing a sequence of distributed join operators is through the semijoin operator. The main value of the semijoin in a distributed system is to reduce the size of the join operands and then the communication cost.[1]

Distributed Query Execution: Each subquery executing at one site, called a local query, is then optimized using the local schema of the site and executed. At this time, the algorithms to perform the relational operators may be chosen.
The general formula for response time
Response time=TCPU seq #insts+TI/Oseq #I/Os +TMSG seq #msgs+TTR seq #bytes
Join trees can be produced by applying the commutativity and associativity rules is O(N!) for N relations. Investigating a large search space may make optimization time prohibitive, sometimes much more expensive than the actual execution time. Therefore query optimizers typically restrict the size of the search space they consider. The first restriction is to use heuristics. The most common heuristic is to perform selection and projection when accessing base relations. Another common heuristic is to avoid Cartesian products that are not required by the query. Another important restriction is with respect to the shape of the join tree. Two kinds of join trees are usually distinguished: linear vs bushy trees. By considering only linear trees, the size of the search space is reduced to O(2N ). However, in a distributed environment, bushy trees are useful in exhibiting parallelism[1]



Broadcast join: If one part of the join is a small table, broadcast join is used. This means that the small table is replicated to all nodes.
Shuffle join:   
Shuffle join involves joining 2 large tables. 
This does not apply to non relational systems.
There will be bucketing involved to co-locate foreign keys
Common join, Map join, auto mapjoin, Bucket map join, sort merge bucket map join, skew join[3]

It is possible to run complex processing like R like functions in MPP environment however, it is impossible to store data in unstructured format in MPP architectures which has resulted in those databases being not so flexible.

Hadooop:
Schema on read not schema on write
This changed the industry in the way that allowed users to store data and then worry about how to analyze data.

Map Reduce
Map Reduce is a general-purpose processing framework. It was proposed by a paper from Google. Very wide range of processing possible with map/reduce paradigm.
MapReduce is programming model and an associated implementation for processing and generating large data sets with paralleldistributed algorithm on a cluster.
A MapReduce program is composed of a Map() procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The "MapReduce System" (also called "infrastructure" or "framework") orchestrates the processing by marshaling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance
[4]           

Dryad 
This is a general-purpose processing framework as well.
Dryad is an execution engine that is optimized for certain types of distributed computation. It relies on the fact that its inputs are immutable and finite, and the outputs of a job are not generally available until the entire computation completes  (i.e. it doesn’t support “infinite” streaming computations) [5]
Dryad is optimized for medium to large clusters (say, between 100 computers and 10,000 computers), and for long-running batch jobs rather than sub-second interactive lookups.
It has mostly been used for data-intensive computations which can benefit from streaming data to or from many disks at once, and which can be broken into parallel stages that do not require very frequent communications. If a program can be broken into pieces that can do more than 20 or 30 seconds of IO or computation before needing to communicate then it can probably be run efficiently on Dryad. The basic computational model that Dryad uses is a directed-acyclic graph (DAG)
[5] 
The approach is to create DAG of operations with data flowing between them.


A Dryad programmer writes several sequential programs and connects them using one-way channels. The computation is structured as a directed graph: programs are graph vertices, while the channels are graph edges. A Dryad job is a graph generator, which can synthesize any directed acyclic graph. These graphs can even change during execution, in response to important events in the computation.[5]

Dryad is notable for allowing graph vertices and computations in general to use an arbitrary number of inputs and outputs. Mapreduce restricts all computations to take a single input set and generate a single output set. SQL and shader languages allow multiple inputs but generate a single output from the user’s perspective, though SQL query plans internally use multiple-output vertices.[6]

Different implementations exist for Dryad with Tez being prominent among those in Hadoop world.

Resilient Distributed Data Sets: This is a general purpose processing framework as well.
Spark is a product which implement resilient distributed data sets. Spark is able to utilize memory for intermediate results as well as share data across DAGs. Spark allows cyclic graphs. Also iterations is a specialty with Spark ML algorithms are a perfect fit to the underlying architecture.
RDDs keep a delta of the operations, which lead from one RDD to another instead of storing each RDD. This facilitates Iterative machine learning algorithms on top of spark as well as interactive operations. However, the capabilities are enhanced by effective use of memory to store intermediate results.
The difference between spark and other data processing engines are the use of memory and recovery capabilities to recover intermediate transformations using delta processing between RDDs. Hence no need to save intermediate state and checkpoint in order to do recovery processing. Or in cases of Dryad/ map reduce you need to reprocess the entire operation. [7]


Principles of Distributed systems by M. Tamer Ozsu, Patrick Valduriez
2 http://cs.nyu.edu/courses/Fall12/CSCI-GA.2433-001/lecture4.pdf
3 https://cwiki.apache.org/confluence/download/attachments/27362054/Hive%2BSummit%2B2011-join.pdf
4 https://en.wikipedia.org/wiki/MapReduce 
5 http://blogs.msdn.com/b/dryad/archive/2009/11/24/what-is-dryad.aspx
6 http://research.microsoft.com/en-us/projects/dryad/eurosys07.pdf
7 https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf