Specs - PYSCO

Blog Post

Company background and project objective

Skout is a fast growing mobile dating network. With Skout, users can see nearby singles looking to get together for a date. Through a 8 week project for Social Data Revolution class, a group of 6 Stanford students have worked closely with Skout to analyze user dating behaviors. The goal of the project is two folds:

Identify interesting patterns in people's dating behaviors through data mining. For example, how age differences affect people's dating behaviors and what kind of role ethnicity plays in the process.

Improve Skout's person recommendation system: The basic goal of the recommendation engine is to maximize the utility of the user using the application, while also keeping in mind the utility of the people that are recommended. We will be constructing and training a 'Like' function for all pairs of people. This will facilitate an improved retrieval list for a query user, based upon maximizing the happiness of the query user, while also providing some semblance of fairness for the system overall.

The overall goal is to better understand user behavior and provide an improved recommendation system. Both of these tasks will improve user satisfaction with the Skout product.

Our approach and preliminary findings
Feature extraction

At first we went about transforming individual attributes into formats easy to process. Date and time is offset according to the start of the earliest time in the data set. Since relative distance is more important than the absolute geo-location of a pair of users, we also performed projection transformation to compute the relative distance between each pair of users.

To better organize the data, we load the raw data table into a network structure. The attributes about a particular individual are loaded into the nodes, while the edges record the interaction activities between users. Based on this network structure, we can then easily extract and compare variables to further explore data.

Basic exploration

Before building up the recommendation engine, we first conducted a series of basic explorations on data in order to find out meaningful attributes and significant correlations among variables.

We found out the most correlated variables are among the actual number of user interactions: for example, how many times user A checks the profile of user B has a strong implication on the number of times they also chat, or hot listed each other. This is quite intuitive when compared to the correlation among static attributes such as ethnicity or age. Again, just as Andreas explained to us, people’s behavior online says more than their alleged profile information. This also influences our later work on recommendation engines.

To better understand the data, we graphed a lot of the variables: pairs of ethnicity vs. their chats, difference in height or age and the influence on their interaction. We have selected the most interesting ones and presented in class, and since the data is not public we would not be showing them here.

Recommendation engine

Based on the preliminary exploration on data we started the brave foray towards a recommendation engine, which would be a system taking individual attributes as well as interaction history of a pair of users as input, and output some selected attributes to see how the model captures the inherent nature of the problem.

For our experiment, we choose to model how all other information predicts the ethnicity of the target user in a particular interaction session. Three models are used and contrasted, with majority-based model giving 40% accuracy, decision trees about 60%, and finally fine-tuned Tree Augmented Bayes Network (TAN) model an accuracy of around 75%.

The baseline for this prediction was computed as follows: The majority ethnicity class of each user is taken and considered as the prediction of the ethnicity. This baseline was taken up based on the hypothesis that the 'user preferences for ethnicity are stable over time'.

Some of the user and interaction features that proved to be most useful are as follows:

(Destination = user returned as search result)
(Source = user searching for results)

Destination node relative date of join

Destination age
Distance between nodes (edge property)
Source node ethnicity
Destination node height
Source out-degree (distinct neighbors) of profile checks
Source out-number of messages
Source out-number of profile checks

Source out-degree (distinct nodes) of messages
Destination node gender interest
Destination node gender
Source node age
Source node gender interest
Edge profile checks
Source relative date of joining

Some of the features that seem most intuitive are indeed listed as being most informative like target user age, distance between the source and destination users, destination user properties like ethnicity, age, height etc.

Related Works
Andrew Fiore, a Ph.D. candidate at the School of Information at UC Berkeley, has conducted a series of studies on online dating sites. Even though his studies examine different aspects than ours and are difficult to compare, we can borrow his results to stimulate ideas about what to test given we have more data. For example, in the paper "Assessing Attractiveness in Online Dating Profiles", Fiore and others analyzed how photos and texts of the profile influence the attractiveness that others perceive. They argued that "photos of men appeared attractive when they looked genuine and trustworthy, extraverted, feminine, and not too warm and kind; photos of women were rated as attractive when they appeared more feminine, less masculine, higher in self-esteem, and lower in self-centeredness." When users of online dating sites are preparing their profiles, these observations can serve as their guidelines to increase the chance of interaction, and in return benefit the dating sites.

Another useful resource is OkTrends, the official blog of OkCupid.com. Several of their posts motivates us to play with the data from Skout. For instance, they analyzed how the race affects the received messages, and these findings can be compared with Fiore's paper on "Homophily in Online Dating: When Do You Like Someone Like Yourself", where attributes like martial status, drinking/smoking habits, and pets are considered as well. Another interesting post is on older women, and since dating sites are to maximize the number of matched couples, these analyses will provide valuable insights on how to design a recommendation system.

Impact

We expect this work to have positive impact on the user experience of Skout. There are two parts of the work we have done: (a) find interesting characteristics about the social dating network and (b) develop a recommendation system taking insights from these characteristics. The characteristics about the social dating network could be used indirectly by the company to improve the other part of its product. For example, learning about what kind of virtual gifts do people of certain age or ethnicity like, Skout could suggest those kind of gifts when a person is considering buying a virtual gift. Similarly, the website can help its users by suggesting them the features of people they should try to initiate a conversation thereby increasing their chances of hooking up.

With the recommendation engine, we expect the user experience to vastly improve. Instead of the user browsing through tens of profiles to find someone interesting, this method would reduce the time of users in finding the right match and increase the chances of hooking up. By recommending them a mix of those who a user might be interested in with those who are interested in the user, we expect this will be a win-win situation for both parties and establish a healthy dating environment for its users.

What else have we learnt?

Yi (MBA2): One interesting learning for me is how we can leverage "bad data" to derive insights and make money out of it. We all know there are lots of bad data on people's profiles. For example, many people change their age, height, even gender all the time to make themselves appear more attractive online. Eg: If you find a hot girl online and you want know how credible her profile is. Imagine if you can buy a premium function through which you will be able to see the girl's "profile evolution history", would you buy it? I would buy it. To me this is a good example of how to leverage the "bad information" and to make money out of it.

One of the most important learning from this class if the basic idea that the "social graph" is so much more than just nodes and edges. It has huge amounts of information in could be leveraged in multiple ways to create an effective use-case.