A Random Forest

|

Data. Decisions. Results.

Dennis M. Ritchie (1941 – 2011) – Father of C and Unix has passed away.

October 13th, 2011

Dennis M. Ritchie was one of the creators of the C programming language and the Unix operating system. These two contributions to society are the underpinnings of modern computers. Consider this: nearly every microchip, phone or computer uses code written in C or was developed using C. Every operating system used today is either a (loosely) a derivative of Unix, i.e. OS X, Linux, Android, or uses ideas originally used in Unix.

Like any human being we may be quirky or flawed. But Dennis M. Ritchie was definitely enormously successful.

“C is quirky, flawed, and an enormous success.” – dmr

UBC Sauder: Masters of Management in Operations Research

September 25th, 2011

It’s been awhile since I’ve posted but now that I’m in the Masters of Management in Operations Research program (MMOR) at UBC Sauder I figure I’ll have more reasons to post. After living through the first few weeks of the MMOR program I think now is a good time to reflect on my decision to join the program.

Why Operations Research?
After graduating with a B.Sc. in computer science and microbiology I could have conceivably become a software developer, a bioinformatics researcher, or even a lab technician. Like many university students I didn’t discover what I feel to be my true calling until I took my second and third courses in statistics. It was in those courses where I learned of the general inference problem: given data what do we know and what should we do, and what makes it the right choice. From this you can generalize to everyday situations and problems. Everyday individuals and businesses are faced with decisions, and we all want to make the right decision. This is the essence of operations research.

I realized I enjoyed applying what I had learned as a student to improving everyday processes, so I believe operations research as an applied field was the right one for me, and also why the Masters of Management in Operations Research at UBC was the right fit.

Next time I’ll talk about why I chose the MMOR program at UBC.

Thanks for reading!

Watson: IBM’s Next Deep Blue

June 16th, 2010

Article in the New York Times: What is IBM’s Watson?

Watson is IBM’s next grand challenge. In 1997 IBM built Deep Blue, which beat the world champion chess player Kasparov. IBM is hoping to do it again by beating previous champions of Jeopardy, and has signed up with the TV show for an actual competition.

Watson is the latest in a string of trends in the computer science world. In the late 80′s and early 90′s artificial intelligence research entered the AI winter, where funding slowed down due to a lack of results. Projects like MYCIN lead to a great deal of hype, and people believed computers were on the verge of true intelligence. Unfortunately the breakthroughs failed to appear – and research slowed down.

In the 90′s the predominant idea with artificial intelligence was creating a set of rules to guide the computer. Building models of language, grammar, speech, and intelligence. However the models being built were never powerful enough to compete with human intelligence, and the models were very complicated to build.

Recent breakthroughs in machine learning however are changing the landscape. Whereas before researchers would painstakingly build a model by hand, current trends are to build a simple but flexible model. The computer would then be fed a staggering amount of information, essentially teaching the computer. These breakthroughs are what lead to speech, and handwriting recognition. It is also the key idea to modern natural language processing, which is what allows Watson to actually compete against humans.

These advances have been enabled by faster and cheaper computers, as well as the enormous growth in machine readable information. Everytime someone adds information to wikipedia, or to a blog post, it allows computers to learn from it.

However it isn’t clear whether these advances will be enough. Already speech recognition advances have plateaued, and it is an open question whether computing power and massive amounts of information will be enough to make true artificial intelligences.

Most likely there is still room to grow. Knowledge engines will supplant search engines when we just need answers to questions. Whether or not we’ll have computers acting as personal assistants is unknown, but Kurzweil certain thinks so, I myself am not so sure.

Open Catalogues of City Data

June 11th, 2010

This San Francisco Smart Parking reminded me about how far we’ve come with open access to the data that impacts our daily lives, and how much more we can do.

In Vancouver, there is the Vancouver Open Data Catalogue which consists mainly of Map/GIS data. Some applications I can think of using the existing data:

  • Take out garbage reminder email – signup with an email and an address and the website sends you an email the day before collection.
  • Web application to mark areas with graffiti, potholes, broken street lights for City of Vancouver to fix.

Some ideas that there is no data for:

  • Map of parking zones (i.e. no signage, 2 hour parking, residential only etc.)
  • Dynamic map showing the amount of traffic throughout the city in the course of a week.

A couple of other cities also have open data policies: Edmonton, Toronto, New York, Washington, and San Francisco.

This brings up the question of data integration. Lets say I make a great web application that allows people in Vancouver to mark locations with graffiti, potholes, or other minor maintenance issues, and the city of Vancouver also uses the application – everyone is happy. Why should someone have to develop an entirely new application to deal with the same issues in New York? Or in Toronto?

I can only hope for the day that question comes up though because right now most cities do not have open data catalogues.

OpenPCR – Develop a cheap, open design, DIY PCR Machine

June 8th, 2010

Found an interesting project looking for donations: OpenPCR.

They aim to create a PCR machine out of open components (i.e. Arduino), and then release all the designs. In theory this means someone can hack together a PCR machine out of base components. This would in turn commoditize PCR machines, allowing generic manufacturers to produce them on the cheap.

An interesting fact: one of the designers/hackers was a judge for the 2009 MIT iGEM competition, of which Eric was part of!

A Tour of the Visualization Zoo

June 5th, 2010

Found an awesome article by the ACM, A Tour through the Visualization Zoo.

One visualization you cannot miss is the following flow map, a recreation based on the 1861 Minard visualization of Napoleons 1812 Russian campaign:

Napoleons March to Moscow

Edward Tufte, author of the classic book on visualizations: The Visual Display of Quantitative Information, called the original chart possibly the best graphical visualization of all time. And it is easy to see why, the chart combines 6 dimensions: geographical, time, size of the army, and direction of the Army, and temperature, with incredible clarity and did Minard did it in 1861.

Here is how the visualization works:

  • Width of the band indicates the size of the army.
  • Red band indicates movement towards Moscow, black the return march.
  • At the bottom is a time line as well as temperature during the return.

I find this visualization inspiring because it is a reminder to everyone who works with data that you all you need to create a compelling story with data is diligent and careful thought. The original Minard chart can be found here.

Another visualization that was not included in the tour is John Snow’s cholera outbreak map:

Snow's Cholera Map

The black lines marks indicate deaths due to cholera, and the dots represent water pumps. Snow used his analysis of deaths to determine that the water pump on Broadway was linked with the cholera outbreak. Some claim that Snow’s analysis was the birth of Epidemiology – the study of the factors affecting the health and illnesses of populations.

Both of these visualizations demonstrate that sometimes the most convincing analysis is simply a visualization.

Resources for Graduate School

May 24th, 2010

Found this article recently: The Secret Lives of Professors which got me thinking about graduate school again. So for those of you thinking about applying, or going to grad school in September, I’ve found some good interesting resources so I thought I would share.

If you have any good links feel free to leave a comment or email it and I’ll add it to this post.

Prospective Graduate Students

Graduate Students

How to Measure Anything: Mathless Confidence Intervals

May 23rd, 2010

I recently picked up a book by Douglas W. Hubbard, How to Measure Anything which offers this table that pays for the book itself.

The “Mathless” 90% CI (p139, How to Measure Anything)

Lower bound: __th smallest

Upper bound: __th largest

Sample Size nth largest and smallest sample value Actual Confidence
5 1st 93.8%
8 2nd 93.0%
11 3rd 93.5%
13 4th 90.8%
16 5th 92.3%
18 6th 90.4%
21 7th 92.2%
23 8th 90.7%
26 9th 92.4%
28 10th 91.3%
30 11th 90.1%

So what does the table mean and how can it be used?

Suppose you are a drug dealer and you’ve received a 100 packages of 10g marijuana, ready to sell. The suppliers may have tried to rip you off so you need to check. You decide you want to  be more than 90% certain that the average weight of each package is actually 100g.

You don’t have the time to hire people on your end to weigh every package, and there are no friendly statisticians willing to calculate sample statistics for you. So what can you do?

With the table above you decide on how many packages you are willing to weight. Suppose you have time to weigh 8 packages and find that they weight 8, 8.9, 9, 9.5, 9.7, 9.9, 10, 10, 10.5, 11, 12g. With a sample size of 11 you only need to look at the 3rd smallest value (9g) and the 3rd largest value (10.5g) to construct a 90% confidence interval (actually 93.5%). Hence a 90% CI of the average weight of the packages is between 9g – 10.5g. You may or may not decide to accept the deal.

What is a 90% confidence interval (CI)? It means using the above table 9 times out of 10 the actual average will between the ‘calculated’ values.

So what if your not a drug dealer? The example in the book is used to measure the average amount of time a group of managers spend on under-performing sales rep. Other examples I can think of include measuring the average amount of time developers spend on bug fixes, and the amount of time employees spend working at home.

The table can construct 90% confidence interval for any kind of sample statistic, with some caveats. The table can construct 90% CI of the median for any distribution. However to use the table to calculate 90% CI’s for averages the distribution has to be symmetric. Which means in the drug dealers case your suppliers are equally likely to give you lighter packages as heavier packages (not super-realistic), but many other things in life are.

Sports Statistics: The No-Stats All-Star

May 14th, 2010

Found an older article from the New York Times magazine: The No-Stats All-Star.

Shane Battier is a small forward for the Housten Rockets (NBA), that was originally drafted by the Vancouver (now Memphis) Grizzlies. What makes him so interesting from a statistics point of view? Here is a quote from the article,

[Shane Batteir's] conventional statistics are unremarkable: he doesn’t score many points, snag many rebounds, block many shots, steal many balls or dish out many assists…When he is on the court, his teammates get better, often a lot better, and his opponents get worse — often a lot worse. He may not grab huge numbers of rebounds, but he has an uncanny ability to improve his teammates’ rebounding. He doesn’t shoot much, but when he does, he takes only the most efficient shots. He also has a knack for getting the ball to teammates who are in a position to do the same, and he commits few turnovers. On defense, although he routinely guards the N.B.A.’s most prolific scorers, he significantly ­reduces their shooting percentages. At the same time he somehow improves the defensive efficiency of his teammates — probably, Morey surmises, by helping them out in all sorts of subtle ways. “I call him Lego,” Morey says. “When he’s on the court, all the pieces start to fit together. And everything that leads to winning that you can get to through intellect instead of innate ability, Shane excels in. I’ll bet he’s in the hundredth percentile of every category.

Since Bill James discovered the sports Pythagorean theorem and founded Sabermetrics, statistics has swept the world of sports by storm. Baseball pitchers are given minute details about every opposing player and their weaknesses. In basketball Battier gets a report detailing how well his check Kobe Bryant plays in every part of the court. All of these statistics are created by people carefully watching videos of the game from all angles, and detailing every move.

But as important as knowing that statistics can dramatically improve everyday decision making, is realizing the weaknesses inherent in statistics. Shane Battier’s stats are a case in point. From the box score he comes off as a mediocre player, but that’s because we are looking at the wrong things. This is one of the largest weaknesses in statistics. Any statistical analysis is only as good as the data, and if we become myopic and look only at the obvious data, we miss seeing magnificent opportunities like Shane Battier.

Statistics is part of the future in terms of decision making, but don’t forget to look for hidden opportunities and ways to improve your analysis. One thing that can be done is to set aside some time every year to review your decisions and any obvious missed opportunities or odd recurrences, then fix the way you make decisions to catch these in the future.

Pythagorean Part 3: R Analysis of Hockey

May 11th, 2010

With the data files extracted in Pythagorean Part 2 we can go on to the statistical analysis with R.

Recall, the proportion of games won by a team is predicted by the formula:

text{Proportion of Wins} = frac{text{Runs Scored}^{2}}{text{Runs Scored}^{2} + text{Runs Allowed}^{2}}

Our goal is to calculate the optimum exponent that best describes the actual proportion of wins for each season. The complete script can be found on github. The script generates the following time series,

Exponent time series

The time series shows the best fitting exponent value for each season (measured in years). The size of the point indicates the number of teams playing in each season. The color represents the error as measured my sum of square residuals (SSE).

As you can see the error is relatively small for all the seasons (SSE less than or equal to 0.20). So the formula fits fairly well for all seasons since 1920.

From the graph it is obvious that an exponent value of 2 is more than accurate enough for all seasons. An interesting trend is the exponent values cluster more tightly as the number of teams increase.

So what does this mean?

We can do fairly simple prediction for any team given the number of goals scored, and the number of goals let in. In general if there is a large discrepancy between the actual proportion of games won versus the predicted proportion of games won, then we can predict that the team should either win/lose in future games to regress back to the predicted proportion of wins.