Analysis of Chess Game Openings

Austin Nguyen and Elena Yang

Link to our Github Page

Project Goals

The goal of the project is to understand the relation of chess openings with different variables such as the outcome of the match, the users' ratings, and even the choices of openings a specific user has. The optimal openings have been a controversy discussion, and we believe the differences in data resulting from variations in openings would provide insight. The question that we will attempt to answer is how much of a factor openings have in terms of winning with the different variables presented, such as rating and time control.

Project Dataset

The dataset we plan to work with is taken from lichess. Each unit of observation is a player on from "Lilchess Teams" uniquely identified by their usernames. This information includes data such as the time the game was played, the users that participated, the exact moves that were played; this means the opening can be directly inferred and placed within the data.

In [ ]:
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/elenajy/elenajy.github.io/master/datascience/games.csv')
In [ ]:
(df["black_rating"].mean()+df["white_rating"].mean())/2 #Finding the average rating by taking the mean of both the white and black rating columns and then averaging those columns
Out[ ]:
1592.7319274105096

The average rating of the dataset is around 1600; although the quality of the games are not exactly GrandMaster level, the average rating reflects that of the slightly-above average player which allows the information to be a bit more useful to an average player. This is important since the information when one attempts to find a "useful" opening would come from a scenario where a GrandMaster would use the opening, which may or may not be as useful to the average player which is the larger demographic.

In addition, although the average is around 1600, it takes a variety of ratings which means we are able to also look at different openings used for the range of ratings provided.

While this information is useful, it could be beneficial to increase the sample size of games as this is a relatively small sample of 20,000 observations. It would also be interesting to include some higher rated games to see a difference in trends depending on the ratings of the players.

Collaboration Plan

We plan on working on this project as a pair at least once a week over Discord, where we have been and will continue to hold most of our project conversation. We will also meet in-person or over Zoom to discuss the project as we achieve certain goals we set for ourselves and to evalute our progress and timeline. A Google Colab has been set up in order to work on the project individually, and a GitHub has been set up with our files which we both have access to and the ability to push and commit changes.

ETL (Extraction, Transform, and Load)

We have loaded our dataset found on Lilchess to the GitHub. In order to have a table that is slightly easier to read, we will remove the "moves" column which reflects later moves made in the match, since the first moves would be known through the opening which is what we are mainly concerned with.

In [ ]:
df = df_moves.drop("moves", axis=1)

Looking at the data set, there is a rated column indicating False if the game was not rated and True otherwise. For the purpose of this analysis we will focus on rated games only. About 83% of this data set is rated games so it won't affect the results too much.

Other columns that we will drop that don't affect our study of opening moves and end game results are created_at which indicates when the game started and last_move_at which indicates what time the game ended. We choose to include white_id and black_id which are the usernames of the respective players at this point in our analysis because there is some analysis to be done regarding player appearance frequencies in this dataset; for example, just because player_a uses the French opening 1000 times does not mean the French is widely used. Finally, we also chose to get rid of the id column as each unit of observations is uniquely identified by the row number already and the id column is thus redundant.

In [ ]:
df.drop('created_at', axis=1, inplace=True)
df.drop('last_move_at', axis=1, inplace=True)
df.drop('id', axis=1, inplace=True)
display(df)
rated turns victory_status winner increment_code white_id white_rating black_id black_rating opening_eco opening_name opening_ply
0 False 13 outoftime white 15+2 bourgris 1500 a-00 1191 D10 Slav Defense: Exchange Variation 5
1 True 16 resign black 5+10 a-00 1322 skinnerua 1261 B00 Nimzowitsch Defense: Kennedy Variation 4
2 True 61 mate white 5+10 ischia 1496 a-00 1500 C20 King's Pawn Game: Leonardis Variation 3
3 True 61 mate white 20+0 daniamurashov 1439 adivanov2009 1454 D02 Queen's Pawn Game: Zukertort Variation 3
4 True 95 mate white 30+3 nik221107 1523 adivanov2009 1469 C41 Philidor Defense 5
... ... ... ... ... ... ... ... ... ... ... ... ...
20053 True 24 resign white 10+10 belcolt 1691 jamboger 1220 A80 Dutch Defense 2
20054 True 82 mate black 10+0 jamboger 1233 farrukhasomiddinov 1196 A41 Queen's Pawn 2
20055 True 35 mate white 10+0 jamboger 1219 schaaksmurf3 1286 D00 Queen's Pawn Game: Mason Attack 3
20056 True 109 resign white 10+0 marcodisogno 1360 jamboger 1227 B07 Pirc Defense 4
20057 True 78 mate black 10+0 jamboger 1235 ffbob 1339 D00 Queen's Pawn Game: Mason Attack 3

20058 rows × 12 columns

Exploratory Data Analysis

One thing discussed is the win-rate of white and black, and we will see if that ratio is obvious in a small sample of average rating.

In [ ]:
df.winner.value_counts().plot.bar()
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f26fa57cc10>
In [ ]:
(df[(df.winner == "white")].winner.value_counts())/(len(df.winner)) #Finding the % of games won by white
Out[ ]:
white    0.498604
Name: winner, dtype: float64
In [ ]:
(df[(df.winner == "black")].winner.value_counts())/(len(df.winner)) #Finding the % of games won by black
Out[ ]:
black    0.454033
Name: winner, dtype: float64
In [ ]:
(df[(df.winner == "draw")].winner.value_counts())/(len(df.winner)) #Finding the % of draws
Out[ ]:
draw    0.047363
Name: winner, dtype: float64

We notice that despite the drawrate stil being around 5%, white still has a 50% winrate with black having a 45% winrate, so it is much more prevalent that white wins more than black. In chess, white moves first so this indicates that if each player places optimally (a large assumption to make, of course) white has the advantage since they dictate the direction the opening goes.

Another interesting observation is the different time controls that were played. When looking a the amount of times a time control was played, we notice that 10+0 (10 minute timer with no increment) was played the most. This could effect the data as the time is a mid-range speed, which may allow for more comfortable openings.

In [ ]:
df.increment_code.value_counts()
Out[ ]:
10+0     7721
15+0     1311
15+15     850
5+5       738
5+8       697
         ... 
14+9        1
0+20        1
0+40        1
13+20       1
14+15       1
Name: increment_code, Length: 400, dtype: int64

Finally, we previously mentioned that we want to make sure that this dataset has a variety of players, and not just one or two players dominating the types of openings reflected in this dataset. Before we change anything in order to reflect a unique set of players, we will first look and find the amount of players that are in the data.

In [ ]:
# we have 20058 observations, so if all the players are unique we would have 40116 unique players
df_playerIds = df[['white_id','black_id']]
df_uniquePlayers = (df_playerIds['white_id'].append(df_playerIds['black_id'])).unique()
print(len(df_uniquePlayers))
15635

With there being 15635 players being unique, this shows that a portion of the games in the dataset are played by duplicate players. While it is important to look at the opening data from unique players, it would also be interesting to look at the different openings that a single player plays, which we could find using this dataset.

The dtypes are properly interpreted from boolean value for rated match, int and floats for our quantitative variables, and object for the rest.

In [ ]:
display(df.dtypes)
rated               bool
turns              int64
victory_status    object
winner            object
increment_code    object
white_id          object
white_rating       int64
black_id          object
black_rating       int64
opening_eco       object
opening_name      object
opening_ply        int64
dtype: object
In [12]:
%%shell
jupyter nbconvert --to html /content/milestone_1.ipynb
[NbConvertApp] Converting notebook /content/milestone_1.ipynb to html
[NbConvertApp] Writing 305923 bytes to /content/milestone_1.html
Out[12]: