Analysis of Chess Game Openings¶

Austin Nguyen and Elena Yang

Project Goals¶

The goal of the project is to understand the relation of chess openings with different variables such as the outcome of the match, the users' ratings, and even the choices of openings a specific user has. The optimal openings have been a controversy discussion, and we believe the differences in data resulting from variations in openings would provide insight. The question that we will attempt to answer is how much of a factor openings have in terms of winning with the different variables presented, such as rating and time control.

Project Dataset¶

The dataset we plan to work with is taken from lichess. Each unit of observation is a player on from "Lilchess Teams" uniquely identified by their usernames. This information includes data such as the time the game was played, the users that participated, the exact moves that were played; this means the opening can be directly inferred and placed within the data.

import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/elenajy/elenajy.github.io/master/datascience/games.csv')

(df["black_rating"].mean()+df["white_rating"].mean())/2 #Finding the average rating by taking the mean of both the white and black rating columns and then averaging those columns

1592.7319274105096

The average rating of the dataset is around 1600; although the quality of the games are not exactly GrandMaster level, the average rating reflects that of the slightly-above average player which allows the information to be a bit more useful to an average player. This is important since the information when one attempts to find a "useful" opening would come from a scenario where a GrandMaster would use the opening, which may or may not be as useful to the average player which is the larger demographic.

In addition, although the average is around 1600, it takes a variety of ratings which means we are able to also look at different openings used for the range of ratings provided.

While this information is useful, it could be beneficial to increase the sample size of games as this is a relatively small sample of 20,000 observations. It would also be interesting to include some higher rated games to see a difference in trends depending on the ratings of the players.

Collaboration Plan¶

We plan on working on this project as a pair at least once a week over Discord, where we have been and will continue to hold most of our project conversation. We will also meet in-person or over Zoom to discuss the project as we achieve certain goals we set for ourselves and to evalute our progress and timeline. A Google Colab has been set up in order to work on the project individually, and a GitHub has been set up with our files which we both have access to and the ability to push and commit changes.

ETL (Extraction, Transform, and Load)¶

We have loaded our dataset found on Lilchess to the GitHub. In order to have a table that is slightly easier to read, we will remove the "moves" column which reflects later moves made in the match, since the first moves would be known through the opening which is what we are mainly concerned with.

df = df_moves.drop("moves", axis=1)

Looking at the data set, there is a rated column indicating False if the game was not rated and True otherwise. For the purpose of this analysis we will focus on rated games only. About 83% of this data set is rated games so it won't affect the results too much.

Other columns that we will drop that don't affect our study of opening moves and end game results are created_at which indicates when the game started and last_move_at which indicates what time the game ended. We choose to include white_id and black_id which are the usernames of the respective players at this point in our analysis because there is some analysis to be done regarding player appearance frequencies in this dataset; for example, just because player_a uses the French opening 1000 times does not mean the French is widely used. Finally, we also chose to get rid of the id column as each unit of observations is uniquely identified by the row number already and the id column is thus redundant.

df.drop('created_at', axis=1, inplace=True)
df.drop('last_move_at', axis=1, inplace=True)
df.drop('id', axis=1, inplace=True)
display(df)

Exploratory Data Analysis¶

One thing discussed is the win-rate of white and black, and we will see if that ratio is obvious in a small sample of average rating.

df.winner.value_counts().plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x7f26fa57cc10>

(df[(df.winner == "white")].winner.value_counts())/(len(df.winner)) #Finding the % of games won by white

white    0.498604
Name: winner, dtype: float64

(df[(df.winner == "black")].winner.value_counts())/(len(df.winner)) #Finding the % of games won by black

black    0.454033
Name: winner, dtype: float64

(df[(df.winner == "draw")].winner.value_counts())/(len(df.winner)) #Finding the % of draws

draw    0.047363
Name: winner, dtype: float64

We notice that despite the drawrate stil being around 5%, white still has a 50% winrate with black having a 45% winrate, so it is much more prevalent that white wins more than black. In chess, white moves first so this indicates that if each player places optimally (a large assumption to make, of course) white has the advantage since they dictate the direction the opening goes.

Another interesting observation is the different time controls that were played. When looking a the amount of times a time control was played, we notice that 10+0 (10 minute timer with no increment) was played the most. This could effect the data as the time is a mid-range speed, which may allow for more comfortable openings.

df.increment_code.value_counts()

10+0     7721
15+0     1311
15+15     850
5+5       738
5+8       697
         ... 
14+9        1
0+20        1
0+40        1
13+20       1
14+15       1
Name: increment_code, Length: 400, dtype: int64

Finally, we previously mentioned that we want to make sure that this dataset has a variety of players, and not just one or two players dominating the types of openings reflected in this dataset. Before we change anything in order to reflect a unique set of players, we will first look and find the amount of players that are in the data.

# we have 20058 observations, so if all the players are unique we would have 40116 unique players
df_playerIds = df[['white_id','black_id']]
df_uniquePlayers = (df_playerIds['white_id'].append(df_playerIds['black_id'])).unique()
print(len(df_uniquePlayers))

15635

With there being 15635 players being unique, this shows that a portion of the games in the dataset are played by duplicate players. While it is important to look at the opening data from unique players, it would also be interesting to look at the different openings that a single player plays, which we could find using this dataset.

The dtypes are properly interpreted from boolean value for rated match, int and floats for our quantitative variables, and object for the rest.

display(df.dtypes)

rated               bool
turns              int64
victory_status    object
winner            object
increment_code    object
white_id          object
white_rating       int64
black_id          object
black_rating       int64
opening_eco       object
opening_name      object
opening_ply        int64
dtype: object

%%shell
jupyter nbconvert --to html /content/milestone_1.ipynb

[NbConvertApp] Converting notebook /content/milestone_1.ipynb to html
[NbConvertApp] Writing 305923 bytes to /content/milestone_1.html

	rated	turns	victory_status	winner	increment_code	white_id	white_rating	black_id	black_rating	opening_eco	opening_name	opening_ply
0	False	13	outoftime	white	15+2	bourgris	1500	a-00	1191	D10	Slav Defense: Exchange Variation	5
1	True	16	resign	black	5+10	a-00	1322	skinnerua	1261	B00	Nimzowitsch Defense: Kennedy Variation	4
2	True	61	mate	white	5+10	ischia	1496	a-00	1500	C20	King's Pawn Game: Leonardis Variation	3
3	True	61	mate	white	20+0	daniamurashov	1439	adivanov2009	1454	D02	Queen's Pawn Game: Zukertort Variation	3
4	True	95	mate	white	30+3	nik221107	1523	adivanov2009	1469	C41	Philidor Defense	5
...	...	...	...	...	...	...	...	...	...	...	...	...
20053	True	24	resign	white	10+10	belcolt	1691	jamboger	1220	A80	Dutch Defense	2
20054	True	82	mate	black	10+0	jamboger	1233	farrukhasomiddinov	1196	A41	Queen's Pawn	2
20055	True	35	mate	white	10+0	jamboger	1219	schaaksmurf3	1286	D00	Queen's Pawn Game: Mason Attack	3
20056	True	109	resign	white	10+0	marcodisogno	1360	jamboger	1227	B07	Pirc Defense	4
20057	True	78	mate	black	10+0	jamboger	1235	ffbob	1339	D00	Queen's Pawn Game: Mason Attack	3