The goal of the project is to understand the relation of chess openings with different variables such as the outcome of the match, the users' ratings, and even the choices of openings a specific user has. The optimal openings have been a controversy discussion, and we believe the differences in data resulting from variations in openings would provide insight. The question that we will attempt to answer is how much of a factor openings have in terms of winning with the different variables presented, such as rating and time control.
The dataset we plan to work with is taken from lichess. Each unit of observation is a player on from "Lilchess Teams" uniquely identified by their usernames. This information includes data such as the time the game was played, the users that participated, the exact moves that were played; this means the opening can be directly inferred and placed within the data.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/elenajy/elenajy.github.io/master/datascience/games.csv')
(df["black_rating"].mean()+df["white_rating"].mean())/2 #Finding the average rating by taking the mean of both the white and black rating columns and then averaging those columns
The average rating of the dataset is around 1600; although the quality of the games are not exactly GrandMaster level, the average rating reflects that of the slightly-above average player which allows the information to be a bit more useful to an average player. This is important since the information when one attempts to find a "useful" opening would come from a scenario where a GrandMaster would use the opening, which may or may not be as useful to the average player which is the larger demographic.
In addition, although the average is around 1600, it takes a variety of ratings which means we are able to also look at different openings used for the range of ratings provided.
While this information is useful, it could be beneficial to increase the sample size of games as this is a relatively small sample of 20,000 observations. It would also be interesting to include some higher rated games to see a difference in trends depending on the ratings of the players.
We plan on working on this project as a pair at least once a week over Discord, where we have been and will continue to hold most of our project conversation. We will also meet in-person or over Zoom to discuss the project as we achieve certain goals we set for ourselves and to evalute our progress and timeline. A Google Colab has been set up in order to work on the project individually, and a GitHub has been set up with our files which we both have access to and the ability to push and commit changes.
We have loaded our dataset found on Lilchess to the GitHub. In order to have a table that is slightly easier to read, we will remove the "moves" column which reflects later moves made in the match, since the first moves would be known through the opening which is what we are mainly concerned with.
df = df_moves.drop("moves", axis=1)
Looking at the data set, there is a rated column indicating False if the game was not rated and True otherwise. For the purpose of this analysis we will focus on rated games only. About 83% of this data set is rated games so it won't affect the results too much.
Other columns that we will drop that don't affect our study of opening moves and end game results are created_at which indicates when the game started and last_move_at which indicates what time the game ended. We choose to include white_id and black_id which are the usernames of the respective players at this point in our analysis because there is some analysis to be done regarding player appearance frequencies in this dataset; for example, just because player_a uses the French opening 1000 times does not mean the French is widely used. Finally, we also chose to get rid of the id column as each unit of observations is uniquely identified by the row number already and the id column is thus redundant.
df.drop('created_at', axis=1, inplace=True)
df.drop('last_move_at', axis=1, inplace=True)
df.drop('id', axis=1, inplace=True)
display(df)
One thing discussed is the win-rate of white and black, and we will see if that ratio is obvious in a small sample of average rating.
df.winner.value_counts().plot.bar()
(df[(df.winner == "white")].winner.value_counts())/(len(df.winner)) #Finding the % of games won by white
(df[(df.winner == "black")].winner.value_counts())/(len(df.winner)) #Finding the % of games won by black
(df[(df.winner == "draw")].winner.value_counts())/(len(df.winner)) #Finding the % of draws
We notice that despite the drawrate stil being around 5%, white still has a 50% winrate with black having a 45% winrate, so it is much more prevalent that white wins more than black. In chess, white moves first so this indicates that if each player places optimally (a large assumption to make, of course) white has the advantage since they dictate the direction the opening goes.
Another interesting observation is the different time controls that were played. When looking a the amount of times a time control was played, we notice that 10+0 (10 minute timer with no increment) was played the most. This could effect the data as the time is a mid-range speed, which may allow for more comfortable openings.
df.increment_code.value_counts()
Finally, we previously mentioned that we want to make sure that this dataset has a variety of players, and not just one or two players dominating the types of openings reflected in this dataset. Before we change anything in order to reflect a unique set of players, we will first look and find the amount of players that are in the data.
# we have 20058 observations, so if all the players are unique we would have 40116 unique players
df_playerIds = df[['white_id','black_id']]
df_uniquePlayers = (df_playerIds['white_id'].append(df_playerIds['black_id'])).unique()
print(len(df_uniquePlayers))
With there being 15635 players being unique, this shows that a portion of the games in the dataset are played by duplicate players. While it is important to look at the opening data from unique players, it would also be interesting to look at the different openings that a single player plays, which we could find using this dataset.
The dtypes are properly interpreted from boolean value for rated match, int and floats for our quantitative variables, and object for the rest.
display(df.dtypes)
%%shell
jupyter nbconvert --to html /content/milestone_1.ipynb