Building a Starting XI With All Players With the Same Name
I love the transfer window! So many possibilities. Like which players will Manchester City get on their budget of the GDP of a small country. Or which player will make the mandatory annual surprise move to the MLS. Or…how many Gabriels can a club fit in their squad?
With the signing of Gabriel Jesus, Arsenal now have 4 players named Gabriel in the squad. This got me thinking:
What’s the best starting XI you can create with all players having the same name?
Best is definitely subjective so to further specify the problem statement, let’s lay out some basic criteria:
- Male players from all eras are eligible - if only to get more names to pick from.
- The name we pick has to be in the names of each of our 11 players - but it can anywhere in the name (first, last, or middle). So Daniel James and James McGinn would both be eligible for James XI.
- Latin diacritics in the names can be simplified to their english ASCII versions. Michaël Cuisance is eligible for Michael XI.
- But, hypocorisms or shortened versions of the name aren’t allowed if that is the exact first name. So Mikel Arteta is not eligible for Michael FC but Dani Parejo is eligible for Daniel FC because Dani is just short for his name Daniel.
- The final XI has to be a reasonably functional XI - you can’t make an XI of just defenders or goalkeepers.
[Seems like a reasonable set of criteria but spoiler alert, even this will be challenging with the data we have.]
Scraping
To answer this question, we’d need data. Mainly player names - lots of them - but also some basic biography about them - leagues, teams, position, years active etc. Since I don’t have access to any API, I have no choice but to scrape.
I scoured the most common public hunting grounds - transfermarkt, whoscored, sofascore. In the end, Wikipedia and FbRef seem like the best options.
-
Wikipedia has a page here which contains lists of lists-of-pages of football players.
-
FbRef has a player index page here which points to pages where players are segregated by their intitals.
I decided to get the data from fbref as it would be the easiest and fastest to scrape and also has a little more biography data. The index page is close to ideal as a starting point for us. There’s some summarized data for the players (strong emphasis on summarized because it is very minimal) and that’s what we’ll capture.
Code cell
What we are doing here is pretty simple. We’ll use the page links from the index page, visit each link, and collect the player names and other details from there. We finally save it to a csv file. With the 2.5 second sleep we have, the entire script should take approximately 25 minutes.
The data for each player is summarized so what we end up collecting has some major limitations. Namely,
a) It doesn’t have full names. This is important because it gives us more options. For example, Luke Ayling isn’t eligible for David XI but Luke David Ayling definitely is.
b) It doesn’t have any league information. It does have team names though so we can attempt to work that part out.
c) There’s no information about the player’s career beyond how long they were active for. How successful they’ve been, how popular they are and so on.
d) The position tags are only basic. For example, we have only DEF for all defenders, including center-backs and fullbacks. This makes things impossible harder to aumotate.
[Unfortunately, this is probably the best we can do without either hitting each player page (which would take a ridiculous amount of time) or getting access to an API. It is still a decent starting point, and considering this is just a random 2 AM idea, I am happy enough.]
Time for some data cleaning and EDA.
Data Cleaning
Code cell
We’ve got data on almost 1,70,000 players.
The other_details
column contains data about the years the player played, the position, and their team(s) - club or national team. It is a column of lists but they are all represented as strings. Also, not all rows have a list of those exact 3 strings so we need to clean it up.
df["other_details"].apply(eval).apply(len).value_counts()
3 122609
2 42524
1 2665
0 1819
4 6
We’ll have to deal with all of these in a case-by-case basis.
Some of the other data cleaning steps to finish:
- Replace the latin diacritics with their ASCII equivalents
- Split the names out to check which names are the most common amongst players
- Split years to get starting and ending years
[Data cleaning isn’t the most fun thing in the world to do, so you can skip the next code cell to go straight to the results.]
Code cell
Let’s find the most common names amongst all players.
Code cell
name | first_name | last_name | total | |
---|---|---|---|---|
0 | David | 1234 | 40 | 1274 |
1 | Daniel | 1121 | 25 | 1146 |
2 | Jose | 1113 | 6 | 1119 |
3 | Juan | 942 | 0 | 942 |
4 | Luis | 813 | 33 | 846 |
5 | Kevin | 815 | 0 | 815 |
6 | Carlos | 763 | 49 | 812 |
7 | Martin | 638 | 154 | 792 |
8 | Thomas | 632 | 143 | 775 |
9 | Michael | 734 | 13 | 747 |
10 | Ivan | 652 | 10 | 662 |
11 | Diego | 632 | 2 | 634 |
12 | Marco | 603 | 2 | 605 |
13 | Paul | 545 | 29 | 574 |
14 | Lucas | 509 | 47 | 556 |
15 | Alex | 542 | 4 | 546 |
16 | Christian | 539 | 6 | 545 |
17 | Nicolas | 522 | 9 | 531 |
18 | Mohamed | 464 | 62 | 526 |
19 | Antonio | 461 | 35 | 496 |
20 | Jorge | 448 | 18 | 466 |
21 | Sebastian | 463 | 2 | 465 |
22 | Jonathan | 452 | 1 | 453 |
23 | Ali | 374 | 76 | 450 |
24 | Victor | 428 | 19 | 447 |
25 | John | 406 | 30 | 436 |
26 | Mario | 419 | 4 | 423 |
27 | Lee | 365 | 55 | 420 |
28 | Luca | 416 | 2 | 418 |
29 | Peter | 400 | 16 | 416 |
30 | Gonzalez | 2 | 412 | 414 |
31 | Gabriel | 365 | 42 | 407 |
32 | Cristian | 398 | 2 | 400 |
33 | Pedro | 369 | 26 | 395 |
34 | Andrea | 392 | 3 | 395 |
David and Daniel seem to be our overall best bets with their being more than 100 more Davids than Daniels. The most common last name seems to be Gonzalez (check out the last_name
column) followed by Lopez and Perez.
Before we go further, let’s check where Gabriel ranks amongst names.
merged.query("name == 'Gabriel'").index + 1
##48
Intuitively, it makes sense to pick the most common names to make our XI. We have some viable names now. The next step is to make the XI.
Note: If you’re a walking player encyclopedia yourself and know more than enough Daniels or Pedros to make a starting XI, feel free to skip the rest of the post. I’m not so I’ll try to find viable names from the data I have itself.
df.start_year.describe()
"""
count 139667.000000
mean 2011.917540
std 8.736951
min 1930.000000
25% 2007.000000
50% 2015.000000
75% 2018.000000
max 2022.000000
"""
Seems like the oldest players we have is from 1930. There’s a lot of missing data in there and overall the data is very right skewed - which does make sense as the more recent players are natuarlly going to be covered better.
df.position.value_counts(normalize=True)*100
MF 32.931221
DF 27.958792
FW 19.976878
GK 11.107614
FW,MF 3.829648
DF,MF 3.681958
DF,FW 0.446096
DF,GK 0.033896
GK,MF 0.018159
FW,GK 0.015737
Almost a third of all players we have are midfielders, followed by defenders and forwards. Note there are some of these mixed position strings (DF,MF/FW,GK) - important to know when we run our queries.
Now since we have players from all eras and all leagues across the world, I want some way to rank our players. We need this because once I pick a name - say “Clarence”, I want to get all the Clarences in our dataset to choose from for our final Clarence XI. You could go through all Clarences and get 11 names but I want to see what more can I do to have the best names rise up to the top.
We don’t have any player value data (player worth, wages, etc) or which leagues they’ve played in. To confound matters further, we don’t have data on how long they’ve played for each team either - just the names of the teams they’ve played for. Nonetheless, I’ll create a ghetto rating system based on the limited data we have. It doesn’t have to be perfect, it only has to better than the alphabetical order and throw the most obvious names at the top. Here’s what I ended up doing:
Decide on three tiers of teams:
- Tier 1: The top teams from the top five leagues
- Tier 2: The remaining teams from the top five leagues
- Tier 3: Rest of the teams
tier_1 = ["Real Madrid",
"Manchester United",
"Barcelona",
"Juventus",
"Chelsea",
"Liverpool",
"Bayern Munich",
"Arsenal",
"Paris Saint-Germain",
"Manchester City",
"Borussia Dortmund",
"AC Milan",
"Napoli",
"Ajax",
"Tottenham",
"Atlético Madrid",
]
tier_2 = ['Marseille', 'Leverkusen', 'Sevilla', 'Monaco','Rennes', 'Nice', 'RB Leipzig', 'Betis', 'Lazio', 'Union Berlin',
'Köln', 'Manchester Utd', 'West Ham', 'Nantes', 'Athletic Club', 'Lille', 'Hellas Verona', 'Leicester City',
'Mainz 05', 'Hoffenheim', 'Brighton', 'Wolves', 'Torino', 'Sassuolo', "M'Gladbach", 'Newcastle Utd',
'Crystal Palace', 'Valencia', 'Brest', 'Udinese', 'Eint Frankfurt', 'Wolfsburg', 'Bochum', 'Osasuna',
'Celta Vigo', 'Reims', 'Brentford', 'Bologna', 'Aston Villa', 'Montpellier', 'Augsburg', 'Rayo Vallecano',
'Elche', 'Espanyol', 'Angers', 'Empoli', 'Southampton', 'Getafe', 'Cádiz', 'Everton', 'Mallorca', 'Troyes',
'Granada', 'Leeds United', 'Stuttgart', 'Hertha BSC', 'Sampdoria', 'Lorient', 'Spezia', 'Clermont Foot',
'Burnley', 'Levante', 'Saint-Étienne', 'Arminia', 'Alavés', 'Metz', 'Bordeaux', 'Salernitana', 'Cagliari',
'Genoa', 'Venezia', 'Watford', 'Norwich City',
]
Now, I’ll just give each player a True/False flag for if they’ve played for any of the teams from tier_1
or tier_2
.
df['is_tier_1'] = False
df.loc[(df['squad_nation'].str.contains('|'.join(tier_1))) & (df['squad_nation'] == df['squad_nation']), "is_tier_1"] = True
df['is_tier_2'] = False
df.loc[(df['squad_nation'].str.contains('|'.join(tier_2))) & (df['squad_nation'] == df['squad_nation']), "is_tier_2"] = True
Creating the Starting XI
Finally, we decide on 11 players.
player_name = "David" ##let's try the most common name
player_df = df.query(f"first_name == '{player_name}' or last_name == '{player_name}'")
player_df.sort_values(["is_tier_1",
"is_tier_2",
"end_year",
"playing_time"], ascending=[False, False, False, False],
inplace=True
)
for position in ["GK", "DF", "MF", "FW"]:
filtered = player_df.loc[player_df['position'].str.contains(position).fillna(False)].player_name.tolist() ## this is a better way because of the mixed position strings
print(position, ": ", filtered[:15], "\n")
"""
GK : ['David Ospina', 'David de Gea', 'David Cobeno', 'David Seaman', 'David Raya', 'David Stockdale', 'David Marshall', 'David Martin', 'David Button', 'David Soria', 'David Gil', 'David Mitrovic', 'David Oberhauser', 'David Lozancic', 'David Yelldell']
DF : ['David Alaba', 'David Lomban', 'David Lopez', 'David Costas', 'David Rozehnal', 'David Giubilato', 'David Sommeil', 'David May', 'David Burrows', "David O'Leary", 'David Luiz', 'David Mateos', 'David Lee', 'David Brightwell', 'David Kiki']
MF : ['David Alaba', 'David Silva', 'David Lopez', 'David Villa', 'David Pizarro', 'David Odonkor', 'David Hellebuyck', 'David Ginola', 'David Hopkin', 'David Platt', 'David Rocastle', 'David White', 'David Babunski', 'David Luiz', 'David Concha']
FW : ['David Silva', 'David Barral', "David N'Gog", 'David Villa', 'David Bellion', 'David Odonkor', 'David Trezeguet', 'David Ginola', 'David White', 'David Concha', 'David Neres', 'David Drocco', 'David Sesa', 'David Raya', 'David Henen']
"""
The goalkeeper position is definitely stacked but lots of solid names in the other positions too. Here’s the final 11 names I picked
Code cell
David XI seems pretty solid!
Perhaps an even more interesting problem would be to combine two less popular names (with less than say, 100 occurences each) to build the best possible starting XI. Some positions might be better stacked in certain names (by random chance if not by nominative determinism) so it would be an interesting optimization problem to get the best possible XI with players from the least popular name pools. You’d need a way to evaluate the final XIs (perhaps something to do with player value) too so perhaps that’s a problem for a later blog post?