Dictionaries
When you have data scattered in multiple list it is not very intuitive to query it. You have to write plenty of useless lines of code, which makes the code unclear and hard to read.
Example :
pop = [30.55, 2.77, 39.21] countries = ["afghanistan", "albania", "algeria"] ind_alb = countries.index("albania") pop[ind_alb] -> 2.77
We can use a dictionary which agregates stuff into one object :
world = {"afghanistan" :30.55, "albania":2.77, "algeria":39.21} world["albania"] -> 2.77
dict_name[ key ] -> result : value
# Definition of dictionary europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo' } # Print out the keys in europe print(europe.keys()) # Print out value that belongs to key 'norway' print(europe['norway'])
Keys are immutable, you can only have it one time. if you add in the list a new key/value couple for ‘france’ in the example above you will just update the value for france with the new value entered. You cannot enter lists as Keys
Some additional features :
#Add a new couple to the world dictionary world["sealand"] = 0.000027 #Check if sealand key is in the dictionary "sealand" in world -> True #Update the value in sealand world["sealand"] = 0.000028 #Delete the entry del(world["sealand"])
List vs Dictionary :
# Dictionary of dictionaries europe = { 'spain': { 'capital':'madrid', 'population':46.77 }, 'france': { 'capital':'paris', 'population':66.03 }, 'germany': { 'capital':'berlin', 'population':80.62 }, 'norway': { 'capital':'oslo', 'population':5.084 } } # Print out the capital of France print(europe['france']['capital']) # Create sub-dictionary data data={'capital':'rome', 'population':59.83} # Add data to europe under key 'italy' europe['italy']=data # Print europe print(europe)
Pandas
When you have data with multiple columns and multiple data types you cannot use 2D Numpy arrays as they require to have the same type of data. So in that case we use the Pandas ! A high level data manipulation tool, built on Numpy and using data frames.
This is a data frame :
Now to create a data frame :
import pandas as pd dict = { "country":["Brazil", "Russia", "India", "China", "South Africa"], "capital":["Brasilia", "Moscow", "New Delhi", "Beijing", "Pretoria"], "area":[8.516, 17.10, 3.286, 9.597, 1.221] "population":[200.4, 143.5, 1252, 1357, 52.98] } brics = pd.DataFrame(dict)
Now, to put the row labels in you have to change the index, done, you have created the dataframe shown above :
brics.index = ["BR", "RU", "IN", "CH", "SA"]
DataFrame from CSV file
brics = pd.read_csv("path/to/brics.csv", index_col = 0)
If you don’t specify the index in the read, you will end up with numbers as indexes.
Here is an example from the exercices :
import pandas as pd # Build cars DataFrame names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt'] dr = [True, False, False, False, True, True, True] cpc = [809, 731, 588, 18, 200, 70, 45] cars_dict = { 'country':names, 'drives_right':dr, 'cars_per_cap':cpc } cars = pd.DataFrame(cars_dict) print(cars) # Definition of row_labels row_labels = ['US', 'AUS', 'JPN', 'IN', 'RU', 'MOR', 'EG'] # Specify row labels of cars cars.index =row_labels # Print cars again print(cars)
Now to select data in pandas :
You can use column access, this will return a column as a new python object : a sery (1D labelled array)
brics["country"] #Out[4]: #BR Brazil #RU Russia #IN India #CH China #SA South Africa #Name: country, dtype: object
If you want to return a dataframe you have to add another layer of brackets :
brics[["country"]] #Out[4]: # Country #BR Brazil #RU Russia #IN India #CH China #SA South Africa
You can also select multiple columns and return them in a new dataframe :
brics[["country", "capital"]]
You can as well select rows in your dataframe like this :
brics[1:4] #Out[9]: # country capital area population #RU Russia Moscow 17.100 143.5 #IN India New Delhi 3.286 1252.0 #CH China Beijing 9.597 1357.0
This is obviously not ideal as you don’t want to select your rows solely on the index of the row and only via slicing. To help us we can as well select rows with the “loc” tool :
#This will return the row for Russia as a Pandas Series brics.loc["RU"] #This will return the row for Russia as a Data frame brics.loc[["RU"]] #To return multiple rows : brics.loc[["RU", "IN", "CH"]] #To return multiple rows and some specific columns : brics.loc[["RU", "IN", "CH"], ["country", "capital"]] #To return all rows and some specific columns : brics.loc[:, ["country", "capital"]]
Recap :
Square brackets
- Column access : brics[[“country”, “capital”]]
- Row access: only through slicing : brics[1:4]
loc (label-based)
- Row access : brics.loc[[“RU”, “IN”, “CH”]]
- Column access : brics.loc[:, [“country”, “capital”]]
- Row & Column access : brics.loc[[“RU”, “IN”, “CH”], [“country”, “capital”]]
Now, next we will se Row Access through iloc, this will select a row with the index location.
#The code below will select the row with index 1, so Russia. brics.iloc[[1]] #The code below will select multiple rows brics.iloc[[1,2,3]] #The code below will select multiple rows and multiple columns : brics.iloc[[1,2,3], [0, 1]] #All rows, specific columns : brics.iloc[:, [0,1]]