Dictionaries

When you have data scattered in multiple list it is not very intuitive to query it. You have to write plenty of useless lines of code, which makes the code unclear and hard to read.

Example :

pop = [30.55, 2.77, 39.21]
countries = ["afghanistan", "albania", "algeria"]
ind_alb = countries.index("albania") 
pop[ind_alb]
-> 2.77

We can use a dictionary which agregates stuff into one object :

world = {"afghanistan" :30.55, "albania":2.77, "algeria":39.21}
world["albania"]
-> 2.77

dict_name[ key ] -> result : value

# Definition of dictionary
europe = {'spain':'madrid', 'france':'paris', 'germany':'berlin', 'norway':'oslo' }

# Print out the keys in europe
print(europe.keys())

# Print out value that belongs to key 'norway'
print(europe['norway'])

Keys are immutable, you can only have it one time. if you add in the list a new key/value couple for ‘france’ in the example above you will just update the value for france with the new value entered. You cannot enter lists as Keys

Some additional features :

#Add a new couple to the world dictionary
world["sealand"] = 0.000027

#Check if sealand key is in the dictionary
"sealand" in world 
-> True

#Update the value in sealand
world["sealand"] = 0.000028

#Delete the entry 
del(world["sealand"])

List vs Dictionary :

# Dictionary of dictionaries
europe = { 'spain': { 'capital':'madrid', 'population':46.77 },
           'france': { 'capital':'paris', 'population':66.03 },
           'germany': { 'capital':'berlin', 'population':80.62 },
           'norway': { 'capital':'oslo', 'population':5.084 } }


# Print out the capital of France
print(europe['france']['capital'])

# Create sub-dictionary data
data={'capital':'rome', 'population':59.83}

# Add data to europe under key 'italy'
europe['italy']=data

# Print europe
print(europe)

Pandas

When you have data with multiple columns and multiple data types you cannot use 2D Numpy arrays as they require to have the same type of data. So in that case we use the Pandas ! A high level data manipulation tool, built on Numpy and using data frames.

This is a data frame :

Now to create a data frame :

import pandas as pd

dict = {
 "country":["Brazil", "Russia", "India", "China", "South Africa"],
 "capital":["Brasilia", "Moscow", "New Delhi", "Beijing", "Pretoria"],
 "area":[8.516, 17.10, 3.286, 9.597, 1.221]
 "population":[200.4, 143.5, 1252, 1357, 52.98] }
 
brics = pd.DataFrame(dict)

Now, to put the row labels in you have to change the index, done, you have created the dataframe shown above :

 brics.index = ["BR", "RU", "IN", "CH", "SA"]

DataFrame from CSV file

brics = pd.read_csv("path/to/brics.csv", index_col = 0)

If you don’t specify the index in the read, you will end up with numbers as indexes.

Here is an example from the exercices :

import pandas as pd

# Build cars DataFrame
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]
cars_dict = { 'country':names, 'drives_right':dr, 'cars_per_cap':cpc }
cars = pd.DataFrame(cars_dict)
print(cars)

# Definition of row_labels
row_labels = ['US', 'AUS', 'JPN', 'IN', 'RU', 'MOR', 'EG']

# Specify row labels of cars
cars.index =row_labels

# Print cars again
print(cars)

Now to select data in pandas :

You can use column access, this will return a column as a new python object : a sery (1D labelled array)

brics["country"]
 
#Out[4]:
#BR Brazil
#RU Russia
#IN India
#CH China
#SA South Africa
#Name: country, dtype: object 

If you want to return a dataframe you have to add another layer of brackets :

brics[["country"]] 

#Out[4]:
#   Country
#BR Brazil
#RU Russia
#IN India
#CH China
#SA South Africa

You can also select multiple columns and return them in a new dataframe :

 brics[["country", "capital"]] 

You can as well select rows in your dataframe like this :

brics[1:4] 

#Out[9]:
#   country capital   area   population
#RU Russia  Moscow    17.100 143.5
#IN India   New Delhi 3.286  1252.0
#CH China   Beijing   9.597  1357.0

This is obviously not ideal as you don’t want to select your rows solely on the index of the row and only via slicing. To help us we can as well select rows with the “loc” tool :

 #This will return the row for Russia as a Pandas Series
 brics.loc["RU"]
 #This will return the row for Russia as a Data frame
 brics.loc[["RU"]]
 #To return multiple rows :
 brics.loc[["RU", "IN", "CH"]] 
 #To return multiple rows and some specific columns : 
 brics.loc[["RU", "IN", "CH"], ["country", "capital"]]
 #To return all rows and some specific columns : 
 brics.loc[:, ["country", "capital"]]

Recap :

Square brackets

  • Column access : brics[[“country”, “capital”]]
  • Row access: only through slicing : brics[1:4]

loc (label-based)

  • Row access : brics.loc[[“RU”, “IN”, “CH”]]
  • Column access : brics.loc[:, [“country”, “capital”]]
  • Row & Column access : brics.loc[[“RU”, “IN”, “CH”], [“country”, “capital”]]

Now, next we will se Row Access through iloc, this will select a row with the index location.

 #The code below will select the row with index 1, so Russia. 
brics.iloc[[1]] 

#The code below will select multiple rows 
brics.iloc[[1,2,3]] 

#The code below will select multiple rows and multiple columns :
brics.iloc[[1,2,3], [0, 1]]

#All rows, specific columns :
brics.iloc[:, [0,1]] 


Brax

Dude in his 30s starting his digital notepad