Python Basics
You can execute commands in Python Shell -> line by line
You can also create Python scripts in text files with extension .py
You can create variables, do operations and then print() to generate output from script :
Savings = 100 growth_multiplier =1.1 result = Savings*growth_multiplier print(result)
What is a variable : Specific, case-sensitive name, Call up value through variable name
#In Python Shell Height=1.79 Weight=68.7 height #prints out 1.79 #Calculate BMI #BMI = weight / height^2 Height=1.79 Weight=68.7 bmi=Weight/Height**2 #**2 squares the value, **3 cubes it and so on print(bmi)
Python Types
type(bmi) #gives the type of the variable #in this case it will be a float day_of_week=5 type(day_of_week) #in this case it will be an int (integer) x='body mass index' y='this works too' type(y) #in this case it will a str (string) z=True type(z) #this will be a boolean (so either True or False) 2+3 #this will print 5 in shell 'ab'+'cd' #this will print 'abcd' #Different types = different behaviour #if you want to print with int and strings in it you need to use the method # str() on the integers, example : print("I started with $" + str(savings) + " and now have $" + str(result) + ". Awesome!")
Python Lists
#Python Data Types # float - real numbers # int - integer numbers # str - string, text # bool - True, False height= 1.73 tall = True #Each variable represents single value
#Problem when we have plenty of data points, like it is the case in data #science usually height1 = 1.73 height2 = 1.68 height3 = 1.71 height4 = 1.89 #We can store those in lists [a,b,c]=[1.73,1.68,1.71,1.89] fam=[1.73,1.68,1.71,1.89] #Name collection of values #Contain any type #Contain different types fam = ["liz", 1.73, "emma", 1.68, "mom", 1.71, "dad", 1.89] #prints : ['liz', 1.73, 'emma', 1.68, 'mom', 1.71, 'dad', 1.89] fam2 = [["liz", 1.73], ["emma", 1.68], ["mom", 1.71], ["dad", 1.89]] #prints same thing but maybe more conveniant type(fam) #the type of fam is 'list' #Specific functionality and specific behavior
Subsetting lists
fam = ["liz", 1.73, "emma", 1.68, "mom", 1.71, "dad", 1.89] fam[3] #taking the index 3 of the list will get the value '1.68' as the index #starts at 0, indeed fam[0] will select the first value, 'liz' fam[-1] #this will take the last element of the list, so 1.89 in our example #you can also slice the list, take a part of it in one go : fam[3:5] #this will select everything from the index 3 (inclusive) to index 5 #(exclusive) #In our example : [1.68,'mom'] #you can also slice it using ':' to get either all the elements before an #index or all the elements after it fam[:4] fam[4:] #in first case it will take all the elements before index 4 (excluding) ['liz', 1.73, 'emma', 1.68] #in the second case it will take all the elements after 4 (including) [1.71, 'dad', 1.89]
List Manipulation
Changing list elements
fam = ["liz", 1.73, "emma", 1.68, "mom", 1.71, "dad", 1.89] #if you want to change an element you can do that by changing the value of a #specific index fam[7]=1.86 #or by changing multiple values by slicing fam[0:2] = ["lisa", 1.74]
Adding new elements to a list
#You can as well add and remove elements fam_1 = fam + ["me", 1.79] #And you can as well remove elements from a list using the del() method! del(fam[2])
List of lists
#You can also do lists of lists # area variables (in square meters) hall = 11.25 kit = 18.0 liv = 20.0 bed = 10.75 bath = 9.50 # house information as list of lists house = [["hallway", hall], ["kitchen", kit], ["living room", liv], ["bedroom", bed], ["bathroom",bath]]
Functions
fam = [1.73, 1.68, 1.71, 1.89] print(max(fam)) #The function max() will get the highest number in the list, you can also #store the result in a variable : tallest = max(fam) round(1.68,1) #round() will round the number with the number of decimals specified in the #second parameter, in our example it is 1, so it will round the nunmber to #1.7 #You can use help(round) to get instructions on how to use the function #some other functions : # Create variables var1 and var2 var1 = [1, 2, 3, 4] var2 = True # Print out type of var1 print(type(var1)) # Print out length of var1 print(len(var1)) # Convert var2 to an integer: out2 out2=int(var2) # Create lists first and second first = [11.25, 18.0, 20.0] second = [10.75, 9.50] # Paste together first and second: full full=first+second # Sort full in descending order: full_sorted full_sorted=sorted(full,reverse=True)
If you are doing a standard task, a function probably exists to do it !
List methods :
fam.index("mom") #returns the index of the element 'mom' in the list fam fam.count(1.73) #returns the number of times the element specified appears in the list fam.append('me') #will add a new element 'me' at the end of the list #Some additional methods : #append(), that adds an element to the list it is called on, #remove(), that removes the first element of a list that matches the input, #and #reverse(), that reverses the order of the elements in the list it is called on.
Str methods
sister='liza' sister.capitalize() #Will put the first character in capital letter sister.replace("z", "sa") #Will replace the 'z' in the string by 'sa' sister.index('z') #will return 2, as this is the index in liza place = "poolhouse" # Use upper() on place: place_up place_up=place.upper() #Will capitalize all the letters in the variable place
Packages
Functions and methods are powerful. A package is a directory of Python scripts. Each script is a module, a package gives access to new libraries and new stuff to use in your code. You have to install and then import the packages you want to use and then you can use the different functions etc that are built into the package.
Some popular packages :
Numpy, Matplotlib, Scikit-learn
Dowload pip :
#In the terminal : python3 get-pip.py #Once you have pip (the package installer tool) you can install packages : pip3 install numpy
In the script you have then to import the packages
import numpy as np np.array([1,2,3]) #You imported the numpy package and can now use arrays ! import numpy numpy.array([1,2,3])
the math package
# Definition of radius r = 0.43 # Import the math package import math # Calculate C C = 2*math.pi*r # Calculate A A =math.pi*r**2 # Build printout print("Circumference: " + str(C)) print("Area: " + str(A)) # Definition of radius r = 192500 # Import radians function of math package from math import radians # Travel distance of Moon over 12 degrees. Store in dist. dist=r*radians(12) # Print out dist print(dist)
Numpy
NumPy is a fundamental Python package to efficiently practice data science
You cannot use traditional list to do operations between lists, example :
height = [1.73, 1.68, 1.71, 1.89, 1.79] weight = [65.4, 59.2, 63.6, 88.4, 68.7] weight / height ** 2 #You get an error : TypeError: unsupported operand type(s) for **: 'list' #and 'int'
Numpy stands for Numeric Python. It has an alternative to Pyhton lists : the Numpy Arrays. It can do calculations over entire arrays, it’s easy and fast. You just have to install the numpy package with :
#In terminal pip3 install numpy
import numpy as np height = [1.73, 1.68, 1.71, 1.89, 1.79] weight = [65.4, 59.2, 63.6, 88.4, 68.7] np_height = np.array(height) #np_height = array([ 1.73, 1.68, 1.71, 1.89, 1.79]) np_weight = np.array(weight) #np_weight = array([ 65.4, 59.2, 63.6, 88.4, 68.7]) bmi = np_weight / np_height ** 2 print(bmi) #array([ 21.852, 20.975, 21.75 , 24.747, 21.441])
This time it worked ! we could take the whole collection of data and do operations between those lists element to element
Numpy arrays can only contain one type of data
#If you do : np.array([1.0, "is", True]) #then all the elements will be converted to string automatically # -> array(['1.0', 'is', 'True'],
Different types, different behaviors :
python_list = [1, 2, 3] numpy_array = np.array([1, 2, 3]) python_list + python_list #[1, 2, 3, 1, 2, 3] numpy_array + numpy_array #array([2, 4, 6])
Numpy Subsetting
bmi #array([ 21.852, 20.975, 21.75 , 24.747, 21.441]) bmi[1] #20.975 bmi > 23 #array([False, False, False, True, False], dtype=bool) #returns an array with booleans that say if element is bigger than 23 or not bmi[bmi > 23] #array([ 24.747]) return the element for which the condition is True
2D Numpy Arrays
np_2d = np.array([[1.73, 1.68, 1.71, 1.89, 1.79], [65.4, 59.2, 63.6, 88.4, 68.7]]) np_2d.shape #Returns the number of rows and columns #in this caqse (2,5), 2 rows and 5 columns, there are indeed 2 lists of 5 #elements each
Subsetting
0 1 2 3 4 array([[ 1.73, 1.68, 1.71, 1.89, 1.79], 0 [ 65.4, 59.2, 63.6, 88.4, 68.7]]) 1 np_2d[0] #Returns first array # array([ 1.73, 1.68, 1.71, 1.89, 1.79]) np_2d[0][2] #Return element with index 2 of the first row #1.71 np_2d[0,2] #Same returns 1.71 np_2d[:,1:3] #Returns elements 1->3 from both rows #array([[ 1.68, 1.71], # [ 59.2 , 63.6 ]]) np_2d[1,:] #Returns everything from second row #array([ 65.4, 59.2, 63.6, 88.4, 68.7])
You can also do array multiplications element to element. imagine you have a 2D array with each row reprensenting some dude with 3 columns, one for the weight, one for the height and one for the age. Now you want to convert the weight and the height for all the dudes from heretic american scales to glorious european ones.
You can create an array with the multiplying factors and just multiplay both arrays !
# baseball is available as a regular list of lists # updated is available as 2D numpy array # Import numpy package import numpy as np # Create np_baseball (3 cols, weight, height, age) np_baseball = np.array(baseball) # Create numpy array: conversion conversion = np.array([0.0254, 0.453592, 1]) # Print out product of np_baseball and conversion print(np_baseball*conversion)
Numpy : basic statistics
When you have huge amounts of data it is impossible to just look at it and spot problems or trends. You must use statistics tools available in Numpy.
#City wide implementation import numpy as np np_city = ... # Implementation left out np_city #the list contains thousands of rows, one for each citizen with height and #weight like this : array([[1.64, 71.78], [1.37, 63.35], [1.6 , 55.09], ..., [2.04, 74.85], [2.04, 68.72], [2.01, 73.57]]) np.mean(np_city[:,0]) #returns the mean for all rows, for the first column #returns for example : 1.7472 #mean is the average np.median(np_city[:,0]) #returns the median for all rows, for the first column #returns for example : 1.75 #median is the number where exactly half of the people are below and half #are above np.corrcoef(np_city[:,0], np_city[:,1]) #Corrcoef calculates correlation coefficients #array([[ 1. , -0.01802], #[-0.01803, 1. ]]) np.std(np_city[:,0]) #computes the standard deviation for first column for all rows #for instance : 0.1992 in our dataset #You can also use sum(), sort().. #selecting specific values #Imagine you have one array with the heights #and another array with the position of the player (Goal keeper, defense..) #Now you want to get all the heights of the goal keepers, you do : gk_heights=np_heights[np_positions == 'GK'] #it returns True/False for the position and based on this select the correct #values in the heights array
Generate data
#Arguments for np.random.normal() #distribution mean #distribution standard deviation #number of samples height = np.round(np.random.normal(1.75, 0.20, 5000), 2) weight = np.round(np.random.normal(60.32, 15, 5000), 2) np_city = np.column_stack((height, weight))