This is my first time learning Python because I mostly use R in my projects and for all I know, whatever can be done in Python can also be done in R. But as an aspiring data scientist, I shouldn’t just stick with R all the time. I enrolled in this course by Microsoft because it is data-science-centered. Of course this is for everyone interested in Python and not just limited for data scientists. Join me as I’m gonna write down my takeaways here.
Version 3.x – https://www.python.org/downloads/
Python Script – Text Files .py
Basics
print(3 + 4) # add
print(4 – 3) # subtract
print(4 * 3) # multiply
print(4 / 2) # divide
print(4 ** 2) # exponent, 4²
print(4 % 2) # modulo
Variables
height = 1.79
weight = 68.7
bmi = weight / height ** 2
Types
type(bmi) # float
type(5) # int
type(“body mass index”) # str
type(‘this works too’) # str
type(True) # bool
print(2 + 3) # 5
print(‘ab’ + ‘cd’) # ‘abcd’
“I said ” + (“Hey ” * 2) + “Hey!” # ‘I said Hey Hey Hey!’
str(5) # convert 5 to a string “5”
int(True) # convert True to 1
bool(“True”) # convert “True” to True
float(1) # convert 1 to t1.0
Lists
fam = [“liz”, 1.73, “emma”, 1.68, “mom”, 1.71, “dad”, 1.89, [a,b], [c,d]] # can contain different types, even lists too
type(fam) # list
fam[3] # 1.68, zero-based indexing
fam[-1] # [c,d]
fam[-3] # 1.89
fam[3:5] # [1.68, “mom”] [start:end] [inclusive:exclusive]
fam[:4] # 0 to 3 [“liz”, 1.73, “emma”, 1.68]
fam[5:] # 5 to last [1.71, “dad”, 1.89, [a,b], [c,d]]
fam[0:2] = [“lisa”, 1.74] # fam = [“lisa”, 1.74, “emma”, 1.68, “mom”, 1.71, “dad”, 1.89, [a,b], [c,d]]
fam + [“me”, 1.79] # [“lisa”, 1.74, “emma”, 1.68, “mom”, 1.71, “dad”, 1.89, [a,b], [c,d], “me”, 1.79]
del(fam[2]) # [“lisa”, 1.74, 1.68, “mom”, 1.71, “dad”, 1.89, [a,b], [c,d], “me”, 1.79]
x = [“a”, “b”, “c”]
y = x
y[1] = “z” # x[1] is also z because you copied the reference to the list, not the actual values themselves
y = list(x) # or y = x[:] to select all elements
fam.index(“mom”) # finds “mom” and returns its index: 4
fam.count(1.74) # counts the number of times 1.74 occurs in the list; returns 1
first = [11.25, 18.0, 20.0]
second = [10.75, 9.50]
full = first + second # paste together
full_sorted = sorted(full, reverse=True) # sort in descending order
Functions
max(fam) # maximum value in the list
round(1.68, 1) # round 1.68 to 1 decimal place, 1.7
round(1.68) # round to nearest whole number
help(round) # opens documentation of round function
len(fam) # length of list
Methods
Methods are functions but they differ from function because they call functions on objects.
sister = ‘liz’
sister.capitalize() # ‘Liz’
sister.replace(“z”, “sa”) # ‘lisa’
sister.index(“z”) # 2
fam = [“liz”, 1.73, “emma”, 1.68, “mom”, 1.71, “dad”, 1.89]
fam.index(“mom”) # 4
fam.append(“me”) # fam = [“liz”, 1.73, “emma”, 1.68, “mom”, 1.71, “dad”, 1.89, “me”] fam automatically updated even without re-assigning to fam
sister.upper() # ‘LIZ’
sister.count(“i”) # 1
fam.reverse() # fam = [“me”, 1.89, “dad”, 1.71, “mom”, 1.68, “emma”, 1.73, “liz”] fam automatically updated even without re-assigning to fam
Numpy
Numpy (Numeric Python) efficiently works with arrays. Once installed…
import numpy as np # personal preference for calling the numpy package; can be done without the as np but the whole numpy word should be used when calling a numpy function like array
np.array([1, 2, 3])
a = [1, 2, 3]
b = [4, 5, 6]
np_a = np.array(a)
np_b = np.array(b)
np_a / np_b ** 2 # can perform element-wise operations
np.array([1.0, “is”, True]) # will all turn to string because because Numpy arrays contain only one type
python_list = [1, 2, 3]
python_list + python_list # [1, 2, 3, 1, 2, 3]
numpy_array = np.array([1, 2, 3])
numpy_array + numpy_array # array([2, 4, 6])
a[1] # 2
a > 1 # array([False, True, True], dtype=bool)
a[a > 1] # array([2, 3])
np_2d = np.array([[1, 2, 3, 4], [5, 6, 7, 8]]) # 2D array
np_2d.shape # returns the dimension of the array; (2, 3) since 2 rows and 3 columns
np_2d[0] # array([1, 2, 3])
np_2d[0][1] # 2
np_2d[0, 1] # 2
np_2d[:, 1:3] # array([[1, 2, 3, 4], [6, 7, 8]])
np_2d[1, :] # array([5, 6, 7, 8])
np_2d_another = np.array([[1, 1, 1, 1], [1, 1, 1, 1]])
np_2d + np_2d_another # array([[2, 3, 4, 5], [6, 7, 8, 9]])
np.mean(np_2d) # mean
np.median(np_2d) # median
np.corrcoef(np_2d, np_2d_another) # correlation
np.std(np_2d) # standard deviation
np.sum(np_a) # sum, faster
np.sort(np_a) # sort, faster
height = np.round(np.random.normal(1.75, 0.20, 5000), 2) # 1.75 distribution mean, 0.20 distribution standard deviation, 5000 number of samples
weight = np.round(np.random.normal(60.32, 15, 5000), 2)
np_city = np.column_stack(height, weight) # combine height and weight by column
gk_heights = np_heights[np_positions == ‘GK’] # use other array’s index
Matplotlib
Package usually used for data visualization
import matplotlib.pyplot as plt # import matplotlib with plt as alias
year = [1950, 1970, 1990, 2010]
pop = [2.519, 3.692, 5.263, 6.972]
plt.plot(year, pop) # (horizontal, vertical) line plot
pop = [1.0, 1.262, 1.650] + pop # include these values too
year = [1800, 1850, 1900] + year # include the 3 years
plt.fill_between(year, population, 0, color=’green’) # fill with color green
plt.xlabel(‘Year’) # x axis label
plt.ylabel(‘Population’) # y axis label
plt.title(‘World Population Projections’) # title label
plt.yticks([0, 2, 4, 6, 8, 10]) # all the ticks you want to display in the y axis
plt.yticks([0, 2, 4, 6, 8, 10], [‘0’, ‘2B’, ‘4B’, ‘6B’, ‘8B’, ’10B’]) # 2nd argument are the labels
plt.show() # only then the plot will build
plt.scatter(year, pop) # scatter plot
plt.xscale(‘log’) # put the x axis on a logarithmic scale
help(plt.hist) # help of function hist in module matplotlib.pyplot
plt.hist(pop, bins = 3) # histogram of pop with 3 bins
plt.clf() # clean up plot
plt.scatter(x = gdp_cap, y = life_exp, s = np.array(pop) * 2, c = col, alpha = 0.8) # another scatter plot example
Boolean Logic and Control Flow
x = 12
x > 5 and x < 15 # True
y = 5
y <= 7 or y > 13 # True
z = 4
z % 2 == 0 # z modulo 2 is 0 or z is divisible by 2
Pandas
Unike Numpy, Pandas can handle an array with more than one type; for data frames.
import pandas as pd
brics = pd.read_csv(“brics.csv”) # load csv file
brics = pd.read_csv(“brics.csv”, index_col=0) # to indicate that there are row indexes
brics[“country”] # returns the column country
brics.country # returns the column country too
brics[“on_earth”] = [True, True, True, True, True] # adding a column
brics[“density”] = brics[“population”] / brics[“area”] * 10000000
brics.loc[“BR”] # row access by index
brics.loc[“CH”, “capital”] # element access; row, column
brics[“capital”].loc[“CH”] # element access; column, row
brics.loc[“CH”][“capital”] # element access; row, column
brics.loc[“BR”] # returns series
brics.loc[[“BR”]] # returns dataframe