Ci-dessous, les différences entre deux révisions de la page.
Les deux révisions précédentes Révision précédente Prochaine révision | Révision précédente Prochaine révision Les deux révisions suivantes | ||
python:first_course_statistics [2016/10/09 23:16] Francesco Beretta [Eruptions of the Old Faithful geyser (p.5)] |
python:first_course_statistics [2016/10/26 19:34] Beretta, Anna Letizia |
||
---|---|---|---|
Ligne 3: | Ligne 3: | ||
Read following important documentation about: | Read following important documentation about: | ||
- | * pandas: accessing dataframes (tables) | + | * pandas [[http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe|dataframes]] |
* [[http://matplotlib.org/api/pyplot_summary.html|matplotlib.pyplot]] | * [[http://matplotlib.org/api/pyplot_summary.html|matplotlib.pyplot]] | ||
Ligne 13: | Ligne 13: | ||
===== Histogram (p.5) ===== | ===== Histogram (p.5) ===== | ||
- | |||
- | FB: this script works fine ! | ||
<code python> | <code python> | ||
Ligne 31: | Ligne 29: | ||
\\ | \\ | ||
+ | |||
+ | ===== Boxplot (p. 6) ===== | ||
+ | |||
+ | <code python> | ||
+ | import matplotlib.pyplot as plt | ||
+ | import pandas as pd | ||
+ | gysr1_boxplot = pd.read_csv('...\geyser1.TAB', '\t') | ||
+ | data_gysr1 = gysr1_boxplot['Interval'] | ||
+ | plt.boxplot(data_gysr1) | ||
+ | ax = plt.gca() | ||
+ | ax.set_xlabel('222 cases') | ||
+ | ax.set_ylabel('Interruption time ( minutes') | ||
+ | ax.set_title('Box and Whisker Plot') | ||
+ | plt.show() | ||
+ | </code> | ||
+ | |||
+ | |||
+ | \\ | ||
+ | |||
+ | ===== ScatterPlot (p. 7) ===== | ||
+ | |||
+ | AB: Put face- and edgecolor to change both of them. You can also have two different colors for the in- and outside of each dot. | ||
+ | |||
+ | <code python> | ||
+ | import matplotlib.pyplot as plt | ||
+ | import pandas as pd | ||
+ | geysr1_scatterplot = pd.read_csv('...\geyser1.TAB', '\t') | ||
+ | geysr1_data_Xax = geysr1_scatterplot['Duration'] | ||
+ | geysr1_data_Yax = geysr1_scatterplot['Interval'] | ||
+ | plt.scatter(geysr1_data_Xax, geysr1_data_Yax, facecolor='y', edgecolor='y') | ||
+ | ax = plt.gca() | ||
+ | ax.set_xlabel('Eruption duration time (minutes)') | ||
+ | ax.set_ylabel('Interuption time (minutes)') | ||
+ | ax.set_title('Scatter Plot of INTERVAL vs DURATION') | ||
+ | plt.show() | ||
+ | </code> | ||
+ | |||
+ | |||
+ | \\ | ||
+ | |||
+ | |||
+ | ===== Descriptive statistics (p.9) ===== | ||
+ | |||
+ | Note: try different examples, e.g. the whole population or only those where 'Duration' <= 3, the whole dataframe | ||
+ | |||
+ | [[http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics|doc]] – [[http://www.marsja.se/pandas-python-descriptive-statistics/|example]] | ||
+ | |||
+ | <code python> | ||
+ | import pandas as pd | ||
+ | gysr1 = pd.read_csv('../geyser1.tab', '\t') | ||
+ | gysr1['Duration'][gysr1['Duration'] <= 3].describe() | ||
+ | </code> | ||
+ | |||
+ | |||
+ | \\ | ||
+ | |||
+ | |||
+ | ===== Boxplot (p.9) ===== | ||
+ | |||
+ | Selecting rows in a dataframe: [[http://pandas.pydata.org/pandas-docs/stable/indexing.html#the-where-method-and-masking|doc]] / [[http://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas|example]] | ||
+ | |||
+ | <code python> | ||
+ | import matplotlib.pyplot as plt | ||
+ | import pandas as pd | ||
+ | gysr1 = pd.read_csv('../geyser1.tab', '\t') | ||
+ | gysr1_inf3 = gysr1.loc[gysr1['Duration'] <= 3] | ||
+ | gysr1_sup3 = gysr1.loc[gysr1['Duration'] > 3] | ||
+ | plt.boxplot([gysr1_inf3['Interval'],gysr1_sup3['Interval']], labels= ['inf3','sup3']) | ||
+ | </code> | ||
+ | |||
+ | |||
+ | \\ | ||
+ | |||
====== International adoption rates (p.13) ====== | ====== International adoption rates (p.13) ====== | ||
+ | |||
+ | ===== Boxplot (p.14) ===== | ||
+ | |||
+ | <code python> | ||
+ | import matplotlib.pyplot as plt | ||
+ | import pandas as pd | ||
+ | adopt_data = pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\adopt.TAB', '\t') | ||
+ | adopt1 = adopt_data['Visa91'] | ||
+ | plt.boxplot(adopt1) | ||
+ | ax = plt.gca() | ||
+ | ax.set_title('Box and Whisker Plot') | ||
+ | ax.set_xlabel('39 cases') | ||
+ | ax.set_ylabel('Number of visas in 1991') | ||
+ | plt.show() | ||
+ | </code> | ||
+ | |||
+ | |||
+ | \\ | ||
+ | |||
+ | |||
+ | ===== Histogram (p.14) ===== | ||
+ | |||
+ | <code python> | ||
+ | import matplotlib.pyplot as plt | ||
+ | import pandas as pd | ||
+ | adopt_data = pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\adopt.TAB', '\t') | ||
+ | adopt1 = adopt_data['Visa91'] | ||
+ | plt.hist(adopt1) | ||
+ | plt.show() | ||
+ | </code> | ||
+ | |||
+ | |||
+ | \\ | ||
+ | |||
+ | =====Histogram with Log(p.18)===== | ||
+ | don't find the way to do it | ||
+ | <code Python> | ||
+ | import pandas as pd | ||
+ | import matplotlib.pyplot as plt | ||
+ | adopt = pd.DataFrame(pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\adopt.TAB', '\t')) | ||
+ | adopt_loghist = adopt['Visa91'] | ||
+ | #adopt_loghist.semilogx() --> was one of the possibilities | ||
+ | ax = plt.gca() | ||
+ | ax.hist(adopt_loghist, bins=10, plt.loglog(0.5,3.5), color='r') #put log=True instead, but you will get the log for the frequencies | ||
+ | plt.gca().set_xscale("log") | ||
+ | ax.set_xlabel('Log (Number of 1991 visas') | ||
+ | ax.set_ylabel('Frequency') | ||
+ | ax.set_title('Histogram') | ||
+ | plt.show() | ||
+ | </code> | ||
+ | |||
+ | |||
+ | =====Scatterplot (p. 17)===== | ||
+ | <code python> | ||
+ | import matplotlib.pyplot as plt | ||
+ | import pandas as pd | ||
+ | adoption_scatterplot = pd.read_csv('...\adopt.TAB', '\t') | ||
+ | adopt_data_Xax = adoption_scatterplot['Visa88'] | ||
+ | adopt_data_Yax = adoption_scatterplot['Visa91'] | ||
+ | plt.scatter(adopt_data_Xax, adopt_data_Yax, facecolor='y', edgecolor='y') | ||
+ | ax = plt.gca() | ||
+ | ax.set_xlabel('Number of Visas in 1988') | ||
+ | ax.set_ylim([0,2700]) | ||
+ | ax.set_xlim([0,5000]) | ||
+ | ax.set_ylabel('Number of Visas in 1991') | ||
+ | ax.set_title('ScatterPlot of Visa91 vs Visa88') | ||
+ | plt.show() | ||
+ | </code> | ||
+ | |||
+ | |||
+ | \\ | ||
+ | |||
+ | |||
+ | =====Scatterplot (p.18)===== | ||
+ | <code python> | ||
+ | import matplotlib.pyplot as plt | ||
+ | import pandas as pd | ||
+ | adoption_scatterplot = pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\adopt.TAB', '\t') | ||
+ | adopt_data_Xax = adoption_scatterplot['Visa91'] | ||
+ | adopt_data_Yax = adoption_scatterplot['Visa92'] | ||
+ | plt.scatter(adopt_data_Xax, adopt_data_Yax, facecolor='y', edgecolor='y') | ||
+ | ax = plt.gca() | ||
+ | ax.set_xlabel('Number of Visas in 1991') | ||
+ | ax.set_ylim([0,1800]) | ||
+ | ax.set_xlim([0,2700]) | ||
+ | ax.set_ylabel('Number of Visas in 1992') | ||
+ | ax.set_title('ScatterPlot of Visa92 vs Visa91') | ||
+ | plt.show() | ||
+ | </code> | ||
+ | |||