Ci-dessous, les différences entre deux révisions de la page.
| Les deux révisions précédentes Révision précédente Prochaine révision | Révision précédente | ||
|
python:first_course_statistics [2016/10/17 07:41] Francesco Beretta [ScatterPlot (p. 7)] |
python:first_course_statistics [2017/09/26 08:54] (Version actuelle) Francesco Beretta [General instructions] |
||
|---|---|---|---|
| Ligne 5: | Ligne 5: | ||
| * pandas [[http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe|dataframes]] | * pandas [[http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe|dataframes]] | ||
| * [[http://matplotlib.org/api/pyplot_summary.html|matplotlib.pyplot]] | * [[http://matplotlib.org/api/pyplot_summary.html|matplotlib.pyplot]] | ||
| + | |||
| + | |||
| + | Get the data from [[http://people.stern.nyu.edu/jsimonof/Casebook/Data/ASCII/README.html|this site]]. | ||
| + | |||
| Save your scripts in a folder inside the data folder, calling the script folder 'my_scripts' or whaterver. If 'my-scripts' is set as your [[python:generic_features#get_the_current_working_directory_address|current working directory]], then the data files are available under this address '../[data file]', for instantce: '../geyser1.TAB' | Save your scripts in a folder inside the data folder, calling the script folder 'my_scripts' or whaterver. If 'my-scripts' is set as your [[python:generic_features#get_the_current_working_directory_address|current working directory]], then the data files are available under this address '../[data file]', for instantce: '../geyser1.TAB' | ||
| Ligne 35: | Ligne 39: | ||
| import matplotlib.pyplot as plt | import matplotlib.pyplot as plt | ||
| import pandas as pd | import pandas as pd | ||
| - | gysr1_boxplot = pd.read_csv('...\geyser1.TAB', '\t') | + | gysr1_boxplot = pd.read_csv('.../geyser1.TAB', '\t') |
| data_gysr1 = gysr1_boxplot['Interval'] | data_gysr1 = gysr1_boxplot['Interval'] | ||
| plt.boxplot(data_gysr1) | plt.boxplot(data_gysr1) | ||
| Ligne 55: | Ligne 59: | ||
| import matplotlib.pyplot as plt | import matplotlib.pyplot as plt | ||
| import pandas as pd | import pandas as pd | ||
| - | geysr1_scatterplot = pd.read_csv('...\geyser1.TAB', '\t') | + | geysr1_scatterplot = pd.read_csv('.../geyser1.TAB', '\t') |
| geysr1_data_Xax = geysr1_scatterplot['Duration'] | geysr1_data_Xax = geysr1_scatterplot['Duration'] | ||
| geysr1_data_Yax = geysr1_scatterplot['Interval'] | geysr1_data_Yax = geysr1_scatterplot['Interval'] | ||
| Ligne 65: | Ligne 69: | ||
| plt.show() | plt.show() | ||
| </code> | </code> | ||
| + | |||
| + | |||
| + | \\ | ||
| + | |||
| + | |||
| + | ===== Descriptive statistics (p.9) ===== | ||
| + | |||
| + | Note: try different examples, e.g. the whole population or only those where 'Duration' <= 3, the whole dataframe | ||
| + | |||
| + | [[http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics|doc]] – [[http://www.marsja.se/pandas-python-descriptive-statistics/|example]] | ||
| + | |||
| + | <code python> | ||
| + | import pandas as pd | ||
| + | gysr1 = pd.read_csv('../geyser1.tab', '\t') | ||
| + | gysr1['Duration'][gysr1['Duration'] <= 3].describe() | ||
| + | </code> | ||
| + | |||
| + | |||
| + | \\ | ||
| + | |||
| ===== Boxplot (p.9) ===== | ===== Boxplot (p.9) ===== | ||
| Ligne 78: | Ligne 102: | ||
| plt.boxplot([gysr1_inf3['Interval'],gysr1_sup3['Interval']], labels= ['inf3','sup3']) | plt.boxplot([gysr1_inf3['Interval'],gysr1_sup3['Interval']], labels= ['inf3','sup3']) | ||
| </code> | </code> | ||
| + | |||
| + | |||
| + | \\ | ||
| + | |||
| ====== International adoption rates (p.13) ====== | ====== International adoption rates (p.13) ====== | ||
| + | ===== Boxplot (p.14) ===== | ||
| + | |||
| + | <code python> | ||
| + | import matplotlib.pyplot as plt | ||
| + | import pandas as pd | ||
| + | adopt_data = pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\adopt.TAB', '\t') | ||
| + | adopt1 = adopt_data['Visa91'] | ||
| + | plt.boxplot(adopt1) | ||
| + | ax = plt.gca() | ||
| + | ax.set_title('Box and Whisker Plot') | ||
| + | ax.set_xlabel('39 cases') | ||
| + | ax.set_ylabel('Number of visas in 1991') | ||
| + | plt.show() | ||
| + | </code> | ||
| + | |||
| + | |||
| + | \\ | ||
| + | |||
| + | |||
| + | ===== Histogram (p.14) ===== | ||
| + | |||
| + | <code python> | ||
| + | import matplotlib.pyplot as plt | ||
| + | import pandas as pd | ||
| + | adopt_data = pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\adopt.TAB', '\t') | ||
| + | adopt1 = adopt_data['Visa91'] | ||
| + | plt.hist(adopt1) | ||
| + | plt.show() | ||
| + | </code> | ||
| + | |||
| + | |||
| + | \\ | ||
| + | |||
| + | =====Histogram with Log(p.18)===== | ||
| + | don't find the way to do it | ||
| + | <code Python> | ||
| + | import pandas as pd | ||
| + | import matplotlib.pyplot as plt | ||
| + | adopt = pd.DataFrame(pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\adopt.TAB', '\t')) | ||
| + | adopt_loghist = adopt['Visa91'] | ||
| + | #adopt_loghist.semilogx() --> was one of the possibilities | ||
| + | ax = plt.gca() | ||
| + | ax.hist(adopt_loghist, bins=10, plt.loglog(0.5,3.5), color='r') #put log=True instead, but you will get the log for the frequencies | ||
| + | plt.gca().set_xscale("log") | ||
| + | ax.set_xlabel('Log (Number of 1991 visas') | ||
| + | ax.set_ylabel('Frequency') | ||
| + | ax.set_title('Histogram') | ||
| + | plt.show() | ||
| + | </code> | ||
| + | |||
| + | |||
| + | =====Scatterplot (p. 17)===== | ||
| + | <code python> | ||
| + | import matplotlib.pyplot as plt | ||
| + | import pandas as pd | ||
| + | adoption_scatterplot = pd.read_csv('...\adopt.TAB', '\t') | ||
| + | adopt_data_Xax = adoption_scatterplot['Visa88'] | ||
| + | adopt_data_Yax = adoption_scatterplot['Visa91'] | ||
| + | plt.scatter(adopt_data_Xax, adopt_data_Yax, facecolor='y', edgecolor='y') | ||
| + | ax = plt.gca() | ||
| + | ax.set_xlabel('Number of Visas in 1988') | ||
| + | ax.set_ylim([0,2700]) | ||
| + | ax.set_xlim([0,5000]) | ||
| + | ax.set_ylabel('Number of Visas in 1991') | ||
| + | ax.set_title('ScatterPlot of Visa91 vs Visa88') | ||
| + | plt.show() | ||
| + | </code> | ||
| + | |||
| + | |||
| + | \\ | ||
| + | |||
| + | |||
| + | =====Scatterplot (p.18)===== | ||
| + | <code python> | ||
| + | import matplotlib.pyplot as plt | ||
| + | import pandas as pd | ||
| + | adoption_scatterplot = pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\adopt.TAB', '\t') | ||
| + | adopt_data_Xax = adoption_scatterplot['Visa91'] | ||
| + | adopt_data_Yax = adoption_scatterplot['Visa92'] | ||
| + | plt.scatter(adopt_data_Xax, adopt_data_Yax, facecolor='y', edgecolor='y') | ||
| + | ax = plt.gca() | ||
| + | ax.set_xlabel('Number of Visas in 1991') | ||
| + | ax.set_ylim([0,1800]) | ||
| + | ax.set_xlim([0,2700]) | ||
| + | ax.set_ylabel('Number of Visas in 1992') | ||
| + | ax.set_title('ScatterPlot of Visa92 vs Visa91') | ||
| + | plt.show() | ||
| + | </code> | ||
| + | |||
| + | \\ | ||
| + | |||
| + | |||
| + | ====== The Performance of stock mutual funds (p. 21) ====== | ||
| + | |||
| + | |||
| + | |||
| + | |||
| + | |||
| + | |||
| + | \\ | ||
| + | |||
| + | ====== Predicting the sales and airplay of popular music (p. 23)====== | ||
| + | |||
| + | |||
| + | |||
| + | |||
| + | \\ | ||
| + | |||
| + | ====== Another look at the "Old faithful" geyser and adoption visas (p.24) ====== | ||
| + | |||
| + | Modified the bins of the both histograms: | ||
| + | The Histogram is reliable for the "Old faithful" geyser but not for the Adoption rates. The appearance of the histogram changes quite a lot by changing the bins. | ||
| + | |||
| + | \\ | ||
| + | |||
| + | ====== Productivity versus quality in the assembly plant (p. 25)====== | ||
| + | |||
| + | |||
| + | ===== Scatterplot of Productivity vs Quality (p. 26) ===== | ||
| + | <code Python> | ||
| + | import pandas as pd | ||
| + | import matplotlib.pyplot as plt | ||
| + | scatter_plot = pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\prdq.TAB', '\t') | ||
| + | productivity_Y = scatter_plot['Producti'] | ||
| + | quality_X = scatter_plot['Quality'] | ||
| + | plt.scatter(productivity_Y, quality_X, bins=20, colors='r') | ||
| + | ax = plt.gca() | ||
| + | ax.set_Xlabel('Assembly defects per 100 cars') | ||
| + | ax.set_Ylabel('Hours per vehicle') | ||
| + | ax.set_title('Scatter Plot of Productivity VS Quality') | ||
| + | plt.show() | ||
| + | </code> | ||
| + | |||
| + | \\ | ||
| + | |||
| + | =====Scatter Plot of PRODJAPN vs QUALJAPN (p. 27) ===== | ||
| + | |||
| + | <code Python> | ||
| + | import pandas as pd | ||
| + | import matplotlib.pyplot as plt | ||
| + | scatter_plot = pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\prdq.TAB', '\t') | ||
| + | productivity_Y = scatter_plot['ProdJapn'] | ||
| + | quality_X = scatter_plot['QualJapn'] | ||
| + | plt.scatter(productivity_Y, quality_X, bins=20, colors='r') | ||
| + | ax = plt.gca() | ||
| + | ax.set_Xlabel('Assembly defects per 100 cars (Japanese origin)') | ||
| + | ax.set_Ylabel('Hours per vehicle (Japanese origin') | ||
| + | ax.set_title('Scatter Plot of PRODJAPN VS QUALJAPN') | ||
| + | plt.show() | ||
| + | </code> | ||
| + | |||
| + | |||
| + | =====Scatter Plot of PRODNONJ cs QUALNONJ (p. 27)===== | ||
| + | <code Python> | ||
| + | import pandas as pd | ||
| + | import matplotlib.pyplot as plt | ||
| + | scatter_plot = pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\prdq.TAB', '\t') | ||
| + | productivity_Y = scatter_plot['ProdNonJ'] | ||
| + | quality_X = scatter_plot['QualNonJ'] | ||
| + | plt.scatter(productivity_Y, quality_X, bins=20, colors='r') | ||
| + | ax = plt.gca() | ||
| + | ax.set_Xlabel('Assembly defects per 100 cars (non-Japanese origin)') | ||
| + | ax.set_Ylabel('Hours per vehicle (non-Japanese origin') | ||
| + | ax.set_title('Scatter Plot of PRODNONJ VS QUALNONJ') | ||
| + | plt.show() | ||
| + | </code> | ||
| + | |||
| + | |||
| + | |||
| + | ===== Scatterplot of productivity VS quality (p. 28) ===== | ||
| + | <code python> | ||
| + | import pandas as pd | ||
| + | import matplotlib.pyplot as plt | ||
| + | scatter_plot = pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\prdq.TAB', '\t') | ||
| + | productivity_Y = scatter_plot['Producti'] | ||
| + | quality_X = scatter_plot['Quality'] | ||
| + | plt.scatter(productivity_Y, quality_X, bins=20, colors='r') | ||
| + | ax = plt.gca() | ||
| + | ax.set_Xlabel('Assembly defects per 100 cars') | ||
| + | ax.set_Ylabel('Hours per vehicle') | ||
| + | ax.set_title('Scatter Plot of PRODUCTIVITY VS QUALITY') | ||
| + | plt.show() | ||
| + | </code> | ||
| + | |||
| + | |||
| + | ===== Productivity versus quality in the assembly plant (p.29) ===== | ||
| + | |||
| + | It worked the first time but now it doesn't work again. Maybe again a windows error? | ||
| + | |||
| + | <code python> | ||
| + | #1 | ||
| + | import matplotlib.pyplot as plt | ||
| + | import pandas as pd | ||
| + | data_comparison = pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\prdq.TAB', '\t') | ||
| + | non_japanese = data_comparison.loc[data_comparison['QualNonJ']] | ||
| + | japanese = data_comparison.loc[data_comparison['QualJapn']] | ||
| + | plt.boxplot([non_japanese['Quality'],japanese['Quality']], labels= ['Non-japanese','Japanese']) | ||
| + | plt.show() | ||
| + | |||
| + | #2 | ||
| + | import matplotlib.pyplot as plt | ||
| + | import pandas as pd | ||
| + | data_comparison = pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\prdq.TAB', '\t') | ||
| + | non_japanese = data_comparison.loc[data_comparison['ProdNonJ']] | ||
| + | japanese = data_comparison.loc[data_comparison['ProdJapn']] | ||
| + | plt.boxplot([non_japanese['Producti'],japanese['Producti']], labels= ['Non-japanese','Japanese']) | ||
| + | plt.show() | ||
| + | </code> | ||