# Wiki du Pôle histoire numériqueLARHRA UMR5190

### Outils du site

python:first_course_statistics

# General instructions

Get the data from this site.

Save your scripts in a folder inside the data folder, calling the script folder 'my_scripts' or whaterver. If 'my-scripts' is set as your current working directory, then the data files are available under this address '../[data file]', for instantce: '../geyser1.TAB'

# Eruptions of the "Old Faithful" geyser (p.5)

## Histogram (p.5)

```import pandas as pd
import matplotlib.pyplot as plt
gys1 = pd.DataFrame(pd.read_csv('../geyser1.TAB', '\t'))
g_int = gys1['Interval']
ax = plt.gca()
ax.hist(g_int, bins=20, color='r')
ax.set_xlabel('Intereruption time')
ax.set_ylabel('Frequency')
ax.set_title('Histogram')
plt.show() ```

## Boxplot (p. 6)

```import matplotlib.pyplot as plt
import pandas as pd
gysr1_boxplot = pd.read_csv('.../geyser1.TAB', '\t')
data_gysr1 = gysr1_boxplot['Interval']
plt.boxplot(data_gysr1)
ax = plt.gca()
ax.set_xlabel('222 cases')
ax.set_ylabel('Interruption time ( minutes')
ax.set_title('Box and Whisker Plot')
plt.show()```

## ScatterPlot (p. 7)

AB: Put face- and edgecolor to change both of them. You can also have two different colors for the in- and outside of each dot.

```import matplotlib.pyplot as plt
import pandas as pd
geysr1_scatterplot = pd.read_csv('.../geyser1.TAB', '\t')
geysr1_data_Xax = geysr1_scatterplot['Duration']
geysr1_data_Yax = geysr1_scatterplot['Interval']
plt.scatter(geysr1_data_Xax, geysr1_data_Yax, facecolor='y', edgecolor='y')
ax = plt.gca()
ax.set_xlabel('Eruption duration time (minutes)')
ax.set_ylabel('Interuption time (minutes)')
ax.set_title('Scatter Plot of INTERVAL vs DURATION')
plt.show()```

## Descriptive statistics (p.9)

Note: try different examples, e.g. the whole population or only those where 'Duration' ⇐ 3, the whole dataframe

```import pandas as pd
gysr1 = pd.read_csv('../geyser1.tab', '\t')
gysr1['Duration'][gysr1['Duration'] <= 3].describe()```

## Boxplot (p.9)

Selecting rows in a dataframe: doc / example

```import matplotlib.pyplot as plt
import pandas as pd
gysr1 = pd.read_csv('../geyser1.tab', '\t')
gysr1_inf3 = gysr1.loc[gysr1['Duration'] <= 3]
gysr1_sup3 = gysr1.loc[gysr1['Duration'] > 3]
plt.boxplot([gysr1_inf3['Interval'],gysr1_sup3['Interval']], labels= ['inf3','sup3'])```

# International adoption rates (p.13)

## Boxplot (p.14)

```import matplotlib.pyplot as plt
import pandas as pd
ax = plt.gca()
ax.set_title('Box and Whisker Plot')
ax.set_xlabel('39 cases')
ax.set_ylabel('Number of visas in 1991')
plt.show()```

## Histogram (p.14)

```import matplotlib.pyplot as plt
import pandas as pd
plt.show()```

## Histogram with Log(p.18)

don't find the way to do it

```import pandas as pd
import matplotlib.pyplot as plt
#adopt_loghist.semilogx() --> was one of the possibilities
ax = plt.gca()
ax.hist(adopt_loghist, bins=10, plt.loglog(0.5,3.5), color='r') #put log=True instead, but you will get the log for the frequencies
plt.gca().set_xscale("log")
ax.set_xlabel('Log (Number of 1991 visas')
ax.set_ylabel('Frequency')
ax.set_title('Histogram')
plt.show() ```

## Scatterplot (p. 17)

```import matplotlib.pyplot as plt
import pandas as pd
ax = plt.gca()
ax.set_xlabel('Number of Visas in 1988')
ax.set_ylim([0,2700])
ax.set_xlim([0,5000])
ax.set_ylabel('Number of Visas in 1991')
ax.set_title('ScatterPlot of Visa91 vs Visa88')
plt.show()```

## Scatterplot (p.18)

```import matplotlib.pyplot as plt
import pandas as pd
ax = plt.gca()
ax.set_xlabel('Number of Visas in 1991')
ax.set_ylim([0,1800])
ax.set_xlim([0,2700])
ax.set_ylabel('Number of Visas in 1992')
ax.set_title('ScatterPlot of Visa92 vs Visa91')
plt.show()```

# Another look at the "Old faithful" geyser and adoption visas (p.24)

Modified the bins of the both histograms: The Histogram is reliable for the “Old faithful” geyser but not for the Adoption rates. The appearance of the histogram changes quite a lot by changing the bins.

# Productivity versus quality in the assembly plant (p. 25)

## Scatterplot of Productivity vs Quality (p. 26)

```import pandas as pd
import matplotlib.pyplot as plt
scatter_plot = pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\prdq.TAB', '\t')
productivity_Y = scatter_plot['Producti']
quality_X = scatter_plot['Quality']
plt.scatter(productivity_Y, quality_X, bins=20, colors='r')
ax = plt.gca()
ax.set_Xlabel('Assembly defects per 100 cars')
ax.set_Ylabel('Hours per vehicle')
ax.set_title('Scatter Plot of Productivity VS Quality')
plt.show()```

## Scatter Plot of PRODJAPN vs QUALJAPN (p. 27)

```import pandas as pd
import matplotlib.pyplot as plt
scatter_plot = pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\prdq.TAB', '\t')
productivity_Y = scatter_plot['ProdJapn']
quality_X = scatter_plot['QualJapn']
plt.scatter(productivity_Y, quality_X, bins=20, colors='r')
ax = plt.gca()
ax.set_Xlabel('Assembly defects per 100 cars (Japanese origin)')
ax.set_Ylabel('Hours per vehicle (Japanese origin')
ax.set_title('Scatter Plot of PRODJAPN VS QUALJAPN')
plt.show()```

## Scatter Plot of PRODNONJ cs QUALNONJ (p. 27)

```import pandas as pd
import matplotlib.pyplot as plt
scatter_plot = pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\prdq.TAB', '\t')
productivity_Y = scatter_plot['ProdNonJ']
quality_X = scatter_plot['QualNonJ']
plt.scatter(productivity_Y, quality_X, bins=20, colors='r')
ax = plt.gca()
ax.set_Xlabel('Assembly defects per 100 cars (non-Japanese origin)')
ax.set_Ylabel('Hours per vehicle (non-Japanese origin')
ax.set_title('Scatter Plot of PRODNONJ VS QUALNONJ')
plt.show()```

## Scatterplot of productivity VS quality (p. 28)

```import pandas as pd
import matplotlib.pyplot as plt
scatter_plot = pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\\prdq.TAB', '\t')
productivity_Y = scatter_plot['Producti']
quality_X = scatter_plot['Quality']
plt.scatter(productivity_Y, quality_X, bins=20, colors='r')
ax = plt.gca()
ax.set_Xlabel('Assembly defects per 100 cars')
ax.set_Ylabel('Hours per vehicle')
ax.set_title('Scatter Plot of PRODUCTIVITY VS QUALITY')
plt.show()```

## Productivity versus quality in the assembly plant (p.29)

It worked the first time but now it doesn't work again. Maybe again a windows error?

```#1
import matplotlib.pyplot as plt
import pandas as pd
data_comparison = pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\prdq.TAB', '\t')
non_japanese = data_comparison.loc[data_comparison['QualNonJ']]
japanese = data_comparison.loc[data_comparison['QualJapn']]
plt.boxplot([non_japanese['Quality'],japanese['Quality']], labels= ['Non-japanese','Japanese'])
plt.show()

#2
import matplotlib.pyplot as plt
import pandas as pd
data_comparison = pd.read_csv('D:\Python\Libri\A_Casebook_for_a_First_Course_in_Statistics_and_Data_Analysis_Datasets\Data\Tab\prdq.TAB', '\t')
non_japanese = data_comparison.loc[data_comparison['ProdNonJ']]
japanese = data_comparison.loc[data_comparison['ProdJapn']]
plt.boxplot([non_japanese['Producti'],japanese['Producti']], labels= ['Non-japanese','Japanese'])
plt.show()```
python/first_course_statistics.txt · Dernière modification: 2017/09/26 08:54 par Francesco Beretta

### Outils de la page 