Pages

Monday, October 17, 2022

Gathering Data

Gathering data

Gathering data is the first step in statistical analysis.

Say for example that you want to know something about all the people in France.

The population is then all of the people in France.

It is too much effort to gather information about all of the members of a population (e.g. all 67+ million people living in France). It is often much easier to collect a smaller group of that population and analyze that. This is called a sample.

A representative sample

The sample needs to be similar to the whole population of France. It should have the same characteristics as the population. If you only include people named Jacques living in Paris who are 48 years old, the sample will not be similar to the whole population.

So for a good sample, you will need people from all over France, with different ages, professions, and so on.

If the members of the sample have similar characteristics (like age, profession, etc.) to the whole population of France, we say that the sample is representative of the population.

A good representative sample is crucial for statistical methods.


Descriptive Statistics

The information (data) from your sample or population can be visualized with graphs or summarized by numbers. This will show key information in a simpler way than just looking at raw data. It can help us understand how the data is distributed.

Graphs can visually show the data distribution.

Examples of graphs include:

Some graphs have a close connection to numerical summary statistics. Calculating those gives us the basis of these graphs.

For example, a box plot visually shows the quartiles of a data distribution.

Quartiles are the data split into four equal size parts, or quarters. A quartile is one type of summary statistics.

Summary statistics

Summary statistics take a large amount of information and sums it up in a few key values.

Numbers are calculated from the data which also describe the shape of the distributions. These are individual 'statistics'.

Some important examples are:

Note: Descriptive statistics is often presented as a part of statistical analysis.

Descriptive statistics is also useful for guiding further analysis, giving insight into the data, and finding what is worth investigating more closely.                                                Statistical Inference

Statistics from the data in the sample is used to make conclusions about the whole population. This is a type of statistical inference.


Probability theory is used to calculate the certainty that those statistics also apply to the population.


When using a sample, there will always be some uncertainty about what the data looks like for the population.


Uncertainty is often expressed as confidence intervals.


Confidence intervals are numerical ways of showing how likely it is that the true value of this statistic is within a certain range for the population.


Hypothesis testing is a another way of checking if a statement about a population is true. More precisely, it checks how likely it is that a hypothesis is true is based on the sample data.


Some examples of statements or questions that can be checked with hypothesis testing:


People in the Netherlands taller than people in Denmark

Do people prefer Pepsi or Coke?

Does a new medicine cure a disease?

Note: Confidence intervals and hypothesis testing are closely related and describe the same things in different ways. Both are widely used in science.


Causal Inference

Causal inference is used to investigate if something causes another thing.


For example: Does rain make plants grow?


If we think two things are related we can investigate to see if they correlate. Statistics can be used to find out how strong this relation is.


Even if things are correlated, finding out of something is caused by other things can be difficult. It can be done with good experimental design or other special statistical techniques.


Note: Good experimental design is often difficult to achieve because of ethical concerns or other practical reasons.

SQL SELECT

 SELECT * FROM EMP