© Copyright 2001-2022, Python Software Foundation.
This page is licensed under the Python Software Foundation License Version 2.
Examples, recipes, and other code in the documentation are additionally licensed under the Zero Clause BSD License.
See History and License for more information.
The Python Software Foundation is a non-profit corporation. Please donate.
Last updated on Oct 09, 2022. Found a bug?
Created using Sphinx 3.4.3.
A binary variable is a categorical variable that can only take one of two values, usually represented as a Boolean — True or False — or an integer variable — 0 or 1
You should already know:
Basic Python — Learn Python and Data Science concepts interactively on Dataquest.
A binary variable is a categorical variable that can only take one of two values, usually represented as a Boolean — True or False — or an integer variable — 0 or 1 — where $0$ typically indicates that the attribute is absent, and $1$ indicates that it is present.
Some examples of binary variables, i.e. attributes, are:
- Smoking is a binary variable with only two possible values: yes or no
- A medical test has two possible outcomes: positive or negative
- Gender is traditionally described as male or female
- Health status can be defined as diseased or healthy
- Company types may have two values: private or public
- E-mails can be assigned into two categories: spam or not
- Credit card transactions can be fraud or not
In some applications, it may be useful to construct a binary variable from other types of data. If you can turn a non-binary attribute into only two categories, you have a binary variable. For example, the numerical variable of age can be divided into two groups: 'less than 30' or 'equal or greater than 30'.
Datasets used in machine learning applications have more likely binary variables. Some applications such as medical diagnoses, spam analysis, facial recognition, and financial fraud detection have binary variables.
In Python, the boolean data type is the binary variable and defined as $True$ or $False$.
# Boolen data type x = True y = False print(type(x), type(y))
Out:
<class 'bool'> <class 'bool'>Additionally, the bool() function converts the value of an object to a boolean value. This function returns $True$ for all values except the following values:
- Empty objects (list, tuple, string, dictionary)
- Zero number (0, 0.0, 0j)
- None value
print("Boolean value of an empty list is ", bool([])) print("Boolean value of zero is ", bool(0)) print("Boolean value of number 10 is", bool(10)) print("Boolean value of an empty string is", bool('')) print("Boolean value of a string is", bool('string'))
Out:
Boolean value of an empty list is False Boolean value of zero is False Boolean value of number 10 is True Boolean value of an empty string is False Boolean value of a string is TrueFrom the statsmodels library, a real dataset named birthwt about 'Risk Factors Associated with Low Infant Birth Weight' will be imported to observe binary variables.
import statsmodels.api as sm dataset1 = sm.datasets.get_rdataset(dataname='birthwt', package='MASS') df1 = dataset1.data df1.head()
From the help file, description of the dataset obtained by dataset1.__doc__ code is given below.
- low : an indicator of whether the birth weight is less than 2.5kg
- age : mother’s age in year
- lwt : mother’s weight in pounds at last menstrual period
- race : mother’s race (1 = white, 2 = black, white = other)
- smoke : smoking status during pregnancy
- ptl : number of previous premature labours
- ht : history of hypertension
- ui : presence of uterine irritability
- ftv : number of physician visits during the first trimester
- bwt : birth weight in grams
As can be easily learned from dataset description, low, smoke, and ui attributes are the binary variables. In Python, "value_counts()" function gives the counts of unique values in the variable.
# find counts of the variables df1['smoke'].value_counts()
Out:
0 115 1 74 Name: smoke, dtype: int64In the following example, a numerical variable, age, will be converted to a binary variable.
# convert a numerical variable to binary variable df1['new_age'] = df1['age'] > 30 df1['new_age'].astype('bool') print('Type of the new variable:\n', type(df1['new_age'].iloc[0]), '\n') print('Value Counts of the new variable:\n', df1['new_age'].value_counts())
Out:
Type of the new variable: <class 'numpy.bool_'> Value Counts of the new variable: False 169 True 20 Name: new_age, dtype: int64