Types of FilesThere are two types of files in Python: Show
Let's look at how to perform different operations on files in Python. Open/Close a FileLet's first look at opening a file. You can open a file using the open() function. The open() function takes two arguments: filename and mode. There are different access modes in which you can open a file. Let's look at each of them in detail. Opening Text FilesWe will assume there's a text file named ''pythonisfun.txt'' for all the modes we look at. (1) The read mode: rUsing the read mode r, you can open a file for reading. The following statement opens the file pythonisfun.txt for reading and returns the file object f. f = open('pythonisfun.txt', 'r') If you don't specify the mode while opening a file, then it's not set to read mode by default. If you specify a file that is unavailable, you'll get a FileNotFound error. (2) The write mode: wYou can open a file for writing using the write mode, w. If the file already exists, the previous data would be cleared and new data would be written to it. If the file doesn't exist, Python creates a new file for you with the specified file name. The following statement appearing here opens the file pythonisfun.txt in write mode. f = open('pythonisfun.txt', 'w') (3) The append mode: aYou can append data to a file by opening it in append mode, a. If the file doesn't exist, then Python creates a new file with the specified file name. Otherwise, it opens the file in append mode. However, previous data is not cleared in this case. f = open('pythonisfun.txt', 'a') Opening Binary FilesOpening binary files is very similar to opening text files. Let's assume there is a binary file named ''selfie.jpg.'' We'll use it as an example for all the modes we will look at. (1) The read mode: rbYou can open a binary file in read mode using rb as the second argument. The following statement opens the file selfie.jpg in read mode, returning the file object f. f = open('selfie.jpg', 'rb') (2) The write mode: wbWe can open a binary file in write mode using wb as the second argument. The following statement opens the file selfie.jpg in write mode. f = open('selfie.jpg', 'wb') (3) The append mode: abYou can append data to a binary file by opening it in append mode, ab. If the file doesn't exist, then Python creates a new binary file with the specified file name. f = open('selfie.jpg', 'ab') Now, let's look at closing a file. We must always close a file after performing any operation on it so that our computer system does not slow down due to performance issues. For this, we can use the close() method as follows: f.close() Write/Read a FileNow, let's look at writing data to a file. We saw how to open and close a file in the previous examples, but how do we add data to a file? For this, we first need to open the file in either write or append modes. We can then write data to the file using the write() or writelines() method.
The following example demonstrates the use of the write() and writelines() method. For that, let's open the file in write mode first. f = open('pythonisfun.txt', 'w') If you run the program and open the pythonisfun.txt file, then the output looks something like this: Hey We used \n in the example because, unlike the print() function, write() and writelines() methods are unable to add the newline character at the end of the strings automatically. Now let's look at reading data from a file. The following four functions are used to read data from a file: Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: Explore Your Dataset With Pandas Do you have a large dataset that’s full of interesting insights, but you’re not sure where to start exploring it? Has your boss asked you to generate some statistics from it, but they’re not so easy to extract? These are precisely the use cases where Pandas and Python can help you! With these tools, you’ll be able to slice a large dataset down into manageable parts and glean insight from that information. In this tutorial, you’ll learn how to:
You’ll also learn about the differences between the main data structures that Pandas and Python use. To follow along, you can get all of the example code in this tutorial at the link below: Setting Up Your EnvironmentThere are a few things you’ll need to get started with this tutorial. First is a familiarity with Python’s built-in data structures, especially lists and dictionaries. For more information, check out Lists and Tuples in Python and Dictionaries in Python. The second thing you’ll need is a working Python environment. You can follow along in any terminal that has Python 3 installed. If you want to see nicer output, especially for the large NBA dataset you’ll be working with, then you might want to run the examples in a Jupyter notebook. The last thing you’ll need is Pandas and other Python libraries, which you can install with pip:
You can also use the Conda package manager:
If you’re using the Anaconda distribution, then you’re good to go! Anaconda already comes with the Pandas Python library installed. The examples in this tutorial have been tested with Python 3.7 and Pandas 0.25.0, but they should also work in older versions. You can get all the code examples you’ll see in this tutorial in a Jupyter notebook by clicking the link below: Let’s get started! Using the Pandas Python LibraryNow that you’ve installed
Pandas, it’s time to have a look at a dataset. In this tutorial, you’ll analyze NBA results provided by FiveThirtyEight in a 17MB CSV file. Create a script
When you execute the script, it will save the file Now you can use the Pandas Python library to take a look at your data: >>>
Here, you follow the convention of importing Pandas in Python with the You can see how much data
>>>
You use the Python built-in function Now you know that there are 126,314 rows and 23 columns in your dataset. But how can you be sure the dataset really contains basketball
stats? You can have a look at the first five rows with If you’re following along with a Jupyter notebook, then you’ll see a result like this: Unless your screen is quite large, your output probably won’t display all 23 columns. Somewhere in the middle, you’ll see a column of ellipses ( >>>
While it’s practical to see all the columns, you probably won’t need six decimal places! Change it to two: >>>
To verify
that you’ve changed the options successfully, you can execute Now, you should see all the columns, and your data should show two decimal places: You can discover some further possibilities of Here’s how to print the last three lines of Your output should look something like this: You can see the last three lines of your dataset with the options you’ve set above. Similar to the Python standard library, functions in Pandas also come with several optional parameters. Whenever you bump into an example that looks relevant but is slightly different from your use case, check out the official documentation. The chances are good that you’ll find a solution by tweaking some optional parameters! Getting to Know Your DataYou’ve imported a CSV file with the Pandas Python library and had a first look at the contents of your dataset. So far, you’ve only seen the size of your dataset and its first and last few rows. Next, you’ll learn how to examine your data more systematically. Displaying Data TypesThe first step in getting to know your data is to discover the different data types it contains. While you can put anything into a list, the columns of a You can display all columns and their data types with This will produce the following output: You’ll see a list of all the columns in your dataset and the type of data each column contains. Here, you can see the data types The Although you can store arbitrary Python objects in the Showing Basics StatisticsNow that you’ve seen what data types are in your dataset, it’s time to get an overview of the values each column contains. You can do this with This function shows you some basic descriptive statistics for all numeric columns:
>>>
Take a look at the Exploring Your DatasetExploratory data analysis can help you answer questions about your dataset. For example, you can examine how often specific values occur in a column: >>>
It seems that
a team named >>>
Indeed, the Minneapolis Lakers ( >>>
It looks like the Minneapolis Lakers played between the years of 1948 and 1960. That explains why you might not recognize this team! You’ve also found out why the Boston Celtics team Similar to the >>>
The Boston Celtics scored a total of 626,484 points. You’ve got a taste for the capabilities of a Pandas Getting to Know Pandas’ Data StructuresWhile a Understanding Series ObjectsPython’s most basic data structure is the list, which
is also a good starting point for getting to know >>>
You’ve used the list
You can access these
components with >>>
While Pandas builds on NumPy, a significant difference is in their indexing. Just like a NumPy array, a Pandas However, a >>>
Here, the index is a list of city names represented by strings. You may have noticed that Python dictionaries use string indices as well, and this is a handy analogy to keep in mind! You can use the code blocks above to distinguish between two types
of
Here’s how to construct a >>>
The dictionary keys become the index, and the
dictionary values are the Just like dictionaries, >>>
You can use these methods to answer questions about your dataset quickly. Understanding DataFrame ObjectsWhile a If you’ve followed along with the
You can combine these objects into a >>>
Note how Pandas
replaced the missing The new >>>
Just like a >>>
You can also refer to the 2 dimensions of a >>>
The axis marked with 0 is the row index, and the axis marked with 1 is the column index. This terminology is important to know because you’ll encounter several A >>>
You can see these concepts in action with the bigger NBA dataset. Does it contain a column called Because you didn’t specify an index column when you
read in the CSV file, Pandas has assigned a >>>
>>>
You can check the existence of a column with >>>
The column is called As you use these methods to answer questions about your dataset, be sure to keep in mind whether you’re working with a Accessing Series ElementsIn the section above, you’ve created a Pandas
You’ll also learn how to use two Pandas-specific access methods:
You’ll see that these data access methods can be much more readable than the indexing operator. Using the Indexing OperatorRecall that a
Next, revisit the >>>
You can conveniently access the values in a >>>
You can also use negative indices and slices, just like you would for a list: >>>
If you want to learn more about the possibilities of the indexing operator, then check out Lists and Tuples in Python. Using .loc and .ilocThe indexing operator ( >>>
What will The good news is, you don’t have to figure it out! Instead, to avoid confusion, the Pandas Python library provides two data access methods:
These data access methods are much more readable: >>>
The following figure shows which elements Again, It’s easier to keep in mind the distinction between
>>>
If you compare this code with the image above, then you can see that On the other hand, >>>
This code block says to return all elements with a label
index between You can also pass a negative positional index to >>>
You start from the end of the You can use the code blocks above to distinguish between two
Be sure to keep these distinctions in mind as you access elements of your Accessing DataFrame ElementsSince a Using the Indexing OperatorIf you think of a >>>
Here, you use the indexing operator to select the column labeled If the column name is a string, then you can use attribute-style accessing with dot notation as well: >>>
There’s one situation where accessing >>>
The indexing operation Using .loc and .ilocSimilar to >>>
Each line of code selects a different row from
Alright, you’ve used The second-to-last row is the row with the positional index of >>>
You’ll see the output as a For a >>>
Note that you separate the parameters with a comma ( It’s time to see the same construct in action with the bigger First, define which rows you want to see, then list the relevant columns: >>>
You use You should see a small part of your quite huge dataset: The output is much easier to read! With data access methods like Querying Your DatasetYou’ve seen how to access subsets of a huge dataset based on its indices. Now, you’ll select rows based on the values in your dataset’s columns to query your data. For example, you can create a new >>>
You now have 24 columns, but your new You can also select the rows where a specific field is not null: >>>
This can be helpful if you want to avoid any missing values in a column. You can also
use You can even access values of the >>>
You use You can combine multiple criteria
and query your dataset as well. To do this, be sure to put each one in parentheses and use the logical operators Do a search for Baltimore games where both teams scored over 100 points. In order to see each game only once, you’ll need to exclude duplicates: >>>
Here, you use
Your output should contain five eventful games: Try to build another query with multiple criteria. In the spring of 1992, both teams from Los Angeles had to play a home game at another
court. Query your dataset to find those two games. Both teams have an ID starting with You can use >>>
Your output should show two games on the day 5/3/1992: Nice find! When you know how to query your dataset with multiple criteria, you’ll be able to answer more specific questions about your dataset. Grouping and Aggregating Your DataYou may also want to learn other features of your dataset, like the sum, mean, or average value of a group of elements. Luckily, the Pandas Python library offers grouping and aggregation functions to help you accomplish this task. A >>>
The first method returns the total of Remember, a column of a >>>
A >>>
By default, Pandas sorts the group keys during the call to You can also group by multiple columns: >>>
You can practice these basics with an exercise. Take a look at the Golden State Warriors’ 2014-15 season ( First, you can group by the >>>
In the examples above, you’ve only scratched the surface of the aggregation functions that are available to you in the Pandas Python library. To see more examples of how to use them, check out Pandas GroupBy: Your Guide to Grouping Data in Python. Manipulating ColumnsYou’ll need to know how to manipulate your dataset’s columns in different phases of the data analysis process. You can add and drop columns as part of the initial data cleaning phase, or later based on the insights of your analysis. Create a copy
of your original >>>
You can define new columns based on the existing ones: >>>
Here, you used the >>>
Here, you used an aggregation function You can also rename the columns of your dataset. It seems that >>>
Note that there’s a new object, Your dataset might contain columns that you don’t need. For example, Elo ratings may be a fascinating concept to some, but you won’t analyze them in this tutorial. You can delete the four columns related to Elo: >>>
Remember, you added the new column Specifying Data TypesWhen you create a new
Take another look at the columns of the You’ll see the same output as before: Ten of your columns have the data type >>>
Here, you use Other columns contain text that are a bit more structured. The >>>
Which data type would you use in a relational database for such a column? You would probably not use a >>>
Run
You’ll often encounter datasets with too many text columns. An essential skill for data scientists to have is the ability to spot which columns they can convert to a more performant data type. Take a moment to practice this now. Find another column in the
>>>
To improve performance, you can convert it into a >>>
You can use As you work with more massive datasets, memory savings becomes especially crucial. Be sure to keep performance in mind as you continue to explore your datasets. Cleaning DataYou may be surprised to find this section so late in the tutorial! Usually, you’d take a critical look at your dataset to fix any issues before you move on to a more sophisticated analysis. However, in this tutorial, you’ll rely on the techniques that you’ve learned in the previous sections to clean your dataset. Missing ValuesHave you ever wondered why When you inspect the This output shows that the Sometimes, the easiest way to deal with records containing missing values is to ignore them. You can remove all the rows with missing values using >>>
Of course, this kind of data cleanup doesn’t make sense for your You can also drop problematic columns if they’re not relevant for your analysis. To do this, use >>>
Now, the resulting If there’s a meaningful default value for your use case, then you can also replace the missing values with that: >>>
Here, you fill the empty Invalid ValuesInvalid values can be even more dangerous than missing values. Often, you can perform your data analysis as expected, but the results you get are peculiar. This is especially important if your dataset is enormous or used manual entry. Invalid values are often more challenging to detect, but you can implement some sanity checks with queries and aggregations. One thing you can do is validate the ranges of your data. For this, The What about >>>
This query returns a single row: It seems the game was forfeited. Depending on your analysis, you may want to remove it from the dataset. Inconsistent ValuesSometimes a value would be entirely realistic in and of itself, but it doesn’t fit with the values in the other columns. You can define some query criteria that are mutually exclusive and verify that these don’t occur together. In the NBA dataset, the values of the fields >>>
Fortunately, both of these queries return
an empty Be prepared for surprises whenever you’re working with raw datasets, especially if they were gathered from different sources or through a complex pipeline. You might see rows where a team scored more points than their opponent, but still didn’t win—at least, according to your dataset! To avoid situations like this, make sure you add further data cleaning techniques to your Pandas and Python arsenal. Combining Multiple DatasetsIn the previous section, you’ve learned how to clean a messy dataset. Another aspect of real-world data is that it often comes in multiple pieces. In this section, you’ll learn how to grab those pieces and combine them into one dataset that’s ready for analysis. Earlier, you combined two >>>
This
second You can add these cities to >>>
Now, the new variable By default, >>>
Note how Pandas added >>>
While it’s most straightforward to combine data based on the index, it’s not the only possibility. You can
use >>>
Here, you pass the parameter Note that the result contains only the cities where the country is known and appears in the joined
>>>
With this
Welcome back, New York & Barcelona! Visualizing Your Pandas DataFrameData visualization is one of the things that works much better in a Jupyter notebook than in a terminal, so go ahead and fire one up. If you need help getting started, then check out Jupyter Notebook: An Introduction. You can also access the Jupyter notebook that contains the examples from this tutorial by clicking the link below: Include this line to show plots directly in the notebook: >>>
Both >>>
This shows a line plot with several peaks and two notable valleys around the years 2000 and 2010: You can also create other types of plots, like a bar plot: >>>
This will show the franchises with the most games played: The Lakers are leading the Celtics by a minimal edge, and there are six further teams with a game count above 5000. Now try a more complicated exercise. In 2013, the Miami Heat won the championship. Create a pie plot showing the count of their wins and losses during that season. Then, expand the code block to see a solution: First, you define a criteria to include only the Heat’s games from 2013. Then, you create a plot in the same way as you’ve seen above: >>>
Here’s what a champion pie looks like: The slice of wins is significantly larger than the slice of losses! Sometimes, the numbers speak for themselves, but often a chart helps a lot with communicating your insights. To learn more about visualizing your data, check out Interactive Data Visualization in Python With Bokeh. ConclusionIn this tutorial, you’ve learned how to start exploring a dataset with the Pandas Python library. You saw how you could access specific rows and columns to tame even the largest of datasets. Speaking of taming, you’ve also seen multiple techniques to prepare and clean your data, by specifying the data type of columns, dealing with missing values, and more. You’ve even created queries, aggregations, and plots based on those. Now you can:
This journey using the NBA stats only scratches the surface of what you can do with the Pandas Python library. You can power up your project with Pandas tricks, learn techniques to speed up Pandas in Python, and even dive deep to see how Pandas works behind the scenes. There are many more features for you to discover, so get out there and tackle those datasets! You can get all the code examples you saw in this tutorial by clicking the link below: Watch Now This tutorial has a related video course created by the Real Python team. Watch it together with the written tutorial to deepen your understanding: Explore Your Dataset With Pandas How do you access data from a file in Python?Steps for reading a text file in Python. First, open a text file for reading by using the open() function.. Second, read text from the text file using the file read() , readline() , or readlines() method of the file object.. Third, close the file using the file close() method.. Can you query data with Python?Query data means requesting specific data from a dataset. If you are familiar with SQL, you must know what it means to query data, but if you use the pandas library in Python, you can still query data from your dataset.
How do you connect datasets in Python?How to connect Database in Python. Install MySQL Driver.. Create a connection Object.. Create a cursor Object.. Execute the Query.. |