使用python进行数据提取和数据处理

Whenever a dataset comes the first step is to extract data and manipulate it. It is the most important part as it gives the most useful information about the dataset. I have taken the IPL dataset. Let's see how it's done.

每当数据集出现时,第一步就是提取数据并对其进行处理。 这是最重要的部分,因为它提供了有关数据集的最有用的信息。 我已经获取了IPL数据集。 让我们看看它是如何完成的。

At first, we need to import libraries such as pandas and NumPy.

首先,我们需要导入诸如pandas和NumPy之类的库。

After that, we need to read the dataset. I am using a “df” variable to store the dataset.

之后,我们需要读取数据集。 我正在使用“ df”变量来存储数据集。

After reading the dataset. If we want the first 5 rows of the dataset then we can use the head function. It is not compulsory that we will only get the top 5 rows. we can pass any value between the head function. If we do df.head(10) then we will get the top 10 rows of the dataset.

读取数据集后。 如果我们想要数据集的前5行,则可以使用head函数。 我们只获得前5行不是强制性的。 我们可以在head函数之间传递任何值。 如果执行df.head(10),则将获得数据集的前10行。

Similarly, we can get the last 5 rows of the dataset by using the tail function. We can pass any value in the tail function to get as many numbers of rows.

同样,我们可以使用tail函数获取数据集的最后5行。 我们可以在tail函数中传递任何值以获取尽可能多的行。

Now if you want to know the mean, median, count, standard deviation, minimum value, a maximum value of each column. Then pandas have the function named as describe which will give all the values which I have mentioned above for each column.

现在,如果您想知道每列的平均值,中位数,计数,标准差,最小值,最大值。 然后,熊猫具有名为describe的函数,该函数将为每列提供上面已提到的所有值。

If you want to know the data type of each column then pandas have the function known as dtype which will give the data type of each column.

如果您想知道每一列的数据类型,那么pandas具有称为dtype的功能,它将提供每一列的数据类型。

If you want to know the names of all the columns then it can be done by columns function.

如果您想知道所有列的名称,则可以通过columns函数来完成。

These all are the names of the columns of the dataset.

这些都是数据集列的名称。

If you want to know the columns where a data type is an object.

如果您想知道数据类型是对象的列。

As you can see that it has given those columns whose data type is an object.

如您所见,它给了那些数据类型为对象的列。

If you want to know whether there is a presence of null values in the data set. It can be done by using isnull() function.

如果您想知道数据集中是否存在空值。 可以通过使用isull()函数来完成。

As the dataset is regarding IPL. Let us explore something related to cricket. Suppose if we want to know the type of dismissal which has happened more in the IPL.

由于数据集与IPL有关。 让我们探索与板球有关的东西。 假设我们想知道IPL中更多发生的解雇类型。

For that, we will go to the dismissal_kind column and implement the value_counts function it will give us the frequency of each dismissal has occurred.

为此,我们将转到dismissal_kind列并实现value_counts函数,它将为我们提供每次解雇的频率。

Now if we want to know which batsman has been dismissed most in the IPL then it can also be done by the same value_counts function only the column name will change.

现在,如果我们想知道哪个蝙蝠侠在IPL中被解雇最多,那么也可以通过相同的value_counts函数来完成,只改变列名即可。

As you can see that Suresh Raina has been dismissed most number of times.

如您所见,Suresh Raina被解雇的次数最多。

Now if you want to know the number of 4 and 6 have been hit in the IPL. Then it can be done by using a simple for loop.

现在,如果您想知道IPL中已命中4和6的数目。 然后可以通过使用简单的for循环来完成。

Now if we want to access the data by using indexes then we can use iloc. It basically stands for integer location. It is used to access the dataset if we want to access it by its default indexes.

现在,如果我们要使用索引访问数据,则可以使用iloc。 它基本上代表整数位置。 如果要通过默认索引访问数据集,则使用它来访问数据集。

As you can see it has given me all the indexes between 200 and 300.

如您所见,它为我提供了200到300之间的所有索引。

If we want to access the non-default indexes or the indexes created by the user then it can by using loc function. It can access the non-integer indexes also.

如果我们要访问非默认索引或用户创建的索引,则可以使用loc函数。 它也可以访问非整数索引。

For now, I have this much on this topic next time I will try to explore more and in more depth.

现在,下一次我将在这个主题上做很多工作,下次我将尝试进一步探索。

Thank you.

谢谢。

(0)

相关推荐