I'm trying to create a dataframe from my data (scores between chemicals and proteins) with pandas in python.
I want my dataframe to first display the proteins that have the most occurences, so I previously sorted my data. But when I make the dataframe it does not get the expected result.
Here's a sample of my data :
chemicals prots scores
CID000000006 10116.ENSRNOP00000003921 196
CID000000051 10116.ENSRNOP00000003921 246
CID000000085 10116.ENSRNOP00000003921 196
CID000000119 10116.ENSRNOP00000003921 247
CID000000134 10116.ENSRNOP00000008952 159
CID000000135 10116.ENSRNOP00000008952 157
CID000000174 10116.ENSRNOP00000008952 439
CID000000175 10116.ENSRNOP00000001021 858
CID000000177 10116.ENSRNOP00000004027 760
As you can see "10116.ENSRNOP00000003921" is the protein with the most occurences in my data.
So i'd like to get something like :
10116.ENSRNOP00000003921 10116.ENSRNOP00000008952
CID000000006 196
CID000000051 246
CID000000085 196
CID000000119 247
CID000000134 159
CID000000135 157
CID000000174 439
And Here's my code :
import pandas as pd
df_rat= pd.read_csv("dt_matrix_rat.csv",sep="\t", header=True)
df_rat.columns = ['chemicals','proteins','scores']
df_rat1 = df_rat.pivot(index='chemicals', columns='proteins', values='scores')
df_rat1.to_csv("rat_matrix.csv", sep='\t', index=True )
I think you need sort_values
of notnull
of sum
and get index to cols
. Lasy use subset
:
df1 = df.pivot(index='chemicals', columns='proteins', values='scores')
cols = df1.notnull().sum(axis=0).sort_values(ascending=False).index
print cols
Index([u'10116.ENSRNOP00000003921', u'10116.ENSRNOP00000008952',
u'10116.ENSRNOP00000004027', u'10116.ENSRNOP00000001021'],
dtype='object', name=u'proteins')
print df1[cols]
proteins 10116.ENSRNOP00000003921 10116.ENSRNOP00000008952 \
chemicals
CID000000006 196.0 NaN
CID000000051 246.0 NaN
CID000000085 196.0 NaN
CID000000119 247.0 NaN
CID000000134 NaN 159.0
CID000000135 NaN 157.0
CID000000174 NaN 439.0
CID000000175 NaN NaN
CID000000177 NaN NaN
proteins 10116.ENSRNOP00000004027 10116.ENSRNOP00000001021
chemicals
CID000000006 NaN NaN
CID000000051 NaN NaN
CID000000085 NaN NaN
CID000000119 NaN NaN
CID000000134 NaN NaN
CID000000135 NaN NaN
CID000000174 NaN NaN
CID000000175 NaN 858.0
CID000000177 760.0 NaN
Or reindex_axis
:
print df1.reindex_axis(cols, axis=1)
proteins 10116.ENSRNOP00000003921 10116.ENSRNOP00000008952 \
chemicals
CID000000006 196.0 NaN
CID000000051 246.0 NaN
CID000000085 196.0 NaN
CID000000119 247.0 NaN
CID000000134 NaN 159.0
CID000000135 NaN 157.0
CID000000174 NaN 439.0
CID000000175 NaN NaN
CID000000177 NaN NaN
proteins 10116.ENSRNOP00000004027 10116.ENSRNOP00000001021
chemicals
CID000000006 NaN NaN
CID000000051 NaN NaN
CID000000085 NaN NaN
CID000000119 NaN NaN
CID000000134 NaN NaN
CID000000135 NaN NaN
CID000000174 NaN NaN
CID000000175 NaN 858.0
CID000000177 760.0 NaN
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments