python - Write a Pandas DataFrame to Google Cloud Storage or BigQuery

link之家
链接快照平台
输入网页链接，自动生成快照
标签化管理网页链接
Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
Hello and thanks for your time and consideration. I am developing a Jupyter Notebook in the Google Cloud Platform / Datalab. I have created a Pandas DataFrame and would like to write this DataFrame to both Google Cloud Storage(GCS) and/or BigQuery. I have a bucket in GCS and have, via the following code, created the following objects:
import gcp
import gcp.storage as storage
project = gcp.Context.default().project_id    
bucket_name = 'steve-temp'           
bucket_path  = bucket_name   
bucket = storage.Bucket(bucket_path)
bucket.exists()  
I have tried various approaches based on Google Datalab documentation but continue to fail.
Thanks
                I've developed a python package specifically designed for transferring data from one location (e.g. a pandas.DataFrame) to another one (e.g. BigQuery or Storage):  google-pandas-load.readthedocs.io/en/latest/. Moreover it has 100% test coverage.
– augustin-barillec
                Dec 5, 2019 at 9:38
Uploading to Google Cloud Storage without writing a temporary file and only using the standard GCS module
from google.cloud import storage
import os
import pandas as pd
# Only need this if you're running this code locally.
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = r'/your_GCP_creds/credentials.json'
df = pd.DataFrame(data=[{1,2,3},{4,5,6}],columns=['a','b','c'])
client = storage.Client()
bucket = client.get_bucket('my-bucket-name')
bucket.blob('upload_test/test.csv').upload_from_string(df.to_csv(), 'text/csv')
                if you only want to push the file to a bucket on GCS then this is a more suitable solution. This can also be used in case you want to push out json format :  bucket.blob('upload_test/test.json').upload_from_string(df.to_json(), 'text/json')
– Amjad Desai
                Feb 22, 2021 at 20:56
                Hi @Theo, could you please elaborate a bit more on this: "without writing a temporary"? Thanks!
– RafalK
                Mar 29 at 9:46
                @RafalK a few years ago when I wrote this there were a few examples of how to do this, but they always included the intermediate step of writing the data frame to disk temporarily and then uploading the file or using another package to simplify the process. Writing to disk is unnecessary and I didn't want to add another package, so this solution avoids both constraints and is probably a bit faster as well.
– Theo
                Mar 30 at 12:37
                This is hilariously simple.  Just make sure to also install gcsfs as a prerequisite (though it'll remind you anyway).  If you're coming here in 2020 or later, just skip the complexity and do this.
– bsplosion
                Dec 16, 2021 at 4:59
                Is there a way to make a saved file publically accessible directly by passing any argument?
– Danish Bansal
                Jan 28, 2022 at 9:31
                It is not working. I have created a ubuntu server and installed pip install pandas fsspec gcsfs. I am able to read csv file using pd.read_csv(gs://BUCKET_PATH) but not able to write
– Shiv Krishna Jaiswal
                Apr 9, 2022 at 3:30
from datalab.context import Context
import google.datalab.storage as storage
import google.datalab.bigquery as bq
import pandas as pd
# Dataframe to write
simple_dataframe = pd.DataFrame(data=[{1,2,3},{4,5,6}],columns=['a','b','c'])
sample_bucket_name = Context.default().project_id + '-datalab-example'
sample_bucket_path = 'gs://' + sample_bucket_name
sample_bucket_object = sample_bucket_path + '/Hello.txt'
bigquery_dataset_name = 'TestDataSet'
bigquery_table_name = 'TestTable'
# Define storage bucket
sample_bucket = storage.Bucket(sample_bucket_name)
# Create storage bucket if it does not exist
if not sample_bucket.exists():
    sample_bucket.create()
# Define BigQuery dataset and table
dataset = bq.Dataset(bigquery_dataset_name)
table = bq.Table(bigquery_dataset_name + '.' + bigquery_table_name)
# Create BigQuery dataset
if not dataset.exists():
    dataset.create()
# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_data(simple_dataframe)
table.create(schema = table_schema, overwrite = True)
# Write the DataFrame to GCS (Google Cloud Storage)
%storage write --variable simple_dataframe --object $sample_bucket_object
# Write the DataFrame to a BigQuery table
table.insert(simple_dataframe)
I used this example, and the _table.py file from the datalab github site as a reference. You can find other datalab source code files at this link.
                Just a note: I believe you need to execute the %%storage commands in a separate cell from the Python code?
– dartdog
                Mar 31, 2016 at 16:13
                It depends on whether you want to execute a line magic or cell magic command. For cell magic it is %%storage, for line magic it is %storage. It's ok to use line magic commands in the same cell as other code. Cell magic commands must be in a separate cell from other code
– Anthonios Partheniou
                Mar 31, 2016 at 16:43
                Thanks very much Anthonios... I was able to successfully create all of the objects (e.g., the table and the schema are in my Project/Dataset in BQ). However, no rows were actually written to the table and no error messages were generated.
– EcoWarrior
                Mar 31, 2016 at 19:45
                A populated table was generated in the Jupyter Notebook after table.Insert_data(out) and this line was at the bottom of that table:  (rows: 0, edw-p19090000:ClickADS2.ADS_Logit1)
– EcoWarrior
                Mar 31, 2016 at 19:52
Writing a Pandas DataFrame to BigQuery
Update on @Anthonios Partheniou's answer.

The code is a bit different now - as of Nov. 29 2017
To define a BigQuery dataset
Pass a tuple containing project_id and dataset_id to bq.Dataset.
# define a BigQuery dataset    
bigquery_dataset_name = ('project_id', 'dataset_id')
dataset = bq.Dataset(name = bigquery_dataset_name)
To define a BigQuery table
Pass a tuple containing project_id, dataset_id and the table name to bq.Table.
# define a BigQuery table    
bigquery_table_name = ('project_id', 'dataset_id', 'table_name')
table = bq.Table(bigquery_table_name)
Create the dataset/ table and write to table in BQ
# Create BigQuery dataset
if not dataset.exists():
    dataset.create()
# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_data(dataFrame_name)
table.create(schema = table_schema, overwrite = True)
# Write the DataFrame to a BigQuery table
table.insert(dataFrame_name)
Since 2017, Pandas has a Dataframe to BigQuery function pandas.DataFrame.to_gbq
The documentation has an example:
import pandas_gbq as gbq
 gbq.to_gbq(df, 'my_dataset.my_table', projectid, if_exists='fail')
Parameter if_exists can be set to 'fail', 'replace' or 'append'
See also this example.
To save a parquet file in GCS with authentication due Service Account:
df.to_parquet("gs://<bucket-name>/file.parquet",
               storage_options={"token": <path-to-gcs-service-account-file>}
I have a little bit simpler solution for the task using Dask. You can convert your DataFrame to Dask DataFrame, which can be written to csv on Cloud Storage
import dask.dataframe as dd
import pandas
df # your Pandas DataFrame
ddf = dd.from_pandas(df,npartitions=1, sort=True)
dd.to_csv('gs://YOUR_BUCKET/ddf-*.csv', index=False, sep=',', header=False,  
                               storage_options={'token': gcs.session.credentials})  
I think you need to load it into a plain bytes variable and use a %%storage write --variable  $sample_bucketpath(see the doc) in a separate cell... I'm still figuring it out... But That is roughly the inverse of what I needed to do to read a CSV file in, I don't know if it makes a difference on write but I had to use BytesIO to read the buffer created by the %% storage read command... Hope it helps, let me know!
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.