printf(Hello World!\n)
tech blog

views: 1456

pqxx::stream_to() is even better

03-10-2020

It’s time to talk about the limitations of COPY.

From PostgreSQL’s own documentation:

COPY with a file name instructs the PostgreSQL server to directly read from or write to a file. The file must be accessible to the server and the name must be specified from the viewpoint of the server. When STDIN or STDOUT is specified, data is transmitted via the connection between the client and the server.

https://www.postgresql.org/docs/9.2/sql-copy.html

What that means is that the code we wrote earlier will choke if you use it to push local data to a remote server.

void pushRoads2(std::string& filename){
 
  //If dbHost is not localhost, we're going to have some issues!
  const std::string connStr("user=" + dbUser + " password=" + dbPasswd + " host=" + dbHost + " dbname=" + dbName);
 
  try {
    pqxx::connection conn(connStr);
    pqxx::work txn(conn);
    std::string cpyCmd = "COPY osm_ways (id, highway, way) FROM '" + filename + "' csv delimiter E'\t'";
    txn.exec(cpyCmd);
    txn.commit();
  }
  catch (const std::exception& e) {
    cout << "PQXX exception: " << e.what() << endl;
  }
 
  return;
}

error 1: COPY-ing to a remote database

Staring push . . .
ERROR:  could not open file "/data/DBOUT_small/atlas_conurb_print.csv" for reading: No such file or directory

... ERROR!

Here’s that error in practice. The file is obviously there; I just COPYed it up to my local database using the exact same code, so I also know the format is good.

The problem is that it does not exist on the remote machine. COPY (when called this way from libpqxx) looks for a file *local to the database server*. My remote server, of course, has no such file.

Additionally, even when trying to use the COPY command to push data to a local database, we can run into issues with database user privileges:

error 2: COPY on localhost without superuser privileges

Starting push. . .
PQXX exception: ERROR:  must be superuser or a member of the pg_read_server_files role to COPY from a file
HINT:  Anyone can COPY to stdout or from stdin. psql's \copy command also works for anyone.

Done!  Push to DB took 0.006621 seconds.

Now that we understand the problem, what are some things we can try?

Bad solutions:

You might not have permissions on the remote server, so copying files around or compiling code there aren’t always an option. Plus, it’s a huge hassle.

OK solution:

cat /path/to/file.csv | psql --host=remote.machine.com -U remote_user -d remote_db -c "COPY osm_ways (id, highway, way) FROM STDIN DELIMITER E'\t'"

This will definitely get your data up there, but it requires popping out of your C++ program to run commands directly through the shell. Fine in some instances, but it kind of defeats the point.

Better solution:

Here’s how our COPY code looks when it’s re-written to use stream_to():

void pushRoads3(std::string& filename){
 
  const std::string connStr("user=" + dbUser + " password=" + dbPasswd + " host=" + dbHost + " dbname=" + dbName);
 
  try {
    pqxx::connection conn(connStr);
    pqxx::work txn(conn);
 
    ifstream csvFile(filename);
    std::string line;
 
    //Specify your target columns, just like you do with the normal COPY command
    std::vector<string> tableCols = {"id", "highway", "way"};
 
    //Pass an open pqxx::transaction type (txn), a destination table name, and a vector of column names
    pqxx::stream_to stream( txn, "osm_ways", tableCols );
    while( std::getline(csvFile, line) ){
      std::vector<string> vals = split(line, '\t');
 
      string id = vals[0];
      string hwy = vals[1];
      string way = vals[2];
 
      //You can't use a vector here.  Create a std::tuple with the values you are going to stream in.  Tuple takes an arbitrary number of values.
      stream << std::make_tuple( id, hwy, way );
    }
    stream.complete(); //close the stream
    txn.commit();
  }
  catch (const std::exception& e) {
    cout << "PQXX exception: " << e.what() << endl;
  }
 
  return;
}

success!

Starting push. . .
Done!  Push to DB took 7.2294 seconds.

As I understand it, stream_to is just a pqxx wrapper around COPY. It replaces the deprecated tablestream, and runs in a similar timeframe. My push with steam_to took almost exactly the same amount of time as my simpler COPY version, so I’m fairly happy with the solution overall.

(For those who need to get data out of a database, there is an equivalent method called stream_from).