views: 1781

PostgreSQL's COPY command is amazing

03-03-2020

Let’s start off with a hypothetical, shall we?

You have a large dataset. At least several GB of data, tens of millions of lines of CSV. It’s not huge, but it’s not insignificant either.

Q: What is the best way of getting lots of data into a PostgreSQL database?

For our problem today, I’ll be using geospatial data derived from Open Street Maps. I’ve extracted the osm_id, highway type and geometry from a large set of roads in North America, converted it to SRID 4326 and saved it as a tab-delimited CSV file:

sample data (osm_ways.dat):

45805679        residential     SRID=4326;LINESTRING(-159.395507277812 21.9631000970925,-159.395650977812 21.9632712970925,-159.395728877812 21.9634491970925,-159.395748277812 21.9635620970924,-159.395751877812 21.9636870970924,-159.395741777812 21.9637586970924,-159.395697877812 21.9638595970924,-159.395617077812 21.9639678970924,-159.395516877812 21.9640527970924,-159.395284477812 21.9641947970924,-159.394294577812 21.9647673970923,-159.393551577812 21.9652254970922,-159.392944577812 21.9656067970922,-159.392305877812 21.9659907970921,-159.391668677813 21.9663948970921)
45809747        service SRID=4326;LINESTRING(-159.395779977812 21.967449997092,-159.395789977812 21.967389997092,-159.395789977812 21.967336997092,-159.395789977812 21.967179997092,-159.395778777812 21.967133097092,-159.395756277812 21.967098697092)
379872841       trunk   SRID=4326;LINESTRING(-159.388419777813 21.9696987970917,-159.388790377813 21.9695644970917,-159.390155077813 21.9691278970918,-159.391336077813 21.9687556970918,-159.391899177812 21.9685880970918,-159.392315677812 21.9684602970918,-159.392820077812 21.9682945970919,-159.393508477812 21.9680701970919,-159.394306977812 21.9677606970919,-159.394882277812 21.967494497092,-159.395756277812 21.967098697092,-159.395824577812 21.967069497092,-159.395892977812 21.967035997092,-159.396829977812 21.9665983970921,-159.397820577812 21.9661419970921,-159.398510377812 21.9658081970922,-159.399338077811 21.9654119970922,-159.399913977811 21.9650855970922,-159.400394177811 21.9648104970923)
. . .

Note: for this example, I’m using a slightly smaller file than what I normally run. My sample contains 431,390 lines for 163M of data. There is no header, though I could specify one if I want.

table structure (destination table)

CREATE TABLE osm_ways ( id bigint, highway text, way geometry(LineString,4326) );

Now it’s time to plug all of that data into our database. The normal, most intuitive approach is to open the CSV file, iterate over every line and perform an INSERT for each new row of data. For this example, I’m going to be using libpqxx to connect to PostgreSQL:

void pushRoads(std::string& filename){
 
  const std::string connStr("user=" + dbUser + " password=" + dbPasswd + " host=" + dbHost + " dbname=" + dbName);
 
  try {
    pqxx::connection conn(connStr);
    pqxx::work txn(conn);
 
    ifstream csvFile(filename);
    std::string line;
    while( std::getline(csvFile, line) ){
      std::vector<string> vals = split(line, '\t');
 
      string id = vals[0];
      string hwy = vals[1];
      string way = vals[2];
 
      std::string insertRoad = "INSERT INTO osm_ways (id, highway, way) VALUES (" + id + ", '" + hwy + "', '" + way + "')";
      pqxx::result result = txn.exec(insertRoad);
    }
    txn.commit();
  }
  catch (const std::exception& e) {
    cout << "PQXX exception: " << e.what() << endl;
  }
 
  return;
}

Not bad. This code is easy to understand and relatively quick to write. But it does have one issue: it’s sloooooooooooow.

Starting push. . .
Done!  Push to DB took 79.3649 seconds.

Luckily, PostgreSQL provides a faster way of ingesting data. COPY allows you to literally stream your data directly into a destination table. It does, however, come with a few caveats:

You must specify columns in the order they appear in your file
You can’t tell COPY to skip columns in your file
You have to be careful with your delimiters
Other stuff I will get to in a later post 😉

Here’s our loading code, reworked to use COPY:

void pushRoads2(std::string& filename){
const std::string connStr("user=" + dbUser + " password=" + dbPasswd + " host=" + dbHost + " dbname=" + dbName);
try {
    pqxx::connection conn(connStr);
    pqxx::work txn(conn);
    //Note: if your CSV file has a header, tell COPY to ignore it with "csv header" instead of just "csv"
    std::string cpyCmd = "COPY osm_ways (id, highway, way) FROM '" + filename + "' csv ";
    txn.exec(cpyCmd);
    txn.commit();
  }
  catch (const std::exception& e) {
    cout << "PQXX exception: " << e.what() << endl;
  }

return;
}

error

Starting push. . .
PQXX exception: ERROR:  extra data after last expected column
CONTEXT:  COPY osm_ways, line 1: "164960067     trunk   SRID=4326;LINESTRING(-159.400394177811 21.9648104970923,-159.399841077811 21.9650312..."

Oops, that didn’t go well. What happened?

Remember when I said that COPY is a bit finicky when it comes to delimiters? Our file is TAB delimited, rather than comma-delimited. We need to tell PostgreSQL how to read in our file correctly. Here’s the change we need to make

//Be careful with delimiters!  
//E'\t' tells PostgreSQL that our file is TAB-delimited.  
//You would use ',' for normal CSV files.
std::string cpyCmd = "COPY osm_ways (id, highway, way) FROM '" + filename + "' csv delimiter E'\t'";

After that little change, let’s try running it again:

Starting push. . .
Done!  Push to DB took 7.83385 seconds.

All done! 400,000+ rows of geospatial data in under 10 seconds, courtesy of COPY.

That’s all for now. Please note that this method works well for uploading local files onto a local database, but there are some more issues we’ll need to resolve to make this work remotely. Next time, I’ll be going over several ways to solve the remote COPY problem.